Architecture of Large-Scale Information Systems

Using Hadoop on Your Computer and AWS

Project 4 involves writing and running some MapReduce problems in a Hadoop cluster on Amazon EC2. This page has some guidelines on setting up Hadoop on your workstation (for development purposes) and running Hadoop in an EC2 cluster (for production).

Downloading and Installing Hadoop

Release versions of Hadoop for download can be found here. Please use a "stable" release; as of this writing the latest stable release is 0.15.3, and there is a public Amazon EC2 AMI built with that release, so it is the one we recommend.

So download the release and expand it somewhere like /usr/local/hadoop-0.14.3. Henceforth we'll call this location $HADOOP_ROOT.

The hadoop release provides some convenient scripts that make the job of launching a cluster much easier. Unfortunately, there is a familiar problem: these scripts do not work if there is more than one Hadoop user in the same Amazon EC2 account. And they can fail to work in nasty ways, for example, trashing some other student's running Hadoop EC2 cluster.

So we have provided replacements for the scripts. This is very important: you must

INSTALL THE REPLACEMENT SCRIPTS FIRST

that is, before you try running Hadoop on EC2. Here is how to do it.

Download the file ec2-new-bin.zip from here (the file is also available from the Project 4 page in CMS).
Expand the zip file in
$HADOOP_ROOT/src/contrib/ec2
The resulting folder will be named $HADOOP_ROOT/src/contrib/ec2/ec2-new-bin
Remove (or rename) the old bin file $HADOOP_ROOT/src/contrib/ec2/bin -> $HADOOP_ROOT/src/contrib/ec2/foo
Rename the new bin file $HADOOP_ROOT/src/contrib/ec2/ec2-new-bin -> $HADOOP_ROOT/src/contrib/ec2/bin
WINDOWS users only: Copy $HADOOP_ROOT/src/contrib/ec2/bin/windows-launch to $HADOOP_ROOT/src/contrib/ec2/bin/launch-hadoop-cluster. Do this only if you are a Windows user.


These new scripts are functionally equivalent to the original ones,
but they allow multiple Hadoop clusters to run in the same AWS account,
provided the clusters all use different security groups.
Our conventions for naming AWS groups
require each student to use a security group name constructed from his or her netid,
so we should be safe.

Trying Hadoop on Your Own Machine

A nice "QuickStart" guide for running Hadoop on your local machine
is available
here
or
here.
(The two versions are slightly different; the second one is a little better at telling you how to retrieve output.)
It is worthwhile trying both "stand-alone" and "pseudo-distributed" operation
on your own machine,
both to gain confidence
and to provide yourself with a platform for development and initial testing.
Notes:

Windows users will need to install cygwin
if they don't already have it. Cygwin should have support for OpenSSH. So, during installation, when you
reach the window titled "Cygwin Setup - Select Packages", click the View button, scroll down to OpenSSH and 
disable the Skip option by clicking on it. Cygwin is a Linux-like environment on Windows. With respect to this project,
you can treat Cygwin just like any other UNIX shell. After ensuring your AWS environment variables are intact (by 
displaying them using the "echo" command), run your 
hadoop scripts by following the instructions described below. Try to use "/cygdrive/c/" in 
place of "C:\", while using cygwin. If you had used Putty for the previous project, chances are that you do 
not have the private key file in Linux format, since Putty needs you to change the format of the private key file. 
So, remember to have a UNIX-style private key file as well, to be able run the Hadoop scripts.

Mac users need to set up ssh to allow connections from localhost to localhost
without a password.
Instructions for doing this are here,
or there is a more elaborate but safer version here.
You also need to make sure sshd is running; use

System Preferences -> Sharing -> Remote Login



There are several examples provided with the distribution.
These are documented: visit the Hadoop Wiki,
then scroll down to "Examples" under "User Documentation" to find
instructions for running them.
Testing Hadoop on EC2

Your next step is to boot up a Hadoop cluster in Amazon EC2.
Before you try anything,

DID YOU INSTALL THE UPDATED SCRIPTS?

If not, go back and do it now.

A couple of tutorials explaining the details of Hadoop EC2 setup are available
here and
here.
These tutorials are very similar, but each has its strengths and weaknesses.
It is probably worth the effort to read them both before you try running anything.

Warning: the first of them has a BUG concerning the use of S3 files.
This bug is described in the next section.

Simply put, the setup involves editing some configuration files in the Hadoop distribution
on your local machine.
These configuration files are consulted by the scripts
that boot up your AMIs.
By default, the scrips use public AMIs that have been provided (by the Hadoop project)
with Hadoop already installed.
We strongly recommend that you use one of these public AMIs, to save some time.
If you want to build your own AMI and install Hadoop, the instructions are posted here.


You can find which versions of Hadoop are available as public AMIs by using the Amazon EC2 tools.
In Unix or MacOS X you can use the command

ec2-describe-images -a | grep hadoop-ec2-images

Actually, you don't need to remember the id of the AMI. You just need to know
that a public AMI exists with the version of Hadoop you want.
The scripts will find it automatically, using information you provide next. 

Navigate to $HADOOP_ROOT, the main directory of the Hadoop distribution on your local machine.
Then edit the EC2 configuration file in

src/contrib/ec2/bin/hadoop-ec2-env.sh

to specify your Amazon Web Service settings, the Hadoop version to run on the cluster (which does not have to match the version of the distribution we unpacked on our workstation), the hostname for the master, and the size of the cluster. These are the variables that you need to set in the file. The values for most of these variables are ones that you have already used for project 2.


 AWS_ACCOUNT_ID : The CS530 Amazon account ID
 AWS_ACCESS_KEY_ID : The CS530 Amazon AWS access key ID
 AWS_SECRET_ACCESS_KEY : The CS530 Amazon AWS secret access key
 EC2_KEYDIR=`dirname "$EC2_PRIVATE_KEY"`  (This one has already been set to the directory containing your private key file. DO NOT CHANGE.)
 KEY_NAME : The name of your key pair (contains your netID)
 PRIVATE_KEY_PATH : The path to the private key file. (For example the value might be ~/.aws/id-rsa-kp-vv39-gsg)
 SSH_OPTS=`echo -i "$PRIVATE_KEY_PATH" -o StrictHostKeyChecking=no` : DO NOT CHANGE.
 HADOOP_VERSION=0.15.3 : DO NOT CHANGE.
 S3_BUCKET=hadoop-ec2-images : This is the S3 bucket containing the Hadoop AMI. DO NOT CHANGE unless you are using your own private AMI.
 GROUP : The security group that you want to run the AMI. (contains your netID)
 MASTER_HOST : DNS Host name for the master: see below for more instructions.
 NO_INSTANCES : The number of nodes in your cluster. Use a very low number (2 or 3) for your initial tests

The following variables are used only when creating an AMI.

 JAVA_BINARY_URL='' : The download URL for the Sun JDK. Visit Sun and get the URL for the "Linux self-extracting file"
 JAVA_VERSION=1.6.0_02 : The version number of the installed JDK.


The hostname you select must be one you have control over, as you will be asked to set it to point to a particular IP address during launch.
Free services such as DynDNS make this very easy.
We have already created individual host names for you on the DynDNS service.
Your host name is of the form

your_cu_id-cs530.dyndns.org

for example, vv39-cs530.dyndns.org.
This is the name that should be inserted for MASTER_HOST in the hadoop-ec2-env.sh file above.
On launch, you will be prompted to set the DNS for that hostname to the IP address that the master has been allocated on EC2,
as described below.

With the configuration out of the way we're ready to go.
From the $HADOOP_ROOT directory, type

src/contrib/ec2/bin/hadoop-ec2 run

This invokes a script that does the following:

 Starts a cluster of Hadoop nodes.
 Prompts you to set up the master DNS name to point to a given IP address.
 Formats the HDFS filesystem on the cluster.
 Starts the Hadoop daemons on the cluster.
 Logs you onto the master node.

When you are prompted to set the DNS,
do it by visiting DynDNS
and logging in

username: vidya81       password: demvid123

Then select my hosts from the left hand column,
click on your personal hostname,
and enter the specified IP address. Choose the "Host with IP address" option and save the changes.

If you are having trouble,
it's also possible to execute the steps of the run command
one at a time, which can be useful for debugging.
Type

src/contrib/ec2/bin/hadoop-ec2

for usage instructions.

Congratulations! You are now ready to use the Hadoop cluster for your map-reduce problems.

A Bug

The first
Hadoop on EC2 Tutorial has a bug related to S3 use.
In the section on "Data Considerations" the command to copy a file into S3
presumes a different setup from the one used for our installation,
and won't work as given.
Instead you should first create a bucket in S3
for your HDFS data
(using the CS530 bucket naming conventions,
so the name of your bucket will contain your netID).
Then, to store data from the local file system of a machine into HDFS on S3, use the command
(all one line):

bin/hadoop   fs   -fs   s3://AWS_KEY_ID:AWS_SECRET_KEY@BUCKET

                
/path/to/source     /path/to/target

Files stored on S3 using HDFS this way are not the same as ordinary S3 files.
Instead, HDFS stores file and directory metadata, and possibly very large file data blcoks,
as individual S3 objects.
For example, if you store a file to

mydir/myfile.xml

and then listed the contents of your HDFS bucket,
you should see something like

% s3ls edu-cornell-cs-cs530-ajd28-hdfs

Bucket edu-cornell-cs-cs530-hdfs (6):

  /

  /user

  /user/root

  /user/root/mydir

  /user/root/mydir/myfile.xml

  block_-2015980193369110525

The procedure for copying data in the other direction --
from HDFS on S3 to HDFS on your cluster -- using thet distcp command
appears to work correctly, so a sequence like

bin/hadoop   fs   -mkdir   destination

bin/hadoop   distcp   s3://AWS_KEY_ID:AWS_SECRET_KEY@TEST_DATA_BUCKET/path/to/source/files

                
destination

could be used by any student to retrieve test data files we placed in TEST_DATA_BUCKET as part of the project.

Shutting Down Your Cluster

Just as in Project 3, please

DO NOT LEAVE YOUR CLUSTER RUNNING OVERNIGHT!

Shutting down your cluster is easy:
from the $HADOOP_ROOT directory on your local machine
type


 src/contrib/ec2/bin/hadoop-ec2 terminate-cluster
 
 
 reply to the prompt, and your Hadoop instances (and nobody else's) will be terminated.