CS530 S08 |
TR 11:40-12:55 |
Olin 245 |
|
Architecture of Large-Scale Information Systems |
| |
Using Hadoop on Your Computer and AWSProject 4 involves writing and running some MapReduce problems in a Hadoop cluster on Amazon EC2. This page has some guidelines on setting up Hadoop on your workstation (for development purposes) and running Hadoop in an EC2 cluster (for production). Downloading and Installing HadoopRelease versions of Hadoop for download can be found here. Please use a "stable" release; as of this writing the latest stable release is 0.15.3, and there is a public Amazon EC2 AMI built with that release, so it is the one we recommend.
So download the release and expand it somewhere
like The hadoop release provides some convenient scripts that make the job of launching a cluster much easier. Unfortunately, there is a familiar problem: these scripts do not work if there is more than one Hadoop user in the same Amazon EC2 account. And they can fail to work in nasty ways, for example, trashing some other student's running Hadoop EC2 cluster. So we have provided replacements for the scripts. This is very important: you must INSTALL THE REPLACEMENT SCRIPTS FIRSTthat is, before you try running Hadoop on EC2. Here is how to do it.
Trying Hadoop on Your Own MachineA nice "QuickStart" guide for running Hadoop on your local machine is available here or here. (The two versions are slightly different; the second one is a little better at telling you how to retrieve output.) It is worthwhile trying both "stand-alone" and "pseudo-distributed" operation on your own machine, both to gain confidence and to provide yourself with a platform for development and initial testing. Notes:
There are several examples provided with the distribution. These are documented: visit the Hadoop Wiki, then scroll down to "Examples" under "User Documentation" to find instructions for running them. Testing Hadoop on EC2Your next step is to boot up a Hadoop cluster in Amazon EC2. Before you try anything, DID YOU INSTALL THE UPDATED SCRIPTS?If not, go back and do it now. A couple of tutorials explaining the details of Hadoop EC2 setup are available here and here. These tutorials are very similar, but each has its strengths and weaknesses. It is probably worth the effort to read them both before you try running anything. Warning: the first of them has a BUG concerning the use of S3 files. This bug is described in the next section.Simply put, the setup involves editing some configuration files in the Hadoop distribution on your local machine. These configuration files are consulted by the scripts that boot up your AMIs. By default, the scrips use public AMIs that have been provided (by the Hadoop project) with Hadoop already installed. We strongly recommend that you use one of these public AMIs, to save some time. If you want to build your own AMI and install Hadoop, the instructions are posted here. You can find which versions of Hadoop are available as public AMIs by using the Amazon EC2 tools. In Unix or MacOS X you can use the command
Actually, you don't need to remember the id of the AMI. You just need to know
that a public AMI exists with the version of Hadoop you want.
The scripts will find it automatically, using information you provide next.
Navigate to
to specify your Amazon Web Service settings, the Hadoop version to run on the cluster (which does not have to match the version of the distribution we unpacked on our workstation), the hostname for the master, and the size of the cluster. These are the variables that you need to set in the file. The values for most of these variables are ones that you have already used for project 2.
The hostname you select must be one you have control over, as you will be asked to set it to point to a particular IP address during launch. Free services such as DynDNS make this very easy. We have already created individual host names for you on the DynDNS service. Your host name is of the form your_cu_id-cs530.dyndns.orgfor example, vv39-cs530.dyndns.org.
This is the name that should be inserted for MASTER_HOST in the hadoop-ec2-env.sh file above.
On launch, you will be prompted to set the DNS for that hostname to the IP address that the master has been allocated on EC2,
as described below.
With the configuration out of the way we're ready to go.
From the src/contrib/ec2/bin/hadoop-ec2 runThis invokes a script that does the following:
username: vidya81 password: demvid123Then select my hosts from the left hand column,
click on your personal hostname,
and enter the specified IP address. Choose the "Host with IP address" option and save the changes.
If you are having trouble,
it's also possible to execute the steps of the
for usage instructions.
Congratulations! You are now ready to use the Hadoop cluster for your map-reduce problems.
A BugThe first Hadoop on EC2 Tutorial has a bug related to S3 use. In the section on "Data Considerations" the command to copy a file into S3 presumes a different setup from the one used for our installation, and won't work as given. Instead you should first create a bucket in S3 for your HDFS data (using the CS530 bucket naming conventions, so the name of your bucket will contain your netID). Then, to store data from the local file system of a machine into HDFS on S3, use the command (all one line):
Files stored on S3 using HDFS this way are not the same as ordinary S3 files.
Instead, HDFS stores file and directory metadata, and possibly very large file data blcoks,
as individual S3 objects.
For example, if you store a file to
and then listed the contents of your HDFS bucket,
you should see something like
The procedure for copying data in the other direction --
from HDFS on S3 to HDFS on your cluster -- using thet distcp command
appears to work correctly, so a sequence like
bin/hadoop fs -mkdir destinationcould be used by any student to retrieve test data files we placed in TEST_DATA_BUCKET as part of the project.
Shutting Down Your ClusterJust as in Project 3, please DO NOT LEAVE YOUR CLUSTER RUNNING OVERNIGHT!Shutting down your cluster is easy: from the $HADOOP_ROOT directory on your local machine
type
reply to the prompt, and your Hadoop instances (and nobody else's) will be terminated.
|