Architecture of Large-Scale Information Systems

Project 4: Hadoop MapReduce on Amazon Web Services

In this project you will do scalable implementations of a few simple computations using the MapReduce programming model implemented by Hadoop running on Amazon EC2.

Part 0: Usage Conventions for AWS

For Project 3 we defined some AWS "usage conventions" that enable multiple student projects to use the same AWS account without conflicts. The conventions are defined here. In this project we assume you have installed the AWS tools and are following the conventions described in that document.

Part 1: Getting Familiar with Hadoop

We have produced a separate document that discusses how to install Hadoop on your local machine and how to run Hadoop jobs on an EC2 cluster; it is available here. That document tells you how to install and run Hadoop, but not how to write MapReduce programs. For that you need to go elsewhere. Lots of documentation is available at the Hadoop Core web page. A particularly good reference is this Hadoop MapReduce Tutorial.

Part 2: Input Data

We have provided several input data files for you. The data is in text files, all with the same format. Each line contains three positive integers

index tag value

The tag field is either 1 or 2. The index and value fields are between 1 and an upper bound of a few K -- your code does not need to know this upper bound. These files can all be interpreted using TextInputFormat as discussed in the Hadoop MapReduce Tutorial. The following files are available:

Test data in a pair of ordinaray text files is available here and here. There are only a few records here, basically enough for debugging but not performance tuning.
The same test data is available in Hadoop HDFS files on S3 in the bucket
edu-cornell-cs-cs530-proj4a-test
which can be retrieved using the
bin/hadoop distcp s3://....
command as discussed in the section "Moving Data and Code to the Cluster" in this Hadoop on EC2 Tutorial.
Production Data -- data in the same format, but a lot more of it -- is available as a set of Hadoop HDFS files on S3 in a different bucket,
edu-cornell-cs-cs530-proj4a-production
There are gigabytes of data here, so you probably don't want to try using it until you have successfully run each problem on the smaller test data file.

Part 3: The Assignments

We are asking you to perform five different MapReduce computations on the production data. To help describe these computations, consider the following example data:

1 2 8
5 1 22
8 1 8
5 2 6
5 1 3

Note one index field occurs three times (5), and one value field occurs twice (8).

(a) A parallel max computation

Compute the maximum over all tuples of the product (index * value). The MapReduce result will be a set of key-data pairs. We don't care what the key is (that's part of how you design the computation), but the maximum value is 110, achieved by the second row of the input data, so your MapReduce job should produce a single line of output with the answer 110 appearing at some prescribed position:

XXX 110

(b) Parallel computation of sum grouped by index

The output should be a set of key-data pairs; the key is an index, and the data is the sum of all value fields associated with that index. In our sample data only one index is repeated (5). A possible output might be:

1 8
5 31
8 8

But note there is no requirement that the keys (1, 5, 8) occur in order in the output.

(c) Parallel computation of average

The output should be a single line with the average value (9.4) appearing at some prescribed position:

XXX 9.4

Note the production input data might be big enough that you should do this computation in double precision floating point to avoid overflow.

(d) Parallel computation of mode

The output should be a single line with the mode of the value fields (i.e. the most frequently occurring value fieldfrom the data) appearing at some prescribed position:

XXX 8

Ties may be broken arbitrarily.
Note: consider using more than one MapReduce pass.

(e) Parallel computation of a JOIN with aggregation

For each index value i occurring in the input data, the output should contain a line

i 0 S1*S2

where

S1 is the sum of the value fields of all inputs for which the index field is i and the tag field is 1;

and

S2 is the sum of the value fields of all inputs for which the index field is i and the tag field is 2;

So there should be exactly one line of output for each index value occurring in the input data.

Note: the constant 0 in the output file format is there to make the type of Reduce consistent, somewhat like the use of NULL values in the discussion of JOIN in lecture.

Motivation: Suppose the tag=1 input lines give tax rates, for example city, county and state sales taxes applicable to a purchase order identified by the index field; and suppose the tag=2 input lines are values of purchases on the purchase order. Then this might be a sales tax computation for this month's purchase orders.

For our example data, only one index field occurs in both tag=1 and tag=2 input lines (5); for the other index fields the final result is 0:

1 0 0
5 0 150
8 0 0

What to Turn In:

Please construct one file proj4a.zip containing the following:

README.txt
in which you tell us the formats of your MapReduce output files and anything else you think we need to know,
output_a.txt
your MapReduce output file for running part (a) of the assignment on the production data set,
master_log_a.txt
a copy of the Hadoop log file from the master machine in your Hadoop cluster (this file is named
/mnt/hadoop/logs/hadoop-root-jobtracker-XXXXXXX.log
on the master of your Hadoop cluster, where XXXXXXXX is the temporary DNS name of the master) containing the results of running part (a) of the assignment on the production data set. Please either make sure the log is empty when you begin the MapReduce run, or cut off the uninteresting prefix of the log so we see only the relevant parts of the log (the parts generated while running your job),
src_a.zip
a .zip file containing your Java source code for part (a),
output_b.txt ... src_e.zip
the analogous files for parts (b) through (e).

Upload this file proj4a.zip to CMS by the project due date.

That's all. Good luck!

A Final Reminder

Remember, please

DO NOT LEAVE YOUR CLUSTER RUNNING OVERNIGHT!

To shut down your cluster from the $HADOOP_ROOT directory on your local machine type

src/contrib/ec2/bin/hadoop-ec2 terminate-cluster

reply to the prompt, and your Hadoop instances (and nobody else's) will be terminated, and will stop consuming funds from our AWS account.