CS530 S08 |
TR 11:40-12:55 |
Olin 245 |
|
Architecture of Large-Scale Information Systems |
| |
Project 4: Hadoop MapReduce on Amazon Web ServicesIn this project you will do scalable implementations of a few simple computations using the MapReduce programming model implemented by Hadoop running on Amazon EC2. Part 0: Usage Conventions for AWSFor Project 3 we defined some AWS "usage conventions" that enable multiple student projects to use the same AWS account without conflicts. The conventions are defined here. In this project we assume you have installed the AWS tools and are following the conventions described in that document. Part 1: Getting Familiar with HadoopWe have produced a separate document that discusses how to install Hadoop on your local machine and how to run Hadoop jobs on an EC2 cluster; it is available here. That document tells you how to install and run Hadoop, but not how to write MapReduce programs. For that you need to go elsewhere. Lots of documentation is available at the Hadoop Core web page. A particularly good reference is this Hadoop MapReduce Tutorial. Part 2: Input DataWe have provided several input data files for you. The data is in text files, all with the same format. Each line contains three positive integers index tag valueThe tag field is either 1 or 2. The index and value fields are between 1 and an upper bound of a few K -- your code does not need to know this upper bound. These files can all be interpreted using TextInputFormat as discussed in the Hadoop MapReduce Tutorial. The following files are available:
Part 3: The AssignmentsWe are asking you to perform five different MapReduce computations on the production data. To help describe these computations, consider the following example data: 1 2 8Note one index field occurs three times (5), and one value field occurs twice (8). (a) A parallel max computationCompute the maximum over all tuples of the product (index * value). The MapReduce result will be a set of key-data pairs. We don't care what the key is (that's part of how you design the computation), but the maximum value is 110, achieved by the second row of the input data, so your MapReduce job should produce a single line of output with the answer 110 appearing at some prescribed position:XXX 110 (b) Parallel computation of sum grouped by indexThe output should be a set of key-data pairs; the key is an index, and the data is the sum of all value fields associated with that index. In our sample data only one index is repeated (5). A possible output might be:1 8But note there is no requirement that the keys (1, 5, 8) occur in order in the output. (c) Parallel computation of averageThe output should be a single line with the average value (9.4) appearing at some prescribed position:XXX 9.4Note the production input data might be big enough that you should do this computation in double precision floating point to avoid overflow. (d) Parallel computation of modeThe output should be a single line with the mode of the value fields (i.e. the most frequently occurring value fieldfrom the data) appearing at some prescribed position:XXX 8Ties may be broken arbitrarily. Note: consider using more than one MapReduce pass. (e) Parallel computation of a JOIN with aggregationFor each index value i occurring in the input data, the output should contain a linei 0 S1*S2where S1 is the sum of the value fields of all inputs for which the index field is i and the tag field is 1;and S2 is the sum of the value fields of all inputs for which the index field is i and the tag field is 2;So there should be exactly one line of output for each index value occurring in the input data. Note: the constant 0 in the output file format is there to make the type of Reduce consistent, somewhat like the use of NULL values in the discussion of JOIN in lecture. Motivation: Suppose the tag=1 input lines give tax rates, for example city, county and state sales taxes applicable to a purchase order identified by the index field; and suppose the tag=2 input lines are values of purchases on the purchase order. Then this might be a sales tax computation for this month's purchase orders. For our example data, only one index field occurs in both tag=1 and tag=2 input lines (5); for the other index fields the final result is 0: 1 0 0 What to Turn In:
Please construct one file
proj4a.zip to CMS
by the project due date.
That's all. Good luck! A Final ReminderRemember, please DO NOT LEAVE YOUR CLUSTER RUNNING OVERNIGHT!To shut down your cluster from the $HADOOP_ROOT directory on your local machine
type
reply to the prompt, and your Hadoop instances (and nobody else's) will be terminated,
and will stop consuming funds from our AWS account.
|