CS530 S08 |
TR 11:40-12:55 |
Olin 245 |
|
Architecture of Large-Scale Information Systems |
| |
Project 4b: A Nontrivial MapReduce PipelineDue: Wed May 14In this project you will exploit what you have learned in Project 4a to build a more substantial MapReduce computation, possibly involving the design of a nontrivial graph algorithm. Introduction to the ProblemImagine we have a collection of satellite images coverng a large wooded region, thousands of square miles in area (e.g. the forests of Canada). The data has been analyzed to produce a grid of Boolean values, with a resolution of one or two meters. At each grid point, a value of 1 indicates the presence of a "tree" (some foliage observed by the satellite), while a value of 0 indicates a clear spot without a tree. We are interested in forest fire propagation. We use a very simple model for the spread of a forest fire: flames can spread from a burning tree to an immediately adjacent tree in any direction (recilinearly or diagonally), but flames cannot jump across a clear spot. Formally, flames jump directly from point 〈x, y〉 to point 〈x', y'〉 if x-1 <= x' <= x+1 and y-1 <= y' <= y+1A fire spreads from one tree to the next until it encounters a "firebreak," a clear area the flames cannot jump across that separates the burning area from the unburned area. Clearly we can model this process with an undirected graph: the nodes of the graph correspond to the trees, each labeled with its location 〈x, y〉 and there is an (undirected) edge betwen trees t and t' iff t and t' are adjacent in the sense described above. Consider the questions If a careless hiker drops a match at location <x, y>, how many trees will burn?Both questions can be answered easily if you know the sizes of the connected components of this graph.andIf the locations of lightning strikes are distributed uniformly at random, and a tree always catches fire when struck by lightning, what is the expected number of trees that will burn as a result of a lightning strike? (Below we call this quantity the “average burn.”) This project demonstrates the use of MapReduce to construct index structures that make it possible to answer such questions very efficiently. A Straightforward ApproachGiven the grid of Boolean values described above, you could take the following approach:
You can use this approach successfully for this project. However, naive use of the transitive closure does not scale for reasons we discuss next.
Critical densityThe forest-file problem presented here is borrowed from an elementary discussion of Percolation Theory that can be found in this pricey book. You don't need to know any percolation theory do this project — I certainly don't know very much — but theoretical study of the problem reveals some interesting properties.
In a real forest the trees are not uniformly distributed, and the above property can fail. In fact, an unboundedly large “managed” forest can be designed with density arbitrarily close to 1 (i.e. a tree at every grid point) while keeping the expected size of a connected component bounded. But real forests tend to be “somewhat” random, and critical density behavior is observable in the wild. In our test data, the trees were generated uniformly at random, so critical density can be a real issue. Scale of the ComputationImagine we had real satellite image data at one or two meter resolution for a wilderness area of 1000 square miles. You can work it out -- that would be on the order of a billion (N = 10**9) points. If we convert this data into a graph and construct the adjacency matrix, the result will be an N by N matrix: 10**18 entries! Note this grows with the square of the number of grid points. Fortunately, the adjacency matrix is sparse, with at most 8 entries per grid point, so its representation requires only O(N) space and probably we would have plenty of disk space to store it. But consider the size of the transitive closure of the adjacency matrix: this matrix connects each tree to each other tree in its connected component; thus it has something like (N*Cbar) nonzero entries, where Cbar is the average (over all trees) of the size of a connected component. If the denstiy of the forest is well below the critical density d*, then Cbar will be a small constant, independent of N, and we can still assume the transitive closure matrix can be represented in O(N) space. What if the density increases, approaching d*? As discussed above, Cbar will grow to be proportional to N, and the size of the transitive closure will grow to Omega(N**2) — something like 10**18 entries — which is enough to stress the resources of Amazon or even the mighty Google! Of course, our test data and parameters have been set to avoid intractable growth of the transitive closure matrix. But you should appreciate that the naive approach of brute-force transitive closure computation would not scale to realistic instance sizes for this problem, even on a very large MapReduce cluster. Algorithm Design Hints(1) The problem parameters have been chosen so that the transitive closure of the adjacency matrix will not be unreasonably large. However, you may find that computing it requires an unacceptably large number of MapReduce iterations unless you use something like the "repeated squaring" technique discussed in lecture. (2) Because the graph is undirected, the adjacency matrix is symmetric. You can exploit this property to reduce the size of the intermediate files and the network traffic somewhat. (3) It is actually possible to find the connected components of an undirected graph without computing a full transitive closure. Recall we solved the "careless hiker" problem by computing the vector A* × uwhere A was the (symmetric) adjacenty matrix, A* was its transitive closure, and u was a column vector containing all 1's. Observe there is another matrix C defined by C[i,j] = 1 iff j is the least member of the component containing i,You can think of this as "naming" each connected component by the smallest tree index it contains; then C maps each tree i to the name j of its connected component. Clearly, C is sparse – it has exactly one nonzero entry per tree. Now let C† be the transpose of C, defined by C†[i,j] = C[j,i].You should convince yourself that A* = C C†so that ( A* u ) = ( C C† ) u = C ( C† u )This can be implemented efficiently by two multiplications of a sparse matrix C by a vector. The solution to the “lightning strike” question can likewise be computed efficiently using C rather than A*. So, you might want to think about an efficient MapReduce strategy for computing C rather than A*, with a reasonable upper bound on the number of MapReduce passes required for convergence (say, log(N), though in practice the algorithm will converge rather more quickly than this), and with a reasonable upper bound on the amount of data passed between the Map and Reduce phases (say, O(N)). We know of at least one strategy that works, and there are undoubtedly more. But don't worry too much if you can't find one – the construction is definitely not trivial. Data FilesWe have provided three sets of data files: a small one for testing, a medium one for use with a transitive closure based algorithm, and a large one to use if you believe you have a better algorithm. Each file consists of a sequence of text lines; each line contains three fields x, y, wwhere x, y are the integer coordinates of a grid point, using 0-origin, and w is a floating point number between 0 and 1. The use of w in the file format is a hack to allow you to experiment with forests of varying densities. Values for w in the data files have been chosen uniformly at random. To construct a forest of a given uniform density d between 0 and 1, just read the file and ignore any tuple whose w value is greater than d. The names and locations of the files in AWS are as follows: The test data can be found in the bucket edu-cornell-cs-cs530-proj4b-test while the medium-sized and the large-sized test data can be found in the buckets edu-cornell-cs-cs530-proj4b-production-medium and edu-cornell-cs-cs530-proj4b-large. Controlling the Pipeline
Your solution will consist of multiple MapReduce passes,
sometimes iterated until convergence.
Controlling the pipeline of MapReduce passes is not trivial.
Ideally, your pipeline will control itself,
using some combination of Java code and scripts running
on your client machine and/or on the Hadoop master.
Initially you can do it manually;
if that's as far as you get,
just document the manual procedure in your The Exact AssignmentPart b1The first assignment is to compute the average burn number (the answer to the “lightning strike” question) for one of our data files at a density of 1/3, using a 5 instance Hadoop cluster. If you are using a transitive closure based algorithm, use the medium-sized file. For extra credit, if you have a more efficient algorithm, use the large file. In either case, your pipeline should not take longer than an hour when run on 5 instances. If you can't manage to make it run that quickly, you can do run it at a density of 1/4 for a small grade penalty. The result should be a set of files:
Combine these into a single file Part b2 (optional)If think you have developed an efficient MapReduce pipeline for computing connected components, try estimating the critical density d* using our medium sized data set. The burn size may not be the best thing to measure for this purpose. Consider using the size of the largest connected component. See if you can find a pair of densities d0 and d1 that differ by a small amount (say .03 or .05) while the size of the largest connected component changes by an order of magnitude. As in part b1, do not consume more than an hour running on 5 instances. The result should be a set of files similar to those for part b1:
Combine these into a single file Yet Another Final ReminderRemember, please DO NOT LEAVE YOUR CLUSTER RUNNING OVERNIGHT!even if it is not doing anything! To shut down your cluster from the $HADOOP_ROOT directory on your local machine
type
reply to the prompt, and your Hadoop instances will be terminated,
and will stop consuming funds from our AWS account.
|