CS530 S08

TR 11:40-12:55

Olin 245

Architecture of Large-Scale Information Systems

Project 4b: A Nontrivial MapReduce Pipeline

Due: Wed May 14

In this project you will exploit what you have learned in Project 4a to build a more substantial MapReduce computation, possibly involving the design of a nontrivial graph algorithm.

Introduction to the Problem

Imagine we have a collection of satellite images coverng a large wooded region, thousands of square miles in area (e.g. the forests of Canada). The data has been analyzed to produce a grid of Boolean values, with a resolution of one or two meters. At each grid point, a value of 1 indicates the presence of a "tree" (some foliage observed by the satellite), while a value of 0 indicates a clear spot without a tree.

We are interested in forest fire propagation. We use a very simple model for the spread of a forest fire: flames can spread from a burning tree to an immediately adjacent tree in any direction (recilinearly or diagonally), but flames cannot jump across a clear spot. Formally, flames jump directly from point ⟨x, y⟩ to point ⟨x', y'⟩ if

x-1 <= x' <= x+1 and y-1 <= y' <= y+1

A fire spreads from one tree to the next until it encounters a "firebreak," a clear area the flames cannot jump across that separates the burning area from the unburned area.

Clearly we can model this process with an undirected graph: the nodes of the graph correspond to the trees, each labeled with its location ⟨x, y⟩ and there is an (undirected) edge betwen trees t and t' iff t and t' are adjacent in the sense described above. Consider the questions

If a careless hiker drops a match at location <x, y>, how many trees will burn?
and
If the locations of lightning strikes are distributed uniformly at random, and a tree always catches fire when struck by lightning, what is the expected number of trees that will burn as a result of a lightning strike? (Below we call this quantity the “average burn.”)

Both questions can be answered easily if you know the sizes of the connected components of this graph.

This project demonstrates the use of MapReduce to construct index structures that make it possible to answer such questions very efficiently.

A Straightforward Approach

Given the grid of Boolean values described above, you could take the following approach:

Use MapReduce to construct the (sparse, symmetric) adjacency matrix A of the graph described above. This matrix has an edge from each tree t to each tree t' such that flames can jump directly from t to t'. Trees are represented by their locations; thus, an edge would be represented by a 4-tuple
⟨ ⟨ x, y ⟩, ⟨ x', y' ⟩ ⟩
A simple trick to ensure even isolated trees are represented is to include a self-loop
⟨ ⟨ x, y ⟩, ⟨ x, y ⟩ ⟩
for every tree.
Use MapReduce to compute the transitive closure A* of the adjacency matrix. This requires multiple MapReduce passes – as we have discussed in lecture, you need to iterate until the result converges. A* represents the connected components of the graph: it has an edge from tree t to tree t' if t and t' are connected, so that t' will eventually burn if t does.
Another MapReduce pass can compute the out-degree of each node in the relation A*, by counting the number of nonzero entries in each row of the matrix. The resulting vector of counts gives the sizes of the connected components of the original graph. This is equivalent to computing the product
A* × u
where u is a column vector containing all 1's, and we use ordinary (rather than Boolean) matrix multiplication. From this vector it would be straightforward in principle to build an index that could answer the "careless hiker" type of question in unit time. We won't ask you to actually build this index.
Finally, you can use MapReduce compute the “average burn” answer to the “lightning strike” question by computing an average over all the grid points:
(a) a point with a tree in it contributes the size of the connected component containing the tree, and
(b) a point with no tree in it contributes 0.
This average is a single real number which is the expected number of trees connected to a randomly chosen grid point.

You can use this approach successfully for this project. However, naive use of the transitive closure does not scale for reasons we discuss next.

Critical density

The forest-file problem presented here is borrowed from an elementary discussion of Percolation Theory that can be found in this pricey book. You don't need to know any percolation theory do this project — I certainly don't know very much — but theoretical study of the problem reveals some interesting properties.

If the trees are uniformly distributed at relatively low density, and the forest is not extremely small, then the expected size of a connected component is independent of the size of the forest, and is the same even for an infinite forest -- it depends only on the density of the trees in the forest.
There is a critical density d* above which we begin to see "wild fires." If trees are uniformly distributed at a density well below d*, then the expected size of a connected component (hence the number of trees that will burn if I drop a match) is a small constant independent of the size of the forest, as described above, so a forest fire eventually consumes an entire connected component and “burns itself out.” As the density increases, the expected size of a component increases. Eventually it begins to increase very rapidly, exhibiting a singularity at d*. For densities at or above d*, the component size is limited only by the size of the forest: if I start a fire in an infinite forest with density at least d*, infinitely many trees will burn.

In a real forest the trees are not uniformly distributed, and the above property can fail. In fact, an unboundedly large “managed” forest can be designed with density arbitrarily close to 1 (i.e. a tree at every grid point) while keeping the expected size of a connected component bounded. But real forests tend to be “somewhat” random, and critical density behavior is observable in the wild.

In our test data, the trees were generated uniformly at random, so critical density can be a real issue.

Scale of the Computation

Imagine we had real satellite image data at one or two meter resolution for a wilderness area of 1000 square miles. You can work it out -- that would be on the order of a billion (N = 10**9) points. If we convert this data into a graph and construct the adjacency matrix, the result will be an N by N matrix: 10**18 entries! Note this grows with the square of the number of grid points. Fortunately, the adjacency matrix is sparse, with at most 8 entries per grid point, so its representation requires only O(N) space and probably we would have plenty of disk space to store it.

But consider the size of the transitive closure of the adjacency matrix: this matrix connects each tree to each other tree in its connected component; thus it has something like (N*Cbar) nonzero entries, where Cbar is the average (over all trees) of the size of a connected component. If the denstiy of the forest is well below the critical density d*, then Cbar will be a small constant, independent of N, and we can still assume the transitive closure matrix can be represented in O(N) space. What if the density increases, approaching d*? As discussed above, Cbar will grow to be proportional to N, and the size of the transitive closure will grow to Omega(N**2) — something like 10**18 entries — which is enough to stress the resources of Amazon or even the mighty Google!

Of course, our test data and parameters have been set to avoid intractable growth of the transitive closure matrix. But you should appreciate that the naive approach of brute-force transitive closure computation would not scale to realistic instance sizes for this problem, even on a very large MapReduce cluster.

Algorithm Design Hints

(1) The problem parameters have been chosen so that the transitive closure of the adjacency matrix will not be unreasonably large. However, you may find that computing it requires an unacceptably large number of MapReduce iterations unless you use something like the "repeated squaring" technique discussed in lecture.

(2) Because the graph is undirected, the adjacency matrix is symmetric. You can exploit this property to reduce the size of the intermediate files and the network traffic somewhat.

(3) It is actually possible to find the connected components of an undirected graph without computing a full transitive closure. Recall we solved the "careless hiker" problem by computing the vector

A* × u

where A was the (symmetric) adjacenty matrix, A* was its transitive closure, and u was a column vector containing all 1's. Observe there is another matrix C defined by

C[i,j] = 1 iff j is the least member of the component containing i,
C[i,j] = 0 otherwise.

You can think of this as "naming" each connected component by the smallest tree index it contains; then C maps each tree i to the name j of its connected component. Clearly, C is sparse – it has exactly one nonzero entry per tree. Now let C† be the transpose of C, defined by

C†[i,j] = C[j,i].

You should convince yourself that

A* = C C†

so that

( A* u ) = ( C C† ) u = C ( C† u )

This can be implemented efficiently by two multiplications of a sparse matrix C by a vector. The solution to the “lightning strike” question can likewise be computed efficiently using C rather than A*.

So, you might want to think about an efficient MapReduce strategy for computing C rather than A*, with a reasonable upper bound on the number of MapReduce passes required for convergence (say, log(N), though in practice the algorithm will converge rather more quickly than this), and with a reasonable upper bound on the amount of data passed between the Map and Reduce phases (say, O(N)).

We know of at least one strategy that works, and there are undoubtedly more. But don't worry too much if you can't find one – the construction is definitely not trivial.

Data Files

We have provided three sets of data files: a small one for testing, a medium one for use with a transitive closure based algorithm, and a large one to use if you believe you have a better algorithm.

Each file consists of a sequence of text lines; each line contains three fields

x, y, w

where x, y are the integer coordinates of a grid point, using 0-origin, and w is a floating point number between 0 and 1.

The use of w in the file format is a hack to allow you to experiment with forests of varying densities. Values for w in the data files have been chosen uniformly at random. To construct a forest of a given uniform density d between 0 and 1, just read the file and ignore any tuple whose w value is greater than d.

The names and locations of the files in AWS are as follows: The test data can be found in the bucket edu-cornell-cs-cs530-proj4b-test while the medium-sized and the large-sized test data can be found in the buckets edu-cornell-cs-cs530-proj4b-production-medium and edu-cornell-cs-cs530-proj4b-large.

Controlling the Pipeline

Your solution will consist of multiple MapReduce passes, sometimes iterated until convergence. Controlling the pipeline of MapReduce passes is not trivial. Ideally, your pipeline will control itself, using some combination of Java code and scripts running on your client machine and/or on the Hadoop master. Initially you can do it manually; if that's as far as you get, just document the manual procedure in your README files, describe how you would approach automating the process.

The Exact Assignment

Part b1

The first assignment is to compute the average burn number (the answer to the “lightning strike” question) for one of our data files at a density of 1/3, using a 5 instance Hadoop cluster. If you are using a transitive closure based algorithm, use the medium-sized file. For extra credit, if you have a more efficient algorithm, use the large file. In either case, your pipeline should not take longer than an hour when run on 5 instances. If you can't manage to make it run that quickly, you can do run it at a density of 1/4 for a small grade penalty.

The result should be a set of files:

README_b1.txt (or .pdf or .doc)
in which you describe your MapReduce pipeline in detail, including input/output formats and what each step of the pipeline does, with an estimate of how many steps you expect it to take to converge (asymptotically as a function of N), and of the amount of data you expect to be passed between MapReduce stages (again, asymptotically as a function of N). Describe any the clever tricks you use to speed up convergence and/or reduce the amount of intermediate data.
Note: A description of your algorithm, so we can understand how you solved the problem, is very important – much more important in this project than in previous ones.
script_b1.txt
If you used a script to control your MapReduce pipeline, test for convergence, etc., include it here. If you controlled the pipeline manually, include a ssh log here and make sure the manual procedure is adequately documented in README_b1.txt.
master_log_b1.txt
a copy of the Hadoop log file from the master machine in your Hadoop cluster (this file is named
/mnt/hadoop/logs/hadoop-root-jobtracker-XXXXXXX.log
on the master of your Hadoop cluster, where XXXXXXXX is the temporary DNS name of the master) containing the results of running part (b1) of the assignment on the (medium or large) test data set. Please either make sure the log is empty when you begin the MapReduce run, or cut off the uninteresting prefix of the log so we see only the relevant parts of the log (the parts generated while running your job).
src_b1.zip
a .zip file containing your Java source code for part (b1).
answer_b1.txt
Your estimate for the burn number.

Note we are not asking for any representation of the connected components.

Combine these into a single file proj4a.zip and upload it to CMS.

Part b2 (optional)

If think you have developed an efficient MapReduce pipeline for computing connected components, try estimating the critical density d* using our medium sized data set. The burn size may not be the best thing to measure for this purpose. Consider using the size of the largest connected component. See if you can find a pair of densities d0 and d1 that differ by a small amount (say .03 or .05) while the size of the largest connected component changes by an order of magnitude.

As in part b1, do not consume more than an hour running on 5 instances.

The result should be a set of files similar to those for part b1:

README_b2.txt (or .pdf or .doc)
script_b2.txt
master_log_b2.txt
answer_b1.txt
Your estimated lower and upper bounds for d*

Combine these into a single file proj4a.zip and upload it to CMS.

Yet Another Final Reminder

Remember, please

DO NOT LEAVE YOUR CLUSTER RUNNING OVERNIGHT!

even if it is not doing anything! To shut down your cluster from the $HADOOP_ROOT directory on your local machine type

src/contrib/ec2/bin/hadoop-ec2 terminate-cluster

reply to the prompt, and your Hadoop instances will be terminated, and will stop consuming funds from our AWS account.