Main

Assignment 6

In this OPTIONAL assignment you will practice some more of the C++ concepts we have learned so far.

In this assignment, you will work with C++ containers (maps, multimaps, and vectors), iterators, and some algorithms. It also serves as an introduction to the popular Map Reduce pattern for parallel computations (but you will not have to write any parallel code yourself!).

The initial code for this assignment: assignment6.zip or assignment6.tar.

Introduction

MapReduce is a parallel programming pattern for processing very large amounts of data. The main idea behind MapReduce is mapping your dataset into a collection of <key,value> pairs, then reducing over all pairs with the same key. You can find a lot of material on MapReduce online, a couple of good introductions are available here and here. (For a much more comprehensive introduction, which you shouldn't have to read, but can if you want: http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)

The following abstract interface defines the map and reduce methods that any class derived from MapReduce can implement (overload) to implement a new MapReduce application. You will implement new classes deriving from this abstract class in each of the problems below.

template<typename Key, typename Val>
class MapReduce {
public:
	/*
	 *  Map: extract something you care about form the input
	 *  Map(input) -> out_values
	 */
	virtual
	void MRmap(const std::map<std::string,std::string> &input,
			  std::multimap<Key,Val> &out_values);

	/*
	 * Reduce: aggregate, summarize, filter, or transform
	 * Reduce(intermediate) -> out_values
	 */
	virtual
	void MRreduce(const std::multimap<Key,Val> &intermediate_values,
				  std::map<Key,Val> &out_values);
};

The implementation size (in terms of lines of code) for this assignment is relatively small, this description is significantly longer than the amount of C++ code you are expected to write.

Problem 1

You are already given a sample MapReduce application that counts the number of occurrences of each word in a collection of text files. To try it out, run make, then run it with

 ./test-wordCount wordCount-input.txt

The input file wordCount-input.txt contains a list of filenames (in the input/ subdirectory) that are used for testing. Feel free to add more files or test with different ones, just remember to list them in wordCount-input.txt.

The output from the above is the list of words and the corresponding number of occurrences, here is an excerpt:

...
wreck.: 1
write,: 1
writer,: 10
writer.: 10
writer...: 1
writing: 1
...

Note that the word "writer" occurs multiple times because the WordCount's implementation of MRmap is not aware of punctuation. As your warm-up task, fix the mr::WordCount::MRmap method to remove punctuation when it creates the multimap of intermediate results. Be careful with contractions (e.g., "I'm", "you're") -- you can consider a contraction to be a single word and not remove apostrophes. Your updated WordCount example should result in this type of output (corresponding the the above original output) when you run ./test-wordCount wordCount-input.txt.

...
wreck: 1
write: 1
writer: 21
writing: 1
...

Required files: wordCount.hpp, wordCount.cpp

Problem 2

In this problem you will create a new class, SentenceStats derived from mr::MapReduce that implements the MRmap and MRreduce methods to compute the following quantities:

Maximum sentence length (number of words).
Minimum sentence length (number of words).
Average sentence length (number of words).

Required files: sentenceStats.hpp, sentenceStats.cpp

Problem 3

Create a test program for your SentenceStats class, use test-WordCount.cpp as an example. The following is a sample output you can produce assuming input.txt is a list of text files to process (as in Problem 1).

 ./test-sentenceStats input.txt
 Maximum sentence length: 12 words
 Minimum sentence length: 3 words
 Average sentence length: 5 words

(Note, these are sample values, not corresponding to a particular input; you can reuse the input file from Problem 1 or create a new one for testing this problem.)

Required files: test-sentenceStats.cpp, Makefile (add a test-sentenceStats target)

Problem 4

In this problem you will create a new class, Phrases derived from mr::MapReduce that implements the MRmap and MRreduce methods, which computes the frequency of two-word phrases in a set of text files.

Required files: phrases.hpp, phrases.cpp

Problem 5

Create a test program for your Phrases class, use test-WordCount.cpp as an example. You should take as input a file containing a list of text files to process (as in Problem 1). The output is similar to that of Problem 1, but instead of single words, you should output all 2-word phrases and the number of times they occur in the provided input.

Required files: test-phrases.cpp, Makefile (add test-phrases target)

Problem 6. Extra credit (optional)

The same as problems 4 and 5, but create a class AllPhrases derived from mr::MapReduce that computes frequency of phrases of any length, not just two-word phrases (the maximum phrase size is the whole sentence). Sort the output from most to least frequent. Create a test program test-allPhrases.cpp to test your class.

Required files: allPhrases.hpp, allPhrases.cpp, test-allPhrases.cpp, Makefile (add test-allPhrases target)

CIS 330 Wiki Pages

edit SideBar

cis330