Assignment 6
In this OPTIONAL assignment you will practice some more of the C++ concepts we have learned so far.
In this assignment, you will work with C++ containers (maps, multimaps, and vectors), iterators, and some algorithms. It also serves as an introduction to the popular Map Reduce pattern for parallel computations (but you will not have to write any parallel code yourself!).
The initial code for this assignment: assignment6.zip or assignment6.tar.
Introduction
MapReduce is a parallel programming pattern for processing very large amounts of data. The main idea behind MapReduce is mapping your dataset into a collection of <key,value> pairs, then reducing over all pairs with the same key. You can find a lot of material on MapReduce online, a couple of good introductions are available here and here. (For a much more comprehensive introduction, which you shouldn't have to read, but can if you want: http://lintool.github.io/MapReduceAlgorithms/ed1n/MapReduce-algorithms.pdf)
The following abstract interface defines the map and reduce methods that any class derived from MapReduce
can implement (overload) to implement a new MapReduce application. You will implement new classes deriving from this abstract class in each of the problems below.
template<typename Key, typename Val> class MapReduce { public: /* * Map: extract something you care about form the input * Map(input) -> out_values */ virtual void MRmap(const std::map<std::string,std::string> &input, std::multimap<Key,Val> &out_values); /* * Reduce: aggregate, summarize, filter, or transform * Reduce(intermediate) -> out_values */ virtual void MRreduce(const std::multimap<Key,Val> &intermediate_values, std::map<Key,Val> &out_values); };
The implementation size (in terms of lines of code) for this assignment is relatively small, this description is significantly longer than the amount of C++ code you are expected to write.
Problem 1
You are already given a sample MapReduce application that counts the number of occurrences of each word in a collection of text files. To try it out, run make
, then run it with
./test-wordCount wordCount-input.txt
The input file wordCount-input.txt
contains a list of filenames (in the input/ subdirectory) that are used for testing. Feel free to add more files or test with different ones, just remember to list them in wordCount-input.txt.
The output from the above is the list of words and the corresponding number of occurrences, here is an excerpt:
... wreck.: 1 write,: 1 writer,: 10 writer.: 10 writer...: 1 writing: 1 ...
Note that the word "writer" occurs multiple times because the WordCount
's implementation of MRmap
is not aware of punctuation. As your warm-up task, fix the mr::WordCount::MRmap
method to remove punctuation when it creates the multimap of intermediate results. Be careful with contractions (e.g., "I'm", "you're") -- you can consider a contraction to be a single word and not remove apostrophes. Your updated WordCount example should result in this type of output (corresponding the the above original output) when you run ./test-wordCount wordCount-input.txt
.
... wreck: 1 write: 1 writer: 21 writing: 1 ...
Required files: wordCount.hpp, wordCount.cpp
Problem 2
In this problem you will create a new class, SentenceStats
derived from mr::MapReduce
that implements the MRmap
and MRreduce
methods to compute the following quantities:
- Maximum sentence length (number of words).
- Minimum sentence length (number of words).
- Average sentence length (number of words).
Required files: sentenceStats.hpp, sentenceStats.cpp
Problem 3
Create a test program for your SentenceStats
class, use test-WordCount.cpp as an example. The following is a sample output you can produce assuming input.txt is a list of text files to process (as in Problem 1).
./test-sentenceStats input.txt Maximum sentence length: 12 words Minimum sentence length: 3 words Average sentence length: 5 words
(Note, these are sample values, not corresponding to a particular input; you can reuse the input file from Problem 1 or create a new one for testing this problem.)
Required files: test-sentenceStats.cpp, Makefile (add a test-sentenceStats target)
Problem 4
In this problem you will create a new class, Phrases
derived from mr::MapReduce
that implements the MRmap
and MRreduce
methods, which computes the frequency of two-word phrases in a set of text files.
Required files: phrases.hpp, phrases.cpp
Problem 5
Create a test program for your Phrases
class, use test-WordCount.cpp as an example. You should take as input a file containing a list of text files to process (as in Problem 1). The output is similar to that of Problem 1, but instead of single words, you should output all 2-word phrases and the number of times they occur in the provided input.
Required files: test-phrases.cpp, Makefile (add test-phrases target)
Problem 6. Extra credit (optional)
The same as problems 4 and 5, but create a class AllPhrases
derived from mr::MapReduce
that computes frequency of phrases of any length, not just two-word phrases (the maximum phrase size is the whole sentence). Sort the output from most to least frequent. Create a test program test-allPhrases.cpp
to test your class.