MapReduce Architecture Implementation
The article below and the code is written by looking at the lab requirements as per MIT 6.824's Lab 1. You can find the original MapReduce Paper here. A helpful lecture video from the same MIT course is on YouTube. A helpful Computerphile video on the same topic is also on YouTube.
A crude and simple implentation using Go channels is here, however the original problem set goes about implenting it using IPC fasion using RPC to communicate between workers.
Overview of MapReduce
MapReduce is a programming model meant for large scale processing and generating of big datasets on a cluster in a parallel and distributed fashion. Several implementations of MapReduce are available, with Apache Hadoop being one of the most popular.
MapReduce consists of the following programming model:
- A
Map()
function that emits intermediate pairs to be picked up by the reduce phase later. - A
Reduce()
function that collects and aggregates the information passed from themap()
grouped by the . - An implicit
Shuffle()
procedure takes care of grouping the intermediate emitted values and group them to the correct reducer. - You can also specify a
Combine()
function that transforms the data similar to aReduce()
but is executed on the Mapper before being sent as intermediate data to the actual reduce workers. Often times, this will end up being the same function asReduce()
.
With this programming model, several tasks that deal with distributed and large scale data mapping and processing can be expressed in very simple terms.
Word Count
We can model this as reading the contents of the document and emitting 1
for each time a word is encountered.
On the reducer side, we can group the values by the word from the document (the ) and sum up the number of occurences.
Pseudocode
Distributed Grep
We can model the map()
function as emitting a line if the line contains the word we are trying to match against.
The reduce()
function can be an identity function that just outputs what its input was.
This way we can get the lines where the search term occurs.
Pseudocode
More examples like count of url access frequency, reverse web-link graph, distributed sort etc. are given in the original paper.
My MapReduce Implementation in Go
I took some time to write this simple model in Go. MapReduce is supposed to be a simple model to program and it should feel the same way while writing the code. I remember deleting the code I initially had as it was getting a bit complex and I felt it was unnecessary. Sometimes, simplicity is the key.
For this crude and simple map-reduce implementation, I had followed somewhat of the setup that is provided in Lab 1 of MIT 6.824 problem set.
They provide some starter code and some files which you can use for this lab.
Using that as a reference, I wrote some of my own simple implementation that does this architecturally the same.
They provide some text under the data/
directory, to test your program against. In this case, there are a bunch of texts of classical stories which we can test against.
In a real scenario (and the ones in the MIT lab's code) RPC calls are used for transferring data between workers, I simply use Go's channels.
Details
I create a type to handle and pass around the intermediate values:
I use the main function to accept files as arguments to the program and we will run the word count against them. Each file will be processed by a single separate map
worker before passing on the data to the reduce
worker.
Here, main
is the coordinator or the master as in the MapReduce paper.
We make 2 channels, mapChan
to send intermediate values to reducers and mapperDoneChan
to signal that we are done with all the mapping; there will be no more data and are just waiting for the reducers to finish.
We can spin off the map workers for all the files in parallel using the go
keyword.
We expect to get back some data about the words and their count of occurences. This can be modelled using a map. We can pass the result back when we are done reducing over all the data.
The next part is crucial. We keep the mapChan
open until all the mappers are done. We can then signal back that there is no more mapping needed and instead we should now wait for the reduce to finish.
We can now wait and receive the result from our reducer on the result channel.
I write the output to a file for persistence and testing later.
This far, it was all plumbing code. A framework like Hadoop will take care of all this file and message passing for us. All we need to do is supply the map
and the reduce
functions.
Let's define them next.
In the mapWorker
, I open the file for reading and set the bufio.Scanner
to scan for each word at a time.
While reading in words, I pass along a struct of {key: word, value: 1}
into the channel to the reduce worker. This corresponds to the emitIntermediate(word, 1)
call in our pseudocode.
For the reduce worker, I keep track of the counts in a map as they come in from the various mappers over the channel, incrementing their count by 1.
How it all works
Map workers are spun off for each file provided and they scan the document by words and for each word emit a struct containing word as the and 1
as the .
The intermediate values are collected over a channel in a Reduce worker which is keeping track of the counts from the different map workers.
In the end, we pass this final map of count of words back to the main goroutine to print the result.
Final results
Running the code the output produced is of the following format.
To test out, we can compare our program's output with the Unix wc
program. We need a sum of all the counts first however. We can do so by using awk
and pulling the third column from the above output. We can join them back on +
to form an addition expression. Finally we can pass the expression to bc
for evaluation.
There is a slight problem, we have a trailing new line and the tr
command will therefore have an extra +
at the end which will cause bc
to yell at a syntax error. We can suffix the entire expression with a 0 to have a fix for our addition.
We can compare this with the total words output from wc
.
Looks like our modelling is correct and our simple implementation works.