Introduction & Background
MapReduce is a computing paradigm for processing data that resides on hundreds of computers. The paradigm is extraordinarily powerful and works particularly well on most Big Data problems.
MapReduce is more of a framework than a tool. The solution for a particular problem has to be fit into the framework of the map and reduce; which can be challenging in some situations and becomes more of a constraint than a feature.
Next, the basic model of MapReduce, with regard to Apache Hadoop is discussed, with the help of an example.
The blog closes by presenting some advanced use cases of MapReduce and its great potential as a framework.
MapReduce in Hadoop
The underlying idea of Apache Hadoop is that it allows the programmer to store and process data across a distributed framework, containing clusters of machines. It essentially focuses on increasing data storage capability by horizontal scaling (adding more machines to the existing pool of resources), as opposed to vertical scaling (adding more sophisticated resources to a single machine).
To take advantage of the parallel processing that Hadoop provides, every programming task needs to be expressed as a MapReduce job.
1. Input split: Split the data into independent chunks (processed in parallel) and send to the mapper.
2. Map: Map each chunk (key) to a specific (value) and output these key-value pairs
3. Shuffle and sort: Take the key-value pairs from the map phase and sort them by key, then send them to the reducer
4. Reduce: Based upon all values related to a specific key, output the result.
For better understanding, consider a basic example:
WordCount is a simple application that counts the number of occurrences of each word in a given input set.
Fig. 1: WordCount using MapReduce
The image given accurately represents all the phases:
1. Split: The input data is split by line i.e. every line is considered as a separate chunk and sent to a mapper.
2. Map: The mapper tokenizes every line on the space “ “ character, thereby obtaining the words (keys). It then maps every word to a count (value) of 1 and outputs these key-value pairs.
3. Shuffle and Sort: This intermediate phase groups together all the similar keys and their values together. Essentially, all the words are collected together, each mapped to a count of 1.
4. Reduce: The reduce tasks each receive one key, along with its set of values. For counting the number of words, the task simply sums up all the values and outputs the result.
The final result is a count of every word in the input file.
Potential of MapReduce
Presented below are some of the analytics operations that can be performed, and are being performed by MapReduce in coalition with Apache Hadoop:
1. Summarization: Grouping similar data together and then performing an operation such as calculating a statistic, building an index, or just simply counting.
2. Filtering: Understanding a smaller piece of data, such as all records generated from a particular user. It is equivalent to applying a microscope to data.
3. Data Organization: Reorganizing data for better storage and/or processing
4. Joins: Data is all over the place, interesting relationships can be discovered when these sets are analyzed together.
The true potential of MapReduce lies in its capability of parallelism and getting the output based on key-value pair analysis. It is capable of doing big wonders when it comes to Big Data. When it comes to real-world problems, MapReduce does make it a great choice for easy processing of any volume of data.