Welcome to MapReduce algorithm example. Before writing MapReduce programs in CloudEra Environment, first we will discuss how MapReduce algorithm works in theory with some simple MapReduce example in this post. In my next posts, we will discuss about How to develop a MapReduce Program to perform WordCounting and some more useful and simple examples.
MapReduce is a Distributed Data Processing Algorithm, introduced by Google in it’s MapReduce Tech Paper.
MapReduce Algorithm is mainly inspired by Functional Programming model. ( Please read this post “Functional Programming Basics” to get some understanding about Functional Programming , how it works and it’s major advantages).
MapReduce algorithm is mainly useful to process huge amount of data in parallel, reliable and efficient way in cluster environments.
I hope, everyone is familiar with “Divide and Conquer” algorithm. It uses Divide and Conquer technique to process large amount of data.
It divides input task into smaller and manageable sub-tasks (They should be executable independently) to execute them in-parallel.
MapReduce Algorithm Steps
MapReduce Algorithm uses the following three main steps:
- Map Function
- Shuffle Function
- Reduce Function
Here we are going to discuss each function role and responsibility in MapReduce algorithm. If you don’t understand it well in this section, don’t get panic. Please read next section, where we use one simple word counting example to explain them in-detail. Once you read next section again come back to this section re-read it again. I bet you will definitely understand these 3 steps or functions very well.
Map Function is the first step in MapReduce Algorithm. It takes input tasks (say DataSets. I have given only one DataSet in below diagram.) and divides them into smaller sub-tasks. Then perform required computation on each sub-task in parallel.
This step performs the following two sub-steps:
- Splitting step takes input DataSet from Source and divide into smaller Sub-DataSets.
- Mapping step takes those smaller Sub-DataSets and perform required action or computation on each Sub-DataSet.
The output of this Map Function is a set of key and value pairs as <Key, Value> as shown in the below diagram.
MapReduce First Step Output:
It is the second step in MapReduce Algorithm. Shuffle Function is also know as “Combine Function”.
It performs the following two sub-steps:
It takes a list of outputs coming from “Map Function” and perform these two sub-steps on each and every key-value pair.
- Merging step combines all key-value pairs which have same keys (that is grouping key-value pairs by comparing “Key”). This step returns <Key, List<Value>>.
- Sorting step takes input from Merging step and sort all key-value pairs by using Keys. This step also returns <Key, List<Value>> output but with sorted key-value pairs.
Finally, Shuffle Function returns a list of <Key, List<Value>> sorted pairs to next step.
MapReduce Second Step Output:
It is the final step in MapReduce Algorithm. It performs only one step : Reduce step.
It takes list of <Key, List<Value>> sorted pairs from Shuffle Function and perform reduce operation as shown below.
MapReduce Final Step Output:
Final step output looks like first step output. However final step <Key, Value> pairs are different than first step <Key, Value> pairs. Final step <Key, Value> pairs are computed and sorted pairs.
We can observe the difference between first step output and final step output with some simple example. We will discuss same steps with one simple example in next section.
That’s it all three steps of MapReduce Algorithm.
MapReduce Example – Word Count
In this section, we are going to discuss about “How MapReduce Algorithm solves WordCount Problem” theoretically. We will implement a Hadoop MapReduce Program and test it in my coming post.
Count the number of occurrences of each word available in a DataSet.
Please find our example Input DataSet file in below diagram. Just for simplicity, we are going to use simple small DataSet. However, Real-time applications use very huge amount of Data.
Client Required Final Result
MapReduce – Map Function (Split Step)
MapReduce – Map Function (Mapping Step)
MapReduce – Shuffle Function (Merge Step)
MapReduce – Shuffle Function (Sorting Step)
MapReduce – Reduce Function (Reduce Step)
MapReduce 3 Step Process With WordCount Example
That’s it all about MapReduce Algorithm and map reduce example step by step. It’s time to start Developing and testing MapReduce Programs. First, We are going to develop same “WordCounting” example in my coming post.
Please drop me a comment if you like my post or have any issues/suggestions.