Welcome to MapReduce algorithm example. Before writing MapReduce programs in CloudEra Environment, first we will discuss how MapReduce algorithm works in theory with some simple MapReduce example in this post. In my next posts, we will discuss about How to develop a MapReduce Program to perform WordCounting and some more useful and simple examples.
Table of Contents
MapReduce Algorithm
MapReduce is a Distributed Data Processing Algorithm, introduced by Google in it’s MapReduce Tech Paper.
MapReduce Algorithm is mainly inspired by Functional Programming model. ( Please read this post “Functional Programming Basics” to get some understanding about Functional Programming , how it works and it’s major advantages).
MapReduce algorithm is mainly useful to process huge amount of data in parallel, reliable and efficient way in cluster environments.
I hope, everyone is familiar with “Divide and Conquer” algorithm. It uses Divide and Conquer technique to process large amount of data.
It divides input task into smaller and manageable sub-tasks (They should be executable independently) to execute them in-parallel.
MapReduce Algorithm Steps
MapReduce Algorithm uses the following three main steps:
- Map Function
- Shuffle Function
- Reduce Function
Here we are going to discuss each function role and responsibility in MapReduce algorithm. If you don’t understand it well in this section, don’t get panic. Please read next section, where we use one simple word counting example to explain them in-detail. Once you read next section again come back to this section re-read it again. I bet you will definitely understand these 3 steps or functions very well.
Map Function
Map Function is the first step in MapReduce Algorithm. It takes input tasks (say DataSets. I have given only one DataSet in below diagram.) and divides them into smaller sub-tasks. Then perform required computation on each sub-task in parallel.
This step performs the following two sub-steps:
- Splitting
- Mapping
- Splitting step takes input DataSet from Source and divide into smaller Sub-DataSets.
- Mapping step takes those smaller Sub-DataSets and perform required action or computation on each Sub-DataSet.
The output of this Map Function is a set of key and value pairs as <Key, Value> as shown in the below diagram.
MapReduce First Step Output:
Shuffle Function
It is the second step in MapReduce Algorithm. Shuffle Function is also know as “Combine Function”.
It performs the following two sub-steps:
- Merging
- Sorting
It takes a list of outputs coming from “Map Function” and perform these two sub-steps on each and every key-value pair.
- Merging step combines all key-value pairs which have same keys (that is grouping key-value pairs by comparing “Key”). This step returns <Key, List<Value>>.
- Sorting step takes input from Merging step and sort all key-value pairs by using Keys. This step also returns <Key, List<Value>> output but with sorted key-value pairs.
Finally, Shuffle Function returns a list of <Key, List<Value>> sorted pairs to next step.
MapReduce Second Step Output:
Reduce Function
It is the final step in MapReduce Algorithm. It performs only one step : Reduce step.
It takes list of <Key, List<Value>> sorted pairs from Shuffle Function and perform reduce operation as shown below.
MapReduce Final Step Output:
Final step output looks like first step output. However final step <Key, Value> pairs are different than first step <Key, Value> pairs. Final step <Key, Value> pairs are computed and sorted pairs.
We can observe the difference between first step output and final step output with some simple example. We will discuss same steps with one simple example in next section.
That’s it all three steps of MapReduce Algorithm.
MapReduce Example – Word Count
In this section, we are going to discuss about “How MapReduce Algorithm solves WordCount Problem” theoretically. We will implement a Hadoop MapReduce Program and test it in my coming post.
Problem Statement:
Count the number of occurrences of each word available in a DataSet.
Input DataSet
Please find our example Input DataSet file in below diagram. Just for simplicity, we are going to use simple small DataSet. However, Real-time applications use very huge amount of Data.
Client Required Final Result
MapReduce – Map Function (Split Step)
MapReduce – Map Function (Mapping Step)
MapReduce – Shuffle Function (Merge Step)
MapReduce – Shuffle Function (Sorting Step)
MapReduce – Reduce Function (Reduce Step)
MapReduce 3 Step Process With WordCount Example
That’s it all about MapReduce Algorithm and map reduce example step by step. It’s time to start Developing and testing MapReduce Programs. First, We are going to develop same “WordCounting” example in my coming post.
Please drop me a comment if you like my post or have any issues/suggestions.
Reference: Wikipedia, Google Research Paper
Can you please explain this with a different example other than word count.
Thanks for sharing well articulated map-reduce function. it was easy to understand.
It’s was simple and nice test as I did in my work also. to reduce the features
I need help in completing my assignment of map only algorithm
This article helped a lot to understand map reduce
great article..it was of help.thnx
In Hadoop, MapReduce is a calculation that decomposes large manipulation jobs into individual tasks that can be executed in parallel cross a cluster of servers. The results of tasks can be joined together to compute final results.
Perfect, thanks a lot
EXCELLENT AND VERY USEFUL
hello,
do you have a pseudo algorithme for shuffle step in mapreduce function for an example ??
thanks
I have a question in mind that instead of following such a long process, word count can be done easily by taking buckets of each word and placing the words in its own bucket. when all the words have been placed in the bucket, apply counter (counter will count the number of words in the bucket). Our question (count the no of words) gets solved by using buckets also which is more easy and saves time, Please reply for the same.
the process you mentioned will be used for small sets of data,,,,but with large data sets point of view buckets may not be advisable
Excellent !
very very useful article….
Its an excellent tutorial Thank You so much
very good example.
Good article with nice explanation. It seems there’s small mistake. I think merging step will be in reduce process just after sorting step.
sorry, my mistake
Very well written and displayed. Very helpful for the beginners to understand the basic concepts.
Excellent
very very good
thank you very much
Very nice to start leanring Hadoop
Hello Pankaj,
I am learning hadoop ritenow, trying to do basic wordcount example on eclipse but I am getting error,
Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
at org.apache.hadoop.conf.Configuration.(Configuration.java:173)
at WordCount.main
can you please help how to solve that, I tried everything I can but I cannot solve the problem
Charmy
nice one
This artical is very easy for understanding. I want some mathematical solved example .
plz help me..
Excellent!This is the best article for introduce Mapredure Algorithm I have ever read.
Usful
It is good but I need some algorithm or program to understand I mean I get your use-case or example. But it takes lot of time to implement this .So, I need ur hrlp…