Hadoop 2.x MapReduce (MR V1) WordCounting Example

Filed Under: Big Data

Before reading this post, please go through my previous post at “How MapReduce Algorithm Works” to get some idea about MapReduce Algorithm. My previous post has already explained about “How MapReduce performs WordCounting” in theoretically.

And if you are not familiar with HDFS Basic commands, please go through my post at “Hadoop HDFS Basic Developer Commands” to get some basic knowledge about how to execute HDFS Commands in CloudEra Environment.

In this post, We are going to develop same WordCounting program using Hadoop 2 MapReduce API and test it in CloudEra Environment.

MapReduce WordCounting Example

We need to write the following three programs to develop and test MapReduce WordCount example:

  1. Mapper Program
  2. Reducer Program
  3. Client Program

NOTE:-
To develop MapReduce Programs, there are two versions of MR API:

  1. One from Hadoop 1.x (MapReduce Old API)
  2. Another from Hadoop 2.x (MapReduce New API)

In Hadoop 2.x, MapReduce Old API is deprecated. So we are gong to concentrate on MapReduce New API to develop this WordCount Example.

In CloudEra environment, They have already provided Eclipse IDE setup with Hadoop 2.x API. So it is very easy to develop and test MapReduce Programs using this setup.

To develop WordCount MapReduce Application, please use the following steps:

  • Open Default Eclipse IDE provided by CloudEra Environment.
  • We can use already created project or create a new Java Project.
  • For simplicity, I’m going to use existing “training” Java Project. They have already added all required Hadoop 2.x Jars to this project classpath. It is ready to use Eclipse Java Project.
  • Create WordCount Mapper Program
  • Create WordCount Reducer Program
  • Create WordCount Client Program to test this application

Let’s us start developing these three programs in next sections.

Mapper Program

Create a “WordCountMapper” Java Class which extends Mapper class as shown below:


package com.journaldev.hadoop.mrv1.wordcount;

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
	
	public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
		String w = value.toString();
		context.write(new Text(w), new IntWritable(1));
	}

}

Code Explanation:

  • Our WordCountMapper class has implemented Hadoop 2 MapReduce API class “Mapper”.
  • Mapper class has defined by using Generic Type as Mapper<LongWritable, Text, Text, IntWritable>
  • Here <LongWritable, Text, Text, IntWritable>

    1. First two <LongWritable, Text> represents Input Data types to our WordCount’s Mapper Program.
    2. For Example:- In our example, we will give a File(Huge amount of Data, any format). Mapper reads each line from this file and give one unique number as shown below

      
      <Unique_Long_Number, Line_Read_From_Input_File>
      

      In Hadoop MapReduce API, it is equal to <LongWritable, Text>.

    3. Last two <Text, IntWritable> represents Output Data types of our WordCount’s Mapper Program.
    4. For Example:- In our example, WordCount’s Mapper Program gives output as shown below

      
      <Unique_Word_From_Input_File, Word_Count>
      

      In Hadoop MapReduce API, it is equal to <Text, IntWritable>.

  • We have implemented Mapper’s map() method and provided our Mapping Function logic here.

Reducer Program

Create a “WordCountReducer” Java Class which extends Reducer class as shown below:


package com.journaldev.hadoop.mrv1.wordcount;

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
	
	public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
		int sum = 0;
		for (IntWritable val : values) {
		sum += val.get();
		}
		context.write(key, new IntWritable(sum));
	}
	
}

Code Explanation:

  • Our WordCountReducer class has extended Hadoop 2 MapReduce API class: “Reducer”.
  • Reducer class has defined by using Generic Type as Mapper<Text, IntWritable, Text, IntWritable>
  • Here <Text, IntWritable, Text, IntWritable>

    1. First two <Text, IntWritable> represents Input Data types to our WordCount’s Reducer Program.
    2. For Example:- In our example, our Mapper Program will give <Text, IntWritable> output, which will become the input of Reducer Program.

      
      <Unique_Word_From_Input_File, Word_Count>
      

      In Hadoop MapReduce API, it is equal to <Text, IntWritable>.

    3. Last two <Text, IntWritable> represents Output Data types of our WordCount’s Reducer Program.
    4. For Example:- In our example, WordCount’s Reducer Program gives output as shown below

      
      <Unique_Word_From_Input_File, Total_Word_Count>
      

      In Hadoop MapReduce API, it is equal to <Text, IntWritable>.

  • We have implemented Reducer’s reduce() method and provided our Reduce Function logic here.

Client Program

Create a “WordCountClient” Java Class with main() method as shown below:


package com.journaldev.hadoop.mrv1.wordcount;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;


public class WordCountClient {

	public static void main(String[] args) throws Exception {
		Job job = Job.getInstance(new Configuration());
		job.setJarByClass(WordCountClient.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		job.setMapperClass(WordCountMapper.class);
		job.setReducerClass(WordCountReducer.class);
		job.setInputFormatClass(TextInputFormat.class);
		job.setOutputFormatClass(TextOutputFormat.class);
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		boolean status = job.waitForCompletion(true);
		if (status) {
			System.exit(0);
		} 
		else {
			System.exit(1);
		}
	}

}

Code Explanation:

  • Hadoop 2 MapReduce API has “Job” class at “org.apache.hadoop.mapreduce” package.
  • Job Class is used to create Jobs (Map/Reduce Jobs) to perform our WordCounting tasks.
  • Client program is using Job Object’s setter methods to set all MapReduce Components like Mapper, Reducer, Input Data Type, Output Data type etc.
  • These Jobs will perform our WordCounting Mapping and Reducing tasks.

NOTE:-

  • As we discussed in my previous post, MapReduce algorithm uses 3 functions: Map Function, Combine Function and Reduce Function.
  • By observing these 3 programs, we can find out one thing that we have developed only only two functions : Map and Reduce. Then What about Combine function?
  • That means we have used default Combine function logic available in Hadoop 2 MapReduce API.
  • We will discuss on “How to develop Combine Function” in my coming posts.

Now we have developed all required components (programs). It’s time to test it.

Test MapReduce WordCounting Example

Our WordCounting project final structure looks like this:

mrv1-wordcount-proj-structure

Please use the following steps to test our MapReduce Application.

  • Create our WordCount application JAR file using Eclipse IDE.
  • export-wordcounting-jar

    export-wordcounting-jar-step2

    wordcount-app-output3

  • Execute the following “hadoop” command to run our WordCounting Application
  • Syntax:-

    
    hadoop jar <our-Jar-file-path> <Client-program>  <Input-Path> <Output-Path>
    

    Let us assume that we have already created “/ram/mrv1/output” folder structure in Hadoop HDFS FileSytem. If you are not performed that, please go through my previous post at “Hadoop HDFS Basic Developer Commands” to create them.

    Example:-

    
    hadoop jar /home/cloudera/JDWordCountMapReduceApp.jar  
           com.journaldev.hadoop.mrv1.wordcount.WordCountClient 
           /ram/mrv1/NASDAQ_daily_prices_C.csv
           /ram/mrv1/output
    

    NOTE:-
    Just for simple readability purpose, I’ve provided command into multiple lines. Please type this command in single line as shown below:

    wordcount-app-output

    By going through this log, we can observe that how Map and Reduce jobs work to solve our WordCounting problem.

  • Execute the following “hadoop” command to view the output directory content
  • 
    hadoop fs -ls /ram/mrv1/output/
    

    It shows the content of “/ram/mrv1/output/” directory as shown below:

    mrv1-wordcount-ls-output-dir

  • Execute the following “hadoop” command to view our WordCounting Application output
  • 
    hadoop fs -cat /ram/mrv1/output/part-r-00000
    

    This command displays WordCounting Application output. As my output file is too big, I’m not able to show you my file output here.

NOTE:-
Here we have used some Hadoop HDFS commands to run and test our WordCounting Application. If you are not familiar with HDFS commands, please go through my “Hadoop HDFS Basic Developer Commands” post.

That’s it all about Hadoop 2.x MapReduce WordCounting Example. We will develop some more useful MapReduce programs in my coming posts.

Please drop me a comment if you like my post or have any issues/suggestions.

Comments

  1. shaharyar khan says:

    Mapper given in above example is not correct. The logic is counting lines instead of words. Tarun is right about this.

  2. Hareesh says:

    Hi Rambabu,

    I believe the old API (org.apache.hadoop.mapred.) is not deprecated in Hadoop 2.x.
    In Hadoop 2.x we can write Mapreduce programs using new API (org.apache.hadoop.mapreduce) as well old API(org.apache.hadoop.mapred).

    And also i tested wordcout program using Old API in Hadoop 2.6.0-cdh5.7.1. working fine.

    1. Rambabu says:

      Hi Hareesh,

      Thanks. Yes, we can still write MapReduce Old API programs in Hadoop 2.x as they are not removed from it. They are just deprecated now and may be removed in future releases.

      Thanks,
      Ram

      1. shyam says:

        Thanks a lot of clear explanation,
        But when I execute the mr test program i am getting below error , how to resolve it?

        java.lang.AssertionError: 4 Error(s): (Missing expected output (bigdata, 1) at position 0., Missing expected output (emerging, 1) at position 1., Missing expected output (hadoop, 2) at position 2., Matched expected output (is, 2) but at incorrect position 0 (expected position 3))

        Even i change the order also i am getting same error.

  3. Tarun Sharma says:

    There is issue in the logic of Mapper,It count lines instead of word,Correct code is given below:
    String word;
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while(tokenizer.hasMoreTokens())
    {
    word = tokenizer.nextToken();
    context.write(new Text(word), new IntWritable(1));
    }

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages