Hadoop 1.x: Architecture, Major Components and How HDFS and MapReduce Works

Filed Under: Big Data

Before reading this post, please go through my previous post at “Introduction to Hadoop” to get some Apache Hadoop Basics.

In this post, we are going to discuss about Apache Hadoop 1.x Architecture and How it’s components work in detail.

Post’s Brief Table of Contents

  • Hadoop 1.x Architecture
  • Hadoop 1.x Major Components
  • How Hadoop 1.x Major Components Works
  • How Store and Compute Operations Work in Hadoop

Hadoop 1.x Architecture

Apache Hadoop 1.x or earlier versions are using the following Hadoop Architecture. It is a Hadoop 1.x High-level Architecture. We will discuss in-detailed Low-level Architecture in coming sections.

If you don’t understand this Architecture at this stage, no need to worry. Read next sections in this post and also coming posts to understand it very well.

hadoop1.x-components

  • Hadoop Common Module is a Hadoop Base API (A Jar file) for all Hadoop Components. All other components works on top of this module.
  • HDFS stands for Hadoop Distributed File System. It is also know as HDFS V1 as it is part of Hadoop 1.x. It is used as a Distributed Storage System in Hadoop Architecture.
  • MapReduce is a Batch Processing or Distributed Data Processing Module. It is built by following Google’s MapReduce Algorithm. It is also know as “MR V1” or “Classic MapReduce” as it is part of Hadoop 1.x.
  • Remaining all Hadoop Ecosystem components work on top of these two major components: HDFS and MapReduce. We will discuss all Hadoop Ecosystem components in-detail in my coming posts.

NOTE:-
Hadoop 1.x MapReduce is also know as “Classic MapReduce” as it was developed by following Google’s MapReduce Algorithm Tech Paper.

Hadoop 1.x Major Components

Hadoop 1.x Major Components components are: HDFS and MapReduce. They are also know as “Two Pillars” of Hadoop 1.x.

HDFS:
HDFS is a Hadoop Distributed FileSystem, where our BigData is stored using Commodity Hardware. It is designed to work with Large DataSets with default block size is 64MB (We can change it as per our Project requirements).

HDFS component is again divided into two sub-components:

  1. Name Node
  2. Name Node is placed in Master Node. It used to store Meta Data about Data Nodes like “How many blocks are stored in Data Nodes, Which Data Nodes have data, Slave Node Details, Data Nodes locations, timestamps etc” .

  3. Data Node
  4. Data Nodes are places in Slave Nodes. It is used to store our Application Actual Data. It stores data in Data Slots of size 64MB by default.

MapReduce:
MapReduce is a Distributed Data Processing or Batch Processing Programming Model. Like HDFS, MapReduce component also uses Commodity Hardware to process “High Volume of Variety of Data at High Velocity Rate” in a reliable and fault-tolerant manner.

MapReduce component is again divided into two sub-components:

  1. Job Tracker
  2. Job Tracker is used to assign MapReduce Tasks to Task Trackers in the Cluster of Nodes. Sometimes, it reassigns same tasks to other Task Trackers as previous Task Trackers are failed or shutdown scenarios.

    Job Tracker maintains all the Task Trackers status like Up/running, Failed, Recovered etc.

  3. Task Tracker
  4. Task Tracker executes the Tasks which are assigned by Job Tracker and sends the status of those tasks to Job Tracker.

hadoop1.x-hdfs-mr-components

We will discuss these four sub-component’s responsibilities and how they interact each other to perform a “Client Application Tasks” in detail in next section.

How Hadoop 1.x Major Components Works

Hadoop 1.x components follow this architecture to interact each other and to work parallel in a reliable and fault-tolerant manner.

Hadoop 1.x Components High-Level Architecture

hadoop1.x-components-architecture

  • Both Master Node and Slave Nodes contain two Hadoop Components:
    1. HDFS Component
    2. MapReduce Component
  • Master Node’s HDFS component is also known as “Name Node”.
  • Slave Node’s HDFS component is also known as “Data Node”.
  • Master Node’s “Name Node” component is used to store Meta Data.
  • Slave Node’s “Data Node” component is used to store actual our application Big Data.
  • HDFS stores data by using 64MB size of “Data Slots” or “Data Blocks”.
  • Master Node’s MapReduce component is also known as “Job Tracker”.
  • Slave Node’s MapReduce component is also known as “Task Tracker”.
  • Master Node’s “Job Tracker” will take care assigning tasks to “Task Tracker” and receiving results from them.
  • Slave Node’s MapReduce component “Task Tracker” contains two MapReduce Tasks:
    1. Map Task
    2. Reduce Task

    We will discuss in-detail about MapReduce tasks (Mapper and Reducer) in my coming post with some simple End-to-End Examples.

  • Slave Node’s “Task Tracker” actually performs Client’s tasks by using MapReduce Batch Processing model.
  • Master Node is a Primary Node to take care of all remaining Slave Nodes (Secondary Nodes).

Hadoop 1.x Components In-detail Architecture

hadoop2.x-components-architecture

Hadoop 1.x Architecture Description

  • Clients (one or more) submit their work to Hadoop System.
  • When Hadoop System receives a Client Request, first it is received by a Master Node.
  • Master Node’s MapReduce component “Job Tracker” is responsible for receiving Client Work and divides into manageable independent Tasks and assign them to Task Trackers.
  • Slave Node’s MapReduce component “Task Tracker” receives those Tasks from “Job Tracker” and perform those tasks by using MapReduce components.
  • Once all Task Trackers finished their job, Job Tracker takes those results and combines them into final result.
  • Finally Hadoop System will send that final result to the Client.

How Store and Compute Operations Work in Hadoop

All these Master Node and Slave Nodes are organized into a Network of clusters. Each Cluster is again divided into Racks. Each rack contains a set of Nodes (Commodity Computer).

When Hadoop system receives “Store” operation like storing Large DataSets into HDFS, it stores that data into 3 different Nodes (As we configure Replication Factor = 3 by default). This complete data is not stored in one single node. Large Data File is divided into manageable and meaningful Blocks and distributed into different nodes with 3 copies.

If Hadoop system receives any “Compute” operation, it will talk to near-by nodes to retrieve those blocks of Data. While Reading Data or Computing if one or more nodes get failed, then it will automatically pick-up performing those tasks by approaching any near-by and available node.

That’s why Hadoop system provides highly available and fault tolerant BigData Solutions.

NOTE:-

  • Hadoop 1.x Architecture has lot of limitations and drawbacks. So that Hadoop Community has evaluated and redesigned this Architecture into Hadoop 2.x Architecture.
  • Hadoop 2.x Architecture is completely different and resolved all Hadoop 1.x Architecture’s limitations and drawbacks.

That’s it all about Hadoop 1.x Architecture, Hadoop Major Components and How those components work together to fulfill Client requirements. We will discuss “Hadoop 2.x Architecture, Major Components and How those components work” in my coming post.

We hope you understood Hadoop 1.x Architecture and how it works very well now.

Please drop me a comment if you like my post or have any issues/suggestions.

Comments

  1. Ramya says:

    Explained very well.it is easy to understand everyone.

  2. Srinivasa says:

    Very good explanation I have understood clearly. And can you please provide 2.x also

  3. Anbarasu says:

    explained well . great place to obtain the work flow of the hadoop

  4. Subbareddy N says:

    Hello Rambabu,

    Your post will help us to understand the basic knowledge of HDFS & Map Reduce. It is very simple and helpful.

    Cheers,
    N. P S Reddy

  5. Subhash Yadav says:

    Very good explanation. Thanks for sharing such a great knowledge about Hadoop 1.x architecture

  6. SIMANCHALA PATTANAYAK says:

    Thanks Rambabu….Your way of explanation of is just awesome….

    In “Hadoop 1.x Components In-detail Architecture” part we didn’t get any explanation about Namenode & Datanode. Could you please explain how data flow occurs in NN & DN.

    Thank you…

  7. Uday says:

    Thanks a lot for all of your efforts and such a clear explanation. Your intention of explaining from a layman prospective is really helpful for beginners.

  8. Sayak says:

    Great explanation..

  9. Deepak says:

    Explained in layman’s words. Thanks and keep going!
    Expecting more from you….:)

  10. VA says:

    Good article,simple n easy to understand.
    thanks

  11. Sunil Pandey says:

    Really well explained. thanks a lot.

    1. Pankaj says:

      Thanks Sunil, please subscribe to our Newsletter where we share exclusive tips and free eBooks.

  12. gixhub says:

    good work

  13. Sreenivas says:

    Can you please provide how the map reduce works internally?

  14. Vijay says:

    Can you please provide Hadoop 1.x limitations and drawbacks in detail.

    1. Rambabu says:

      Hi Vijay
      Yes, I have a plan to deliver a post on “Differences between Hadoop 1.x & 2.x, Limitations/Drawbacks of Hadoop 1.x and Advantages of Hadoop 2.x” soon. As most of users are asking to provide some practical posts before going into in-depth discussions about Hadoop 1.x and 2.x, I’m concentrating on some basics examples now. Will answer your question soon. Please read my future post.

  15. Atchaiah says:

    Thanks Rambabu and i am waiting for u r up coming posts

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages