Introduction to Hadoop, BigData Life-Cycle Management

Filed Under: Big Data

Before reading this post, please go through my previous post at “Introduction to BigData” to get some BigData Basics. In this post, we will discuss about Hadoop Basics.

Post’s Brief Table Of Contents:

  • Introduction to Hadoop
  • What is Apache Hadoop?
  • Why Apache Hadoop to Solve BigData Problems?
  • Hadoop Advantages
  • Hadoop is Suitable For
  • Hadoop is NOT Suitable For
  • Hadoop Deployment Modes
  • Hadoop 2.x Components
  • Hadoop 2.x Components Responsibilities
  • BigData Life-Cycle Management

Introduction to Hadoop

We are living in “BigData” Era. Most of the Organizations are facing BigData Problems.

Hadoop is an Open Source framework from Apache Software Foundation to solve BigData Problems. It is completely written in Java Programming Language.

Google published two Tech Papers: one is on Google FileSystem (GFS) in October 2003 and another on MapReduce Algorithm in Dec 2004. Google FileSystem is a Google’s proprietary distributed FileSystem to store and manage data efficiently and reliably using commodity hardware. MapReduce is a Parallel and distributed programming model, which is used for processing and generating large Datasets.

Google solves their BigData Problems using these two Components: GFS and MapReduce Algorithm.

Hadoop was initially inspired, designed and developed by following Google’s Paper on “MapReduce Algorithm and Google FileSystem(GFS)”.

All Apache Hadoop core modules are developed by using Java. Latest Hadoop Version is 2.x.

hadoop-logo

Above image is a Logo of Apache Hadoop Software.

What is Apache Hadoop?

Apache Hadoop is an Open-Source BigData Solution Framework for both Distributed Storage, Distributed Computing and Cloud Computing using Commodity Hardware.

Apache Hadoop Office Website: https://hadoop.apache.org/

NOTE:-What is Commodity Hardware?
Commodity Hardware means very In-expensive Normal Hardware, which is designed with normal Hardware components for normal computing purpose. It is very Cheap non-enterprise Hardware device.

It is a Data Management software framework with Scale-out storage and Distributed Processing.

It uses Commodity Hardware and gives very Cost-effective BigData Solution by using Distributed Computing. Some vendors also supports BigData Hadoop Solutions using Cloud, for example AWS (Amazon Web Services).

Any BigData Hadoop Solution mainly provides two kinds of services:

  1. Storage Service
  2. Computation Service

Why Apache Hadoop to Solve BigData Problems?

Apache Hadoop is an open-source BigData Solution software. We should use this for the following reasons:

  • Open Source
  • Very Reliable
  • Highly Scalable
  • Uses Commodity Hardware

As existing tools are not able to handle that much huge variety data, we can use Apache Hadoop BigData Solution to solve these problems.

Hadoop Advantages

Apache Hadoop provides the following benefits in solving BigData Problems:

  • Open Source
  • Apache Hadoop is Open Source BigData Solution with free license from Apache Software Foundation.

  • Highly Availability
  • Hadoop Solution uses Replication Technique. By default it uses Replication factor = 3. If required, we can change this value.

    If one node is down for some reason, it will automatically pickup data from other near-by and available node. Hadoop System finds that failure node automatically and do the necessary things to up and running that node. So that it is highly available.

    So Apache Hadoop provides no downtime BigData Solutions.

  • Highly Scalable
  • Hadoop is highly Scalable, because it can store and distribute very huge amount of Data across hundreds of thousands of commodity hardware that operates in parallel. We can scale it in Horizontally or Vertically based on our Project requirements.

  • Better Performance
  • Even though Hadoop uses commodity hardware, it distributes work into different nodes and perform those tasks parallel. So that it can process PB (Peta Bytes) or More amount of Data in just few minutes and gives better performance.

    NOTE:- Node means any commodity computer in Hadoop Cluster.

  • Handles Huge and Varied types of Data
  • Hadoop handles very huge amount of variety of data by using Parallel computing technique.

  • Cost-Effective BigData Solutions
  • Unlike Traditional Relational Databases and Tools, Hadoop uses very in-expensive and non-enterprise commodity hardware to setup Hadoop Clusters. We don’t need to buy very Expensive, High-Capacity and High Performance Hardware to solve our BigData Problems. Hadoop uses Cheap Hardware and deliver very effective solutions.

  • Increases Profits
  • By using very Cheap commodity hardware to construct Our BigData Network, it increases Profits. If we use Cloud Technology to solve BigData Problems, we can improve our profits a lot.

  • Very Flexible
  • Hadoop can accept any kind of Data Formats from different data sources. We can integrate new Data Sources with Hadoop system and use them very easily.

  • Fault Tolerance Architecture
  • Hadoop provides very fault tolerant architecture to solve Big Data Problems. We will discuss how it is resilient to failures soon in my coming posts.

  • Solves Complex Problems
  • As Hadoop follows Distributed/Parallel Processing, it solves complex problems very easily.

Hadoop is Suitable For

Apache Hadoop is suitable to solve the following kinds of BigData problems:

  • Recommendation systems
  • Processing Very Big DataSets
  • Processing Diversity of Data
  • Log Processing
  • Best to process Data when it is rest.

Hadoop is NOT Suitable For

Hadoop is not suitable for all BigData Solutions. The following are the few scenarios where BigData Hadoop Solution is not suitable:

  • Processing small DataSets
  • Executing Complex Queries
  • Bit tough to process Data when it is in Motion

Hadoop BigData Solutions

We have many BigData Hadoop Solutions in the current market. Most popular solutions are:

  • Apache Hadoop
  • CloudEra BigData Hadoop Solution
  • IBM Hortonworks
  • Google Cloud BigData Solution
  • AWS EMR(Amazon Web Services – Elastic MapReduce)
  • MapR

NOTE:-
All above BigData Hadoop Solutions are implemented by following Apache Hadoop Software.

Hadoop Deployment Modes

Apache Hadoop Software is deployed or operated or installed in the following three modes:

  • Standalone Mode
  • It is used for Simple Analysis Purpose or Debugging purpose. It is not Distributed or Clustered Architecture, just installed on a Single Node for Testing purpose.

  • Pseudo Distributed Mode
  • It is installed in a Single Node, but simulated like installed on Multiple Servers. It creates a simulated Hadoop Cluster Of Nodes, but not really distributed. It is mainly useful for preparing POC (Proof Of Concept) to Test Multiple Nodes and Clustered Hadoop System.

  • Fully Distributed Mode
  • It is a real Fully Distributed Hadoop Clustered Architecture. It is used in Live BigData Solutions Systems.

Hadoop 2.x Components

Apache Hadoop Version 2.x has the following three major Components:

  • HDFS
  • YARN
  • MapReduce

We will discuss major differences between Hadoop V1.x and V2.x and also how these components in Hadoop environment to solve BigData Solutions in my coming posts.

Hadoop 2.x Components Responsibilities

The main responsibilities of Apache Hadoop 2.x Components are:

  • Data Storage
  • Resource Management
  • Data Integration
  • Data Governance
  • Data and Batch Processing
  • Data Analysis
  • Real-time Computing

We will discuss which Hadoop Individual component is responsible to do these tasks in-detail in my coming posts.

BigData Life-Cycle Management

Generally, Hadoop Systems uses the following Life-Cycle to manage it’s BigData:

bigdata-lifecycle

First, Data Sources are create BigData. Here Data Sources are anything like Social Media, Internet, Mobile, Computer, Documents, Audio and Videos, Cameras,Sensors etc. Once BigData is created by systems, it is captured and processed into some formats to store into Hadoop Storage system.

After Storing BigData into Hadoop Storage, it is transformed and stored into Some NoSQL or Hadoop Database.

Then we will use some Hadoop tools to analyse the BigData and prepare the reports.

Business or Organizations will go through those reports or visualizations to understand the needs and do necessary actions to improve business value.

NOTE:-
If you don’t understand any terminology at this stage, don’t worry. Read next posts and practice some programs. Once you are clear with Hadoop Architecture and How Hadoop Components works, then come back to this page and read it again. I bet you will get clear idea about these concepts.

That’s it all about Hadoop Introduction and BigData Life-Cycle Management. We will discuss Hadoop Architecture and How Hadoop Component’s works in my coming posts.

Please drop me a comment if you like my post or have any issues/suggestions.

Comments

  1. jitu yadav says:

    know your site looks too cool sir
    jitu yadav from indore

  2. moni says:

    Hello,
    can you please tell me End to end process of a Hadoop developer in SDLC?

  3. Shailna Patidar says:

    Hey Rambabu,

    Very nice article on Hadoop overview. Keep writing more such articles for freshers.

    Cheers.

  4. Victor Rojas says:

    Me gusto el artículo, empezaré con el siguiente.

  5. Hareesh says:

    Hi Rambabu,

    Very Nice overview of Bigdata and Hadoop.

    Can you explain me more about Bigdata lifecycle picture. what does curation.?

    1. Rambabu says:

      Sure will update that post soon.

      Many thanks,
      Ram

  6. Balaji K says:

    Rambabu, Appreciate further details on Big Data Life Cycle Mangement, use cases would be great.

  7. RajendraBabu says:

    Thanks Posa for your time to give knowledge on Hadoop. Please explain How Hadoop helps Java and provide with sample programs and application

    1. Rambabu says:

      We write all MapReduce Jobs in Java only, please see them in my coming posts. We should have minimum Core Java knowledge to work as a Hadoop Developer.

      1. Jasim Saeed says:

        Hi Rambabu,

        I really appreciate it, if you can advise on Hadoop’s security features & configuration, and best practice from InfoSecurity perspective.

        Many Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages