Before reading this post, please go through my previous post at “Introduction to BigData” to get some BigData Basics. In this post, we will discuss about Hadoop Basics.
Post’s Brief Table Of Contents:
- Introduction to Hadoop
- What is Apache Hadoop?
- Why Apache Hadoop to Solve BigData Problems?
- Hadoop Advantages
- Hadoop is Suitable For
- Hadoop is NOT Suitable For
- Hadoop Deployment Modes
- Hadoop 2.x Components
- Hadoop 2.x Components Responsibilities
- BigData Life-Cycle Management
Introduction to Hadoop
We are living in “BigData” Era. Most of the Organizations are facing BigData Problems.
Hadoop is an Open Source framework from Apache Software Foundation to solve BigData Problems. It is completely written in Java Programming Language.
Google published two Tech Papers: one is on Google FileSystem (GFS) in October 2003 and another on MapReduce Algorithm in Dec 2004. Google FileSystem is a Google’s proprietary distributed FileSystem to store and manage data efficiently and reliably using commodity hardware. MapReduce is a Parallel and distributed programming model, which is used for processing and generating large Datasets.
Google solves their BigData Problems using these two Components: GFS and MapReduce Algorithm.
Hadoop was initially inspired, designed and developed by following Google’s Paper on “MapReduce Algorithm and Google FileSystem(GFS)”.
All Apache Hadoop core modules are developed by using Java. Latest Hadoop Version is 2.x.
Above image is a Logo of Apache Hadoop Software.
What is Apache Hadoop?
Apache Hadoop is an Open-Source BigData Solution Framework for both Distributed Storage, Distributed Computing and Cloud Computing using Commodity Hardware.
Apache Hadoop Office Website: https://hadoop.apache.org/
NOTE:-What is Commodity Hardware?
Commodity Hardware means very In-expensive Normal Hardware, which is designed with normal Hardware components for normal computing purpose. It is very Cheap non-enterprise Hardware device.
It is a Data Management software framework with Scale-out storage and Distributed Processing.
It uses Commodity Hardware and gives very Cost-effective BigData Solution by using Distributed Computing. Some vendors also supports BigData Hadoop Solutions using Cloud, for example AWS (Amazon Web Services).
Any BigData Hadoop Solution mainly provides two kinds of services:
- Storage Service
- Computation Service
Why Apache Hadoop to Solve BigData Problems?
Apache Hadoop is an open-source BigData Solution software. We should use this for the following reasons:
- Open Source
- Very Reliable
- Highly Scalable
- Uses Commodity Hardware
As existing tools are not able to handle that much huge variety data, we can use Apache Hadoop BigData Solution to solve these problems.
Apache Hadoop provides the following benefits in solving BigData Problems:
- Open Source
- Highly Availability
- Highly Scalable
- Better Performance
- Handles Huge and Varied types of Data
- Cost-Effective BigData Solutions
- Increases Profits
- Very Flexible
- Fault Tolerance Architecture
- Solves Complex Problems
Apache Hadoop is Open Source BigData Solution with free license from Apache Software Foundation.
Hadoop Solution uses Replication Technique. By default it uses Replication factor = 3. If required, we can change this value.
If one node is down for some reason, it will automatically pickup data from other near-by and available node. Hadoop System finds that failure node automatically and do the necessary things to up and running that node. So that it is highly available.
So Apache Hadoop provides no downtime BigData Solutions.
Hadoop is highly Scalable, because it can store and distribute very huge amount of Data across hundreds of thousands of commodity hardware that operates in parallel. We can scale it in Horizontally or Vertically based on our Project requirements.
Even though Hadoop uses commodity hardware, it distributes work into different nodes and perform those tasks parallel. So that it can process PB (Peta Bytes) or More amount of Data in just few minutes and gives better performance.
NOTE:- Node means any commodity computer in Hadoop Cluster.
Hadoop handles very huge amount of variety of data by using Parallel computing technique.
Unlike Traditional Relational Databases and Tools, Hadoop uses very in-expensive and non-enterprise commodity hardware to setup Hadoop Clusters. We don’t need to buy very Expensive, High-Capacity and High Performance Hardware to solve our BigData Problems. Hadoop uses Cheap Hardware and deliver very effective solutions.
By using very Cheap commodity hardware to construct Our BigData Network, it increases Profits. If we use Cloud Technology to solve BigData Problems, we can improve our profits a lot.
Hadoop can accept any kind of Data Formats from different data sources. We can integrate new Data Sources with Hadoop system and use them very easily.
Hadoop provides very fault tolerant architecture to solve Big Data Problems. We will discuss how it is resilient to failures soon in my coming posts.
As Hadoop follows Distributed/Parallel Processing, it solves complex problems very easily.
Hadoop is Suitable For
Apache Hadoop is suitable to solve the following kinds of BigData problems:
- Recommendation systems
- Processing Very Big DataSets
- Processing Diversity of Data
- Log Processing
- Best to process Data when it is rest.
Hadoop is NOT Suitable For
Hadoop is not suitable for all BigData Solutions. The following are the few scenarios where BigData Hadoop Solution is not suitable:
- Processing small DataSets
- Executing Complex Queries
- Bit tough to process Data when it is in Motion
Hadoop BigData Solutions
We have many BigData Hadoop Solutions in the current market. Most popular solutions are:
- Apache Hadoop
- CloudEra BigData Hadoop Solution
- IBM Hortonworks
- Google Cloud BigData Solution
- AWS EMR(Amazon Web Services – Elastic MapReduce)
All above BigData Hadoop Solutions are implemented by following Apache Hadoop Software.
Hadoop Deployment Modes
Apache Hadoop Software is deployed or operated or installed in the following three modes:
- Standalone Mode
- Pseudo Distributed Mode
- Fully Distributed Mode
It is used for Simple Analysis Purpose or Debugging purpose. It is not Distributed or Clustered Architecture, just installed on a Single Node for Testing purpose.
It is installed in a Single Node, but simulated like installed on Multiple Servers. It creates a simulated Hadoop Cluster Of Nodes, but not really distributed. It is mainly useful for preparing POC (Proof Of Concept) to Test Multiple Nodes and Clustered Hadoop System.
It is a real Fully Distributed Hadoop Clustered Architecture. It is used in Live BigData Solutions Systems.
Hadoop 2.x Components
Apache Hadoop Version 2.x has the following three major Components:
We will discuss major differences between Hadoop V1.x and V2.x and also how these components in Hadoop environment to solve BigData Solutions in my coming posts.
Hadoop 2.x Components Responsibilities
The main responsibilities of Apache Hadoop 2.x Components are:
- Data Storage
- Resource Management
- Data Integration
- Data Governance
- Data and Batch Processing
- Data Analysis
- Real-time Computing
We will discuss which Hadoop Individual component is responsible to do these tasks in-detail in my coming posts.
BigData Life-Cycle Management
Generally, Hadoop Systems uses the following Life-Cycle to manage it’s BigData:
First, Data Sources are create BigData. Here Data Sources are anything like Social Media, Internet, Mobile, Computer, Documents, Audio and Videos, Cameras,Sensors etc. Once BigData is created by systems, it is captured and processed into some formats to store into Hadoop Storage system.
After Storing BigData into Hadoop Storage, it is transformed and stored into Some NoSQL or Hadoop Database.
Then we will use some Hadoop tools to analyse the BigData and prepare the reports.
Business or Organizations will go through those reports or visualizations to understand the needs and do necessary actions to improve business value.
If you don’t understand any terminology at this stage, don’t worry. Read next posts and practice some programs. Once you are clear with Hadoop Architecture and How Hadoop Components works, then come back to this page and read it again. I bet you will get clear idea about these concepts.
That’s it all about Hadoop Introduction and BigData Life-Cycle Management. We will discuss Hadoop Architecture and How Hadoop Component’s works in my coming posts.
Please drop me a comment if you like my post or have any issues/suggestions.