In this lesson, we will see how we can get started with Apache Hive by installing it on our Ubuntu machine and verifying the installation by running some Hive DDL Commands as well. Installing and running Apache Hive can be tricky and that’s why we’ll try to keep this lesson as simple and informative as possible.
In this installation guide, we will make use of Ubuntu 17.10 (GNU/Linux 4.13.0-37-generic x86_64) machine:
Table of Contents
Prerequisites for Hive Installation
Before we can proceed to Hive Installation on our machine, we need to have some other things installed as well:
- Java must be installed
- Hadoop must be installed and cluster must be configured
Java Setup
Before we can start installing Hive, we need to update Ubuntu with the latest software patches available:
sudo apt-get update && sudo apt-get -y dist-upgrade
Next, we need to install Java on the machine as Java is the main Prerequisite to run Hive and Hadoop. Java 6 and above versions are supported for Hive. Let’s install Java 8 for this lesson:
sudo apt-get -y install openjdk-8-jdk-headless
Getting Started with Hive Installation
We are ready to start downloading Hive once you have installed Java and Hadoop based on instructions presented above.
Find all Hive installation files on Apache Hive archives. Now, run the following set of commands to make a new directory and download the latest available Hive installation archive from the mirror site:
mkdir hive
cd hive
wget https://www-eu.apache.org/dist/hive/hive-2.3.3/apache-hive-2.3.3-bin.tar.gz
With this, a new file apache-hive-2.3.3-bin.tar.gz
will be downloaded on the system:
Let us uncompress this file now:
tar -xvf apache-hive-2.3.3-bin.tar.gz
Now, the periods in the file name might not be accepted as valid identifiers on the path variables in Ubuntu. To avoid these issues, rename the unarchived directory:
mv apache-hive-2.3.3-bin apache_hive
Once this is done, we need to add Hive home directory to path. Run the following set of commands to edit the .bashrc
file:
cd
vi .bashrc
Add the following lines in the .bashrc
file and save it:
export HIVE_HOME=$HOME/hive/apache_hive
export PATH=$PATH:$HIVE_HOME/bin
Now, to make environment variables come into effect, source the .bashrc
file:
source .bashrc
Note that path to Hadoop is already set in our file and overall configuration is done:
# Configure Hadoop and Java Home
export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$HADOOP_HOME/bin
export HIVE_HOME=$HOME/hive/apache_hive
export PATH=$PATH:$HIVE_HOME/bin
If you want to confirm that Hadoop is correctly working, just check its version:
Now, we need to configure the directory information where Hive can store data into Hadoop Distributed File System (HDFS). For this, we will make a new directory:
hdfs dfs -mkdir -p /root/hive/warehouse
Once this is done, we have the last configuration to do before we can launch the Hive shell. We need to inform hive about the database that it should use for its schema definition. We execute the following line so that Hive can initialize the metastore schema:
$HIVE_HOME/bin/schematool -initSchema -dbType derby
When we execute the command, we will see the following success output:
Starting the Hive Shell
After all this configuration is done, Hive can be launched with a single and simple command:
hive
If everything worked correctly, you should see the hive shell appearing magically:
Using the Hive Shell
Now that we have a Hive shell running, we will put it to use with some basic Hive DDL Commands in which we will use Hive Query language (HQL).
HQL: Creating a Database
Like any other Database, we can start using Hive only after we make a Database. Let’s do this now:
CREATE DATABASE journaldev;
We will see the following output:
A better way to create a database is by checking if the DB doesn’t exist already:
CREATE DATABASE IF NOT EXISTS journaldev;
We will see the same output here as well:
Now we can show databases which exist in Hive:
show databases;
This will result in the following:
HQL: Creating Tables
We have an active database present where we can create some tables as well. To do this, first switch to the DB you want to use:
use journaldev;
Now, create a new table inside this DB with some fields:
create table blogs(blog_id INT, blog_title STRING, blog_link STRING);
Once this table is created, we can show its schema as:
describe blogs;
We will see the following output:
HQL: Inserting Data into Tables
As final commands, let us insert a record in the table we just created:
INSERT INTO TABLE blogs VALUES (1, 'Introduction to Hive', 'https://www.journaldev.com/20353/installing-apache-hive-on-ubuntu-and-sample-queries');
We will see a long output as Hive, with the help of Hadoop starts MapReduce Jobs to fulfill the data insertion into the warehouse we created. The output will:
Finally, we can see the data in Hive as:
select * from blogs;
Conclusion
In this lesson, we saw how we can install Apache Hive on an Ubuntu server and start executing sample HQL Queries in it. Read more Big Data Posts to gain deeper knowledge of available Big Data tools and processing frameworks.