Installing Apache Hive on Ubuntu and Running HQL Queries

Filed Under: Big Data

In this lesson, we will see how we can get started with Apache Hive by installing it on our Ubuntu machine and verifying the installation by running some Hive DDL Commands as well. Installing and running Apache Hive can be tricky and that’s why we’ll try to keep this lesson as simple and informative as possible.

In this installation guide, we will make use of Ubuntu 17.10 (GNU/Linux 4.13.0-37-generic x86_64) machine:

ubuntu version for apache hive installation

Ubuntu Version

Prerequisites for Hive Installation

Before we can proceed to Hive Installation on our machine, we need to have some other things installed as well:

Java Setup

Before we can start installing Hive, we need to update Ubuntu with the latest software patches available:

sudo apt-get update && sudo apt-get -y dist-upgrade

Next, we need to install Java on the machine as Java is the main Prerequisite to run Hive and Hadoop. Java 6 and above versions are supported for Hive. Let’s install Java 8 for this lesson:

sudo apt-get -y install openjdk-8-jdk-headless

Getting Started with Hive Installation

We are ready to start downloading Hive once you have installed Java and Hadoop based on instructions presented above.

Find all Hive installation files on Apache Hive archives. Now, run the following set of commands to make a new directory and download the latest available Hive installation archive from the mirror site:


mkdir hive
cd hive
wget http://www-eu.apache.org/dist/hive/hive-2.3.3/apache-hive-2.3.3-bin.tar.gz

With this, a new file apache-hive-2.3.3-bin.tar.gz will be downloaded on the system:

download hive

Downloading Hive

Let us uncompress this file now:

tar -xvf apache-hive-2.3.3-bin.tar.gz

Now, the periods in the file name might not be accepted as valid identifiers on the path variables in Ubuntu. To avoid these issues, rename the unarchived directory:

mv apache-hive-2.3.3-bin apache_hive

Once this is done, we need to add Hive home directory to path. Run the following set of commands to edit the .bashrc file:


cd
vi .bashrc

Add the following lines in the .bashrc file and save it:


export HIVE_HOME=$HOME/hive/apache_hive
export PATH=$PATH:$HIVE_HOME/bin

Now, to make environment variables come into effect, source the .bashrc file:

source .bashrc

Note that path to Hadoop is already set in our file and overall configuration is done:


# Configure Hadoop and Java Home
export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

export PATH=$PATH:$HADOOP_HOME/bin

export HIVE_HOME=$HOME/hive/apache_hive
export PATH=$PATH:$HIVE_HOME/bin

If you want to confirm that Hadoop is correctly working, just check its version:

hadoop version

Check Hadoop version

Now, we need to configure the directory information where Hive can store data into Hadoop Distributed File System (HDFS). For this, we will make a new directory:

hdfs dfs -mkdir -p /root/hive/warehouse

Once this is done, we have the last configuration to do before we can launch the Hive shell. We need to inform hive about the database that it should use for its schema definition. We execute the following line so that Hive can initialize the metastore schema:

$HIVE_HOME/bin/schematool -initSchema -dbType derby

When we execute the command, we will see the following success output:

hive schema initialization

Hive metastore schema initialization

Starting the Hive Shell

After all this configuration is done, Hive can be launched with a single and simple command:

hive

If everything worked correctly, you should see the hive shell appearing magically:

starting hive shell

Starting Hive shell

Using the Hive Shell

Now that we have a Hive shell running, we will put it to use with some basic Hive DDL Commands in which we will use Hive Query language (HQL).

HQL: Creating a Database

Like any other Database, we can start using Hive only after we make a Database. Let’s do this now:

CREATE DATABASE journaldev;

We will see the following output:

hive create database

Create Database in Hive

A better way to create a database is by checking if the DB doesn’t exist already:

CREATE DATABASE IF NOT EXISTS journaldev;

We will see the same output here as well:

hive create database if not exists

Create Database in Hive, if not exists

Now we can show databases which exist in Hive:

show databases;

This will result in the following:

hive show databases

Show Databases using HQL

HQL: Creating Tables

We have an active database present where we can create some tables as well. To do this, first switch to the DB you want to use:

use journaldev;

Now, create a new table inside this DB with some fields:

create table blogs(blog_id INT, blog_title STRING, blog_link STRING);

Once this table is created, we can show its schema as:

describe blogs;

We will see the following output:

hive describe table

Table metadata

HQL: Inserting Data into Tables

As final commands, let us insert a record in the table we just created:

INSERT INTO TABLE blogs VALUES (1, 'Introduction to Hive', 'https://www.journaldev.com/20353/installing-apache-hive-on-ubuntu-and-sample-queries');

We will see a long output as Hive, with the help of Hadoop starts MapReduce Jobs to fulfill the data insertion into the warehouse we created. The output will:

hive insert data sql query

Insert Data into Hive

Finally, we can see the data in Hive as:

select * from blogs;
hive select sql query

Show all data in Hive

Conclusion

In this lesson, we saw how we can install Apache Hive on an Ubuntu server and start executing sample HQL Queries in it. Read more Big Data Posts to gain deeper knowledge of available Big Data tools and processing frameworks.

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages