Hadoop Hive Introduction




Hadoop Hive Overview

Hadoop Hive is very similar to Apache Pig. What it does is let you create tables and load external files into tables using SQL. Then it creates MapReduce jobs in Java.  Java is a very wordy language so using Pig and Hive is simpler.

Some have said that Hadoop Hive is a data warehouse tool (Bluntly put, that means an RDBMS used to do analytics before Hadoop was invented.). In fact you can use Apache Sqoop to load data into Hive or Hadoop from a relational database.

Hadoop Hive Installation

 You need to install Apache Hadoop first. Then you can install MySQL to store metadata. (Or you could use Derby. Here we use MySQL). Hive will create data in Hadoop.

So first install Hadoop. Then execute these steps:


apt-get install mysql-server
apt-get install libmysql-java

Download and extract Hive to /usr/local/hive. Then in .bashrc set:


export HIVE_HOME=/usr/local/hive/apache-hive-2.0.1-bin
export PATH=$PATH:$HIVE_HOME/bin

Make this softlink:


ln -s /usr/share/java/mysql-connector-java.jar $HIVE_HOME/lib/mysql-connector-java.jar

Then open MySQL to create the schema,etc:


mysql -u root -p
mysql> CREATE DATABASE metastore;
mysql> USE metastore;

Here use the schema that most nearly matches the version you are using. (In this tutorial we are using Hadoop Hive 2.0.1 but there is no 2.0.1 schema. So we used 2.0.0.)


mysql> SOURCE /usr/local/hive/apache-hive-2.0.1-bin/scripts/metastore/upgrade/mysql/hive-schema-2.0.0.mysql.sql

mysql> CREATE USER 'hiveuser'@'%' IDENTIFIED BY password';

mysql> GRANT all on *.* to 'hiveuser'@localhost identified by 'password';
mysql>  flush privileges;

Now edit $HIVE_HOME/conf/hive-site.xml:

<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
<description>user name for connecting to mysql server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
<description>password for connecting to mysql server</description>
</property>
</configuration>

If you have set up Hadoop as a cluster you might need to run the next command. You will know if that is the case if you get an error about that.


$HADOOP_HOME/bin/hadoop dfsadmin -safemode leave

Also if you have installed Spark you could choose to execute this to get rid of a warning message-


unset SPARK_HOME

Now run the Hive shell:


hive

Much of what you do with Hive is load external files stored in Hadoop so that you can use SQL to work with them. So here is an example.

Download this file (It is an ssh log.) then open it with Google sheets or Excel or use sed and vi to replace the spaces with commas so that it will be comma-delimited.

Then the top of the file looks like this:


1331901012,CTHcOo3BARDOPDjYue,192.168.202.68,53633,192.168.28.254,22,failure,INBOUND,SSH-2.0-OpenSSH_5.0,SSH-1.99-Cisco-1.25,-,-,-,-,-
1331901030,CBHpSz2Zi3rdKbAvwd,192.168.202.68,35820,192.168.23.254,22,failure,INBOUND,SSH-2.0-OpenSSH_5.0,SSH-1.99-Cisco-1.25,-,-,-,-,-

Now copy the file to Hadoop:


hadoop fs -put /home/walker/Downloads/ssh.csv /data/ssh/

Notice that you put the directory name and not the name of the file in Hadoop.Then we load the file and create the schema all in one step:


CREATE EXTERNAL TABLE IF NOT EXISTS sshlog(
sshtime STRING,
sshkey STRING,
sourceIP STRING,
socket STRING,
targetIP STRING,
port STRING,
status STRING,
direction STRING,
software STRING,
device STRING,
junk1 STRING,
junk2 STRING,
junk3 STRING,
Junk4 STRING,
Junk5 STRING )
   COMMENT 'ssh log'
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    location '/data/ssh/';



Python User Defined Function (UDF)

Before we show some SQL commands (Technically they are called HQL, Hive Query Language.) we will show how to write a Python UDF:

Create the Python program ssh.py. Below we explain how it works.


import sys
	import datetime
	import re

def bi(s):
    r='(d+).(d+).(d+).(d+)'
    g=re.search('(d+).(d+).(d+).(d+)',s)
    if g:
        return bin(int(g.group(1))) + bin(int(g.group(2))) + bin(int(g.group(3)))

for line in sys.stdin:
	x, y, z = line.strip().split('t')
	print x,bi(y),bi(z)

Add it to make it available for Hive to execute this command in the Hive shell.


add file /home/walker/Downloads/ssh.py

Then run this transform operation calling the Python UDF:


SELECT TRANSFORM (sshtime, sourceip, targetip) USING 'python /home/walker/Downloads/ssh.py' as (sshtime, sourceIP, targetip) FROM sshlog;

1332017778 0b110000000b101010000b11001010 0b110000000b101010000b10101	NULL	NULL
1332017793 0b110000000b101010000b11001010 0b110000000b101010000b10101	NULL	NULL

The Python function works off stdin and stdout. We split the input into a time and then two IP address fields. Then we run a function to convert the IP address to bit format. The TRANSFORM statement means pass those three values to the UDF.

Exercise: why does it return 2 null columns?

Create table insert data

No we go back and do something simpler – create a table and insert some data into it to illustrate HQL functions.

First create a table. People who know SQL will see that it is almost the same syntax.


create table students (student string, age int);

Then add some data into it:


insert into table students values('Walker', 33);
insert into table students values('Sam', 33);
insert into table students values('Sally', 33);
insert into table students values('Sue', 72);
insert into table students values('George', 56);
insert into table students values('William', 64);
insert into table students values('Ellen', 24);
insert into table students values('Jose', 72);
insert into table students values('Li', 56);
insert into table students values('Chris', 64);
insert into table students values('Ellen', 24);
insert into table students values('Sue', 72);
insert into table students values('Ricardo', 56);
insert into table students values('Wolfgang', 64);
insert into table students values('Melanie', 24);
insert into table students values('Monica', 36);

Select

Now we can run regular SQL commands over that. Remember this is not a relational database. Hadoop as you will recall does not allow updating files: only adding and deleting them. So it creates new files for every operation.


select count(*) from students;

Here we can find all students whose birthday is a multiple of 3 by using the modulo (is divisible by) function. (Hive has many math functions.)


select age,pmod(age,3) from students where pmod(age,3) = 0;

Now, without discriminating against older people, we create two new tables: one with people older than 45 and one for people younger than that:


create table old as select * from students where age > 45;

eate table young as select * from students where age <= 45;

Join

Now we can make the intersection of two tables by showing students from the old and new table whose name is 3 letters long:


select young.student, old.student from young join old on (length(young.student) = 3) and (length(old.student) = 3);

It responds:


OK
Sam	Sue
Sam	Sue

Maps

Hive supports maps, structs, and array complex types. But it does not support the ability to add data to those with SQL yet. So we show the sort of awkward way of doing that below.

First, create a table with a map column, meaning a (key->value) column:

create table prices(product map);

Now we use the students table we created above to stand in as a proxy for this insert operation. You can use any table for that

insert into prices select map("abc", 1) from students limit 1;

Now you can see the data we just inserted:

select * from prices;
OK
{"abc":1}



http://www.bmc.com/guides/hadoop-hive.html