Getting data into Hive using Hadoop on Azure

Once you’ve managed to run a MapReduce job in Hadoop on Azure, the output needs to get somewhere friendlier than a dumped text file on HDFS. Hadoop on Azure comes equipped with Hive, a database like means of accessing your data (which you can then connect to using the Hive ODBC driver).

There are two mechanisms for doing this – first, create a Hive table that points directly at a directory, or second, load the output into a Hive table. I’ll tackle both mechanisms in this post.

1: Point your Hive table directly at the data

In the Interactive Console, choose Hive to be presented with this rather ugly screen:

Fig 1: Hadoop on Azure Hive Interactive Console
Fig 1: Hadoop on Azure Hive Interactive Console

 

 

 

 

 

 

 

 

 

 

Here we can create an “External Table”, i.e. a Hive table that points at a file, by entering the table DDL into the lowermost text box, and hitting “Evaluate”, which is a bit of a misnomer as it also executes if the DDL is valid. An example is below:

CREATE EXTERNAL TABLE output_file_table (
clientid STRING,
evtdata STRING,
amount STRING
)
COMMENT “This is my output”
ROW FORMAT DELIMITED FIELDS TERMINATED by “32”
STORED AS TEXTFILE
LOCATION “/example/data/StreamingOutput/mytj/”;

See here for the full Hive DDL reference.

This allows us to run Hive queries directly against the text file, such as:

SELECT * FROM output_file_table LIMIT 10

Which will show us the first ten rows of data. Alternatively, we can perform SQL like operations to aggregate:

SELECT clientid, SUM(amount)
FROM output_file_table
GROUP BY  clientid

It’s interesting to note that this kicks off a set of MapReduce jobs to parse and understand the data.

2: Load your data into a Hive table

The other option is to create an empty structure and load data into it. So as before, we create a table but this time its a stand alone entity:

CREATE TABLE output_table (
clientid STRING,
evtdate STRING,
amount STRING
)
COMMENT “This is my output in a table”
STORED AS TEXTFILE

Then load it using the LOAD DATA command:

LOAD DATA INPATH ‘/example/data/StreamingOutput/mytj/part-00000′
OVERWRITE INTO TABLE output_table

Interestingly, this deletes the source file…  and also adds some nulls at the tail end of the amount column. I also couldn’t load the table using the same data types as in the first example – unless I had them as STRING it just loaded as nulls.

Summary

Here I’ve displayed two paths to getting your output from MapReduce jobs into Hadoop on Azure. I need to do some more research into the subtleties but hopefully you can understand the options at a high level.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>