Once you’ve managed to run a MapReduce job in Hadoop on Azure, the output needs to get somewhere friendlier than a dumped text file on HDFS. Hadoop on Azure comes equipped with Hive, a database like means of accessing your data (which you can then connect to using the Hive ODBC driver).
There are two mechanisms for doing this – first, create a Hive table that points directly at a directory, or second, load the output into a Hive table. I’ll tackle both mechanisms in this post.
1: Point your Hive table directly at the data
In the Interactive Console, choose Hive to be presented with this rather ugly screen:
Here we can create an “External Table”, i.e. a Hive table that points at a file, by entering the table DDL into the lowermost text box, and hitting “Evaluate”, which is a bit of a misnomer as it also executes if the DDL is valid. An example is below:
CREATE EXTERNAL TABLE output_file_table (
COMMENT “This is my output”
ROW FORMAT DELIMITED FIELDS TERMINATED by “32”
STORED AS TEXTFILE
This allows us to run Hive queries directly against the text file, such as:
SELECT * FROM output_file_table LIMIT 10
Which will show us the first ten rows of data. Alternatively, we can perform SQL like operations to aggregate:
SELECT clientid, SUM(amount)
GROUP BY clientid
It’s interesting to note that this kicks off a set of MapReduce jobs to parse and understand the data.
2: Load your data into a Hive table
The other option is to create an empty structure and load data into it. So as before, we create a table but this time its a stand alone entity:
CREATE TABLE output_table (
COMMENT “This is my output in a table”
STORED AS TEXTFILE
Then load it using the LOAD DATA command:
LOAD DATA INPATH ‘/example/data/StreamingOutput/mytj/part-00000′
OVERWRITE INTO TABLE output_table
Interestingly, this deletes the source file… and also adds some nulls at the tail end of the amount column. I also couldn’t load the table using the same data types as in the first example – unless I had them as STRING it just loaded as nulls.
Here I’ve displayed two paths to getting your output from MapReduce jobs into Hadoop on Azure. I need to do some more research into the subtleties but hopefully you can understand the options at a high level.