Extract data from Hive using SSIS

So now the Hive ODBC driver exists, the next thing to do is use SSIS to extract data from Hive into a SQL instance for… well, I’m sure we’ll find a reason for it.

Setting up the DSN

The first thing to do is set up a System DSN (Data Source Name) to reference in the ODBC connection. For SSIS, that means we need a 32 bit driver to reference, which means we need to find the 32 Bit ODBC Data Source Administrator. If you’re on a 32 Bit OS, just go to the Control Panel and search for it. If you are on a 64 Bit OS like me, you need to hunt it out. On Windows 7, it can be found at “C:\Windows\SysWOW64\odbcad32.exe”. Note you need to run as Administrator to make changes.

Go to the System DSN:

Fig 1: ODBC Data Source Administrator
Fig 1: ODBC Data Source Administrator

 

 

 

 

 

 

Click “Add…”

Fig 2: ODBC Data Source Administrator
Fig 2: ODBC Data Source Administrator

 

 

 

 

 

 

 

 

 

 

Scroll down the list until you find the “HIVE” driver, then click “Finish”, which brings up the ODBC Hive Setup dialog:

Fig 3: ODBC Data Source Administrator
Fig 3: ODBC Data Source Administrator

 

 

 

 

 

 

 

 

 

 

 

 

 

Give your DSN a sensible name and description. For your host enter the cluster URL (without http://) – i.e. “[your cluster name].cloudapp.net”. Leave the port as 10000. Under Authentication select “Username/Password” and enter your username. Then click “OK” and we are ready to move on.

Connect in SSIS

To hook this into SSIS we need to create a Data Flow and add an ADO.NET Connection Manager. Not – as I initially thought – an ODBC Connection Manager.

Under the Provider, select under “.Net Providers” the “Odbc Data Provider” option.

Fig 4: ADO.NET Connection Manager
Fig 4: ADO.NET Connection Manager

 

 

 

 

 

 

Once that’s done you can choose your just created Data Source Name using the dropdown under “Data source specification”. Add your username and password to complete setup, then click “OK”.

Fig 5: ADO.NET Connection Manager
Fig 5: ADO.NET Connection Manager

 

 

 

 

 

 

 

 

Now the Connection Manager is set up, you can use it in a Data Flow. Add a ADO.NET Data Source, and select your Connection Manager. Then you can – as per a normal database connection – select tables or write queries. In this example I’ve just chosen the HiveSampleTable that comes with every cluster.

Fig 6: ADO.NET Source
Fig 6: ADO.NET Source

 

 

 

 

 

 

 

Then we route the data somewhere, in this case just pumping it to a Row Count.

Fig 7: ADO.NET Data Flow
Fig 7: ADO.NET Data Flow

 

 

 

 

 

 

 

 

I’ve put on a Data Viewer just to show it works.

Fig 8: ADO.NET Data Flow
Fig 8: ADO.NET Data Flow

 

 

 

 

 

 

 

And there we have it. Data from Hive on a Hadoop on Azure cluster via SSIS.

 

Read More

Getting data into Hive using Hadoop on Azure

Once you’ve managed to run a MapReduce job in Hadoop on Azure, the output needs to get somewhere friendlier than a dumped text file on HDFS. Hadoop on Azure comes equipped with Hive, a database like means of accessing your data (which you can then connect to using the Hive ODBC driver).

There are two mechanisms for doing this – first, create a Hive table that points directly at a directory, or second, load the output into a Hive table. I’ll tackle both mechanisms in this post.

1: Point your Hive table directly at the data

In the Interactive Console, choose Hive to be presented with this rather ugly screen:

Fig 1: Hadoop on Azure Hive Interactive Console
Fig 1: Hadoop on Azure Hive Interactive Console

 

 

 

 

 

 

 

 

 

 

Here we can create an “External Table”, i.e. a Hive table that points at a file, by entering the table DDL into the lowermost text box, and hitting “Evaluate”, which is a bit of a misnomer as it also executes if the DDL is valid. An example is below:

CREATE EXTERNAL TABLE output_file_table (
clientid STRING,
evtdata STRING,
amount STRING
)
COMMENT “This is my output”
ROW FORMAT DELIMITED FIELDS TERMINATED by “32”
STORED AS TEXTFILE
LOCATION “/example/data/StreamingOutput/mytj/”;

See here for the full Hive DDL reference.

This allows us to run Hive queries directly against the text file, such as:

SELECT * FROM output_file_table LIMIT 10

Which will show us the first ten rows of data. Alternatively, we can perform SQL like operations to aggregate:

SELECT clientid, SUM(amount)
FROM output_file_table
GROUP BY  clientid

It’s interesting to note that this kicks off a set of MapReduce jobs to parse and understand the data.

2: Load your data into a Hive table

The other option is to create an empty structure and load data into it. So as before, we create a table but this time its a stand alone entity:

CREATE TABLE output_table (
clientid STRING,
evtdate STRING,
amount STRING
)
COMMENT “This is my output in a table”
STORED AS TEXTFILE

Then load it using the LOAD DATA command:

LOAD DATA INPATH ‘/example/data/StreamingOutput/mytj/part-00000′
OVERWRITE INTO TABLE output_table

Interestingly, this deletes the source file…  and also adds some nulls at the tail end of the amount column. I also couldn’t load the table using the same data types as in the first example – unless I had them as STRING it just loaded as nulls.

Summary

Here I’ve displayed two paths to getting your output from MapReduce jobs into Hadoop on Azure. I need to do some more research into the subtleties but hopefully you can understand the options at a high level.

Read More

Issuing a Hive query against Hadoop on Azure using Excel

..no, seriously. I can query an Azure cloud based Hive instance of Hadoop from Excel. Its simple stuff.

First step is to install the Hive drivers and Excel add-in. This then gives you a new button on your Excel ribbon:

The Hive Excel Add-In
The Hive Excel Add-In

Clicking this opens up the query pane, which is pretty simple stuff:

Hive Excel Add-in Query Pane
Hive Excel Add-in Query Pane

Pick your Hive instance, choose the table (in this case just the standard sample table), pick your columns…  then scroll down a bit because the pane is a bit long when you expand out the options.

Hive Excel Add-in Query Pane
Hive Excel Add-in Query Pane

I passed on providing any Criteria (i.e. the WHERE clause), added in some Aggregations, skipped over ordering as I can do that easily enough in Excel, added a 2k row limit.

This spat out some HiveQL (Hive Query language) which I modified slightly to include a count(*). Then click Execute Query, and wait a bit for the Big Data Bees to make my Insights.

Then:

Hive Data in Excel
Hive Data in Excel

I have Hive Data in Excel. I could have put it in PowerPivot if I really wanted to show off…

Read More