Extract data from Hive using SSIS
So now the Hive ODBC driver exists, the next thing to do is use SSIS to extract data from Hive into a SQL instance for… well, I’m sure we’ll find a reason for it.
Setting up the DSN
The first thing to do is set up a System DSN (Data Source Name) to reference in the ODBC connection. For SSIS, that means we need a 32 bit driver to reference, which means we need to find the 32 Bit ODBC Data Source Administrator. If you’re on a 32 Bit OS, just go to the Control Panel and search for it. If you are on a 64 Bit OS like me, you need to hunt it out. On Windows 7, it can be found at “C:\Windows\SysWOW64\odbcad32.exe”. Note you need to run as Administrator to make changes.
Go to the System DSN:
Scroll down the list until you find the “HIVE” driver, then click “Finish”, which brings up the ODBC Hive Setup dialog:
Give your DSN a sensible name and description. For your host enter the cluster URL (without http://) – i.e. “[your cluster name].cloudapp.net”. Leave the port as 10000. Under Authentication select “Username/Password” and enter your username. Then click “OK” and we are ready to move on.
Connect in SSIS
To hook this into SSIS we need to create a Data Flow and add an ADO.NET Connection Manager. Not – as I initially thought – an ODBC Connection Manager.
Under the Provider, select under “.Net Providers” the “Odbc Data Provider” option.
Once that’s done you can choose your just created Data Source Name using the dropdown under “Data source specification”. Add your username and password to complete setup, then click “OK”.
Now the Connection Manager is set up, you can use it in a Data Flow. Add a ADO.NET Data Source, and select your Connection Manager. Then you can – as per a normal database connection – select tables or write queries. In this example I’ve just chosen the HiveSampleTable that comes with every cluster.
Then we route the data somewhere, in this case just pumping it to a Row Count.
I’ve put on a Data Viewer just to show it works.
And there we have it. Data from Hive on a Hadoop on Azure cluster via SSIS.