In the session I’ll be looking at HDInsight – Microsoft’s implementation of Hadoop – and how to leverage that to perform some simple Sentiment Analysis, then link that up with structured data to perform some Data Visualisation using the Microsoft BI stack, especially PowerView.
Hopefully this will also tie in with the release of a White Paper on the subject so anyone with deep technical interest can get hands on with the experience.
I’m excited to get a chance to present again – look forward to seeing you there!
The thing about Big Data is, well… it’s big. Which has impacts in terms of how long it takes you to move your data about and the space it needs to be stored in. Now as a novice, I had assumed that you had to decompress your data to process it and I also had to tolerate the huge volumes of output my (admittedly not very efficient) code output.
As it turns out, you can not only process input in a compressed format, you can also compress the output – as detailed in the Hadoop Streaming documentation. So now my jobs start smaller and end smaller, and without a massive performance overhead.
So how does it work? Well, to read compressed data you have to configure absolutely nothing. It just works, as long as Hadoop recognises the compression algorithm. To compress the output, you need to tell the job to do so. Using the “-D” option you can set some generic command options to configure the job. A sample job – formatted for HDInsight – is below, with the key options highlighted in blue:
One thing Hadoop doesn’t do that effectively (right now, anyway) is clean up after itself. Like most file systems it has a trash bin (see “Space Reclamation” in the HDFS Architecture guide) which is supposed to clean itself up after “a configurable amount of time” – which appears to be 360 minutes (6 hours) according to core-site.xml in the HDInsight default setup.
However I’ve found this is doesn’t always happen at the speed I’d like, and also some processes (which ones, I haven’t yet confirmed) also leave stuff lying around in the /tmp folder, which has to be manually cleaned up – as long as there’s nothing running it seems to be safe to kill whatever is stored in /tmp. However, don’t blame me if it all goes wrong for you
HDFS Commands to help free up space
So there’s a few things you can do to get out of this. First, is avoiding trash disk space usage by adding a -skipTrash option to your deletes:
hadoop fs -rmr -skipTrash /user/hadoop/data
This avoids the problem of using the Trash altogether. Of course, this also means you avoid being able to retrieve stuff from the Trash bin, so use wisely.
The next thing you can do is reach for the expunge command, which forces an empty of the Trash:
hadoop fs -expunge
However this didn’t always seem to work for me, so it’s worth checking it has had the desired effect.
HDFS Commands to find what is using disk space
Sometimes the key thing is to find out where that disk space is being eaten up. Say hello to du (disk usage)
hadoop fs -dus /
Which will then give you the size of that data on your datanodes. Then dig deeper with ls:
hadoop fs -ls /
Which gives you the directories in root. Use du to size them, find unexpected space, and delete using rm or rmr as required.
So now the Hive ODBC driver exists, the next thing to do is use SSIS to extract data from Hive into a SQL instance for… well, I’m sure we’ll find a reason for it.
Setting up the DSN
The first thing to do is set up a System DSN (Data Source Name) to reference in the ODBC connection. For SSIS, that means we need a 32 bit driver to reference, which means we need to find the 32 Bit ODBC Data Source Administrator. If you’re on a 32 Bit OS, just go to the Control Panel and search for it. If you are on a 64 Bit OS like me, you need to hunt it out. On Windows 7, it can be found at “C:\Windows\SysWOW64\odbcad32.exe”. Note you need to run as Administrator to make changes.
Go to the System DSN:
Scroll down the list until you find the “HIVE” driver, then click “Finish”, which brings up the ODBC Hive Setup dialog:
Give your DSN a sensible name and description. For your host enter the cluster URL (without http://) – i.e. “[your cluster name].cloudapp.net”. Leave the port as 10000. Under Authentication select “Username/Password” and enter your username. Then click “OK” and we are ready to move on.
Connect in SSIS
To hook this into SSIS we need to create a Data Flow and add an ADO.NET Connection Manager. Not – as I initially thought – an ODBC Connection Manager.
Under the Provider, select under “.Net Providers” the “Odbc Data Provider” option.
Once that’s done you can choose your just created Data Source Name using the dropdown under “Data source specification”. Add your username and password to complete setup, then click “OK”.
Now the Connection Manager is set up, you can use it in a Data Flow. Add a ADO.NET Data Source, and select your Connection Manager. Then you can – as per a normal database connection – select tables or write queries. In this example I’ve just chosen the HiveSampleTable that comes with every cluster.
Then we route the data somewhere, in this case just pumping it to a Row Count.
I’ve put on a Data Viewer just to show it works.
And there we have it. Data from Hive on a Hadoop on Azure cluster via SSIS.
As described in my Upload data post, you need to use curl, and the command syntax is:
curl -k ftps://[cluster user name]:[password md5 hash]@[cluster name].cloudapp.net:2226/[path to data or specific file on HDFS] -o [local path name on your machine]
I found this out as I was trying to run some things through Pig, trying to manage some Excel CSV files that had fields with line feeds in them. I discovered there was a Pig load/store function CSVExcelStorage that would handle them, but when I tried to use it… ah. Not there. Turns out it was a piggybank function, which are a set of user contributed functions that you have to include in your pig build. The source code is freely available (being open source and all) but I haven’t worked out how in an HOA environment you can build them and use them.
I can understand why Microsoft have opted not to include these – it’s not part of the core build, it’s user contributed, etc. – things you want to avoid if doing a massively reproducible on demand cloud environment. If I can work out how to include them, I’ll provide a followup post.
This is probably going to appear to be brain dead to some readers, but I have been frustrated by not being able to configure the RDP connection to my Hadoop on Azure account. Fooled by the slick Metro UI, I had wrongly assumed that the only option was to click on the “Remote Desktop” button to get access, as per the lovely menu below:
However it was pointed out to me today that you can right click, save as…. and then you have your RDP connection file to configure to share local resources, etc. Doh.
One of the things I’ve learned from the Microsoft team behind Hadoop on Azure is that the Hadoop clusters’ short lifespan is in fact intentional – the clusters are intended to be disposable and exist for the lifetime of the analysis only.
So what happens if you want your raw data to live up in the cloud for longer? The answer is Azure Blog Storage. This gives up to 100TB of storage per account so should be adequate for most cases. Hadoop on Azure can reference Azure Blob Storage (or Amazon S3 blob storage, if you want a non-Microsoft solution) directly as a location for input data.
Firstly, you need to have an Azure account, and set up storage, instructions for which can be found here. Then you need to upload some data to it, which can most easily be done using a nice tool called Cloudberry Explorer, which operates pretty much like an FTP tool. Then you need to configure your Hadoop on Azure instance to point at your Azure Blob Storage. Then as per this guide you can then point your jobs at the Azure Blob Storage using the asv:// notation, like in the example below:
Once you’ve managed to run a MapReduce job in Hadoop on Azure, the output needs to get somewhere friendlier than a dumped text file on HDFS. Hadoop on Azure comes equipped with Hive, a database like means of accessing your data (which you can then connect to using the Hive ODBC driver).
There are two mechanisms for doing this – first, create a Hive table that points directly at a directory, or second, load the output into a Hive table. I’ll tackle both mechanisms in this post.
1: Point your Hive table directly at the data
In the Interactive Console, choose Hive to be presented with this rather ugly screen:
Here we can create an “External Table”, i.e. a Hive table that points at a file, by entering the table DDL into the lowermost text box, and hitting “Evaluate”, which is a bit of a misnomer as it also executes if the DDL is valid. An example is below:
CREATE EXTERNAL TABLE output_file_table (
COMMENT “This is my output”
ROW FORMAT DELIMITED FIELDS TERMINATED by “32”
STORED AS TEXTFILE
This allows us to run Hive queries directly against the text file, such as:
SELECT * FROM output_file_table LIMIT 10
Which will show us the first ten rows of data. Alternatively, we can perform SQL like operations to aggregate:
SELECT clientid, SUM(amount)
GROUP BY clientid
It’s interesting to note that this kicks off a set of MapReduce jobs to parse and understand the data.
2: Load your data into a Hive table
The other option is to create an empty structure and load data into it. So as before, we create a table but this time its a stand alone entity:
CREATE TABLE output_table (
COMMENT “This is my output in a table”
STORED AS TEXTFILE
Then load it using the LOAD DATA command:
LOAD DATA INPATH ‘/example/data/StreamingOutput/mytj/part-00000′
OVERWRITE INTO TABLE output_table
Interestingly, this deletes the source file… and also adds some nulls at the tail end of the amount column. I also couldn’t load the table using the same data types as in the first example – unless I had them as STRING it just loaded as nulls.
Here I’ve displayed two paths to getting your output from MapReduce jobs into Hadoop on Azure. I need to do some more research into the subtleties but hopefully you can understand the options at a high level.