TechEd 2013: I’ll be presenting!

I’ll be presenting at TechEd Australia 2013 on “Big Data, Small Data and Data Visualisation via  Sentiment Analysis with HDInsight

In the session I’ll be looking at HDInsight – Microsoft’s implementation of Hadoop – and how to leverage that to perform some simple Sentiment Analysis, then link that up with structured data to perform some Data Visualisation using the Microsoft BI stack, especially PowerView.

Hopefully this will also tie in with the release of a White Paper on the subject so anyone with deep technical interest can get hands on with the experience.

I’m excited to get a chance to present again – look forward to seeing you there!

Read More

Compression in Hadoop Streaming Jobs

The thing about Big Data is, well…   it’s big. Which has impacts in terms of how long it takes you to move your data about and the space it needs to be stored in. Now as a novice, I had assumed that you had to decompress your data to process it and I also had to tolerate the huge volumes of output my (admittedly not very efficient) code output.

As it turns out, you can not only process input in a compressed format, you can also compress the output – as detailed in the Hadoop Streaming documentation. So now my jobs start smaller and end smaller, and without a massive performance overhead.

So how does it work? Well, to read compressed data you have to configure absolutely nothing. It just works, as long as Hadoop recognises the compression algorithm. To compress the output, you need to tell the job to do so. Using the “-D” option you can set some generic command options to configure the job. A sample job – formatted for HDInsight – is below, with the key options highlighted in blue:

c:\hadoop\hadoop-1.1.0-SNAPSHOT\bin\hadoop.cmd jar


“-D mapred.output.compress=true”


-files “hdfs://localhost:8020/user/hadoop/code/Sentiment_v2.exe”

-numReduceTasks 0

-mapper “Sentiment_v2.exe”

-input “/user/hadoop/data/”

-output “/user/hadoop/output/Sentiment”

This tells the job to compress the output, and to use GZip as the compression technique.

And now, my jobs are still inefficient but at least take up less disk space!

Read More

Taking out the trash in HDInsight

One thing Hadoop doesn’t do that effectively (right now, anyway) is clean up after itself. Like most file systems it has a trash bin (see “Space Reclamation” in the HDFS Architecture guide) which is supposed to clean itself up after “a configurable amount of time” – which appears to be 360 minutes (6 hours) according to core-site.xml in the HDInsight default setup.

However I’ve found this is doesn’t always happen at the speed I’d like, and also some processes (which ones, I haven’t yet confirmed) also leave stuff lying around in the /tmp folder, which has to be manually cleaned up – as long as there’s nothing running it seems to be safe to kill whatever is stored in /tmp. However, don’t blame me if it all goes wrong for you :)

HDFS Commands to help free up space

So there’s a few things you can do to get out of this. First, is avoiding trash disk space usage by adding a -skipTrash option to your deletes:

hadoop fs -rmr -skipTrash /user/hadoop/data

This avoids the problem of using the Trash altogether. Of course, this also means you avoid being able to retrieve stuff from the Trash bin, so use wisely.

The next thing you can do is reach for the expunge command, which forces an empty of the Trash:

hadoop fs -expunge

However this didn’t always seem to work for me, so it’s worth checking it has had the desired effect.

HDFS Commands to find what is using disk space

Sometimes the key thing is to find out where that disk space is being eaten up. Say hello to du (disk usage)

hadoop fs -dus /

Which will then give you the size of that data on your datanodes. Then dig deeper with ls:

hadoop fs -ls /

Which gives you the directories in root. Use du to size them, find unexpected space, and delete using rm or rmr as required.

The full file system shell commands are listed here

Read More

Reference Environment Variables in C# Mappers for HDInsight

Within your Mappers and Reducers there may be a need to reference the environment variables being fed to the task, such as the file name. Understanding how to do so took a little digging on my part, with a little help from Matt Winkler in the HDInsight MDSN forum.

Using this snippet of code:

// Adding this reference at the start of the code

using System.Collections;

foreach (DictionaryEntry var in Environment.GetEnvironmentVariables())

Console.WriteLine(“{0}”, var.Key + “|” + var.Value);

// Some junk code so the mapper doesn’t fail

string line; // Variable to hold current line

while ((line = Console.ReadLine()) != null)

{             // do nothing            }


It was possible to output all the Environment Variables as the Mapper output and work out their format from the resultant text file it created.

Then, to reference individual Environment Variables in the Mapper, you can simply use variations on:


string FileName = System.Environment.GetEnvironmentVariable(“map_input_file”);

string FileChunk = System.Environment.GetEnvironmentVariable(“map_input_start”);

Read More

Extract data from Hive using SSIS

So now the Hive ODBC driver exists, the next thing to do is use SSIS to extract data from Hive into a SQL instance for… well, I’m sure we’ll find a reason for it.

Setting up the DSN

The first thing to do is set up a System DSN (Data Source Name) to reference in the ODBC connection. For SSIS, that means we need a 32 bit driver to reference, which means we need to find the 32 Bit ODBC Data Source Administrator. If you’re on a 32 Bit OS, just go to the Control Panel and search for it. If you are on a 64 Bit OS like me, you need to hunt it out. On Windows 7, it can be found at “C:\Windows\SysWOW64\odbcad32.exe”. Note you need to run as Administrator to make changes.

Go to the System DSN:

Fig 1: ODBC Data Source Administrator
Fig 1: ODBC Data Source Administrator







Click “Add…”

Fig 2: ODBC Data Source Administrator
Fig 2: ODBC Data Source Administrator











Scroll down the list until you find the “HIVE” driver, then click “Finish”, which brings up the ODBC Hive Setup dialog:

Fig 3: ODBC Data Source Administrator
Fig 3: ODBC Data Source Administrator














Give your DSN a sensible name and description. For your host enter the cluster URL (without http://) – i.e. “[your cluster name]”. Leave the port as 10000. Under Authentication select “Username/Password” and enter your username. Then click “OK” and we are ready to move on.

Connect in SSIS

To hook this into SSIS we need to create a Data Flow and add an ADO.NET Connection Manager. Not – as I initially thought – an ODBC Connection Manager.

Under the Provider, select under “.Net Providers” the “Odbc Data Provider” option.

Fig 4: ADO.NET Connection Manager
Fig 4: ADO.NET Connection Manager







Once that’s done you can choose your just created Data Source Name using the dropdown under “Data source specification”. Add your username and password to complete setup, then click “OK”.

Fig 5: ADO.NET Connection Manager
Fig 5: ADO.NET Connection Manager









Now the Connection Manager is set up, you can use it in a Data Flow. Add a ADO.NET Data Source, and select your Connection Manager. Then you can – as per a normal database connection – select tables or write queries. In this example I’ve just chosen the HiveSampleTable that comes with every cluster.

Fig 6: ADO.NET Source
Fig 6: ADO.NET Source








Then we route the data somewhere, in this case just pumping it to a Row Count.

Fig 7: ADO.NET Data Flow
Fig 7: ADO.NET Data Flow









I’ve put on a Data Viewer just to show it works.

Fig 8: ADO.NET Data Flow
Fig 8: ADO.NET Data Flow








And there we have it. Data from Hive on a Hadoop on Azure cluster via SSIS.


Read More

Download data from a Hadoop on Azure cluster

So you’ve run a job on Hadoop on Azure, and now you want that data somewhere more useful, like in your Data Warehouse for some analytics. If the Hive ODBC Driver isn’t an option (perhaps because you used Pig), then FTP is the way – there isn’t a Javascript console fs.get() command available.

As described in my Upload data post, you need to use curl, and the command syntax is:

curl -k ftps://[cluster user name]:[password md5 hash]@[cluster name][path to data or specific file on HDFS] -o [local path name on your machine]

Happy downloading!

UPDATE: This functionality has now been disabled in HDInsight, see this thread from the MSDN Forum.

Read More

No Piggybank for Pig on Hadoop on Azure

A quick note – Pig functions from the piggybank are not available in Hadoop on Azure.

I found this out as I was trying to run some things through Pig, trying to manage some Excel CSV files that had fields with line feeds in them. I discovered there was a Pig load/store function CSVExcelStorage that would handle them, but when I tried to use it… ah. Not there. Turns out it was a piggybank function, which are a set of user contributed functions that you have to include in your pig build. The source code is freely available (being open source and all) but I haven’t worked out how in an HOA environment you can build them and use them.

I can understand why Microsoft have opted not to include these – it’s not part of the core build, it’s user contributed, etc. – things you want to avoid if doing a massively reproducible on demand cloud environment. If I can work out how to include them, I’ll provide a followup post.


Read More

Save your RDP connection to Hadoop on Azure

This is probably going to appear to be brain dead to some readers, but I have been frustrated by not being able to configure the RDP connection to my Hadoop on Azure account. Fooled by the slick Metro UI, I had wrongly assumed that the only option was to click on the “Remote Desktop” button to get access, as per the lovely menu below:

Fig 1: Hadoop on Azure - Your Cluster menu
Fig 1: Hadoop on Azure – Your Cluster menu

However it was pointed out to me today that you can right click, save as…. and then you have your RDP connection file to configure to share local resources, etc. Doh.

Fig 2: Hadoop on Azure - Your Cluster menu - now with Right Click
Fig 2: Hadoop on Azure – Your Cluster menu – now with Right Click

Handy tip. Boy, do I feel silly….

Read More

Using Azure Blob Storage as a Data Source for Hadoop on Azure

One of the things I’ve learned from the Microsoft team behind Hadoop on Azure is that the Hadoop clusters’ short lifespan is in fact intentional – the clusters are intended to be disposable and exist for the lifetime of the analysis only.

So what happens if you want your raw data to live up in the cloud for longer? The answer is Azure Blog Storage. This gives up to 100TB of storage per account so should be adequate for most cases. Hadoop on Azure can reference Azure Blob Storage (or Amazon S3 blob storage, if you want a non-Microsoft solution) directly as a location for input data.

Firstly, you need to have an Azure account, and set up storage, instructions for which can be found here. Then you need to upload some data to it, which can most easily be done using a nice tool called Cloudberry Explorer, which operates pretty much like an FTP tool. Then you need to configure your Hadoop on Azure instance to point at your Azure Blob Storage. Then as per this guide you can then point your jobs at the Azure Blob Storage using the asv:// notation, like in the example below:

Hadoop jar hadoop-streaming.jar -files “hdfs://10.NN.NN.NN:9000/example/apps/mappertwo.exe,hdfs://10.NN.NN.NN:9000/example/apps/reducertwo.exe” -mapper “mappertwo.exe” -reducer “reducertwo.exe” -input “asv://hadoop-test/” -output “/example/data/StreamingOutput/abtj”

This of course is slightly confused when setting up your job the parameters can be marked as asv://, like below:

Fig 1: Job Parameters
Fig 1: Job Parameters







However I couldn’t work out how to mark something as an input parameter as well as have it as ASV on the dropdown, so I left as plain text and entered the command as:

-input “asv://hadoop-test/”

And it all worked – I did find that it didn’t handle direct file references – it only would accept storage references at the folder level.

So there you go – using Azure Blob Storage as a data source for Hadoop on Azure. Nice and easy.

Read More

Getting data into Hive using Hadoop on Azure

Once you’ve managed to run a MapReduce job in Hadoop on Azure, the output needs to get somewhere friendlier than a dumped text file on HDFS. Hadoop on Azure comes equipped with Hive, a database like means of accessing your data (which you can then connect to using the Hive ODBC driver).

There are two mechanisms for doing this – first, create a Hive table that points directly at a directory, or second, load the output into a Hive table. I’ll tackle both mechanisms in this post.

1: Point your Hive table directly at the data

In the Interactive Console, choose Hive to be presented with this rather ugly screen:

Fig 1: Hadoop on Azure Hive Interactive Console
Fig 1: Hadoop on Azure Hive Interactive Console











Here we can create an “External Table”, i.e. a Hive table that points at a file, by entering the table DDL into the lowermost text box, and hitting “Evaluate”, which is a bit of a misnomer as it also executes if the DDL is valid. An example is below:

CREATE EXTERNAL TABLE output_file_table (
clientid STRING,
evtdata STRING,
amount STRING
COMMENT “This is my output”
LOCATION “/example/data/StreamingOutput/mytj/”;

See here for the full Hive DDL reference.

This allows us to run Hive queries directly against the text file, such as:

SELECT * FROM output_file_table LIMIT 10

Which will show us the first ten rows of data. Alternatively, we can perform SQL like operations to aggregate:

SELECT clientid, SUM(amount)
FROM output_file_table
GROUP BY  clientid

It’s interesting to note that this kicks off a set of MapReduce jobs to parse and understand the data.

2: Load your data into a Hive table

The other option is to create an empty structure and load data into it. So as before, we create a table but this time its a stand alone entity:

CREATE TABLE output_table (
clientid STRING,
evtdate STRING,
amount STRING
COMMENT “This is my output in a table”

Then load it using the LOAD DATA command:

LOAD DATA INPATH ‘/example/data/StreamingOutput/mytj/part-00000′

Interestingly, this deletes the source file…  and also adds some nulls at the tail end of the amount column. I also couldn’t load the table using the same data types as in the first example – unless I had them as STRING it just loaded as nulls.


Here I’ve displayed two paths to getting your output from MapReduce jobs into Hadoop on Azure. I need to do some more research into the subtleties but hopefully you can understand the options at a high level.

Read More