My eBook “HDInsight Succinctly” has been published by SyncFusion!

Thanks to the lovely people over at Syncfusion I’ve been able to share my experiences with HDInsight in a short eBook for the “Succinctly” series which was released this weekend. It is unsurprisingly called “HDInsight Succinctly” and is free to download.

For a quick precis this is the summary description from the website:

Master the higher-level languages and other features necessary to process data with HDInsight. Learn how to set up and manage HDInsight clusters on Azure, how to use Azure Blob Storage to store input and output data, connect with Microsoft BI, and much more. With the guidance of author James Beresford, HDInsight Succinctly will reveal a new avenue of data management.

You can also read my guest blog on Syncfusion’s site entitled “Being on the cutting edge is fun!” where I spend a bit of time extolling the virtues of HDInsight and Microsoft BI and getting all excited about analytics and its place in the future.

Download it for free here.

 

Read More

TechEd 2013: I’ll be presenting!

I’ll be presenting at TechEd Australia 2013 on “Big Data, Small Data and Data Visualisation via  Sentiment Analysis with HDInsight

In the session I’ll be looking at HDInsight – Microsoft’s implementation of Hadoop – and how to leverage that to perform some simple Sentiment Analysis, then link that up with structured data to perform some Data Visualisation using the Microsoft BI stack, especially PowerView.

Hopefully this will also tie in with the release of a White Paper on the subject so anyone with deep technical interest can get hands on with the experience.

I’m excited to get a chance to present again – look forward to seeing you there!

Read More

Compression in Hadoop Streaming Jobs

The thing about Big Data is, well…   it’s big. Which has impacts in terms of how long it takes you to move your data about and the space it needs to be stored in. Now as a novice, I had assumed that you had to decompress your data to process it and I also had to tolerate the huge volumes of output my (admittedly not very efficient) code output.

As it turns out, you can not only process input in a compressed format, you can also compress the output – as detailed in the Hadoop Streaming documentation. So now my jobs start smaller and end smaller, and without a massive performance overhead.

So how does it work? Well, to read compressed data you have to configure absolutely nothing. It just works, as long as Hadoop recognises the compression algorithm. To compress the output, you need to tell the job to do so. Using the “-D” option you can set some generic command options to configure the job. A sample job – formatted for HDInsight – is below, with the key options highlighted in blue:

c:\hadoop\hadoop-1.1.0-SNAPSHOT\bin\hadoop.cmd jar

C:\Hadoop\websites\HadoopDashboard\Models\Samples\hadoop-streaming.jar

“-D mapred.output.compress=true”

“-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec”

-files “hdfs://localhost:8020/user/hadoop/code/Sentiment_v2.exe”

-numReduceTasks 0

-mapper “Sentiment_v2.exe”

-input “/user/hadoop/data/”

-output “/user/hadoop/output/Sentiment”

This tells the job to compress the output, and to use GZip as the compression technique.

And now, my jobs are still inefficient but at least take up less disk space!

Read More

Reference Environment Variables in C# Mappers for HDInsight

Within your Mappers and Reducers there may be a need to reference the environment variables being fed to the task, such as the file name. Understanding how to do so took a little digging on my part, with a little help from Matt Winkler in the HDInsight MDSN forum.

Using this snippet of code:

// Adding this reference at the start of the code

using System.Collections;

foreach (DictionaryEntry var in Environment.GetEnvironmentVariables())

Console.WriteLine(“{0}”, var.Key + “|” + var.Value);

// Some junk code so the mapper doesn’t fail

string line; // Variable to hold current line

while ((line = Console.ReadLine()) != null)

{             // do nothing            }

 

It was possible to output all the Environment Variables as the Mapper output and work out their format from the resultant text file it created.

Then, to reference individual Environment Variables in the Mapper, you can simply use variations on:

 

string FileName = System.Environment.GetEnvironmentVariable(“map_input_file”);

string FileChunk = System.Environment.GetEnvironmentVariable(“map_input_start”);

Read More