My eBook “HDInsight Succinctly” has been published by SyncFusion!

Thanks to the lovely people over at Syncfusion I’ve been able to share my experiences with HDInsight in a short eBook for the “Succinctly” series which was released this weekend. It is unsurprisingly called “HDInsight Succinctly” and is free to download.

For a quick precis this is the summary description from the website:

Master the higher-level languages and other features necessary to process data with HDInsight. Learn how to set up and manage HDInsight clusters on Azure, how to use Azure Blob Storage to store input and output data, connect with Microsoft BI, and much more. With the guidance of author James Beresford, HDInsight Succinctly will reveal a new avenue of data management.

You can also read my guest blog on Syncfusion’s site entitled “Being on the cutting edge is fun!” where I spend a bit of time extolling the virtues of HDInsight and Microsoft BI and getting all excited about analytics and its place in the future.

Download it for free here.

 

Read More

What’s new in SQL Server 2014 – Full Day Workshop in Sydney

Shameless plug time – I’ll be demoing HDInsight at the event below in Sydney on May 28th – grab a ticket here. The text below is pretty much a straight copy from the event website, so hop over there for up to date details. There’s some great and knowledgeable speakers… and me :)

SQL Server 2014 Jump Start

Mission Critical Performance, BI, Big Data and Cloud

When: 28th of May, 9:00 AM to 5 PM

Where: Saxons, Level 12, 10 Barrack Street, Sydney NSW 2000

  • Discover the impact of cloud technologies on Business Intelligence
  • Accelerate your solutions with in-memory databases
  • Enable analytics for your users with the latest wave of self-service solutions

Register today and join our special event featuring SQL Server experts and guest speakers from Microsoft.

In this full day event, you will learn about the new features and enhancements to Microsoft SQL 2014 and Power BI, as well as how to create business value through features such as in-memory databases and Power BI reporting.

 

Who Should Attend:

  • IT professionals or DBAs using SQL Server Enterprise, with multi-terabyte OLTP or OLAP databases, interested in increased scaling, high availability, and performance
  • BI developers or administrators interested in an overview of the SQL Server big data analytics platform
  • Data professionals interested in the options for SQL Server hybrid solutions both on premise and in the cloud

 

Speakers:

  • Victor Isakov (MVP, MCT, Microsoft Certified Master, Microsoft Certified Architect: SQL Server)
  • James Beresford (www.bimonkey.com)
  • Iman Eftekhari (MCSE, MCITP: Business Intelligence)
  • Dean Corcoran (Microsoft)
  • Shashank Pawar (Microsoft)

 

Agenda

  • 09:00 am: Opening by Microsoft
  • 09:30 am: Session 1: SQL Server 2014 editions and engine enhancements
  • 10:30 am: Morning Tea
  • 10:45 am: Session 2: Data analytics and BI
  • 11:45 am: Session 3: Cloud and Big Data
  • 12:15 pm: Lunch
  • 01:15 pm: Demo 1: Database enhancements
  • 02:45 pm: Demo 2: Self-service BI in the Cloud
  • 03:30 pm: Afternoon tea
  • 04:00 pm: Demo 3: Big Data

Read More

TechEd 2013: I’ll be presenting!

I’ll be presenting at TechEd Australia 2013 on “Big Data, Small Data and Data Visualisation via  Sentiment Analysis with HDInsight

In the session I’ll be looking at HDInsight – Microsoft’s implementation of Hadoop – and how to leverage that to perform some simple Sentiment Analysis, then link that up with structured data to perform some Data Visualisation using the Microsoft BI stack, especially PowerView.

Hopefully this will also tie in with the release of a White Paper on the subject so anyone with deep technical interest can get hands on with the experience.

I’m excited to get a chance to present again – look forward to seeing you there!

Read More

Compression in Hadoop Streaming Jobs

The thing about Big Data is, well…   it’s big. Which has impacts in terms of how long it takes you to move your data about and the space it needs to be stored in. Now as a novice, I had assumed that you had to decompress your data to process it and I also had to tolerate the huge volumes of output my (admittedly not very efficient) code output.

As it turns out, you can not only process input in a compressed format, you can also compress the output – as detailed in the Hadoop Streaming documentation. So now my jobs start smaller and end smaller, and without a massive performance overhead.

So how does it work? Well, to read compressed data you have to configure absolutely nothing. It just works, as long as Hadoop recognises the compression algorithm. To compress the output, you need to tell the job to do so. Using the “-D” option you can set some generic command options to configure the job. A sample job – formatted for HDInsight – is below, with the key options highlighted in blue:

c:\hadoop\hadoop-1.1.0-SNAPSHOT\bin\hadoop.cmd jar

C:\Hadoop\websites\HadoopDashboard\Models\Samples\hadoop-streaming.jar

“-D mapred.output.compress=true”

“-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec”

-files “hdfs://localhost:8020/user/hadoop/code/Sentiment_v2.exe”

-numReduceTasks 0

-mapper “Sentiment_v2.exe”

-input “/user/hadoop/data/”

-output “/user/hadoop/output/Sentiment”

This tells the job to compress the output, and to use GZip as the compression technique.

And now, my jobs are still inefficient but at least take up less disk space!

Read More

Taking out the trash in HDInsight

One thing Hadoop doesn’t do that effectively (right now, anyway) is clean up after itself. Like most file systems it has a trash bin (see “Space Reclamation” in the HDFS Architecture guide) which is supposed to clean itself up after “a configurable amount of time” – which appears to be 360 minutes (6 hours) according to core-site.xml in the HDInsight default setup.

However I’ve found this is doesn’t always happen at the speed I’d like, and also some processes (which ones, I haven’t yet confirmed) also leave stuff lying around in the /tmp folder, which has to be manually cleaned up – as long as there’s nothing running it seems to be safe to kill whatever is stored in /tmp. However, don’t blame me if it all goes wrong for you :)

HDFS Commands to help free up space

So there’s a few things you can do to get out of this. First, is avoiding trash disk space usage by adding a -skipTrash option to your deletes:

hadoop fs -rmr -skipTrash /user/hadoop/data

This avoids the problem of using the Trash altogether. Of course, this also means you avoid being able to retrieve stuff from the Trash bin, so use wisely.

The next thing you can do is reach for the expunge command, which forces an empty of the Trash:

hadoop fs -expunge

However this didn’t always seem to work for me, so it’s worth checking it has had the desired effect.

HDFS Commands to find what is using disk space

Sometimes the key thing is to find out where that disk space is being eaten up. Say hello to du (disk usage)

hadoop fs -dus /

Which will then give you the size of that data on your datanodes. Then dig deeper with ls:

hadoop fs -ls /

Which gives you the directories in root. Use du to size them, find unexpected space, and delete using rm or rmr as required.

The full file system shell commands are listed here

Read More

Reference Environment Variables in C# Mappers for HDInsight

Within your Mappers and Reducers there may be a need to reference the environment variables being fed to the task, such as the file name. Understanding how to do so took a little digging on my part, with a little help from Matt Winkler in the HDInsight MDSN forum.

Using this snippet of code:

// Adding this reference at the start of the code

using System.Collections;

foreach (DictionaryEntry var in Environment.GetEnvironmentVariables())

Console.WriteLine(“{0}”, var.Key + “|” + var.Value);

// Some junk code so the mapper doesn’t fail

string line; // Variable to hold current line

while ((line = Console.ReadLine()) != null)

{             // do nothing            }

 

It was possible to output all the Environment Variables as the Mapper output and work out their format from the resultant text file it created.

Then, to reference individual Environment Variables in the Mapper, you can simply use variations on:

 

string FileName = System.Environment.GetEnvironmentVariable(“map_input_file”);

string FileChunk = System.Environment.GetEnvironmentVariable(“map_input_start”);

Read More