MapReduce in C# for Hadoop on Azure

There are a bewildering array of language options available to write Mappers and Reducers (aka MapReduce) – Java and Python feature heavily, and for the non programmer the entire exercise is borderline incomprehensible.

However, a kind soul by the name of Sreedhar Pelluru has posted a simple walkthrough for building a Mapper and Reducer using C# & VS2010 for us Microsoft oriented souls, with an intended Hadoop on Azure target. The walkthrough is here: Walkthrough: Creating and Using C# Mapper and Reducer (Hadoop Streaming)

There are a few holes in the script so here’s the things to look out for:

  • In the section “Create and run a Map/Reduce job on HadoopOnAzure portal”, the first item suggests you run a Javascript command to get the IP address, but doesn’t provide it until a few lines later – the command is: “#cat file:///apps/dist/conf/core-site.xml “. You can also find out the IP by remoting into the cluster and running IPConfig at the command line.
  • Step 7 in the same section asks you to open hadoop-streaming.jar, and it took me a while to realise this mean on the HadoopOnAzure portal, not on your local machine (so I spent quite a bit of time in misadventures trying to manipulate the file on my local machine)
  • Error messages for job failure aren’t terribly helpful, and there’s no validation on job parameter input, so really really make sure that your command really does look exactly like the one in step 11. Miss a double quote or mistype a path and you will get no hint as to that being the source of the error.

Eventually I beat the above and achieved victory – a successful job run on HadoopOnAzure with the expected results. Next challenge – build my own data, mapper & reducer and repeat. Then get it into Hive….

 

Read More

Issuing a Hive query against Hadoop on Azure using Excel

..no, seriously. I can query an Azure cloud based Hive instance of Hadoop from Excel. Its simple stuff.

First step is to install the Hive drivers and Excel add-in. This then gives you a new button on your Excel ribbon:

The Hive Excel Add-In
The Hive Excel Add-In

Clicking this opens up the query pane, which is pretty simple stuff:

Hive Excel Add-in Query Pane
Hive Excel Add-in Query Pane

Pick your Hive instance, choose the table (in this case just the standard sample table), pick your columns…  then scroll down a bit because the pane is a bit long when you expand out the options.

Hive Excel Add-in Query Pane
Hive Excel Add-in Query Pane

I passed on providing any Criteria (i.e. the WHERE clause), added in some Aggregations, skipped over ordering as I can do that easily enough in Excel, added a 2k row limit.

This spat out some HiveQL (Hive Query language) which I modified slightly to include a count(*). Then click Execute Query, and wait a bit for the Big Data Bees to make my Insights.

Then:

Hive Data in Excel
Hive Data in Excel

I have Hive Data in Excel. I could have put it in PowerPivot if I really wanted to show off…

Read More

Hadoop on Azure

For those who may be unaware, Microsoft are offering Hadoop capability hosted in Azure (Microsoft’s cloud hosting ecosystem). There is scant information at the official site but this blog by Avkash Chauhan on MSDN is one good source of detail.

Anyway, I’m in on the CTP and I’m going to figure out what it’s all about and bring you on the journey. As a taster, here’s what it looks like:

Hadoop on Azure
Hadoop on Azure

I also needed to set up a SQL Azure instance as well, which was incredibly easy, so I’m going to have to start exploring that as well. Some busy blogging and tweeting days ahead…

Read More

An Introduction to Big Data for the C-Level

I had a play with xtranormal last night – a cute site that allows you to make simple animated videos of two characters talking – you’ve probably seen one or two in your time.

Anyway, I made one to Introduce Big Data to the C-Level – not terribly serious, but I’m hoping that “Big Data Bees” will become part of the Hadoop lexicon :)

 

Enjoy!

Read More

Microsoft BI in the Cloud

This post by Jamie Thompsonabout discovering the existence of a person in a manager role at Microsoft of “Cloudscale Predictive Analytics (SQL Azure Cloud Data Services Platform)” causing him to speculate about the existence of an analytics platform on Azure, caused my speculative brain to tick over in turn.

What is the possible roadmap for Microsoft BI? At a very high level, I think we will see Cloud BI play catch-up with Server BI and eventually converge to the point where there is no difference betweeen the two – though that will take a few years. I won’t be surprised if in ten years time there won’t be a distinction between the two streams of delivery.

In the short term I think we will see components of the current BI stack pop up in a loose order. Reporting is I suspect, the first thing we will see – after all flat reporting is where BI started many years ago (waaay back when it was Decision Support) and technically is the easiest to deliver; just run a query and format the results. Then I’d expect to see Cubes appear – they are a very cloud friendly concept – big bursts of processing followed by lots of relative idleness mean you can leverage the clouds ability to deliver on-demand computing power. Cubes are the sort of app that also sit well in the cloud – if the data is in the cloud too, there’s no big network data transfer going on, it all just happens in the datacenter. Hot on the heels of Cubes i’d expect PerformancePoint to rear its head to leverage those capabilities. Integration Services will be last off the rank – I think it’s the least cloud friendly app – it often means moving huge amounts of data across global networks and may not be terribly efficient. Maybe a simpler form may appear earlier, but I wouldn’t bet on it.

There are a few other bits which may crop up unexpectedly along this path – StreamInsight is another cloud friendly app, as is Master Data Services. Much further down the line is the stuff Jamie has read the tea leaves on – predictive analytics, grid computing and so forth. There’s a few interesting things coming in the stack which I have got wind of that belong in the cloud quite comfortably, but are too deep under NDA to even think of too loudly :)

I will of course reiterate that this is of course pure speculation, based on my understanding of the technical complexity of the solutions and how well they fit in the cloud. But expect BI to shift cloudwards, and be ready for it!

Read More