Download data from a Hadoop on Azure cluster

So you’ve run a job on Hadoop on Azure, and now you want that data somewhere more useful, like in your Data Warehouse for some analytics. If the Hive ODBC Driver isn’t an option (perhaps because you used Pig), then FTP is the way – there isn’t a Javascript console fs.get() command available.

As described in my Upload data post, you need to use curl, and the command syntax is:

curl -k ftps://[cluster user name]:[password md5 hash]@[cluster name].cloudapp.net:2226/[path to data or specific file on HDFS] -o [local path name on your machine]

Happy downloading!

UPDATE: This functionality has now been disabled in HDInsight, see this thread from the MSDN Forum.

Read More

Using Azure Blob Storage as a Data Source for Hadoop on Azure

One of the things I’ve learned from the Microsoft team behind Hadoop on Azure is that the Hadoop clusters’ short lifespan is in fact intentional – the clusters are intended to be disposable and exist for the lifetime of the analysis only.

So what happens if you want your raw data to live up in the cloud for longer? The answer is Azure Blog Storage. This gives up to 100TB of storage per account so should be adequate for most cases. Hadoop on Azure can reference Azure Blob Storage (or Amazon S3 blob storage, if you want a non-Microsoft solution) directly as a location for input data.

Firstly, you need to have an Azure account, and set up storage, instructions for which can be found here. Then you need to upload some data to it, which can most easily be done using a nice tool called Cloudberry Explorer, which operates pretty much like an FTP tool. Then you need to configure your Hadoop on Azure instance to point at your Azure Blob Storage. Then as per this guide you can then point your jobs at the Azure Blob Storage using the asv:// notation, like in the example below:

Hadoop jar hadoop-streaming.jar -files “hdfs://10.NN.NN.NN:9000/example/apps/mappertwo.exe,hdfs://10.NN.NN.NN:9000/example/apps/reducertwo.exe” -mapper “mappertwo.exe” -reducer “reducertwo.exe” -input “asv://hadoop-test/” -output “/example/data/StreamingOutput/abtj”

This of course is slightly confused when setting up your job the parameters can be marked as asv://, like below:

Fig 1: Job Parameters
Fig 1: Job Parameters

 

 

 

 

 

 

However I couldn’t work out how to mark something as an input parameter as well as have it as ASV on the dropdown, so I left as plain text and entered the command as:

-input “asv://hadoop-test/”

And it all worked – I did find that it didn’t handle direct file references – it only would accept storage references at the folder level.

So there you go – using Azure Blob Storage as a data source for Hadoop on Azure. Nice and easy.

Read More

Upload data to a Hadoop on Azure cluster

Once you have a shiny Hadoop on Azure cluster, getting some data on it so you can get your MapReduce on is the next challenge. There are two options:

  1. Via the Javascript console
  2. Via FTP

Javascript Console

This has the advantage of being the simplest approach, as all you need to do is a command line input and then pick your file. The downside is that its slow, unstable and not really suited for uploading large numbers or volumes of data.

However, so you know how for little test files, here’s what to do. First, click on the interactive console:

Fig 1: Hadoop on Azure interactive console
Fig 1: Hadoop on Azure interactive console

 

 

 

 

 

 

 

 

 

 

Then at the command line type “fs.put()”

Fig 2: Hadoop on Azure Interactive Javascript
Fig 2: Hadoop on Azure Interactive Javascript

 

 

 

 

 

 

 

 

This will launch the file upload dialog – browse for the file on your local machine, enter the destination path on your Cluster and click upload. The file should upload… though as I’ve mentioned, it gets shakier as the files get bigger. If it works the dialog will close and below the fs.put() you typed in, a “File uploaded” confirmation line will appear.

FTP

UPDATE: This functionality has now been disabled in HDInsight, see this thread from the MSDN Forum.

This is the more industrial approach. Most of the detail below is sourced from this TechNet article: How to FTP data to Hadoop-based services on Windows Azure. However as noted in a comment at the bottom of that article, the FTPS is also a bit unstable – sometimes it just doesn’t work. Presumably this is a CTP stability issue.

Follow that guides steps 1 through 11 to open your ports and get the MD5 hash of your password. After that it assumes you have a few things at your disposal so I’m filling in the gaps here.

For the life of me I could not get my personal favourite FTP client, FileZilla to connect properly, though I’m still poking at it, and if I get it to work I’ll put a post up explaining how. So I ended up using what the article suggested, which is cURL. Curl can be downloaded here. The version you want will be Windows SSL SSPI enabled – either 32 or 64 bit depending on your own OS. Download and unzip the package, and you’re ready to go – cURL is just a command line executable so there’s no install or GUI.

From the command line, navigate to the same folder as your cURL executable and put a script in a batch file that looks like this:

curl -k -T C:\MapReduce\Transactions.txt ftps://your_cluster_user_name:your_password_md5_hash@your_cluster_name.cloudapp.net:2226/example/data/transactions.txt

Replace “C:\MapReduce\Transactions.txt” with your source file and “/example/data/transactions.txt” with your target on your HDFS cluster. Of course also update your cluster user name, password md5 hash and cluster name as well.

The command switches -k & -T are required and explained here in the cURL command line switch documentation.

Run the batch file and watch the progress of your file transfer. You can validate the upload worked on the target from the Javascript interactive console as described in step 14 of the TechNet guide.

 

Read More

Setting up a Hadoop on Azure Cluster in 15 minutes

One thing I’d like to share about the Microsoft Hadoop on Azure offering is how ridiculously easy it is to set up.

This is the setup screen for registering your cluster (click to see full size):

Fig 1: Hadoop on Azure Cluster setup screen
Fig 1: Hadoop on Azure Cluster setup screen

It’s all shiny and Metro interfaced, but the important bit is that to set up a cluster you need to choose the following:

  • A DNS Name
  • How big a Cluster you want (4 Node 2TB, 8 Node 4TB, 16 Node 8TB or 32Node 16TB)
  • Username and Password for the Cluster
  • Optional Azure details to store Hive content
  • …. and that’s it.

Enter those fields, click request cluster and you get your wait screen:

Fig 2: Hadoop on Azure Cluster Allocation wait screen
Fig 2: Hadoop on Azure Cluster Allocation wait screen

It then allocates your nodes:

Fig 3: Hadoop on Azure Node Allocation Wait Screen
Fig 3: Hadoop on Azure Node Allocation Wait Screen

Gets down to the business of Creating and Starting them:

Fig 4: Hadoop on Azure Node Creation Wait Screen
Fig 4: Hadoop on Azure Node Creation Wait Screen

The services start on the Nodes:

Fig 5: Hadoop on Azure Nodes starting
Fig 5: Hadoop on Azure Nodes starting

And then it’s cooked:

Fig 6: One Hadoop on Azure Cluster at your service
Fig 6: One Hadoop on Azure Cluster at your service

This took 15 minutes.

Let me repeat that – end to end – the process took 15 minutes.

Some days my laptop takes that long to become usable….

Read More

MapReduce in C# for Hadoop on Azure

There are a bewildering array of language options available to write Mappers and Reducers (aka MapReduce) – Java and Python feature heavily, and for the non programmer the entire exercise is borderline incomprehensible.

However, a kind soul by the name of Sreedhar Pelluru has posted a simple walkthrough for building a Mapper and Reducer using C# & VS2010 for us Microsoft oriented souls, with an intended Hadoop on Azure target. The walkthrough is here: Walkthrough: Creating and Using C# Mapper and Reducer (Hadoop Streaming)

There are a few holes in the script so here’s the things to look out for:

  • In the section “Create and run a Map/Reduce job on HadoopOnAzure portal”, the first item suggests you run a Javascript command to get the IP address, but doesn’t provide it until a few lines later – the command is: “#cat file:///apps/dist/conf/core-site.xml “. You can also find out the IP by remoting into the cluster and running IPConfig at the command line.
  • Step 7 in the same section asks you to open hadoop-streaming.jar, and it took me a while to realise this mean on the HadoopOnAzure portal, not on your local machine (so I spent quite a bit of time in misadventures trying to manipulate the file on my local machine)
  • Error messages for job failure aren’t terribly helpful, and there’s no validation on job parameter input, so really really make sure that your command really does look exactly like the one in step 11. Miss a double quote or mistype a path and you will get no hint as to that being the source of the error.

Eventually I beat the above and achieved victory – a successful job run on HadoopOnAzure with the expected results. Next challenge – build my own data, mapper & reducer and repeat. Then get it into Hive….

 

Read More

Issuing a Hive query against Hadoop on Azure using Excel

..no, seriously. I can query an Azure cloud based Hive instance of Hadoop from Excel. Its simple stuff.

First step is to install the Hive drivers and Excel add-in. This then gives you a new button on your Excel ribbon:

The Hive Excel Add-In
The Hive Excel Add-In

Clicking this opens up the query pane, which is pretty simple stuff:

Hive Excel Add-in Query Pane
Hive Excel Add-in Query Pane

Pick your Hive instance, choose the table (in this case just the standard sample table), pick your columns…  then scroll down a bit because the pane is a bit long when you expand out the options.

Hive Excel Add-in Query Pane
Hive Excel Add-in Query Pane

I passed on providing any Criteria (i.e. the WHERE clause), added in some Aggregations, skipped over ordering as I can do that easily enough in Excel, added a 2k row limit.

This spat out some HiveQL (Hive Query language) which I modified slightly to include a count(*). Then click Execute Query, and wait a bit for the Big Data Bees to make my Insights.

Then:

Hive Data in Excel
Hive Data in Excel

I have Hive Data in Excel. I could have put it in PowerPivot if I really wanted to show off…

Read More

Hadoop on Azure

For those who may be unaware, Microsoft are offering Hadoop capability hosted in Azure (Microsoft’s cloud hosting ecosystem). There is scant information at the official site but this blog by Avkash Chauhan on MSDN is one good source of detail.

Anyway, I’m in on the CTP and I’m going to figure out what it’s all about and bring you on the journey. As a taster, here’s what it looks like:

Hadoop on Azure
Hadoop on Azure

I also needed to set up a SQL Azure instance as well, which was incredibly easy, so I’m going to have to start exploring that as well. Some busy blogging and tweeting days ahead…

Read More