One of the things I’ve learned from the Microsoft team behind Hadoop on Azure is that the Hadoop clusters’ short lifespan is in fact intentional – the clusters are intended to be disposable and exist for the lifetime of the analysis only.
So what happens if you want your raw data to live up in the cloud for longer? The answer is Azure Blog Storage. This gives up to 100TB of storage per account so should be adequate for most cases. Hadoop on Azure can reference Azure Blob Storage (or Amazon S3 blob storage, if you want a non-Microsoft solution) directly as a location for input data.
Firstly, you need to have an Azure account, and set up storage, instructions for which can be found here. Then you need to upload some data to it, which can most easily be done using a nice tool called Cloudberry Explorer, which operates pretty much like an FTP tool. Then you need to configure your Hadoop on Azure instance to point at your Azure Blob Storage. Then as per this guide you can then point your jobs at the Azure Blob Storage using the asv:// notation, like in the example below:
Hadoop jar hadoop-streaming.jar -files “hdfs://10.NN.NN.NN:9000/example/apps/mappertwo.exe,hdfs://10.NN.NN.NN:9000/example/apps/reducertwo.exe” -mapper “mappertwo.exe” -reducer “reducertwo.exe” -input “asv://hadoop-test/” -output “/example/data/StreamingOutput/abtj”
This of course is slightly confused when setting up your job the parameters can be marked as asv://, like below:
However I couldn’t work out how to mark something as an input parameter as well as have it as ASV on the dropdown, so I left as plain text and entered the command as:
And it all worked – I did find that it didn’t handle direct file references – it only would accept storage references at the folder level.
So there you go – using Azure Blob Storage as a data source for Hadoop on Azure. Nice and easy.