Once you have a shiny Hadoop on Azure cluster, getting some data on it so you can get your MapReduce on is the next challenge. There are two options:
- Via FTP
This has the advantage of being the simplest approach, as all you need to do is a command line input and then pick your file. The downside is that its slow, unstable and not really suited for uploading large numbers or volumes of data.
However, so you know how for little test files, here’s what to do. First, click on the interactive console:
Then at the command line type “fs.put()”
This will launch the file upload dialog – browse for the file on your local machine, enter the destination path on your Cluster and click upload. The file should upload… though as I’ve mentioned, it gets shakier as the files get bigger. If it works the dialog will close and below the fs.put() you typed in, a “File uploaded” confirmation line will appear.
UPDATE: This functionality has now been disabled in HDInsight, see this thread from the MSDN Forum.
This is the more industrial approach. Most of the detail below is sourced from this TechNet article: How to FTP data to Hadoop-based services on Windows Azure. However as noted in a comment at the bottom of that article, the FTPS is also a bit unstable – sometimes it just doesn’t work. Presumably this is a CTP stability issue.
Follow that guides steps 1 through 11 to open your ports and get the MD5 hash of your password. After that it assumes you have a few things at your disposal so I’m filling in the gaps here.
For the life of me I could not get my personal favourite FTP client, FileZilla to connect properly, though I’m still poking at it, and if I get it to work I’ll put a post up explaining how. So I ended up using what the article suggested, which is cURL. Curl can be downloaded here. The version you want will be Windows SSL SSPI enabled – either 32 or 64 bit depending on your own OS. Download and unzip the package, and you’re ready to go – cURL is just a command line executable so there’s no install or GUI.
From the command line, navigate to the same folder as your cURL executable and put a script in a batch file that looks like this:
curl -k -T C:\MapReduce\Transactions.txt ftps://your_cluster_user_name:your_password_md5_hash@your_cluster_name.cloudapp.net:2226/example/data/transactions.txt
Replace “C:\MapReduce\Transactions.txt” with your source file and “/example/data/transactions.txt” with your target on your HDFS cluster. Of course also update your cluster user name, password md5 hash and cluster name as well.
The command switches -k & -T are required and explained here in the cURL command line switch documentation.