As described in my Upload data post, you need to use curl, and the command syntax is:
curl -k ftps://[cluster user name]:[password md5 hash]@[cluster name].cloudapp.net:2226/[path to data or specific file on HDFS] -o [local path name on your machine]
As the title says, FTP to Hadoop on Azure with Filezilla – doesn’t work – which is possibly due to an FTP server configuration issue / Filezilla version compatibility problem called out here: http://trac.filezilla-project.org/ticket/7873. The solution proposed requires a FTP server config change which can’t be done on the Hadoop on Azure clusters as the user doesn’t have Administrative permissions.
However for reference should this get fixed, these are the settings you need for Filezilla in Site Manager:
Once you have a shiny Hadoop on Azure cluster, getting some data on it so you can get your MapReduce on is the next challenge. There are two options:
This has the advantage of being the simplest approach, as all you need to do is a command line input and then pick your file. The downside is that its slow, unstable and not really suited for uploading large numbers or volumes of data.
However, so you know how for little test files, here’s what to do. First, click on the interactive console:
Then at the command line type “fs.put()”
This will launch the file upload dialog – browse for the file on your local machine, enter the destination path on your Cluster and click upload. The file should upload… though as I’ve mentioned, it gets shakier as the files get bigger. If it works the dialog will close and below the fs.put() you typed in, a “File uploaded” confirmation line will appear.
This is the more industrial approach. Most of the detail below is sourced from this TechNet article: How to FTP data to Hadoop-based services on Windows Azure. However as noted in a comment at the bottom of that article, the FTPS is also a bit unstable – sometimes it just doesn’t work. Presumably this is a CTP stability issue.
Follow that guides steps 1 through 11 to open your ports and get the MD5 hash of your password. After that it assumes you have a few things at your disposal so I’m filling in the gaps here.
For the life of me I could not get my personal favourite FTP client, FileZilla to connect properly, though I’m still poking at it, and if I get it to work I’ll put a post up explaining how. So I ended up using what the article suggested, which is cURL. Curl can be downloaded here. The version you want will be Windows SSL SSPI enabled – either 32 or 64 bit depending on your own OS. Download and unzip the package, and you’re ready to go – cURL is just a command line executable so there’s no install or GUI.
From the command line, navigate to the same folder as your cURL executable and put a script in a batch file that looks like this:
Replace “C:\MapReduce\Transactions.txt” with your source file and “/example/data/transactions.txt” with your target on your HDFS cluster. Of course also update your cluster user name, password md5 hash and cluster name as well.