Download data from a Hadoop on Azure cluster

So you’ve run a job on Hadoop on Azure, and now you want that data somewhere more useful, like in your Data Warehouse for some analytics. If the Hive ODBC Driver isn’t an option (perhaps because you used Pig), then FTP is the way – there isn’t a Javascript console fs.get() command available.

As described in my Upload data post, you need to use curl, and the command syntax is:

curl -k ftps://[cluster user name]:[password md5 hash]@[cluster name].cloudapp.net:2226/[path to data or specific file on HDFS] -o [local path name on your machine]

Happy downloading!

UPDATE: This functionality has now been disabled in HDInsight, see this thread from the MSDN Forum.

Read More

FTP to Hadoop on Azure with Filezilla – doesn’t work :(

As the title says, FTP to Hadoop on Azure with Filezilla – doesn’t work – which is possibly due to an FTP server configuration issue / Filezilla version compatibility problem called out here: http://trac.filezilla-project.org/ticket/7873. The solution proposed requires a FTP server config change which can’t be done on the Hadoop on Azure clusters as the user doesn’t have Administrative permissions.

However for reference should this get fixed, these are the settings you need for Filezilla in Site Manager:

  • Host: yourclustername.cloudapp.net
  • Port: 2226 (or 2227, 2228 now)
  • Protocol: FTP
  • Encryption: Require Implicit FTP over TLS
  • Logon Type: Normal
  • User: yourusername
  • Password: md5 hash of password (See step 11 here)

Also under Transfer Settings, opt for Passive transfer mode.

If anyone out there can get this working, please let me know so I can share the solution.

UPDATE: This functionality has now been disabled in HDInsight, see this thread from the MSDN Forum.

Read More

Upload data to a Hadoop on Azure cluster

Once you have a shiny Hadoop on Azure cluster, getting some data on it so you can get your MapReduce on is the next challenge. There are two options:

  1. Via the Javascript console
  2. Via FTP

Javascript Console

This has the advantage of being the simplest approach, as all you need to do is a command line input and then pick your file. The downside is that its slow, unstable and not really suited for uploading large numbers or volumes of data.

However, so you know how for little test files, here’s what to do. First, click on the interactive console:

Fig 1: Hadoop on Azure interactive console
Fig 1: Hadoop on Azure interactive console

 

 

 

 

 

 

 

 

 

 

Then at the command line type “fs.put()”

Fig 2: Hadoop on Azure Interactive Javascript
Fig 2: Hadoop on Azure Interactive Javascript

 

 

 

 

 

 

 

 

This will launch the file upload dialog – browse for the file on your local machine, enter the destination path on your Cluster and click upload. The file should upload… though as I’ve mentioned, it gets shakier as the files get bigger. If it works the dialog will close and below the fs.put() you typed in, a “File uploaded” confirmation line will appear.

FTP

UPDATE: This functionality has now been disabled in HDInsight, see this thread from the MSDN Forum.

This is the more industrial approach. Most of the detail below is sourced from this TechNet article: How to FTP data to Hadoop-based services on Windows Azure. However as noted in a comment at the bottom of that article, the FTPS is also a bit unstable – sometimes it just doesn’t work. Presumably this is a CTP stability issue.

Follow that guides steps 1 through 11 to open your ports and get the MD5 hash of your password. After that it assumes you have a few things at your disposal so I’m filling in the gaps here.

For the life of me I could not get my personal favourite FTP client, FileZilla to connect properly, though I’m still poking at it, and if I get it to work I’ll put a post up explaining how. So I ended up using what the article suggested, which is cURL. Curl can be downloaded here. The version you want will be Windows SSL SSPI enabled – either 32 or 64 bit depending on your own OS. Download and unzip the package, and you’re ready to go – cURL is just a command line executable so there’s no install or GUI.

From the command line, navigate to the same folder as your cURL executable and put a script in a batch file that looks like this:

curl -k -T C:\MapReduce\Transactions.txt ftps://your_cluster_user_name:your_password_md5_hash@your_cluster_name.cloudapp.net:2226/example/data/transactions.txt

Replace “C:\MapReduce\Transactions.txt” with your source file and “/example/data/transactions.txt” with your target on your HDFS cluster. Of course also update your cluster user name, password md5 hash and cluster name as well.

The command switches -k & -T are required and explained here in the cURL command line switch documentation.

Run the batch file and watch the progress of your file transfer. You can validate the upload worked on the target from the Javascript interactive console as described in step 14 of the TechNet guide.

 

Read More