Compression in Hadoop Streaming Jobs

The thing about Big Data is, well…   it’s big. Which has impacts in terms of how long it takes you to move your data about and the space it needs to be stored in. Now as a novice, I had assumed that you had to decompress your data to process it and I also had to tolerate the huge volumes of output my (admittedly not very efficient) code output.

As it turns out, you can not only process input in a compressed format, you can also compress the output – as detailed in the Hadoop Streaming documentation. So now my jobs start smaller and end smaller, and without a massive performance overhead.

So how does it work? Well, to read compressed data you have to configure absolutely nothing. It just works, as long as Hadoop recognises the compression algorithm. To compress the output, you need to tell the job to do so. Using the “-D” option you can set some generic command options to configure the job. A sample job – formatted for HDInsight – is below, with the key options highlighted in blue:

c:\hadoop\hadoop-1.1.0-SNAPSHOT\bin\hadoop.cmd jar

C:\Hadoop\websites\HadoopDashboard\Models\Samples\hadoop-streaming.jar

“-D mapred.output.compress=true”

“-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec”

-files “hdfs://localhost:8020/user/hadoop/code/Sentiment_v2.exe”

-numReduceTasks 0

-mapper “Sentiment_v2.exe”

-input “/user/hadoop/data/”

-output “/user/hadoop/output/Sentiment”

This tells the job to compress the output, and to use GZip as the compression technique.

And now, my jobs are still inefficient but at least take up less disk space!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>