What is MapReduce anyway?

I have been wandering into the Hadoop world lately, and doing my best to understand and build MapReduce jobs. Undoubtedly some of you are wondering what MapReduce is…   My novice’s understanding so far that it is to some extent equivalent to ELT in the traditional DW world.

MapReduce is actually two processes that run in sequence – the Mappers and then the Reducers – and sometimes the Reducers are optional. So now we have some more jargon, let’s break that down.

Mappers

Mappers make sense of what is called “unstructured data” – which to me is a bit of a misnomer as it implies the data has no structure. What it really means is it hasn’t been put into a formal structure yet. An example might be a transaction record from an ATM – it may be stored as a line in a file with a number of columns in a known sequence. However because those columns aren’t called out in a structured format such as a database, where you can reference a column directly, it gets called “unstructured”. I’ll do another small post on this soon, but the below picture gives a rough idea of what’s happening:

Fig 1: A Mapper in action
Fig 1: A Mapper in action

So what Mappers do is create the structure on the fly. From your transaction record with 200 possible fields per line, the mappers suck out the relevant components of data for your analysis, and output it ready for the Reducers to consume. This roughly aligns to the “EL” in ELT.

Reducers

Reducers are the value-add processes that take the output from the Reducers. So they will take the output and perform operations on it such as Aggregations, Data Mining – whatever you want to cook up. In my ATM transaction example you may be looking for unusually large transactions or strange withdrawal patterns that may indicate fraud. This is roughly the “T” in ELT.

Reducers can be optional – you can just use your Mappers as one big staging job to get the elements you need from your masses of data for loading elsewhere.

Distributed Computing

The clever part in all this is in how this gets distributed across many compute nodes so it can do this cheaply and relatively quickly. However that’s another big subject in itself, which requires a bit of understanding about how HDFS – the Hadoop File System – works under the hood. So i’ll park that for now and leave you with some other MapReduce Explanations that I found helpful:

Read More

MapReduce in C# for Hadoop on Azure

There are a bewildering array of language options available to write Mappers and Reducers (aka MapReduce) – Java and Python feature heavily, and for the non programmer the entire exercise is borderline incomprehensible.

However, a kind soul by the name of Sreedhar Pelluru has posted a simple walkthrough for building a Mapper and Reducer using C# & VS2010 for us Microsoft oriented souls, with an intended Hadoop on Azure target. The walkthrough is here: Walkthrough: Creating and Using C# Mapper and Reducer (Hadoop Streaming)

There are a few holes in the script so here’s the things to look out for:

  • In the section “Create and run a Map/Reduce job on HadoopOnAzure portal”, the first item suggests you run a Javascript command to get the IP address, but doesn’t provide it until a few lines later – the command is: “#cat file:///apps/dist/conf/core-site.xml “. You can also find out the IP by remoting into the cluster and running IPConfig at the command line.
  • Step 7 in the same section asks you to open hadoop-streaming.jar, and it took me a while to realise this mean on the HadoopOnAzure portal, not on your local machine (so I spent quite a bit of time in misadventures trying to manipulate the file on my local machine)
  • Error messages for job failure aren’t terribly helpful, and there’s no validation on job parameter input, so really really make sure that your command really does look exactly like the one in step 11. Miss a double quote or mistype a path and you will get no hint as to that being the source of the error.

Eventually I beat the above and achieved victory – a successful job run on HadoopOnAzure with the expected results. Next challenge – build my own data, mapper & reducer and repeat. Then get it into Hive….

 

Read More