I have been wandering into the Hadoop world lately, and doing my best to understand and build MapReduce jobs. Undoubtedly some of you are wondering what MapReduce is… My novice’s understanding so far that it is to some extent equivalent to ELT in the traditional DW world.
MapReduce is actually two processes that run in sequence – the Mappers and then the Reducers – and sometimes the Reducers are optional. So now we have some more jargon, let’s break that down.
Mappers make sense of what is called “unstructured data” – which to me is a bit of a misnomer as it implies the data has no structure. What it really means is it hasn’t been put into a formal structure yet. An example might be a transaction record from an ATM – it may be stored as a line in a file with a number of columns in a known sequence. However because those columns aren’t called out in a structured format such as a database, where you can reference a column directly, it gets called “unstructured”. I’ll do another small post on this soon, but the below picture gives a rough idea of what’s happening:
So what Mappers do is create the structure on the fly. From your transaction record with 200 possible fields per line, the mappers suck out the relevant components of data for your analysis, and output it ready for the Reducers to consume. This roughly aligns to the “EL” in ELT.
Reducers are the value-add processes that take the output from the Reducers. So they will take the output and perform operations on it such as Aggregations, Data Mining – whatever you want to cook up. In my ATM transaction example you may be looking for unusually large transactions or strange withdrawal patterns that may indicate fraud. This is roughly the “T” in ELT.
Reducers can be optional – you can just use your Mappers as one big staging job to get the elements you need from your masses of data for loading elsewhere.
The clever part in all this is in how this gets distributed across many compute nodes so it can do this cheaply and relatively quickly. However that’s another big subject in itself, which requires a bit of understanding about how HDFS – the Hadoop File System – works under the hood. So i’ll park that for now and leave you with some other MapReduce Explanations that I found helpful:
- MapReduce explained through Chutney – great post from Shekhar Gulati
- A more technical view from Ayende @ Rahien that explains from a Document database type perspective
- A viewpoint from Google via Philipp Lenssen