The third and final day at SQL Pass was presaged by me at the bloggers table (though only able to manically tweet) watching Dr DeWitt’s keynote – and I can see why his keynotes are so highly regarded. His subject was Big Data – and given the potential for this to be a dull and impenetrable subject area – he gave a great and illuminating talk on the topic.
Topics that he covered included:
- ACID vs BASE (i.e the battle between consistency of data vs. availability of data)
- NoSQL is a means of querying raw data with no cleansing / structure / ETL
- His expectation is that Structured (SQL) and Unstructured (Hadoop) data will coexist in organisations
- Hadoop consists of Storage (HDFS) and Process (MapReduce)
- MapReduce is too complex to work with so languages such as Hive and Pig sit on top of it
- Sqoop is the tool to make Unstructured and Structured data talk – but performance is not good
I can’t really do his talk justice but now I understand Hadoop a whole lot better – essentially it’s just a read only store of unstructured data, a very different beast to a relational database and addressing totally different needs.