An in joke for one of my fellow leaders in the BI industry…
I’ll be presenting at TechEd Australia 2013 on “Big Data, Small Data and Data Visualisation via Sentiment Analysis with HDInsight”
In the session I’ll be looking at HDInsight – Microsoft’s implementation of Hadoop – and how to leverage that to perform some simple Sentiment Analysis, then link that up with structured data to perform some Data Visualisation using the Microsoft BI stack, especially PowerView.
Hopefully this will also tie in with the release of a White Paper on the subject so anyone with deep technical interest can get hands on with the experience.
I’m excited to get a chance to present again – look forward to seeing you there!
This Thursday 22nd August Sydney BI Social presents “Rapid Fire Mini Sessions” – presented by a range of experienced BI professionals giving a quick overview of a topic they are experts in. The sessions are:
Session 1: Power BI in action, the exciting new BI functionality in Excel 2013 and Office 365
Session 2: 5 tips for better data visualisation
Session 3: A day in the life of an SQL DBA
Also – for those with an eye further on the horizon, on Weds Sep 18th we have Stephen Samild presenting on “The Data to Decision Landscape”.
This Wednesday 17th June Sydney BI Social presents “BI & NoSQL” – presented by Stephen Young, CEO of GraphBase and architect of the GraphBase DBMS. Steve will give an overview of the various classes of NoSQL database, their advantages and disadvantages, with an emphasis on Graph Databases and the novel ways that they can be used for Business Intelligence purposes.
The thing about Big Data is, well… it’s big. Which has impacts in terms of how long it takes you to move your data about and the space it needs to be stored in. Now as a novice, I had assumed that you had to decompress your data to process it and I also had to tolerate the huge volumes of output my (admittedly not very efficient) code output.
As it turns out, you can not only process input in a compressed format, you can also compress the output – as detailed in the Hadoop Streaming documentation. So now my jobs start smaller and end smaller, and without a massive performance overhead.
So how does it work? Well, to read compressed data you have to configure absolutely nothing. It just works, as long as Hadoop recognises the compression algorithm. To compress the output, you need to tell the job to do so. Using the “-D” option you can set some generic command options to configure the job. A sample job – formatted for HDInsight – is below, with the key options highlighted in blue:
This tells the job to compress the output, and to use GZip as the compression technique.
And now, my jobs are still inefficient but at least take up less disk space!
Reminder the Sydney BI Social group is TOMORROW!! http://lnkd.in/Gf8j9y – the topic: “When Projects Go Bad” – 29 attending, 21 slots left – for a chance to learn from others (and get some Pizza from our sponsors Citi Recruitment). We have Mikkel Kristiansen of Crossjoin.net and Paul Fuller providing some of their stories – both experienced professionals who will have a lot to share!
Look forward to seeing you there tomorrow
One theme that constantly pops up in the BI / Analytics / Big Data world is why – given we have all these amazing tools and models, etc. – is the adoption of Analytics so low? From a Microsoft perspective, Data Mining was baked into SQL Server since 2005 – and due to negligible uptake has hardly changed since. Now I know from my colleagues in Analytics – and the fact that R continues to grow at a great rate – that it’s not a dead field. Far from it. But it’s not quite at the front of everyone’s minds either.
I think the challenges are human rather than technical. Understanding Analytics often means pushing the mind to the limits of what our poor grey lumps of brain were designed to do. We are rigged to make snap decisions with limited information to aid our survival, not contemplate the likelihood of that wolf being hungry through careful modelling deep thought and … ouch, why is there a wolf biting my leg?
A great example of this is showed up in my Facebook feed recently:
Source: these guys, who I totally don’t endorse as they might be hippies
OH MY GOD POUR ALL THE SODA DOWN THE SINK!!!
Well, er – lets not rush. As with all internet circulated health information, the facts are dubiously presented with no link to source. So first of all, let’s remedy that – this is the study in question:
Cancer Epidemiology, Biomarkers & Prevention, February 2010
Hurrah for open access journals. Reading through the study, the kernel of truth is there – a statistically valid effect found that indicated that those with a soda consumption of greater than 2 a week increased the relative risk of cancer by 85%. I’m not going to scoff at that, 85% is a big uptick in risk. Relative Risk - and this is where the above image is misleading.
At face value I would take the 85% figure to mean that if I drink 2 or more cans of soda a week, I have an 85% chance of getting pancreatic cancer, i.e. the Absolute Risk. If this was the case I would ban soda from my house immediately.
However dig into the maths and for the population study group the actual Absolute Risk of developing Pancreatic cancer if you drink no soda is about 1/4500. This makes it a pretty unusual cause of death compared to the big killers like Diabetes, which is a more likely consequence of drinking excess soda. For the population studied who did drink more than 2 sodas a week, the risk jumped to 1/2500. Which is still pretty remote. It also makes for a lousy headline. Much better to say the risk has increased by 85% without stating that the number refers to Relative risk and the Absolute risk is small. Not to mention that the study admits that its findings are far from conclusive.
So let’s revisit our risk types
Absolute Risk and Relative Risk are two very different things.
Absolute Risk is the chance of something happening to you if all other factors are equal. So for example, crossing a city street with your eyes closed may have a Absolute Risk of 10% in terms of being hit by a vehicle.
Relative risk is the adjustment to Absolute Risk when conditions alter. If it’s a highway, that risk of being hit by a car may jump to 70%. So the Relative Risk of crossing a highway instead of a city street is 700% higher. It doesn’t mean you have a 700% chance of getting hit by a vehicle, because – well, that makes no sense to have a 700% chance of something happening.
What does this have to do with how our brains are wired for Analytics?
It explains why the above image is simultaneously accurate and misleading. The snap decision we make is Soda – Cancer – Big Risk number – Soda Bad. The deeper analysis took a bit longer, and by which point most of us have lost interest.
Analytics is hard to get penetrated in the human way of working because it doesn’t appeal to our way of thinking, and it takes work to understand. So the message from here is if you are in Analytics and not being successful, it may not be because your models aren’t brilliant (I’m sure they are) – but because you cannot communicate how they work – and their value – in a way most peoples grey lumpy bits can grasp.
Disclaimer: I may have got some of the maths a bit wrong, particularly around the Absolute Risk of getting Pancreatic cancer, as I only spent 5 minutes trying to work it all out. This post does not constitute medical advice. If you take medical advice from Facebook, Twitter, Blogs or any other form of social media that has never been to Medical School, see a Doctor.