Sydney BI Social group is TOMORROW – Wednesday 19th June

Reminder the Sydney BI Social group is TOMORROW!! http://lnkd.in/Gf8j9y – the topic: “When Projects Go Bad” – 29 attending, 21 slots left – for a chance to learn from others (and get some Pizza from our sponsors Citi Recruitment). We have Mikkel Kristiansen of Crossjoin.net and Paul Fuller providing some of their stories – both experienced professionals who will have a lot to share!

Look forward to seeing you there tomorrow :)

Statistics is hard #479: Absolute and Relative Risk

One theme that constantly pops up in the BI / Analytics / Big Data world is why – given we have all these amazing tools and models, etc. – is the adoption of Analytics so low? From a Microsoft perspective, Data Mining was baked into SQL Server since 2005 – and due to negligible uptake has hardly changed since. Now I know from my colleagues in Analytics – and the fact that R continues to grow at a great rate – that it’s not a dead field. Far from it. But it’s not quite at the front of everyone’s minds either.

I think the challenges are human rather than technical. Understanding Analytics often means pushing the mind to the limits of what our poor grey lumps of brain were designed to do. We are rigged to make snap decisions with limited information to aid our survival, not contemplate the likelihood of that wolf being hungry through careful modelling deep thought and … ouch, why is there a wolf biting my leg?

A great example of this is showed up in my Facebook feed recently:

Relative Risk - Soda will not kill you

Relative Risk – Soda will not kill you, read below for details

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Source: these guys, who I totally don’t endorse as they might be hippies

OH MY GOD POUR ALL THE SODA DOWN THE SINK!!!

Well, er – lets not rush. As with all internet circulated health information, the facts are dubiously presented with no link to source. So first of all, let’s remedy that – this is the study in question:

Soft Drink and Juice Consumption and Risk of Pancreatic Cancer: The Singapore Chinese Health Study

Cancer Epidemiology, Biomarkers & Prevention, February 2010

Hurrah for open access journals. Reading through the study, the kernel of truth is there – a statistically valid effect found that indicated that those with a soda consumption of greater than 2 a week increased the relative risk of cancer by 85%. I’m not going to scoff at that, 85% is a big uptick in risk. Relative Risk - and this is where the above image is misleading.

At face value I would take the 85% figure to mean that if I drink 2 or more cans of soda a week, I have an 85% chance of getting pancreatic cancer, i.e. the Absolute Risk. If this was the case I would ban soda from my house immediately.

However dig into the maths and for the population study group the actual Absolute Risk of developing Pancreatic cancer if you drink no soda is about 1/4500. This makes it a pretty unusual cause of death compared to the big killers like Diabetes, which is a more likely consequence of drinking excess soda. For the population studied who did drink more than 2 sodas a week, the risk jumped to 1/2500. Which is still pretty remote. It also makes for a lousy headline. Much better to say the risk has increased by 85% without stating that the number refers to Relative risk and the Absolute risk is small. Not to mention that the study admits that its findings are far from conclusive.

So let’s revisit our risk types

Absolute Risk and Relative Risk are two very different things.

Absolute Risk is the chance of something happening to you if all other factors are equal. So for example, crossing a city street with your eyes closed may have a Absolute Risk of 10% in terms of being hit by a vehicle.

Relative risk is the adjustment to Absolute Risk when conditions alter. If it’s a highway, that risk of being hit by a car may jump to 70%. So the Relative Risk of crossing a highway instead of a city street is 700% higher. It doesn’t mean you have a 700% chance of getting hit by a vehicle, because – well, that makes no sense to have a 700% chance of something happening.

What does this have to do with how our brains are wired for Analytics?

It explains why the above image is simultaneously accurate and misleading. The snap decision we make is Soda – Cancer – Big Risk number – Soda Bad. The deeper analysis took a bit longer, and by which point most of us have lost interest.

Analytics is hard to get penetrated in the human way of working because it doesn’t appeal to our way of thinking, and it takes work to understand. So the message from here is if you are in Analytics and not being successful, it may not be because your models aren’t brilliant (I’m sure they are) – but because you cannot communicate how they work – and their value – in a way most peoples grey lumpy bits can grasp.

 

Disclaimer: I may have got some of the maths a bit wrong, particularly around the Absolute Risk of getting Pancreatic cancer, as I only spent 5 minutes trying to work it all out. This post does not constitute medical advice. If you take medical advice from Facebook, Twitter, Blogs or any other form of social media that has never been to Medical School, see a Doctor.

Sydney BI Social Group (SYBIS) – Wednesday May 15th

I’m pleased to say that we now have a second session of SYBIS scheduled, with the topic of “Agile BI“. As a reminder, the aim of the group is focused more on networking rather than technical knowledge sharing, so if you like to talk BI (among other things) – it’s a great opportunity to meet some new faces and reconnect with others.

I’ll be there at the next event – from 5.30pm, Wednesday May 15th at the City Hotel (349 Kent Street) - and this time listening to Iman Eftekhari talk about Implementing BI solutions using Agile/SCRUM methodology. Some expert panel members will help liven up the discussion, names TBC.

More details can be found in this LinkedIn group and please RSVP using this MeetUp - 10 attendees already confirmed so don’t be shy!

Sydney BI Social Group (SYBIS)

The first SYBIS (Sydney BI Social Group) took place a couple of weeks ago now and we had a great session with a good turnout, and a lively panel featuring Viktor Isakov and myself speaking about Big Data.

Please join the LinkedIn group and MeetUp and keep an eye out for the next session in May – date to be announced soon!

 

Sydney BI Social Group (SYBIS) – April 3rd

Fellow Microsoft BI blogger Iman Eftekhari is setting up a BI Social Group in Sydney – called SYBIS. The aim of the group is focused more on networking rather than technical knowledge sharing, so if you like to talk BI (among other things) – it’s a great opportunity to meet some new faces and reconnect with others.

I’ll be there at the first event – from 5.30pm, April 3rd at the Shark Hotel - and sharing my opinions on one of the two themes put forward – “Big Data & Microsoft”, or “Does tool choice really matter?” – winning topic to be decided on the night. Attendees are welcome to come along and disagree with me :)

More details can be found in this LinkedIn group and please RSVP using this MeetUp - 8 attendees already confirmed so don’t be shy!

Taking out the trash in HDInsight

One thing Hadoop doesn’t do that effectively (right now, anyway) is clean up after itself. Like most file systems it has a trash bin (see “Space Reclamation” in the HDFS Architecture guide) which is supposed to clean itself up after “a configurable amount of time” – which appears to be 360 minutes (6 hours) according to core-site.xml in the HDInsight default setup.

However I’ve found this is doesn’t always happen at the speed I’d like, and also some processes (which ones, I haven’t yet confirmed) also leave stuff lying around in the /tmp folder, which has to be manually cleaned up – as long as there’s nothing running it seems to be safe to kill whatever is stored in /tmp. However, don’t blame me if it all goes wrong for you :)

HDFS Commands to help free up space

So there’s a few things you can do to get out of this. First, is avoiding trash disk space usage by adding a -skipTrash option to your deletes:

hadoop fs -rmr -skipTrash /user/hadoop/data

This avoids the problem of using the Trash altogether. Of course, this also means you avoid being able to retrieve stuff from the Trash bin, so use wisely.

The next thing you can do is reach for the expunge command, which forces an empty of the Trash:

hadoop fs -expunge

However this didn’t always seem to work for me, so it’s worth checking it has had the desired effect.

HDFS Commands to find what is using disk space

Sometimes the key thing is to find out where that disk space is being eaten up. Say hello to du (disk usage)

hadoop fs -dus /

Which will then give you the size of that data on your datanodes. Then dig deeper with ls:

hadoop fs -ls /

Which gives you the directories in root. Use du to size them, find unexpected space, and delete using rm or rmr as required.

The full file system shell commands are listed here

Do you know what motivates you?

Do you know what motivates you at work? Is it the glory, the cash, the dramatic road warrior lifestyle? Or do you blindly “do stuff” and enjoy some of it, and other bits not so much?

Well, occasionally in the mass of Management reading I do I come across something that helps me realise how I operate and improves how I perform through higher self awareness. I recently read “Drive” by Daniel Pink, and suggest you do too – as it will help you get to grips with how you are motivated at work.

Welcome to Motivation 3.0

Central to the book is the theory of Motivation 3.0. To understand how we got there, we need to know about 1.0 & 2.0. Motivation 1.0 was pretty simple – eat, find shelter, or die. Good cavemen grade stuff. Moving to 2.0 we enter the industrial age where performance is rewarded and disobedience punished.

Daniel Pink’s theory is that we have moved now to 3.0, as 2.0 only works for jobs with a fixed path to completion with no room for creativity, such as data entry or widget making. Work is increasingly creative – BI is definitely short on routine, easily defined work – and he proposes that you cannot give rewards for being creative because that makes creativity work, and then demotivates you to be creative….  a bit of a fatal blow in the modern workplace.

So Motivation 3.0 gives the worker the inner drive to solve creative problems through 3 things:

  • Mastery – striving to be a master of your trade
  • Autonomy – freedom to pursue your own path to your objectives
  • Purpose – being part of something bigger than making money

These all lead to the employer having to have faith in employees to do the right thing and work for the goals of the company without the traditional constraints of Motivation 2.0 – i.e. punishment and reward. It ultimately drives to the Result Oriented Work Environment – where hours are less important than what you deliver in the time you spend. Imagine a world without the 9-5 obligation where half your day is wasted because you just aren’t in the zone (or “in Flow” as it is referred to by some researchers), and you may as well have been at the beach?

It’s a short and interesting read, backed up with research, examples and stories that will prove thought provoking, and may change the way you go about your job.

Update 15/01/2013 – thanks to one of my colleagues, here’s a great TED Talk from the author on some of the key themes:

Reference Environment Variables in C# Mappers for HDInsight

Within your Mappers and Reducers there may be a need to reference the environment variables being fed to the task, such as the file name. Understanding how to do so took a little digging on my part, with a little help from Matt Winkler in the HDInsight MDSN forum.

Using this snippet of code:

// Adding this reference at the start of the code

using System.Collections;

foreach (DictionaryEntry var in Environment.GetEnvironmentVariables())

Console.WriteLine(“{0}”, var.Key + “|” + var.Value);

// Some junk code so the mapper doesn’t fail

string line; // Variable to hold current line

while ((line = Console.ReadLine()) != null)

{             // do nothing            }

 

It was possible to output all the Environment Variables as the Mapper output and work out their format from the resultant text file it created.

Then, to reference individual Environment Variables in the Mapper, you can simply use variations on:

 

string FileName = System.Environment.GetEnvironmentVariable(“map_input_file”);

string FileChunk = System.Environment.GetEnvironmentVariable(“map_input_start”);

Extract data from Hive using SSIS

So now the Hive ODBC driver exists, the next thing to do is use SSIS to extract data from Hive into a SQL instance for… well, I’m sure we’ll find a reason for it.

Setting up the DSN

The first thing to do is set up a System DSN (Data Source Name) to reference in the ODBC connection. For SSIS, that means we need a 32 bit driver to reference, which means we need to find the 32 Bit ODBC Data Source Administrator. If you’re on a 32 Bit OS, just go to the Control Panel and search for it. If you are on a 64 Bit OS like me, you need to hunt it out. On Windows 7, it can be found at “C:\Windows\SysWOW64\odbcad32.exe”. Note you need to run as Administrator to make changes.

Go to the System DSN:

Fig 1: ODBC Data Source Administrator

Fig 1: ODBC Data Source Administrator

 

 

 

 

 

 

Click “Add…”

Fig 2: ODBC Data Source Administrator

Fig 2: ODBC Data Source Administrator

 

 

 

 

 

 

 

 

 

 

Scroll down the list until you find the “HIVE” driver, then click “Finish”, which brings up the ODBC Hive Setup dialog:

Fig 3: ODBC Data Source Administrator

Fig 3: ODBC Data Source Administrator

 

 

 

 

 

 

 

 

 

 

 

 

 

Give your DSN a sensible name and description. For your host enter the cluster URL (without http://) – i.e. “[your cluster name].cloudapp.net”. Leave the port as 10000. Under Authentication select “Username/Password” and enter your username. Then click “OK” and we are ready to move on.

Connect in SSIS

To hook this into SSIS we need to create a Data Flow and add an ADO.NET Connection Manager. Not – as I initially thought – an ODBC Connection Manager.

Under the Provider, select under “.Net Providers” the “Odbc Data Provider” option.

Fig 4: ADO.NET Connection Manager

Fig 4: ADO.NET Connection Manager

 

 

 

 

 

 

Once that’s done you can choose your just created Data Source Name using the dropdown under “Data source specification”. Add your username and password to complete setup, then click “OK”.

Fig 5: ADO.NET Connection Manager

Fig 5: ADO.NET Connection Manager

 

 

 

 

 

 

 

 

Now the Connection Manager is set up, you can use it in a Data Flow. Add a ADO.NET Data Source, and select your Connection Manager. Then you can – as per a normal database connection – select tables or write queries. In this example I’ve just chosen the HiveSampleTable that comes with every cluster.

Fig 6: ADO.NET Source

Fig 6: ADO.NET Source

 

 

 

 

 

 

 

Then we route the data somewhere, in this case just pumping it to a Row Count.

Fig 7: ADO.NET Data Flow

Fig 7: ADO.NET Data Flow

 

 

 

 

 

 

 

 

I’ve put on a Data Viewer just to show it works.

Fig 8: ADO.NET Data Flow

Fig 8: ADO.NET Data Flow

 

 

 

 

 

 

 

And there we have it. Data from Hive on a Hadoop on Azure cluster via SSIS.

 

Download data from a Hadoop on Azure cluster

So you’ve run a job on Hadoop on Azure, and now you want that data somewhere more useful, like in your Data Warehouse for some analytics. If the Hive ODBC Driver isn’t an option (perhaps because you used Pig), then FTP is the way – there isn’t a Javascript console fs.get() command available.

As described in my Upload data post, you need to use curl, and the command syntax is:

curl -k ftps://[cluster user name]:[password md5 hash]@[cluster name].cloudapp.net:2226/[path to data or specific file on HDFS] -o [local path name on your machine]

Happy downloading!

UPDATE: This functionality has now been disabled in HDInsight, see this thread from the MSDN Forum.

Next Page »