Statistics is hard #479: Absolute and Relative Risk
One theme that constantly pops up in the BI / Analytics / Big Data world is why – given we have all these amazing tools and models, etc. – is the adoption of Analytics so low? From a Microsoft perspective, Data Mining was baked into SQL Server since 2005 – and due to negligible uptake has hardly changed since. Now I know from my colleagues in Analytics – and the fact that R continues to grow at a great rate – that it’s not a dead field. Far from it. But it’s not quite at the front of everyone’s minds either.
I think the challenges are human rather than technical. Understanding Analytics often means pushing the mind to the limits of what our poor grey lumps of brain were designed to do. We are rigged to make snap decisions with limited information to aid our survival, not contemplate the likelihood of that wolf being hungry through careful modelling deep thought and … ouch, why is there a wolf biting my leg?
A great example of this is showed up in my Facebook feed recently:
Source: these guys, who I totally don’t endorse as they might be hippies
OH MY GOD POUR ALL THE SODA DOWN THE SINK!!!
Well, er – lets not rush. As with all internet circulated health information, the facts are dubiously presented with no link to source. So first of all, let’s remedy that – this is the study in question:
Soft Drink and Juice Consumption and Risk of Pancreatic Cancer: The Singapore Chinese Health Study
Cancer Epidemiology, Biomarkers & Prevention, February 2010
Hurrah for open access journals. Reading through the study, the kernel of truth is there – a statistically valid effect found that indicated that those with a soda consumption of greater than 2 a week increased the relative risk of cancer by 85%. I’m not going to scoff at that, 85% is a big uptick in risk. Relative Risk - and this is where the above image is misleading.
At face value I would take the 85% figure to mean that if I drink 2 or more cans of soda a week, I have an 85% chance of getting pancreatic cancer, i.e. the Absolute Risk. If this was the case I would ban soda from my house immediately.
However dig into the maths and for the population study group the actual Absolute Risk of developing Pancreatic cancer if you drink no soda is about 1/4500. This makes it a pretty unusual cause of death compared to the big killers like Diabetes, which is a more likely consequence of drinking excess soda. For the population studied who did drink more than 2 sodas a week, the risk jumped to 1/2500. Which is still pretty remote. It also makes for a lousy headline. Much better to say the risk has increased by 85% without stating that the number refers to Relative risk and the Absolute risk is small. Not to mention that the study admits that its findings are far from conclusive.
So let’s revisit our risk types
Absolute Risk and Relative Risk are two very different things.
Absolute Risk is the chance of something happening to you if all other factors are equal. So for example, crossing a city street with your eyes closed may have a Absolute Risk of 10% in terms of being hit by a vehicle.
Relative risk is the adjustment to Absolute Risk when conditions alter. If it’s a highway, that risk of being hit by a car may jump to 70%. So the Relative Risk of crossing a highway instead of a city street is 700% higher. It doesn’t mean you have a 700% chance of getting hit by a vehicle, because – well, that makes no sense to have a 700% chance of something happening.
What does this have to do with how our brains are wired for Analytics?
It explains why the above image is simultaneously accurate and misleading. The snap decision we make is Soda – Cancer – Big Risk number – Soda Bad. The deeper analysis took a bit longer, and by which point most of us have lost interest.
Analytics is hard to get penetrated in the human way of working because it doesn’t appeal to our way of thinking, and it takes work to understand. So the message from here is if you are in Analytics and not being successful, it may not be because your models aren’t brilliant (I’m sure they are) – but because you cannot communicate how they work – and their value – in a way most peoples grey lumpy bits can grasp.
Disclaimer: I may have got some of the maths a bit wrong, particularly around the Absolute Risk of getting Pancreatic cancer, as I only spent 5 minutes trying to work it all out. This post does not constitute medical advice. If you take medical advice from Facebook, Twitter, Blogs or any other form of social media that has never been to Medical School, see a Doctor.
Taking out the trash in HDInsight
One thing Hadoop doesn’t do that effectively (right now, anyway) is clean up after itself. Like most file systems it has a trash bin (see “Space Reclamation” in the HDFS Architecture guide) which is supposed to clean itself up after “a configurable amount of time” – which appears to be 360 minutes (6 hours) according to core-site.xml in the HDInsight default setup.
However I’ve found this is doesn’t always happen at the speed I’d like, and also some processes (which ones, I haven’t yet confirmed) also leave stuff lying around in the /tmp folder, which has to be manually cleaned up – as long as there’s nothing running it seems to be safe to kill whatever is stored in /tmp. However, don’t blame me if it all goes wrong for you
HDFS Commands to help free up space
So there’s a few things you can do to get out of this. First, is avoiding trash disk space usage by adding a -skipTrash option to your deletes:
hadoop fs -rmr -skipTrash /user/hadoop/data
This avoids the problem of using the Trash altogether. Of course, this also means you avoid being able to retrieve stuff from the Trash bin, so use wisely.
The next thing you can do is reach for the expunge command, which forces an empty of the Trash:
hadoop fs -expunge
However this didn’t always seem to work for me, so it’s worth checking it has had the desired effect.
HDFS Commands to find what is using disk space
Sometimes the key thing is to find out where that disk space is being eaten up. Say hello to du (disk usage)
hadoop fs -dus /
Which will then give you the size of that data on your datanodes. Then dig deeper with ls:
hadoop fs -ls /
Which gives you the directories in root. Use du to size them, find unexpected space, and delete using rm or rmr as required.
The full file system shell commands are listed here
Extract data from Hive using SSIS
So now the Hive ODBC driver exists, the next thing to do is use SSIS to extract data from Hive into a SQL instance for… well, I’m sure we’ll find a reason for it.
Setting up the DSN
The first thing to do is set up a System DSN (Data Source Name) to reference in the ODBC connection. For SSIS, that means we need a 32 bit driver to reference, which means we need to find the 32 Bit ODBC Data Source Administrator. If you’re on a 32 Bit OS, just go to the Control Panel and search for it. If you are on a 64 Bit OS like me, you need to hunt it out. On Windows 7, it can be found at “C:\Windows\SysWOW64\odbcad32.exe”. Note you need to run as Administrator to make changes.
Go to the System DSN:
Click “Add…”
Scroll down the list until you find the “HIVE” driver, then click “Finish”, which brings up the ODBC Hive Setup dialog:
Give your DSN a sensible name and description. For your host enter the cluster URL (without http://) – i.e. “[your cluster name].cloudapp.net”. Leave the port as 10000. Under Authentication select “Username/Password” and enter your username. Then click “OK” and we are ready to move on.
Connect in SSIS
To hook this into SSIS we need to create a Data Flow and add an ADO.NET Connection Manager. Not – as I initially thought – an ODBC Connection Manager.
Under the Provider, select under “.Net Providers” the “Odbc Data Provider” option.
Once that’s done you can choose your just created Data Source Name using the dropdown under “Data source specification”. Add your username and password to complete setup, then click “OK”.
Now the Connection Manager is set up, you can use it in a Data Flow. Add a ADO.NET Data Source, and select your Connection Manager. Then you can – as per a normal database connection – select tables or write queries. In this example I’ve just chosen the HiveSampleTable that comes with every cluster.
Then we route the data somewhere, in this case just pumping it to a Row Count.
I’ve put on a Data Viewer just to show it works.
And there we have it. Data from Hive on a Hadoop on Azure cluster via SSIS.
Download data from a Hadoop on Azure cluster
So you’ve run a job on Hadoop on Azure, and now you want that data somewhere more useful, like in your Data Warehouse for some analytics. If the Hive ODBC Driver isn’t an option (perhaps because you used Pig), then FTP is the way – there isn’t a Javascript console fs.get() command available.
As described in my Upload data post, you need to use curl, and the command syntax is:
curl -k ftps://[cluster user name]:[password md5 hash]@[cluster name].cloudapp.net:2226/[path to data or specific file on HDFS] -o [local path name on your machine]
Happy downloading!
UPDATE: This functionality has now been disabled in HDInsight, see this thread from the MSDN Forum.
No Piggybank for Pig on Hadoop on Azure
A quick note – Pig functions from the piggybank are not available in Hadoop on Azure.
I found this out as I was trying to run some things through Pig, trying to manage some Excel CSV files that had fields with line feeds in them. I discovered there was a Pig load/store function CSVExcelStorage that would handle them, but when I tried to use it… ah. Not there. Turns out it was a piggybank function, which are a set of user contributed functions that you have to include in your pig build. The source code is freely available (being open source and all) but I haven’t worked out how in an HOA environment you can build them and use them.
I can understand why Microsoft have opted not to include these – it’s not part of the core build, it’s user contributed, etc. – things you want to avoid if doing a massively reproducible on demand cloud environment. If I can work out how to include them, I’ll provide a followup post.
Save your RDP connection to Hadoop on Azure
This is probably going to appear to be brain dead to some readers, but I have been frustrated by not being able to configure the RDP connection to my Hadoop on Azure account. Fooled by the slick Metro UI, I had wrongly assumed that the only option was to click on the “Remote Desktop” button to get access, as per the lovely menu below:
However it was pointed out to me today that you can right click, save as…. and then you have your RDP connection file to configure to share local resources, etc. Doh.
Handy tip. Boy, do I feel silly….
Using Azure Blob Storage as a Data Source for Hadoop on Azure
One of the things I’ve learned from the Microsoft team behind Hadoop on Azure is that the Hadoop clusters’ short lifespan is in fact intentional – the clusters are intended to be disposable and exist for the lifetime of the analysis only.
So what happens if you want your raw data to live up in the cloud for longer? The answer is Azure Blog Storage. This gives up to 100TB of storage per account so should be adequate for most cases. Hadoop on Azure can reference Azure Blob Storage (or Amazon S3 blob storage, if you want a non-Microsoft solution) directly as a location for input data.
Firstly, you need to have an Azure account, and set up storage, instructions for which can be found here. Then you need to upload some data to it, which can most easily be done using a nice tool called Cloudberry Explorer, which operates pretty much like an FTP tool. Then you need to configure your Hadoop on Azure instance to point at your Azure Blob Storage. Then as per this guide you can then point your jobs at the Azure Blob Storage using the asv:// notation, like in the example below:
Hadoop jar hadoop-streaming.jar -files “hdfs://10.NN.NN.NN:9000/example/apps/mappertwo.exe,hdfs://10.NN.NN.NN:9000/example/apps/reducertwo.exe” -mapper “mappertwo.exe” -reducer “reducertwo.exe” -input “asv://hadoop-test/” -output “/example/data/StreamingOutput/abtj”
This of course is slightly confused when setting up your job the parameters can be marked as asv://, like below:
However I couldn’t work out how to mark something as an input parameter as well as have it as ASV on the dropdown, so I left as plain text and entered the command as:
-input “asv://hadoop-test/”
And it all worked – I did find that it didn’t handle direct file references – it only would accept storage references at the folder level.
So there you go – using Azure Blob Storage as a data source for Hadoop on Azure. Nice and easy.
Getting data into Hive using Hadoop on Azure
Once you’ve managed to run a MapReduce job in Hadoop on Azure, the output needs to get somewhere friendlier than a dumped text file on HDFS. Hadoop on Azure comes equipped with Hive, a database like means of accessing your data (which you can then connect to using the Hive ODBC driver).
There are two mechanisms for doing this – first, create a Hive table that points directly at a directory, or second, load the output into a Hive table. I’ll tackle both mechanisms in this post.
1: Point your Hive table directly at the data
In the Interactive Console, choose Hive to be presented with this rather ugly screen:
Here we can create an “External Table”, i.e. a Hive table that points at a file, by entering the table DDL into the lowermost text box, and hitting “Evaluate”, which is a bit of a misnomer as it also executes if the DDL is valid. An example is below:
CREATE EXTERNAL TABLE output_file_table (
clientid STRING,
evtdata STRING,
amount STRING
)
COMMENT “This is my output”
ROW FORMAT DELIMITED FIELDS TERMINATED by “32″
STORED AS TEXTFILE
LOCATION “/example/data/StreamingOutput/mytj/”;
See here for the full Hive DDL reference.
This allows us to run Hive queries directly against the text file, such as:
SELECT * FROM output_file_table LIMIT 10
Which will show us the first ten rows of data. Alternatively, we can perform SQL like operations to aggregate:
SELECT clientid, SUM(amount)
FROM output_file_table
GROUP BY clientid
It’s interesting to note that this kicks off a set of MapReduce jobs to parse and understand the data.
2: Load your data into a Hive table
The other option is to create an empty structure and load data into it. So as before, we create a table but this time its a stand alone entity:
CREATE TABLE output_table (
clientid STRING,
evtdate STRING,
amount STRING
)
COMMENT “This is my output in a table”
STORED AS TEXTFILE
Then load it using the LOAD DATA command:
LOAD DATA INPATH ‘/example/data/StreamingOutput/mytj/part-00000′
OVERWRITE INTO TABLE output_table
Interestingly, this deletes the source file… and also adds some nulls at the tail end of the amount column. I also couldn’t load the table using the same data types as in the first example – unless I had them as STRING it just loaded as nulls.
Summary
Here I’ve displayed two paths to getting your output from MapReduce jobs into Hadoop on Azure. I need to do some more research into the subtleties but hopefully you can understand the options at a high level.
FTP to Hadoop on Azure with Filezilla – doesn’t work :(
As the title says, FTP to Hadoop on Azure with Filezilla – doesn’t work – which is possibly due to an FTP server configuration issue / Filezilla version compatibility problem called out here: http://trac.filezilla-project.org/ticket/7873. The solution proposed requires a FTP server config change which can’t be done on the Hadoop on Azure clusters as the user doesn’t have Administrative permissions.
However for reference should this get fixed, these are the settings you need for Filezilla in Site Manager:
- Host: yourclustername.cloudapp.net
- Port: 2226 (or 2227, 2228 now)
- Protocol: FTP
- Encryption: Require Implicit FTP over TLS
- Logon Type: Normal
- User: yourusername
- Password: md5 hash of password (See step 11 here)
Also under Transfer Settings, opt for Passive transfer mode.
If anyone out there can get this working, please let me know so I can share the solution.
UPDATE: This functionality has now been disabled in HDInsight, see this thread from the MSDN Forum.
Upload data to a Hadoop on Azure cluster
Once you have a shiny Hadoop on Azure cluster, getting some data on it so you can get your MapReduce on is the next challenge. There are two options:
- Via the Javascript console
- Via FTP
Javascript Console
This has the advantage of being the simplest approach, as all you need to do is a command line input and then pick your file. The downside is that its slow, unstable and not really suited for uploading large numbers or volumes of data.
However, so you know how for little test files, here’s what to do. First, click on the interactive console:
Then at the command line type “fs.put()”
This will launch the file upload dialog – browse for the file on your local machine, enter the destination path on your Cluster and click upload. The file should upload… though as I’ve mentioned, it gets shakier as the files get bigger. If it works the dialog will close and below the fs.put() you typed in, a “File uploaded” confirmation line will appear.
FTP
UPDATE: This functionality has now been disabled in HDInsight, see this thread from the MSDN Forum.
This is the more industrial approach. Most of the detail below is sourced from this TechNet article: How to FTP data to Hadoop-based services on Windows Azure. However as noted in a comment at the bottom of that article, the FTPS is also a bit unstable – sometimes it just doesn’t work. Presumably this is a CTP stability issue.
Follow that guides steps 1 through 11 to open your ports and get the MD5 hash of your password. After that it assumes you have a few things at your disposal so I’m filling in the gaps here.
For the life of me I could not get my personal favourite FTP client, FileZilla to connect properly, though I’m still poking at it, and if I get it to work I’ll put a post up explaining how. So I ended up using what the article suggested, which is cURL. Curl can be downloaded here. The version you want will be Windows SSL SSPI enabled – either 32 or 64 bit depending on your own OS. Download and unzip the package, and you’re ready to go – cURL is just a command line executable so there’s no install or GUI.
From the command line, navigate to the same folder as your cURL executable and put a script in a batch file that looks like this:
curl -k -T C:\MapReduce\Transactions.txt ftps://your_cluster_user_name:your_password_md5_hash@your_cluster_name.cloudapp.net:2226/example/data/transactions.txt
Replace “C:\MapReduce\Transactions.txt” with your source file and “/example/data/transactions.txt” with your target on your HDFS cluster. Of course also update your cluster user name, password md5 hash and cluster name as well.
The command switches -k & -T are required and explained here in the cURL command line switch documentation.
Run the batch file and watch the progress of your file transfer. You can validate the upload worked on the target from the Javascript interactive console as described in step 14 of the TechNet guide.














