Sydney BI Social Group (SYBIS) – Wednesday May 15th
I’m pleased to say that we now have a second session of SYBIS scheduled, with the topic of “Agile BI“. As a reminder, the aim of the group is focused more on networking rather than technical knowledge sharing, so if you like to talk BI (among other things) – it’s a great opportunity to meet some new faces and reconnect with others.
I’ll be there at the next event – from 5.30pm, Wednesday May 15th at the City Hotel (349 Kent Street) - and this time listening to Iman Eftekhari talk about Implementing BI solutions using Agile/SCRUM methodology. Some expert panel members will help liven up the discussion, names TBC.
More details can be found in this LinkedIn group and please RSVP using this MeetUp - 10 attendees already confirmed so don’t be shy!
Sydney BI Social Group (SYBIS)
The first SYBIS (Sydney BI Social Group) took place a couple of weeks ago now and we had a great session with a good turnout, and a lively panel featuring Viktor Isakov and myself speaking about Big Data.
Please join the LinkedIn group and MeetUp and keep an eye out for the next session in May – date to be announced soon!
Sydney BI Social Group (SYBIS) – April 3rd
Fellow Microsoft BI blogger Iman Eftekhari is setting up a BI Social Group in Sydney – called SYBIS. The aim of the group is focused more on networking rather than technical knowledge sharing, so if you like to talk BI (among other things) – it’s a great opportunity to meet some new faces and reconnect with others.
I’ll be there at the first event – from 5.30pm, April 3rd at the Shark Hotel - and sharing my opinions on one of the two themes put forward – “Big Data & Microsoft”, or “Does tool choice really matter?” – winning topic to be decided on the night. Attendees are welcome to come along and disagree with me
More details can be found in this LinkedIn group and please RSVP using this MeetUp - 8 attendees already confirmed so don’t be shy!
Taking out the trash in HDInsight
One thing Hadoop doesn’t do that effectively (right now, anyway) is clean up after itself. Like most file systems it has a trash bin (see “Space Reclamation” in the HDFS Architecture guide) which is supposed to clean itself up after “a configurable amount of time” – which appears to be 360 minutes (6 hours) according to core-site.xml in the HDInsight default setup.
However I’ve found this is doesn’t always happen at the speed I’d like, and also some processes (which ones, I haven’t yet confirmed) also leave stuff lying around in the /tmp folder, which has to be manually cleaned up – as long as there’s nothing running it seems to be safe to kill whatever is stored in /tmp. However, don’t blame me if it all goes wrong for you
HDFS Commands to help free up space
So there’s a few things you can do to get out of this. First, is avoiding trash disk space usage by adding a -skipTrash option to your deletes:
hadoop fs -rmr -skipTrash /user/hadoop/data
This avoids the problem of using the Trash altogether. Of course, this also means you avoid being able to retrieve stuff from the Trash bin, so use wisely.
The next thing you can do is reach for the expunge command, which forces an empty of the Trash:
hadoop fs -expunge
However this didn’t always seem to work for me, so it’s worth checking it has had the desired effect.
HDFS Commands to find what is using disk space
Sometimes the key thing is to find out where that disk space is being eaten up. Say hello to du (disk usage)
hadoop fs -dus /
Which will then give you the size of that data on your datanodes. Then dig deeper with ls:
hadoop fs -ls /
Which gives you the directories in root. Use du to size them, find unexpected space, and delete using rm or rmr as required.
The full file system shell commands are listed here
Do you know what motivates you?
Do you know what motivates you at work? Is it the glory, the cash, the dramatic road warrior lifestyle? Or do you blindly “do stuff” and enjoy some of it, and other bits not so much?
Well, occasionally in the mass of Management reading I do I come across something that helps me realise how I operate and improves how I perform through higher self awareness. I recently read “Drive” by Daniel Pink, and suggest you do too – as it will help you get to grips with how you are motivated at work.
Welcome to Motivation 3.0
Central to the book is the theory of Motivation 3.0. To understand how we got there, we need to know about 1.0 & 2.0. Motivation 1.0 was pretty simple – eat, find shelter, or die. Good cavemen grade stuff. Moving to 2.0 we enter the industrial age where performance is rewarded and disobedience punished.
Daniel Pink’s theory is that we have moved now to 3.0, as 2.0 only works for jobs with a fixed path to completion with no room for creativity, such as data entry or widget making. Work is increasingly creative – BI is definitely short on routine, easily defined work – and he proposes that you cannot give rewards for being creative because that makes creativity work, and then demotivates you to be creative…. a bit of a fatal blow in the modern workplace.
So Motivation 3.0 gives the worker the inner drive to solve creative problems through 3 things:
- Mastery – striving to be a master of your trade
- Autonomy – freedom to pursue your own path to your objectives
- Purpose – being part of something bigger than making money
These all lead to the employer having to have faith in employees to do the right thing and work for the goals of the company without the traditional constraints of Motivation 2.0 – i.e. punishment and reward. It ultimately drives to the Result Oriented Work Environment – where hours are less important than what you deliver in the time you spend. Imagine a world without the 9-5 obligation where half your day is wasted because you just aren’t in the zone (or “in Flow” as it is referred to by some researchers), and you may as well have been at the beach?
It’s a short and interesting read, backed up with research, examples and stories that will prove thought provoking, and may change the way you go about your job.
Update 15/01/2013 – thanks to one of my colleagues, here’s a great TED Talk from the author on some of the key themes:
Reference Environment Variables in C# Mappers for HDInsight
Within your Mappers and Reducers there may be a need to reference the environment variables being fed to the task, such as the file name. Understanding how to do so took a little digging on my part, with a little help from Matt Winkler in the HDInsight MDSN forum.
Using this snippet of code:
// Adding this reference at the start of the code
using System.Collections;
foreach (DictionaryEntry var in Environment.GetEnvironmentVariables())
Console.WriteLine(“{0}”, var.Key + “|” + var.Value);
// Some junk code so the mapper doesn’t fail
string line; // Variable to hold current line
while ((line = Console.ReadLine()) != null)
{ // do nothing }
It was possible to output all the Environment Variables as the Mapper output and work out their format from the resultant text file it created.
Then, to reference individual Environment Variables in the Mapper, you can simply use variations on:
string FileName = System.Environment.GetEnvironmentVariable(“map_input_file”);
string FileChunk = System.Environment.GetEnvironmentVariable(“map_input_start”);
Extract data from Hive using SSIS
So now the Hive ODBC driver exists, the next thing to do is use SSIS to extract data from Hive into a SQL instance for… well, I’m sure we’ll find a reason for it.
Setting up the DSN
The first thing to do is set up a System DSN (Data Source Name) to reference in the ODBC connection. For SSIS, that means we need a 32 bit driver to reference, which means we need to find the 32 Bit ODBC Data Source Administrator. If you’re on a 32 Bit OS, just go to the Control Panel and search for it. If you are on a 64 Bit OS like me, you need to hunt it out. On Windows 7, it can be found at “C:\Windows\SysWOW64\odbcad32.exe”. Note you need to run as Administrator to make changes.
Go to the System DSN:
Click “Add…”
Scroll down the list until you find the “HIVE” driver, then click “Finish”, which brings up the ODBC Hive Setup dialog:
Give your DSN a sensible name and description. For your host enter the cluster URL (without http://) – i.e. “[your cluster name].cloudapp.net”. Leave the port as 10000. Under Authentication select “Username/Password” and enter your username. Then click “OK” and we are ready to move on.
Connect in SSIS
To hook this into SSIS we need to create a Data Flow and add an ADO.NET Connection Manager. Not – as I initially thought – an ODBC Connection Manager.
Under the Provider, select under “.Net Providers” the “Odbc Data Provider” option.
Once that’s done you can choose your just created Data Source Name using the dropdown under “Data source specification”. Add your username and password to complete setup, then click “OK”.
Now the Connection Manager is set up, you can use it in a Data Flow. Add a ADO.NET Data Source, and select your Connection Manager. Then you can – as per a normal database connection – select tables or write queries. In this example I’ve just chosen the HiveSampleTable that comes with every cluster.
Then we route the data somewhere, in this case just pumping it to a Row Count.
I’ve put on a Data Viewer just to show it works.
And there we have it. Data from Hive on a Hadoop on Azure cluster via SSIS.
Download data from a Hadoop on Azure cluster
So you’ve run a job on Hadoop on Azure, and now you want that data somewhere more useful, like in your Data Warehouse for some analytics. If the Hive ODBC Driver isn’t an option (perhaps because you used Pig), then FTP is the way – there isn’t a Javascript console fs.get() command available.
As described in my Upload data post, you need to use curl, and the command syntax is:
curl -k ftps://[cluster user name]:[password md5 hash]@[cluster name].cloudapp.net:2226/[path to data or specific file on HDFS] -o [local path name on your machine]
Happy downloading!
UPDATE: This functionality has now been disabled in HDInsight, see this thread from the MSDN Forum.
No Piggybank for Pig on Hadoop on Azure
A quick note – Pig functions from the piggybank are not available in Hadoop on Azure.
I found this out as I was trying to run some things through Pig, trying to manage some Excel CSV files that had fields with line feeds in them. I discovered there was a Pig load/store function CSVExcelStorage that would handle them, but when I tried to use it… ah. Not there. Turns out it was a piggybank function, which are a set of user contributed functions that you have to include in your pig build. The source code is freely available (being open source and all) but I haven’t worked out how in an HOA environment you can build them and use them.
I can understand why Microsoft have opted not to include these – it’s not part of the core build, it’s user contributed, etc. – things you want to avoid if doing a massively reproducible on demand cloud environment. If I can work out how to include them, I’ll provide a followup post.
Save your RDP connection to Hadoop on Azure
This is probably going to appear to be brain dead to some readers, but I have been frustrated by not being able to configure the RDP connection to my Hadoop on Azure account. Fooled by the slick Metro UI, I had wrongly assumed that the only option was to click on the “Remote Desktop” button to get access, as per the lovely menu below:
However it was pointed out to me today that you can right click, save as…. and then you have your RDP connection file to configure to share local resources, etc. Doh.
Handy tip. Boy, do I feel silly….









