October Sydney training roundup – MS BI, Cloud, Analytics

The end of the year is closing in fast but there’s still plenty of chances to learn from specialist providers Agile BI, Presciient and of course, me!

Topics cover the full spread of DW, BI and Analytics so there’s something for every role in the data focused organisation.

Build your Data Warehouse in SQL Server & SSIS with the BI Monkey

Nov 24/25 – Are you about to build your Data Warehouse with Microsoft tools and want to do it right first time?

This course is designed to help a novice understand what is involved in building a Data Warehouse both from a technical architecture and project delivery perspective. It also delivers you basic skills in the tools the Microsoft Business Intelligence suite offers you to do that with.

Get more detail here

Agile BI workshops

Power BI specialist Agile BI brings your product updates on this key new self service BI technology:

Oct 15 – Power BI workshop – Excel new features for reporting and data analysis – more detail here

Oct 30 – What Every Manager Should Know About Microsoft Cloud, Power BI for Office 365 and SQL Server 2014 – more detail here

Presciient Training

Dr Eugene Dubossarsky shares his deep business and technical exercise across a range of advanced and basic analytics. Full details here but the key list is:

Dec 9/10 – Predictive analytics and data science for big data

Dec 11/12 –Introduction to R and data visualisation

Dec 16/17 –Data analytics for fraud and anomaly detection, security and forensics

Dec 18/19 – Business analytics and data for beginners

 

Read More

What will we do when White Collar Automation takes our jobs?

While I’m in a bit of a groove about the future of the workplace, I may as well talk about how there may not be a future for the workplace.

Automation destroyed the working class

The Industrial Revolution was so long ago now that it qualifies as history. The replacement of skilled labour with machines wiped out a whole class of skilled workers, but simultaneously expanded opportunities for unskilled workers to such an extent that overall standards of living rose and most people saw this as a Good Thing(tm). However since the seventies, robotics and computing started to strip humans from the factory to the point that now the modern factory floor workforce is only a tiny proportion of what it used to be. Similar effects can be found in farming, where vast farms are now run by just a handful of people.

Any repetitive physical task can be completed by a robot – and nobody has questioned this too hard. Factory conditions are harsh and most people don’t want to perform the exact same task hundreds of times a day due to the physical and psychic toll that can take.

However a clear upshot of this is that unskilled labour has little place in a modern economy. You could perhaps be a driver (a career with probably less that 20 years left before that becomes automated) – work in retail (currently being seriously eroded by ecommerce) – construction (safe for now) – but the options are limited and shrinking. If a job doesn’t require physical presence (e.g. Bricklayer) or face-to-face interaction (most sales) then it is potentially at risk.

A debate I’ve been having recently with a friend thinks that office workers are more immune…  but I think she’s being rather optimistic.

Analytics will destroy the middle class

Famed economist John Maynard Keynes once predicted widespread unemployment “due to our discovery of means of economising the use of labour outrunning the pace at which we can find new uses for labour” – i.e. we will make the economy so efficient that we don’t need all available working people to run it any more.

Now this future has been long foreseen by Science Fiction writers and falls across a wide spectrum of possibilities. There’s the wildly optimistic future presented by the late Iain M. Banks of “The Culture” where effectively machines take care of humanity in a benign manner and give them a life of luxury and freedom. Then there is the darker end, such as UK comic 2000AD‘s character Judge Dredd‘s dystopian Mega Cities where wealth is concentrated in the hands of the few and 99% of the population is unemployed and lives off far from generous state handouts and life for most people is pretty dismal.

According to a study by Oxford University nearly 47% of US jobs are at high risk of being replaced by automation within the next 20 years. So this may be a reality we need to work out sooner rather than later. If your job involves decision making and it has routine repeatable elements to it then it is at risk of a pattern detecting engine being applied to it and that decision making process delegated to a machine. This could be as simple as approving a loan – something that is largely automated anyway – or as complex as diagnosing cancer.

Now many people may resist this and argue that a machine could never replicate the subtlety of human thinking. To some extent that is true, but the quality of human decision making is poor and it is arguable that handing over things such as medical diagnoses to systems that can absorb a volume of data far beyond our poor human brains capacity – and assess it rationally and fairly – may well improve the decisions that do get made.

So, perhaps it is time hail our new AI overlords, and let us pray they are kind to their creators…

Read More

TechEd 2013: I’ll be presenting!

I’ll be presenting at TechEd Australia 2013 on “Big Data, Small Data and Data Visualisation via  Sentiment Analysis with HDInsight

In the session I’ll be looking at HDInsight – Microsoft’s implementation of Hadoop – and how to leverage that to perform some simple Sentiment Analysis, then link that up with structured data to perform some Data Visualisation using the Microsoft BI stack, especially PowerView.

Hopefully this will also tie in with the release of a White Paper on the subject so anyone with deep technical interest can get hands on with the experience.

I’m excited to get a chance to present again – look forward to seeing you there!

Read More

Sydney BI Social is Wednesday 17th – “BI & NoSQL”

This Wednesday 17th June Sydney BI Social presents “BI & NoSQL” – presented by Stephen Young, CEO of GraphBase and architect of the GraphBase DBMS. Steve will give an overview of the various classes of NoSQL database, their advantages and disadvantages, with an emphasis on Graph Databases and the novel ways that they can be used for Business Intelligence purposes.

The venue as usual is City Hotel – Pizza is being provided courtesy of our sponsors Citi Recruitment – come and join the 33 BI professionals who have already decided to join us!

Read More

Taking out the trash in HDInsight

One thing Hadoop doesn’t do that effectively (right now, anyway) is clean up after itself. Like most file systems it has a trash bin (see “Space Reclamation” in the HDFS Architecture guide) which is supposed to clean itself up after “a configurable amount of time” – which appears to be 360 minutes (6 hours) according to core-site.xml in the HDInsight default setup.

However I’ve found this is doesn’t always happen at the speed I’d like, and also some processes (which ones, I haven’t yet confirmed) also leave stuff lying around in the /tmp folder, which has to be manually cleaned up – as long as there’s nothing running it seems to be safe to kill whatever is stored in /tmp. However, don’t blame me if it all goes wrong for you :)

HDFS Commands to help free up space

So there’s a few things you can do to get out of this. First, is avoiding trash disk space usage by adding a -skipTrash option to your deletes:

hadoop fs -rmr -skipTrash /user/hadoop/data

This avoids the problem of using the Trash altogether. Of course, this also means you avoid being able to retrieve stuff from the Trash bin, so use wisely.

The next thing you can do is reach for the expunge command, which forces an empty of the Trash:

hadoop fs -expunge

However this didn’t always seem to work for me, so it’s worth checking it has had the desired effect.

HDFS Commands to find what is using disk space

Sometimes the key thing is to find out where that disk space is being eaten up. Say hello to du (disk usage)

hadoop fs -dus /

Which will then give you the size of that data on your datanodes. Then dig deeper with ls:

hadoop fs -ls /

Which gives you the directories in root. Use du to size them, find unexpected space, and delete using rm or rmr as required.

The full file system shell commands are listed here

Read More

Extract data from Hive using SSIS

So now the Hive ODBC driver exists, the next thing to do is use SSIS to extract data from Hive into a SQL instance for… well, I’m sure we’ll find a reason for it.

Setting up the DSN

The first thing to do is set up a System DSN (Data Source Name) to reference in the ODBC connection. For SSIS, that means we need a 32 bit driver to reference, which means we need to find the 32 Bit ODBC Data Source Administrator. If you’re on a 32 Bit OS, just go to the Control Panel and search for it. If you are on a 64 Bit OS like me, you need to hunt it out. On Windows 7, it can be found at “C:\Windows\SysWOW64\odbcad32.exe”. Note you need to run as Administrator to make changes.

Go to the System DSN:

Fig 1: ODBC Data Source Administrator
Fig 1: ODBC Data Source Administrator

 

 

 

 

 

 

Click “Add…”

Fig 2: ODBC Data Source Administrator
Fig 2: ODBC Data Source Administrator

 

 

 

 

 

 

 

 

 

 

Scroll down the list until you find the “HIVE” driver, then click “Finish”, which brings up the ODBC Hive Setup dialog:

Fig 3: ODBC Data Source Administrator
Fig 3: ODBC Data Source Administrator

 

 

 

 

 

 

 

 

 

 

 

 

 

Give your DSN a sensible name and description. For your host enter the cluster URL (without http://) – i.e. “[your cluster name].cloudapp.net”. Leave the port as 10000. Under Authentication select “Username/Password” and enter your username. Then click “OK” and we are ready to move on.

Connect in SSIS

To hook this into SSIS we need to create a Data Flow and add an ADO.NET Connection Manager. Not – as I initially thought – an ODBC Connection Manager.

Under the Provider, select under “.Net Providers” the “Odbc Data Provider” option.

Fig 4: ADO.NET Connection Manager
Fig 4: ADO.NET Connection Manager

 

 

 

 

 

 

Once that’s done you can choose your just created Data Source Name using the dropdown under “Data source specification”. Add your username and password to complete setup, then click “OK”.

Fig 5: ADO.NET Connection Manager
Fig 5: ADO.NET Connection Manager

 

 

 

 

 

 

 

 

Now the Connection Manager is set up, you can use it in a Data Flow. Add a ADO.NET Data Source, and select your Connection Manager. Then you can – as per a normal database connection – select tables or write queries. In this example I’ve just chosen the HiveSampleTable that comes with every cluster.

Fig 6: ADO.NET Source
Fig 6: ADO.NET Source

 

 

 

 

 

 

 

Then we route the data somewhere, in this case just pumping it to a Row Count.

Fig 7: ADO.NET Data Flow
Fig 7: ADO.NET Data Flow

 

 

 

 

 

 

 

 

I’ve put on a Data Viewer just to show it works.

Fig 8: ADO.NET Data Flow
Fig 8: ADO.NET Data Flow

 

 

 

 

 

 

 

And there we have it. Data from Hive on a Hadoop on Azure cluster via SSIS.

 

Read More

Download data from a Hadoop on Azure cluster

So you’ve run a job on Hadoop on Azure, and now you want that data somewhere more useful, like in your Data Warehouse for some analytics. If the Hive ODBC Driver isn’t an option (perhaps because you used Pig), then FTP is the way – there isn’t a Javascript console fs.get() command available.

As described in my Upload data post, you need to use curl, and the command syntax is:

curl -k ftps://[cluster user name]:[password md5 hash]@[cluster name].cloudapp.net:2226/[path to data or specific file on HDFS] -o [local path name on your machine]

Happy downloading!

UPDATE: This functionality has now been disabled in HDInsight, see this thread from the MSDN Forum.

Read More

No Piggybank for Pig on Hadoop on Azure

A quick note – Pig functions from the piggybank are not available in Hadoop on Azure.

I found this out as I was trying to run some things through Pig, trying to manage some Excel CSV files that had fields with line feeds in them. I discovered there was a Pig load/store function CSVExcelStorage that would handle them, but when I tried to use it… ah. Not there. Turns out it was a piggybank function, which are a set of user contributed functions that you have to include in your pig build. The source code is freely available (being open source and all) but I haven’t worked out how in an HOA environment you can build them and use them.

I can understand why Microsoft have opted not to include these – it’s not part of the core build, it’s user contributed, etc. – things you want to avoid if doing a massively reproducible on demand cloud environment. If I can work out how to include them, I’ll provide a followup post.

 

Read More

Save your RDP connection to Hadoop on Azure

This is probably going to appear to be brain dead to some readers, but I have been frustrated by not being able to configure the RDP connection to my Hadoop on Azure account. Fooled by the slick Metro UI, I had wrongly assumed that the only option was to click on the “Remote Desktop” button to get access, as per the lovely menu below:

Fig 1: Hadoop on Azure - Your Cluster menu
Fig 1: Hadoop on Azure – Your Cluster menu

However it was pointed out to me today that you can right click, save as…. and then you have your RDP connection file to configure to share local resources, etc. Doh.

Fig 2: Hadoop on Azure - Your Cluster menu - now with Right Click
Fig 2: Hadoop on Azure – Your Cluster menu – now with Right Click

Handy tip. Boy, do I feel silly….

Read More

Using Azure Blob Storage as a Data Source for Hadoop on Azure

One of the things I’ve learned from the Microsoft team behind Hadoop on Azure is that the Hadoop clusters’ short lifespan is in fact intentional – the clusters are intended to be disposable and exist for the lifetime of the analysis only.

So what happens if you want your raw data to live up in the cloud for longer? The answer is Azure Blog Storage. This gives up to 100TB of storage per account so should be adequate for most cases. Hadoop on Azure can reference Azure Blob Storage (or Amazon S3 blob storage, if you want a non-Microsoft solution) directly as a location for input data.

Firstly, you need to have an Azure account, and set up storage, instructions for which can be found here. Then you need to upload some data to it, which can most easily be done using a nice tool called Cloudberry Explorer, which operates pretty much like an FTP tool. Then you need to configure your Hadoop on Azure instance to point at your Azure Blob Storage. Then as per this guide you can then point your jobs at the Azure Blob Storage using the asv:// notation, like in the example below:

Hadoop jar hadoop-streaming.jar -files “hdfs://10.NN.NN.NN:9000/example/apps/mappertwo.exe,hdfs://10.NN.NN.NN:9000/example/apps/reducertwo.exe” -mapper “mappertwo.exe” -reducer “reducertwo.exe” -input “asv://hadoop-test/” -output “/example/data/StreamingOutput/abtj”

This of course is slightly confused when setting up your job the parameters can be marked as asv://, like below:

Fig 1: Job Parameters
Fig 1: Job Parameters

 

 

 

 

 

 

However I couldn’t work out how to mark something as an input parameter as well as have it as ASV on the dropdown, so I left as plain text and entered the command as:

-input “asv://hadoop-test/”

And it all worked – I did find that it didn’t handle direct file references – it only would accept storage references at the folder level.

So there you go – using Azure Blob Storage as a data source for Hadoop on Azure. Nice and easy.

Read More