Adventures in context with MDX

One of the key concepts in MDX is context – that a calculation’s results are affected by the slicers that are applied to the portion of the cube the calculation works with. Today I had some fun with getting the context right in a query through subcubes and where clauses, so here’s some useful lessons. Note the below examples run against the “Internet Operation” cube of the AdventureWorks tabular model.

Subcubes do not give implicitly context to created members

If you use a subcube to constrain your scope, as below, it doesn’t affect the scope of the calculated members. So the below query, which creates a subcube by the use of the SELECT FROM a secondary SELECT at the bottom:

WITH

MEMBER Measures.Context AS [Geography].[Geography].CurrentMember.UNIQUE_NAME

SELECT
{
Measures.Context
}
ON COLUMNS,

{
[Date].[Calendar].[Year]
*    [Date].[Month Name].Members
}
ON ROWS

FROM

// Subcube

(
SELECT
(
[Geography].[Geography].[Country Region].&[Australia]
,    [Date].[Calendar].[Year].&[2008]
)    ON COLUMNS

FROM [Internet Operation]
)

Gives a set of results like this:

MDX Subcube results

MDX Subcube results

So note that despite the filter in the Subcube to Australia, the calculated member is not aware of this scope limitation, and so it believes the CurrentMember of the Geography Dimension is the default “All” level.

So, to give it the intended context, you need to include the Geography dimension on the rows, like below:

WITH

MEMBER Measures.Context AS [Geography].[Geography].CurrentMember.UNIQUE_NAME

SELECT
{
Measures.Context
}
ON COLUMNS,

{
[Geography].[Geography].[Country Region]
* [Date].[Calendar].[Year]
* [Date].[Month Name].Members
}
ON ROWS

FROM

// Subcube

(
SELECT
(
[Geography].[Geography].[Country Region].&[Australia]
, [Date].[Calendar].[Year].&[2008]
) ON COLUMNS

FROM [Internet Operation]
)

The addition of the Geography Dimension gives the calculated member the context to operate in, and the results are filtered to Australia by the filter in the subcube:

MDX Subcube results

MDX Subcube results

Using a WHERE does give context to calculated members

Alternatively, using a WHERE clause does give context to calculated members:

WITH

MEMBER Measures.Context AS [Geography].[Geography].CurrentMember.UNIQUE_NAME

SELECT
{
Measures.Context
}
ON COLUMNS,

{
[Date].[Month Name].Members
}
ON ROWS

FROM [Internet Operation]

WHERE
(
[Geography].[Geography].[Country Region].&[Australia]
, [Date].[Calendar].[Year].&[2008]
)

As we can see from the results – we haven’t specified geography on the rows, just filtered in the WHERE – but the calculated member still has the context for the Geography dimension:

MDX WHERE results

MDX WHERE results

Why is WHERE different from a Subcube?

The WHERE has a different effect to the Subcube because the WHERE clause sets the context for the Geography Dimension before the calculated members are evaluated. The Subcube only sets the context when the calculated members are evaluated, hence the different results depending on whether the context is provided at runtime (as in the second example) or not at all (as per the first example).

Passing Parameters to MDX Shared Datasets in SSRS

So part of my ongoing punishment work involved making some changes to a SSRS report that was running off some MDX, querying a Tabular  Model. It needed Parameters to feed Shared Datasets…  and the two had to talk nicely. This was not as easy at is seems…

First of all  big warning to anyone wanting to mix DAX and MDX on the same report: the two won’t talk well to each other. So in the below examples, whilst the DAX queries to get the parameters would have been relatively easy, they ultimately had to feed an MDX query (as that was what was pre-existing). This is a problem as Member Names are not in the same syntax for DAX and MDX so you can’t easily feed a selected Member from a DAX resultset into an MDX operator like STRTOMEMBER.

So, what we will do in this post is set up a simple report that has a user selectable parameter in the report that affects a shared dataset. Note the below examples run off AdventureWorks 2012.

Setting up a Parameter from MDX

Setting up an MDX Parameter to be usable is a little adventure in itself.  In order to be a Parameter that feeds into an MDX query, you need to pick up some additional Member properties along the way. So while this may get the members you want on the rows:

WITH
MEMBER    Measures.[Dummy] AS NULL

SELECT
{
[Measures].[Dummy]
}
ON COLUMNS,
{
[Date].[Calendar].[Calendar Year].Members
}
ON ROWS

FROM [Adventure Works]

It won’t play nicely when you try to use the member name in a function such as STRTOMEMBER it as it doesn’t return the Unique Member Name, which that function needs to work properly. So you need to write something like this:

WITH
MEMBER    Measures.ParameterCaption AS [Date].[Calendar].CURRENTMEMBER.MEMBER_CAPTION
MEMBER    Measures.ParameterUniqueName AS [Date].[Calendar].CURRENTMEMBER.UNIQUENAME
MEMBER    Measures.ParameterLevel AS [Date].[Calendar].CURRENTMEMBER.LEVEL.ORDINAL

SELECT
{
[Measures].[ParameterCaption]
,    [Measures].[ParameterUniqueName]
,    [Measures].[ParameterLevel]
}
ON COLUMNS,
{
[Date].[Calendar].[Calendar Year].Members
}
ON ROWS 

FROM [Adventure Works]

What this adds by returning these member properties is some values that can be used more effectively in SSRS parameters. The MEMBER_CAPTION gives the Member name for display,  UNIQUENAME gives the unique member name in the hierarchy so it can be used in downstream MDX calculations reliably and finally the level ORDINAL number can be used if depth in the hierarchy is important.

So now, to set it up in SSRS. I won’t go into too much detail – you can get most of this from the sample, but essentially create a shared dataset with the MDX for the parameter, add a report, then within the report create a dataset that uses the shared dataset. Then, we add a parameter and set it up as below (click on the images to see the full screen – the screen grabs were too big).

First, give it a name, a user friendly label and allow multiple values (if you want)

SSRS MDX  parameter setup properties

SSRS MDX parameter setup properties

Then, we point the parameter to get values from the shared dataset we created for the parameter. Here we use the additional properties we added in the MDX query.  We use the member UNIQUENAME as the Value field and the CAPTION as the Label field. The Value field is the actual data that will be passed by the parameter to other datasets.

SSRS MDX  parameter setup available values

SSRS MDX parameter setup available values

Finally, to keep things clean, we use the same dataset to provide the default values for the parameter.

SSRS MDX  parameter setup default values

SSRS MDX parameter setup default values

.. and we’re done. Now to connect it up with the Shared Dataset properly.

Parameterise a Shared Dataset

First, we need to create a parameterised shared data set. So we will add a Shared Data Set with a parameter in the text of the query, like below:

SELECT NON EMPTY
{ [Measures].[Order Count] }
ON COLUMNS,
NON EMPTY
{ ([Ship Date].[Calendar Year].[Calendar Year].ALLMEMBERS ) }
ON ROWS
FROM [Adventure Works]

WHERE STRTOSET(@CalendarYears)

Note the parameter for the query is called out by preceding it with the “@” symbol.

Now it is important to add the query in the right place to get the parameterisation to work. Don’t use the function button and add the query directly. Go to the “Query Designer” option

Define a Shared Data Set Query

Define a Shared Data Set Query

 

Within the Query Designer, click the “Design Mode” option (the triangle on the far right):

Shared Data Set Query Designer

Shared Data Set Query Designer

And then paste the query in there:

ssrs_mdx_parameter_setup_3c

Then you need to click on the  “Query Parameters” option -  5th from the right (the toolbar icons are minuscule at 1920×1080), and set up the parameter – this time, specify the name without the @ symbol:

Shared Dataset Parameter setup

Shared Dataset Parameter setup

This involves naming the parameter and tying it to a Dimension, Hierarchy and optionally default value(s). After that  click ok, run the query to test, and if all is OK, click on OK to close the designer.

Adding the parameter to the query in this way automatically adds content to the Parameters section of the Shared Dataset – note  the parameter name here is not preceded by an “@” symbol:

Shared Dataset Parameter section

Shared Dataset Parameter section

And you’re done.

Side warning about trying to set up parameters directly in the Shared Dataset properties:

One thing to be careful of is that doing parameter setup directly in the Shared Dataset properties, instead of via the Query Designer Parameter set up is that doing so adds a default value to the parameter which may throw this error when you run the report.:

An error occurred during local report processing.
The definition of the report ‘[Report Name]’ is invalid.

An unexpected error occurred while compiling expressions. Native compiler return value: ‘[BC40000] ‘RequestMinimum’ is obsolete: ‘Assembly level declarative security is obsolete and is no longer enforced by the CLR by default. See http://go.microsoft.com/fwlink/?LinklD=155570 for more information.’.’.

I haven’t tested any other context but this was using SSDT as part of VS2012. The reason is the default parameter format looks like this:

=new Object() {[Date].[Calendar Year].&[2008], [Date].[Calendar Year].&[2009], [Date].[Calendar Year].&[2007]}

Which causes errors at runtime. It seems to be impossible to correct as well – you can edit it but it always reverts back to the created value. So make sure you follow the steps above – as a reminder:

  1. Create a shared dataset
  2. Create the query in Query Designer mode
  3. Set up the parameter via the Query Designer “Query Parameters” window

Parameterise the Report

Next we need to set all this up in a report. Add a new dataset to the report that refers to the shared dataset we just created, and put a Tablix on the report that uses the Shared Dataset as a source.

Because the source Shared Dataset is parameterised, the report dataset will carry that parameter across:

Report Dataset Parameters

Report Dataset Parameters

It will map the Parameter Name from the query (without an “@”) to the  Parameter Value available from the report (with the “@”). Now – because planning – I named the parameter created in the first section the same as that for the Shared Data set, it has auto-magically mapped correctly.

Now, if you execute the report and set some values for the parameter, the parameter values will successfully pass via the report to the Shared Data Set and affect the results it passes back to the report.

There you go. Not the most obvious thing in the world, and there’s lots of ways to get it wrong – especially by putting @ symbols in the wrong place – but if you follow the pattern laid out in this post, you should avoid too much trouble.

The sample project can be downloaded here: Shared Dataset Parameters Example Project

 

 

Adventures in DAX, a mild grumble

Being back in the field is proving to be interesting as I get back into the daily frustrations and challenges of development (as opposed to those of PowerPoint). Anyway, one of the new things I’m hands on with is and Tabular Models and DAX table queries, and I’m feeling a little underwhelmed.

Syntax

I’m now torn with whether to hate DAX more than MDX. It seems that in the design of DAX table queries, any opportunity to make it accessible was largely ignored. What do I mean? Well…  here’s how to get all the rows from a given table in a Tabular Model:

EVALUATE
(
‘A Fact Table’
)

Ugh…   “EVALUATE”? I’m sure it makes sense to someone, but even MDX managed to have “SELECT”.
But then hiding in there are some weird syntactic quirks at the simplest level. Say I want to SUM a given column, I do this:

EVALUATE
(
SUMMARIZE
(
‘A Fact Table’,
‘A Fact Table’[My Measure]
)
)

So setting aside the Americanisation of SUMMARIZE, the above only actually works if the Measure is in the base table. If its a calculated column, you have to give it an alias to be able to access it, i.e:

EVALUATE
(
SUMMARIZE
(
‘A Fact Table’,
“My Alias”, ‘A Fact Table’[My Measure]
)
)

So that’s just annoying, as the error message if you don’t use an alias simply implies the measure isn’t there.

Usability

That’s of course just the small stuff. There’s the bigger challenges.

SSMS has no DAX query editor, so you need to use an MDX query template. And then in that there’s no Intellisense and the Metadata panel is useless as it is for querying your tabular model in MDX mode.

SSRS can only cope with DAX queries if you pretend you are writing a DMX query.

I’ve yet to interact with it via SSIS but I doubt the experience is much better.

The Upside

Anyway, it’s not all bad. To get started, take a look at these posts by Chris Webb (don’t fall over yourself looking at the official documentation though, it’s a bit sparse).

The language looks pretty powerful and getting sets of data out of it is simpler than MDX. However it’s not as friendly as it could be and it’s lacking some basic support as part of day to day usability. Hopefully this will improve by SQL2016…

SQL PASS DW/BI VC on Youtube!

Enough acronyms for you?

SQL PASS (Professional Association of SQL Server) runs what they call “Virtual Chapters” – that provide online training and activity – I presented to one last year as a practice run for my TechEd session. The BI/DW chapter is run by local SQL MVP Julie Koesmarno (perhaps some of you know her as Ms SQL Girl).

Anyway – the VC has now caught up with this wacky Youtube fad thing and got themselves a Youtube Channel. There’s a couple of videos I’m going to get stuck into sharpish – the Data Vault (subject for future blog post) and BIML (same) ones. There’s lots more and plenty to come, so bookmark it as a new and useful resource.

So thanks Julie – I’ve added a link to her blog on the blogroll & to the YouTube channel under training resources.

 

Is ETL Development doomed?

A slightly dramatic title, but over the last few months I’ve been exposed to a number of tools that will provide a strong layer of automation to ETL development, eliminating a lot of the basic need for ETL developers to shift data from System A to Database B and beyond.

I’ve been hands on with a couple:

And also heard about a couple more:

… and I’m certain this list is not comprehensive. The significant takeaway is that software build automation in the BI world is starting to catch up with where other software languages have already been (Coded a website lately? many IDE’s do most of the hard work for you now). Much as IDE driven tools such as DTS, SSIS and so on moved us away from hand coding SQL and wrapping up those scripts, the latest round of tools are moving us away from the IDE’s where we drag and drop our flows.

How will ETL careers be killed?

There seems to be a couple of tracks for this. First is the pure development automation tools, such as Varigence MIST. If you are technically minded, take a look at this product demo video – though I suggest skipping to about 25 minutes in to see the real meat as it does go on a bit. It looks mindbogglingly powerful but is clearly shooting at the ETL pro who wants to churn stuff out faster, more consistently and with less fiddling about. MIST is limited to SSIS/AS (for now) and I’m not sure how far it will go as it’s clearly aimed at the developer pro market, which is not always the big buyers. I expect to be playing with it more over the next few weeks on a live project so should be able to get a better view.

The second path appears to be more targeted at eliminating ETL developers in their entirety. AnalytixDS wraps up metadata import (i.e. you suck in your source and target metadata from the systems or ERWIN), do the mapping of fields and apply rules, then “push button make code”. Obviously there’s a bit more to it than that, but the less you care about your back end and the quality of your ETL code (cough Wherescape cough) the more likely this product will appeal to you. Say hello, business users, who are the big buyers (though I look forward to troubleshooting your non-scalable disasters in the near future).

What’s the diagnosis, doctor?

Long term, the demand for ETL skills will decline on the back of these tools. Simple ETL work will simply go away, but the hard stuff will remain and it will become an even more niche skill that will pay handsomely – though you may spend more time configuring and troubleshooting products than working with raw code. Which class of tool dominates is uncertain, but I’m leaning towards the business oriented mapping tools that completely abstract away from ETL development altogether.

If you’ve had any experience with these tools, I’d love to hear about them in the comments.

Gartner 2014 Magic Quadrant for Business Intelligence

The Gartner Magic Quadrant for Business Intelligence and Analytics Platforms is now available.

Good news for Microsoft again – it remains in the Leaders quadrant, though in line with all other MegaVendors has slipped a bit due to a weak story around data discovery. It still remains a well loved platform by both users, developers and architects and is showing increasing levels of being the standard enterprise product. For those of us working the the field it remains a safe bet from a career point of view for a good few years yet.

On the downside there are the same bugbears we are still complaining about – no credible mobile story, metadata management is non-existent (hello Project Barcelona – no news for 2 years now?)  and PowerView, while shiny, is no match for the likes of QlikView or Tableau (regardless of how ugly they are behind the shiny screens, that’s what the users see and judge you on).

Anyway, not too shabby a report card, a decent score but the usual caveat of “could try harder”. But the other big kids (IBM Cognos, SAS, Oracle) are doing pretty much the same so not much to worry about.

TechEd 2013: I’ll be presenting!

I’ll be presenting at TechEd Australia 2013 on “Big Data, Small Data and Data Visualisation via  Sentiment Analysis with HDInsight

In the session I’ll be looking at HDInsight – Microsoft’s implementation of Hadoop – and how to leverage that to perform some simple Sentiment Analysis, then link that up with structured data to perform some Data Visualisation using the Microsoft BI stack, especially PowerView.

Hopefully this will also tie in with the release of a White Paper on the subject so anyone with deep technical interest can get hands on with the experience.

I’m excited to get a chance to present again – look forward to seeing you there!

Compression in Hadoop Streaming Jobs

The thing about Big Data is, well…   it’s big. Which has impacts in terms of how long it takes you to move your data about and the space it needs to be stored in. Now as a novice, I had assumed that you had to decompress your data to process it and I also had to tolerate the huge volumes of output my (admittedly not very efficient) code output.

As it turns out, you can not only process input in a compressed format, you can also compress the output – as detailed in the Hadoop Streaming documentation. So now my jobs start smaller and end smaller, and without a massive performance overhead.

So how does it work? Well, to read compressed data you have to configure absolutely nothing. It just works, as long as Hadoop recognises the compression algorithm. To compress the output, you need to tell the job to do so. Using the “-D” option you can set some generic command options to configure the job. A sample job – formatted for HDInsight – is below, with the key options highlighted in blue:

c:\hadoop\hadoop-1.1.0-SNAPSHOT\bin\hadoop.cmd jar

C:\Hadoop\websites\HadoopDashboard\Models\Samples\hadoop-streaming.jar

“-D mapred.output.compress=true”

“-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec”

-files “hdfs://localhost:8020/user/hadoop/code/Sentiment_v2.exe”

-numReduceTasks 0

-mapper “Sentiment_v2.exe”

-input “/user/hadoop/data/”

-output “/user/hadoop/output/Sentiment”

This tells the job to compress the output, and to use GZip as the compression technique.

And now, my jobs are still inefficient but at least take up less disk space!

Reference Environment Variables in C# Mappers for HDInsight

Within your Mappers and Reducers there may be a need to reference the environment variables being fed to the task, such as the file name. Understanding how to do so took a little digging on my part, with a little help from Matt Winkler in the HDInsight MDSN forum.

Using this snippet of code:

// Adding this reference at the start of the code

using System.Collections;

foreach (DictionaryEntry var in Environment.GetEnvironmentVariables())

Console.WriteLine(“{0}”, var.Key + “|” + var.Value);

// Some junk code so the mapper doesn’t fail

string line; // Variable to hold current line

while ((line = Console.ReadLine()) != null)

{             // do nothing            }

 

It was possible to output all the Environment Variables as the Mapper output and work out their format from the resultant text file it created.

Then, to reference individual Environment Variables in the Mapper, you can simply use variations on:

 

string FileName = System.Environment.GetEnvironmentVariable(“map_input_file”);

string FileChunk = System.Environment.GetEnvironmentVariable(“map_input_start”);

Extract data from Hive using SSIS

So now the Hive ODBC driver exists, the next thing to do is use SSIS to extract data from Hive into a SQL instance for… well, I’m sure we’ll find a reason for it.

Setting up the DSN

The first thing to do is set up a System DSN (Data Source Name) to reference in the ODBC connection. For SSIS, that means we need a 32 bit driver to reference, which means we need to find the 32 Bit ODBC Data Source Administrator. If you’re on a 32 Bit OS, just go to the Control Panel and search for it. If you are on a 64 Bit OS like me, you need to hunt it out. On Windows 7, it can be found at “C:\Windows\SysWOW64\odbcad32.exe”. Note you need to run as Administrator to make changes.

Go to the System DSN:

Fig 1: ODBC Data Source Administrator

Fig 1: ODBC Data Source Administrator

 

 

 

 

 

 

Click “Add…”

Fig 2: ODBC Data Source Administrator

Fig 2: ODBC Data Source Administrator

 

 

 

 

 

 

 

 

 

 

Scroll down the list until you find the “HIVE” driver, then click “Finish”, which brings up the ODBC Hive Setup dialog:

Fig 3: ODBC Data Source Administrator

Fig 3: ODBC Data Source Administrator

 

 

 

 

 

 

 

 

 

 

 

 

 

Give your DSN a sensible name and description. For your host enter the cluster URL (without http://) – i.e. “[your cluster name].cloudapp.net”. Leave the port as 10000. Under Authentication select “Username/Password” and enter your username. Then click “OK” and we are ready to move on.

Connect in SSIS

To hook this into SSIS we need to create a Data Flow and add an ADO.NET Connection Manager. Not – as I initially thought – an ODBC Connection Manager.

Under the Provider, select under “.Net Providers” the “Odbc Data Provider” option.

Fig 4: ADO.NET Connection Manager

Fig 4: ADO.NET Connection Manager

 

 

 

 

 

 

Once that’s done you can choose your just created Data Source Name using the dropdown under “Data source specification”. Add your username and password to complete setup, then click “OK”.

Fig 5: ADO.NET Connection Manager

Fig 5: ADO.NET Connection Manager

 

 

 

 

 

 

 

 

Now the Connection Manager is set up, you can use it in a Data Flow. Add a ADO.NET Data Source, and select your Connection Manager. Then you can – as per a normal database connection – select tables or write queries. In this example I’ve just chosen the HiveSampleTable that comes with every cluster.

Fig 6: ADO.NET Source

Fig 6: ADO.NET Source

 

 

 

 

 

 

 

Then we route the data somewhere, in this case just pumping it to a Row Count.

Fig 7: ADO.NET Data Flow

Fig 7: ADO.NET Data Flow

 

 

 

 

 

 

 

 

I’ve put on a Data Viewer just to show it works.

Fig 8: ADO.NET Data Flow

Fig 8: ADO.NET Data Flow

 

 

 

 

 

 

 

And there we have it. Data from Hive on a Hadoop on Azure cluster via SSIS.

 

Next Page »