SSAS Tabular at Scale

The cube on my project has been hitting some apparent concurrency issues, so I’ve been hunting for advice on how to tune the hardware (model tuning has already gone a long way). Unfortunately Microsoft don’t have any reference architectures – and their only other advice was to try and use an appliance in Direct Query mode – which was not practical in our circumstances any way.

As usual, the gents at SQLBI had something useful to say on the subject based on a customer case study, which is detailed in this white paper. While well worth a read, I’ll summarise the key findings:

  • Standard Server CPU’s don’t perform well enough, and you will need to look at faster CPU’s with a large cache
  • Faster CPU’s are better than more CPU’s in terms of return on investment for perfromance
  • Fast RAM is a must
  • For NUMA aware servers you need to set the Node Affinity to a single node, preferably using a Hyper-V host for your tabular server

Setting aside the last point, which is a bit deep in server config and requires more explanation, the key thing is to look for fast CPU. They found that Workstation Blades were generally better than Server Blades, and some of the best performance they got was out of one of their Dev’s gaming rigs!

We’ll be trying some of this out and hopefully I can keep you posted with results. I have more stuff on monitoring tabular in the pipeline now I’ve finished my PowerPivot book (to be published soon).

Also, don’t forget my upcoming DW Starter Training on Nov 24/25 in Sydney

October Sydney training roundup – MS BI, Cloud, Analytics

The end of the year is closing in fast but there’s still plenty of chances to learn from specialist providers Agile BI, Presciient and of course, me!

Topics cover the full spread of DW, BI and Analytics so there’s something for every role in the data focused organisation.

Build your Data Warehouse in SQL Server & SSIS with the BI Monkey

Nov 24/25 – Are you about to build your Data Warehouse with Microsoft tools and want to do it right first time?

This course is designed to help a novice understand what is involved in building a Data Warehouse both from a technical architecture and project delivery perspective. It also delivers you basic skills in the tools the Microsoft Business Intelligence suite offers you to do that with.

Get more detail here

Agile BI workshops

Power BI specialist Agile BI brings your product updates on this key new self service BI technology:

Oct 15 – Power BI workshop – Excel new features for reporting and data analysis – more detail here

Oct 30 – What Every Manager Should Know About Microsoft Cloud, Power BI for Office 365 and SQL Server 2014 – more detail here

Presciient Training

Dr Eugene Dubossarsky shares his deep business and technical exercise across a range of advanced and basic analytics. Full details here but the key list is:

Dec 9/10 – Predictive analytics and data science for big data

Dec 11/12 -Introduction to R and data visualisation

Dec 16/17 -Data analytics for fraud and anomaly detection, security and forensics

Dec 18/19 – Business analytics and data for beginners

 

PowerPivot calculated columns are not dynamic

A quick and dirty one – in attempting some clever dynamic security modelling in DAX I was warned about a gotcha in my design – that calculated columns were only evaluated when the model processed, so any approach based on calculated columns was doomed to failure. I didn’t quite believe it so I decided to do a road test in PowerPivot. Using a simple data set of one Dimension and one Fact, like below:

Simple Data Model

Simple Data Model

I hooked them up in PowerPivot with a relationship between “Column” on both tables. Then, using the ISFILTERED() function I created a couple of calculations. One, at Row Level, that would return a value of 1 if I filtered on the Attribute column:

=IF(ISFILTERED(Dim[Attribute]),1,0)

Which I then put a SUM on top of. I also added one at measure level, perfoming a similar function:

FilterCheckAsMeasure:=IF(ISFILTERED(Dim[Attribute]),1,0)

Then I created a Pivot Table checking the results, and got this:

Results

Results

The takeaway being that filtering on the Attribute was picked up by the table level measure, but the calculated column did not change value.

You can have a check of this yourself in my sample workbook: DAX Calculated Columns Evaluation Order.

What does the documentation say?

Well, nothing terribly clear, but in this page there is this small paragraph:

When a column contains a formula, the value is computed for each row. The results are calculated for the column as soon as you create the formula. Column values are then recalculated as necessary, such as when the underlying data is refreshed.

It doesn’t say exactly when “recalculated as necessary” is necessary, but the implication is that it’s a model level change, rather than the switching of context, or in my case the changing of the user viewing the model.

So in summary, we have to assume that our calculated column values are fixed upon initial calculation, formula changes or data load (or processing in a Tabular model) and there isn’t a way to make the value in a given cell change.

BISM Normaliser is a cocktail for the Tabular Model

Well, that title makes much more sense in the context of this post in which I mused about the difficulty of developing against tabular models in a multi developer environment, given there is only one .bim file to work against. I even raised a connect to give the SSAS team to have something else to mark as “Won’t Fix” for the next release (cynical, me?).

Now to stretch an analogy if the problem is two drinks  and only one mouth, then the solution clearly is cocktails*!

Mix me up a BISM Normaliser, barman!

A chap called Christian Wade has kindly built up a nifty Visual Studio plug in called BISM Normaliser which handily merges two tabular models together giving you an option to handle development in a multi user environment. You put 2 models side by side and get a comparison screen like this:

bism normaliser

bism normaliser

You can then merge in tables, columns, relationships, measures – all the good stuff. It’s like a diff but considerably more usable than doing a raw XML comparison. This means if you start from the same base model – advisable as tables are grouped by connections so if your connections don’t match you can’t merge – the dev team can work on separate areas and then merge it back together at some point.

It’s not a substitute for a proper multi-author environment, but at least it makes it possible. There are risks of course – it’s a no warranty codeplex plug in – and you won’t get the benefits of TFS managed components (source control, changes, etc) – and the code currently is set to expire in Dec 2014 so if Christian sells the code you’ll need to buy it off someone.

Anyway – there is a partial solution – on our project we’ve given it a first pass and it seems to do what it claims and since we have no alternative it’s going to get used. So, big thanks to Christian!

 

 

*Or Jagerbombs, but let’s not go there:

Jagerbombs are not the answer (in this case)

Jagerbombs are not the answer (in this case)

Multiple developers against a single Tabular model is a drinking problem

The project I’m currently working on has at it’s heart a whopping great tabular model with dreams of eating more tables and being the biggest, fattest tabular model there ever was. To achieve it’s ambition of eating all this lovely data and becoming so vast it needs an army of chefs … er, developers… to feed it.

Except it has one big problem:

Two hands one mouth drinking problem

Two hands one mouth drinking problem

 

It only has one “mouth” – the .bim file. Only one person at a time can work on it. If you want to do simultaneous development, realistically all you can do is maintain separate versions and merge your changes together either by a diff tool, manually (or potentially via BIML script if that eventuates)

So I’ve raised a connect: Componentise Tabular models to allow for multi user development to request that the .bim file can get broken up into chunks. The text of the connect is below:

At present the Tabular model is a single code object (the .bim) file. This means it is not possible to break up the development of a model across multiple developers – a problem in large projects. It also makes source control less effective as you cannot track which specific objects within the model have been changed.

The model needs to be componentised so that individual elements can be worked on separately and then merged back into a main model, in a manner more similar to OLAP cubes.

Elements that could be broken out are: Perspectives, Connection Managers, Individual tables, Calculated Columns, Measures and relationships

Anyway…  if you think this is a problem you’d like to see solved – vote it up!

BIML and MIST – a first encounter

The MIST developers – Varigence – have been waving their product at me for a wee while now and I’ve finally had a chance to get into the IDE and get a better feel for it.

Before I get too carried away, here’s a quick 101. There’s this thing called BIML – an XML dialect for describing BI Assets (for now, only in the Microsoft world). This opens the door to scripting and therefore simpler dynamic generation of BI objects. BIML can be used by BIDS Helper (a thing you should have if you are an active BI developer) or the more focused BIML IDE MIST.

Now, I’ve seen the shiny video that promised the world, but nothing quite beats hands on experience. So I’ve started following the online user guide and got as far as building a Dimension table.

My feelings so far? I’m a bit “meh”. Now I know there’s a lot more capability to the product which I haven’t got to yet – so this is far from final commentary – but there are a few clear things that I think need to be looked at in the product to give it the sense of really being a development accelerator.

First up, it’s pretty clunky. It suffers heavily from “kitchen sinkism” – i.e. because it can do something, there’s a dialog box / screen / tab for it displayed all at once. Take for example this table development screen:

MIST Table Development

MIST Table Development

 

There’s a lot going on and that’s on a 1920×1080 screen….   some better screen space organisation is needed.

Next up is the fact that the designers don’t add a lot over the basic SSMS  capability. The table designer there is effectively the same blank grid that you get in the SSMS table designer, but without even the courtesy of being able to add another row without going back to the ribbon. At this point I’d be more inclined to develop in SSMS and import to MIST.

Then my next concern is over value add / accessibility.  For example, when setting up a project there’s some basic stuff that needs to be done – setting up connections, importing from databases – that should just be a wizard when starting up  a fresh project.  When creating a dimension, a bundle of default attributes should be added (preferably from a checklist or custom template).

So my first impression is that it needs a user experience workover. However this is a far from unique criticism of many developer tools so I won’t go too  hard on them. I’ll press on with it and see how the functionality unveils itself.

My eBook “HDInsight Succinctly” has been published by SyncFusion!

Thanks to the lovely people over at Syncfusion I’ve been able to share my experiences with HDInsight in a short eBook for the “Succinctly” series which was released this weekend. It is unsurprisingly called “HDInsight Succinctly” and is free to download.

For a quick precis this is the summary description from the website:

Master the higher-level languages and other features necessary to process data with HDInsight. Learn how to set up and manage HDInsight clusters on Azure, how to use Azure Blob Storage to store input and output data, connect with Microsoft BI, and much more. With the guidance of author James Beresford, HDInsight Succinctly will reveal a new avenue of data management.

You can also read my guest blog on Syncfusion’s site entitled “Being on the cutting edge is fun!” where I spend a bit of time extolling the virtues of HDInsight and Microsoft BI and getting all excited about analytics and its place in the future.

Download it for free here.

 

SQL Server Data Warehouse Training in Sydney – July 28

The eagle eyed among you may have noticed I’ve added a big countdown banner to the right of the page announcing a course “Build your Data Warehouse in SQL Server & SSIS with the BI Monkey“, running for 2 days from July 28.

The course is aimed at giving an overview of how to build and manage a Data Warehouse using the SQL Server platform as a base. It’s aimed to be a solid mix of Data Warehouse theory and hands on development to get the concepts locked in. For full details see my new “Training” page, but the key outcomes are:

  • Learn how to get the Architecture right and keep your solution flexible
  • Understand the whole lifecycle of a DW project
  • Get hands on with the tools you need to move data

 

The high level agenda is:

  • The Three Main Data Warehouse architectures – Kimball, Inmon & Data Vault
  • SSIS Basics – Projects, Control Flows and Data Flows
  • Moving Data with SSIS
  • ETL Control Frameworks
  • Database considerations
  • How HDInsight (Hadoop) fits in
  • Code Management with Team Foundation Server
  • Testing
  • Build Automation
  • Deployment
  • Management
  • MDS, DQS and other useful Microsoft Toys

 
Click the button to register now – look forward to seeing you!

Eventbrite - Build your Data Warehouse in SQL Server & SSIS with the BI Monkey

SSAS Tabular Compression of Binary and Flag fields

I’m currently doing some Tabular Cube design, and part of the design has to include several “Y/N” flag fields. My inner geek then wondered what, if any, difference the data type made to the compression rates of the cube.

So, to start with I created a SQL Server table with three columns with the three usual types for this type of data

  • char(1)
  • tinyint
  • bit

… and then populated it with a mix of Y/N’s for char and  1/0’s for tinyint and bit. I thought about 40 billion rows should do it, as that gave me about half a gig of on disk data.

Next I sucked it into a Tabular model, which gave me the following data types:

  • char(1) -> Text
  • tinyint -> Whole Number
  • bit -> TRUE/FALSE

Then I processed the model 3 times, each time using only one column in the data. Then, using the SSAS DMV $system.discover_object_memory_usage I looked at how much memory each column ended up using.

The results were:

  • char(1) = 323,754 bytes
  • tinyint = 304,256 bytes
  • bit = 303,840 bytes

…   and yes, that is bytes. Assuming that the data is stored evenly on disk that means that approx 180 million bytes were compressed to 300 thousand – a compression ratio of 600:1 – yoinks!

Anyway, the first pass of the results seem to show that char is mildly more inefficient that tinyint and bit. However, if you take it a level deeper, and just look at the Memory assigned to the In-Memory Table.Columns element, the most significant source of difference between the memory usage, you see this:

  • char(1) = 17,944 bytes
  • tinyint = 2,344 bytes
  • bit = 2,344 bytes

So it appears that char(1) is much less efficient that tinyint or bit for storage, which end up being the same at this level of detail. For those of you who care deeply, you can view my results spreadsheet.

My conclusion? To be honest the amount of memory used  by these columns is going to be pretty trivial unless you have a gigantic dataset, but if you have a flag field on your fact tables, I would choose to use bit in your database and TRUE/FALSE in your tabular model if you can.

Dynamic Parent Child Security using DAX

I was recently posed a challenge to implement the following requirements in a Tabular model:

  1. Secure a Parent Child dimension at a Parent level, allowing all children to be visible
  2. Allow multiple Parents to be securable to a given user
  3. Keep it as easy to administer as possible

The first 2 requirements are technical and internal to the tabular model, but the last implied inherently that use of Roles and AD Groups was out as this is not a very user friendly approach.

Note that in this exercise I will be using the DimEmployee Dimension from the AdventureWorks 2012 DW sample database.

Row Level  Security in Tabular Models

Tabular Models apply security is applied at the row level by the application of row filters that restrict what a given Role can see. You can make the effect of a Roles dynamic through the use of  lookup tables and the USERNAME() function. There is a detailed white paper on securing Tabular Models on MSDN: Securing the Tabular BI Semantic Model

In a non-parent child situation this is simple to manage as it is easy to establish a relationship between viewable members (i.e. rows) and users via a bridging table – the paper mentioned above has a clear example of this.

How Parent Child Dimensions complicate things

Parent Child Dimensions add complications because there isn’t a straightforward relationship to manage. You can secure the parent by filtering on the parent’s unique key – as below – but any child row, while aware of the relationship to the  immediate parent via the ParentEmployeeKey cannot be filtered on the same column.

DimEmployee in AdventureWorks

DimEmployee in AdventureWorks

A consequence of this is that if you want to secure using row filters, you would need a table that secures each parent and child explicitly for each user. This is obviously not easy to administer and if the hierarchy changes you need to rebuild your security all over again.

How can I make Security Dynamic?

DAX, of course! This is a variation on the scenario described in the white paper – except in this case complicated because the table to be secured contains no data that can be used to apply security directly.

How do I apply this to a Parent-Child Dimension?

Fortunately there are ways to make rows aware of the hierarchy that applies to them. Of particular relevance in our case are the PATH and PATHCONTAINS function.

The PATH function returns a text delimited list of hierarchy members from the current member to the topmost parent. The PATHCONTAINS function scans that list for a specific member and returns a boolean TRUE/FALSE result as to whether that specific member is there. So if we have the following functions in our model:

PATH:=PATH([EmployeeKey],[ParentEmployeeKey])

PATHCONTAINS:=PATHCONTAINS(PATH([EmployeeKey],[ParentEmployeeKey]),189)

We get a set of results like this:

Parent Child Dimension with PATH and PATHCONTAINS DAX functions

Parent Child Dimension with PATH and PATHCONTAINS DAX functions

So the PATH for Employee Key 12 is “112 | 23 | 189 | 12″ – so key 12 is the leaf, and the three preceding members in the hierarchy. The PATHCONTAINS test for member 189 returns TRUE as it is present in the PATH.

So, now we have a means of evaluating in a single column what that rows place in the hierarchy is, and more importantly what its parents are.

The next challenge is to map this to the user data. Now, as I mentioned, there isn’t a meaningful relationship that can be established between the data and the security table, which I’ll show below:

PCSecurity

This means any relationship needs to be generated on the fly. Also note it’s not a single higher parent per user – a user may need to be able to view different paths in the hierarchy. An example case of this may be a manager with a responsibility for managing employees performance across multiple units but without a direct reporting line in the hierarchy.

GENERATE a relationship dynamically

The data items can be linked using the GENERATE function in conjunction with some FILTERs to determine what applies. GENERATE creates a cartesian product of two tables. So by using GENERATE between the Dimension and the Security table, we get a view of all possible user / dimension level combinations:

GENERATE(UserSecurity,DimEmployee)

Now this is only a starting point because we need to return a single value within our equation otherwise DAX will throw an error. Also, we don’t need everything from UserSecurity, just the items that apply to the current user. So first, lets FILTER our UserSecurity table:

GENERATE(FILTER(UserSecurity,UserSecurity[User]=”U4″),DimEmployee)

Note I’ve hardcoded the User here to “U4″ but in a real life context you would use the USERNAME function to get that name. Now we’ve cut down on one side of the cartesian product, but now we need to cut down the other side to get the members we are allowed to see. So here we filter using PATHCONTAINS to pick up the items that are the permitted members or children of those members:

GENERATE(FILTER(UserSecurity,UserSecurity[User]=”U4″),FILTER(‘DimEmployee’,PATHCONTAINS(PATH([EmployeeKey],[ParentEmployeeKey]),UserSecurity[Highest Parent]))

So now in the GENERATED table we have all the members of the Dimension that the current user is permitted to see.It’s worth taking a quick diversion from DAX to show how this works in practice:

PATHCONTAINS Example

PATHCONTAINS Example

Because the PATH contains the keys for all members in the hierarchy from the current member to the topmost parent, any PATH that contains the highest allowable parent will pass the PATHCONTAINS test. So if looking for member 189, rows 1&2 fail because though they are part of the overall path they are above the highest permissible member. Rows 3&4 pass because they feature the highest permissible member in the path to the topmost parent. Row 5 fails because the highest permissible parent doesn’t feature in the PATH at all.

Note that because the result at this point is still a Table as opposed to a single value, it handles users with multiple permissions correctly. The next part is to make sure that in the specific row of the Dimension we are evaluating if that row is in the permitted list, which we can do by putting an outer FILTER on the GENERATE.

FILTER(GENERATE(FILTER(UserSecurity,UserSecurity[User]=”U4″),FILTER(‘DimEmployee’,PATHCONTAINS(PATH([EmployeeKey],[ParentEmployeeKey]),UserSecurity[Highest Parent]))),DimEmployee[EmployeeKey]=EARLIER(DimEmployee[EmployeeKey]))

We use the EARLIER function to get the current value of the EmployeeKey from the outer evaluation pass and give us the context specific to this row. Our final step is to convert this into a single value, which I do using a COUNTROWS function:

COUNTROWS(FILTER(GENERATE(FILTER(UserSecurity,UserSecurity[User]=”U4″),FILTER(‘DimEmployee’,PATHCONTAINS(PATH([EmployeeKey],[ParentEmployeeKey]),UserSecurity[Highest Parent]))),DimEmployee[EmployeeKey]=EARLIER(DimEmployee[EmployeeKey])))

Because of the way the data works in this example, COUNTROWS will return either 1 if Authorised or 0 if unauthorised.

The final step is then to use this Authorised calculation in the Measures to ensure we can only see permitted values for that user, which is no more complex than:

AuthorisedSum:=CALCULATE(SUM(Data[Value]),‘DimEmployee'[Authorised]=1)

… and it’s done. Totals created by this measure will be sensitive to what the user is entitled to see and be fully aware of the Parent / Child relationship even though we only explicitly secure a parent.

Seeing this in action, we see that user U4 is authorised to see the two nodes from keys 290 (Alberts) & 294 (Abbas) and downwards:

Security Demo Pivot

Security Demo Pivot

As I mentioned previously this doesn’t secure the Hierarchy members from being visible, so the path up the tree is still visible – but there will be no values coming through that may be tied to those members.

Quick wrap

So this post has been pretty dense. If you want to get your hands on the result I have provided a sample PowerPivot workbook which has the completed example.

As a reminder, we set out to:

  1. Secure a Parent Child dimension at a Parent level, allowing all children to be visible
  2. Allow multiple Parents to be securable to a given user
  3. Keep it as easy to administer as possible

We did this by using the PATH function to make each row aware of where it was in the Parent / Child Hierarchy. Then using the GENERATE function and some FILTERs to dynamically mimic a relationship we worked out what security could apply. Finally we used the PATHCONTAINS function to work out what security applied.

So we hit point 1 on functionality, point 2 just through the nature of the solution – and by using a simple table rather than Roles / AD type solutions hit point 3.

Finally, a big hat tip to a post from Alberto Ferrari which helped me make sense of Parent Child in Tabular in the first place.

Next Page »