Dynamic Time Variant calculations in DAX

A common requirement in any set of calculations is to create a range of time variants on any measure – Prior Period, Year to Date, Prior Year to Date, Prior Quarter…   you think of a time slice and someone will find it useful.

However the downside to this is that in the model you end up maintaining lots  of calculations that are all largely doing the same thing. Any good coder likes to parameterise and make code reusable. So how could we do this in Tabular? There is a way that is a very specific variant of the idea of Parameter Tables

Disconnect your Dimensions!

Step one is to unhook your Date Dimension from your fact table. This may seem counter-intuitive, but what it frees you to do is to use the Date dimension as a source of reference data that doesn’t filter your data when you select a date – this simplifies the subsequent calculations significantly. You also need to add to the date dimension all the dates you will need to perform your calculations – Year starts, Prior Year starts, Period starts etc. – this isn’t compulsory but you’ll be grateful later on when you need these dates and don’t have to calculate them on the fly, trust me. Your Date table (I’m going to cease calling it a Dimension, it isn’t any more) will end up looking something like this:

Date Table
Date Table

In practice you would hide all the columns apart from the Date as this is the only one that actually gets used by users.

Time for the Variants

Next, we need to create a simple filter table to apply the Time Variant calculations. All it needs is a numeric identifier per variant and a variant name, like so:

Variants Table
Variants Table

This – quite clearly – isn’t the clever bit. The thing to observe with all of these variants is that they create a date range. So what we need to do is calculate the applicable Start and End dates of that range. This is the bit where we are grateful we pre-calculated all those in our Date table. We add two Measures to the table, StartDate and EndDate, which detect which Time Variant is being calculated and then work out the appropriate date, based on the currently selected date. The DAX for StartDate looks like this:

StartDate:=
SWITCH(MIN([VariantID]),
1,MIN(Dates[PeriodStart]),
2,MIN(Dates[PriorPeriodStart]),
3,MIN(Dates[YearStart]),
4,MIN(Dates[SamePeriodPriorYearStart]),
5,MIN(Dates[PriorYearStart])
)

We use a SWITCH statement against the VariantID to detect which Variant we are trying to get the date range start for, then pick the right date from the Date Table. Pre-calculating these in the Date table keeps this part simple.

Add it all up

The final part is to pull these dates into the measure:

TotalTransactionAmount:=SUMX(CALCULATETABLE(Transactions,DATESBETWEEN(Transactions[TransactionDate],[StartDate],[EndDate])),Transactions[TransactionAmount])

This works by using the DATEBETWEEN function to apply a custom date range filter to the Transactions table – which we create dynamically through our StartDate and EndDate calculations.

Our end result:

Time Variant Results
Time Variant Results

We can see above that we can for a single selected date, generate a range of Start and End dates and apply those to our single summarising function to create multiple Time Variations.

The sample workbook is here: DAX Time Variants

Read More

Great PASS BIDW VC Video on how Vertipaq Compression works in SSAS Tabular / PowerPivot

Embedded below is a great video from Marco Russo on how the compression engine works in SSAS Tabular / PowerPivot:

This is from the SQL PASS BI Virtual Chapter Youtube channel – well worth nosing around now they post all their sessions on there (handy for us in Oz where the timings are usually not great)

Read More

Shrink Tabular column space used by over 50% using a simple trick

I’ve recently wrapped up writing the draft of a PowerPivot book (news on that once it’s published) and as part of having to make sure I “knew my onions” I spent a bit of time working my way around understanding the compression engine. I came across this post – Optimizing High Cardinality Columns in VertiPaq – by Marco Russo, and it sparked my interest in seeing how it could be applied to a couple of common data types – financial amounts and date / times. This first lead to me getting distracted building a tabular model to see how much memory columns (and other objects) used. Now i’m getting back to what took me down that path in the first place: seeing how different data type constructions affect memory usage.

How PowerPivot compresses Data

As an introduction, it really helps to understand how PowerPivot compresses data in the first place*. The key tool it uses is a Dictionary which assigns an integer key to a data value. Then when the data is stored it actually stores the key,  rather than the data. When presenting the data, it retrieves the keys and shows the user the values in the dictionary.

To illustrate, in this list of Names and Values:

Names and Values
Names and Values

We have several repetitions of Name. These get stored in the dictionary as follows:

Names Dictionary
Names Dictionary

Then, internally PowerPivot stores the data of Names/Values like this:

PowerPivot Stored Data
PowerPivot Stored Data

This results in high compression because a text value takes up much more space than an integer value in the database. This effect multiples the more repetitive (i.e. lower cardinality) the data is. High cardinality data,  typically numeric values and timestamps – do not compress as well as the number of dictionary entries is often not much less than the number of actual values.

* Quick caveat: this is the theory, not necessarily the practice. The actual compression algorithms used are proprietary to Microsoft so they may not always follow this pattern.

Splitting Data – the theory

The key to Marco’s approach is to split data down into forms with lower cardinality. So what does that mean?

For a financial amount, the data will be in the form nnnnnnn.dd – i.e. integer and fraction, dollars and cents, pounds and pence, etc. But the key thing is that the cents / pence / “dd’ portion is very low cardinality – there are only one hundred variations. Also, stripping out the “dd” potion will probably end up reducing the cardinality of the number overall. For example, consider these unique 4 numbers:

  • 4.95
  • 4.50
  • 7.95
  • 7.50

That is four distinct numbers… but two integer parts and two fraction parts. At this small scale it makes no difference, but for thousands of values it can make a big impact on cardinality.

For a DateTime the data will be in the form dd/mm/yy : hh:mm:ss.sss. You can separate out the time component or round it down to reduce cardinality. Your use case will determine what makes sense, and we will look at both below.

Splitting Data – the practice

Any good theory needs a test, so I created a one million row data set with the following fields:

  • TranCode: A 3 character Alpha transaction code
  • TranAmount: A random number roughly between 0.00 and 20,000.00
  • TranAmountInteger: The Integer part of TranAmount
  • TranAmountFraction: The Fraction part of TranAmount
  • TranDateTime: A random date in 2014 down to the millisecond
  • TranDate: The date part of TranDateTime
  • TranTime_s: The time part of TranDateTime rounded to the second expressed as a time datatype
  • TranTime_ms: The time part of TranDateTime rounded to the millisecond expressed as a time datatype
  • TranTime_num_s: The time part of TranDateTime rounded to the second expressed as an integer datatype
  • TranTime_num_ms: The time part of TranDateTime rounded to the millisecond expressed as an integer datatype
  • TranTime_s_DateBaseLined: The time part of TranDateTime rounded to the second expressed as a datetime datatype, baselined to the date 01/10/1900
  • TranTime_ms_DateBaseLined: The time part of TranDateTime rounded to the millisecond expressed as a datetime datatype, baselined to the date 01/10/1900

The generating code is available here. I’ve used some T-SQL Non Uniform Random Number functions to get more “realistic” data as early drafts of this test were delivering odd results because the data was too uniformly distributed so VertiPaq couldn’t compress it effectively.

You may be wondering why I’ve produced TranTime as time and datetime datatypes – the short answer is Tabular Models treat sql server time datatypes as text datatypes in the tabular model, so I wanted to check if that made a difference as I was getting some strange results for split time.

I then imported the table into a tabular model and processed it, then used the discover_memory_object_usage to work out space consumed by column. The results were this:

Split Column Memory Usage
Split Column Memory Usage

There was a clear saving for splitting the financial amounts into integer and fractions – the split column saved around 50% of the space.

DateTime behaved very oddly. Rounding down the precision from milliseconds to seconds brought big savings – which makes sense as the cardinality of the column went from 1,000,000 to 60,000. However splitting it out to just the time component actually increased space used.

I tried fixing this by baselining the time component to a specific date – so all millisecond/second components were added to the same date (01/01/1900) – this basically made no difference.

A more effective variation was to just capture the number of milliseconds / seconds since the start of the date as an integer, which saved about 89% and 92% of space respectively.

Splitting Data – the advice

Though there are certain costs associated with doing so, such as the loss of the ability to do DISTINCTCOUNT on values, but if your model is pushing memory limits then splitting decimal numbers into their integer and fraction (especially currency fields) can make a big difference – my experiments showed 50% and that was using fairly random data – real life tends to be a bit more ordered so you can hope for more savings.

Fundamentally it looks like DateTime values compress poorly, and Time values even more so. A better solution – at least from a compression standpoint – is to store the date value as a Date datatype in the model, and have any time component stored as integers. How this impacts performance when bringing these together at runtime using the DATEADD function is a matter for you to test!

Read More

Exploring Memory Usage in Tabular Models

Trying to understand what is going on under the hood with a Tabular model is possible using the SSAS DISCOVER_OBJECT_MEMORY_USAGE Rowset – but the results aren’t going to win any prizes for accessibility. The mighty Kasper De Jonge had a crack at making it accessible back in 2012 using a PowerPivot model. However it didn’t help me filter down the way I wanted so I decided to up it a notch with a Tabular model on my Tabular model’s memory usage.

A Tabular Model on DISCOVER_OBJECT_MEMORY_USAGE

The main features of the Tabular model are:

  • A Measure for Memory usage (kind of important)
  • A Hierarchy for exploring the structure of the memory use
  • An Attribute for the Model (so you can filter on just the model you want)
  • An Attribute for the Model Object (e.g. Hierarchy, Column Storage, Data Sources, etc.)
  • An Attribute to identify Server objects (such as Server Assemblies) vs Model objects

Before we get into the gnarly details, here’s a look at what comes out the other side:

DISCOVER_OBJECT_MEMORY_USAGE
DISCOVER_OBJECT_MEMORY_USAGE model output

What you get is the capacity to browse down the hierarchy and apply a few useful filters:

  • Filter to the Model(s) you are interested in
  • Filter for the type of Model Object (e.g. Column, Hierarchy) you want to focus on
  • Filter for Server / Model level objects (largely useful for just getting rid of server level noise)

Things that work well, and not so well.

Actually, it mostly works pretty well. It cleans up most of the GUIDs that make navigation tricky, categorises objects usefully (for me, anyway) and the logic baked into the view that does most of the work is not too hard to follow.

The biggest problem is that the hierarchy of objects doesn’t always make sense – there seem to be Model level objects at the Server level with no attached model. This is probably more to do with my understanding of how the server handles certain objects.

However, I’m always happy to get some feedback on this and any suggestions – especially on how to categorise things properly – will be greatly appreciated.

How to get this in your environment

The solution comes in a few parts:

  • SQL Table to hold the contents of DISCOVER_OBJECT_MEMORY_USAGE
  • SSIS Package to extract the results from DISCOVER_OBJECT_MEMORY_USAGE into the table
  • SQL View to translate, clean and categorise the output from DISCOVER_OBJECT_MEMORY_USAGE
  • A Tabular model to help structure exploring the output
  • An Excel spreadsheet to show the results

If you want to get this up and running, the pack here has everything you need. In order to install it do the following:

  1. Run the SQL script dmv_SSAS_Tabular_DISCOVER_OBJECT_MEMORY_USAGE.sql to create the destination table
  2. Run the SQL script vw_dmv_SSAS_Tabular_DISCOVER_OBJECT_MEMORY_USAGE.sql to create the translating view
  3. Run the SSIS package in the the SSAS_DMV project to load the table
  4. Deploy the SSAS project TabularObjectMemoryUsage to create the tabular model
  5. Open the spreadsheet ObjectMemoryUsage.xlsx to explore your results

Along the way in steps 1-5 you’ll have to set connections/configurations to ones that work for your environment.

Have fun playing!

Read More

SSAS Tabular at Scale

The cube on my project has been hitting some apparent concurrency issues, so I’ve been hunting for advice on how to tune the hardware (model tuning has already gone a long way). Unfortunately Microsoft don’t have any reference architectures – and their only other advice was to try and use an appliance in Direct Query mode – which was not practical in our circumstances any way.

As usual, the gents at SQLBI had something useful to say on the subject based on a customer case study, which is detailed in this white paper. While well worth a read, I’ll summarise the key findings:

  • Standard Server CPU’s don’t perform well enough, and you will need to look at faster CPU’s with a large cache
  • Faster CPU’s are better than more CPU’s in terms of return on investment for perfromance
  • Fast RAM is a must
  • For NUMA aware servers you need to set the Node Affinity to a single node, preferably using a Hyper-V host for your tabular server

Setting aside the last point, which is a bit deep in server config and requires more explanation, the key thing is to look for fast CPU. They found that Workstation Blades were generally better than Server Blades, and some of the best performance they got was out of one of their Dev’s gaming rigs!

We’ll be trying some of this out and hopefully I can keep you posted with results. I have more stuff on monitoring tabular in the pipeline now I’ve finished my PowerPivot book (to be published soon).

Also, don’t forget my upcoming DW Starter Training on Nov 24/25 in Sydney

Read More

October Sydney training roundup – MS BI, Cloud, Analytics

The end of the year is closing in fast but there’s still plenty of chances to learn from specialist providers Agile BI, Presciient and of course, me!

Topics cover the full spread of DW, BI and Analytics so there’s something for every role in the data focused organisation.

Build your Data Warehouse in SQL Server & SSIS with the BI Monkey

Nov 24/25 – Are you about to build your Data Warehouse with Microsoft tools and want to do it right first time?

This course is designed to help a novice understand what is involved in building a Data Warehouse both from a technical architecture and project delivery perspective. It also delivers you basic skills in the tools the Microsoft Business Intelligence suite offers you to do that with.

Get more detail here

Agile BI workshops

Power BI specialist Agile BI brings your product updates on this key new self service BI technology:

Oct 15 – Power BI workshop – Excel new features for reporting and data analysis – more detail here

Oct 30 – What Every Manager Should Know About Microsoft Cloud, Power BI for Office 365 and SQL Server 2014 – more detail here

Presciient Training

Dr Eugene Dubossarsky shares his deep business and technical exercise across a range of advanced and basic analytics. Full details here but the key list is:

Dec 9/10 – Predictive analytics and data science for big data

Dec 11/12 -Introduction to R and data visualisation

Dec 16/17 -Data analytics for fraud and anomaly detection, security and forensics

Dec 18/19 – Business analytics and data for beginners

 

Read More

PowerPivot calculated columns are not dynamic

A quick and dirty one – in attempting some clever dynamic security modelling in DAX I was warned about a gotcha in my design – that calculated columns were only evaluated when the model processed, so any approach based on calculated columns was doomed to failure. I didn’t quite believe it so I decided to do a road test in PowerPivot. Using a simple data set of one Dimension and one Fact, like below:

Simple Data Model
Simple Data Model

I hooked them up in PowerPivot with a relationship between “Column” on both tables. Then, using the ISFILTERED() function I created a couple of calculations. One, at Row Level, that would return a value of 1 if I filtered on the Attribute column:

=IF(ISFILTERED(Dim[Attribute]),1,0)

Which I then put a SUM on top of. I also added one at measure level, perfoming a similar function:

FilterCheckAsMeasure:=IF(ISFILTERED(Dim[Attribute]),1,0)

Then I created a Pivot Table checking the results, and got this:

Results
Results

The takeaway being that filtering on the Attribute was picked up by the table level measure, but the calculated column did not change value.

You can have a check of this yourself in my sample workbook: DAX Calculated Columns Evaluation Order.

What does the documentation say?

Well, nothing terribly clear, but in this page there is this small paragraph:

When a column contains a formula, the value is computed for each row. The results are calculated for the column as soon as you create the formula. Column values are then recalculated as necessary, such as when the underlying data is refreshed.

It doesn’t say exactly when “recalculated as necessary” is necessary, but the implication is that it’s a model level change, rather than the switching of context, or in my case the changing of the user viewing the model.

So in summary, we have to assume that our calculated column values are fixed upon initial calculation, formula changes or data load (or processing in a Tabular model) and there isn’t a way to make the value in a given cell change.

Read More

BISM Normaliser is a cocktail for the Tabular Model

Well, that title makes much more sense in the context of this post in which I mused about the difficulty of developing against tabular models in a multi developer environment, given there is only one .bim file to work against. I even raised a connect to give the SSAS team to have something else to mark as “Won’t Fix” for the next release (cynical, me?).

Now to stretch an analogy if the problem is two drinks  and only one mouth, then the solution clearly is cocktails*!

Mix me up a BISM Normaliser, barman!

A chap called Christian Wade has kindly built up a nifty Visual Studio plug in called BISM Normaliser which handily merges two tabular models together giving you an option to handle development in a multi user environment. You put 2 models side by side and get a comparison screen like this:

bism normaliser
bism normaliser

You can then merge in tables, columns, relationships, measures – all the good stuff. It’s like a diff but considerably more usable than doing a raw XML comparison. This means if you start from the same base model – advisable as tables are grouped by connections so if your connections don’t match you can’t merge – the dev team can work on separate areas and then merge it back together at some point.

It’s not a substitute for a proper multi-author environment, but at least it makes it possible. There are risks of course – it’s a no warranty codeplex plug in – and you won’t get the benefits of TFS managed components (source control, changes, etc) – and the code currently is set to expire in Dec 2014 so if Christian sells the code you’ll need to buy it off someone.

Anyway – there is a partial solution – on our project we’ve given it a first pass and it seems to do what it claims and since we have no alternative it’s going to get used. So, big thanks to Christian!

 

 

*Or Jagerbombs, but let’s not go there:

Jagerbombs are not the answer (in this case)
Jagerbombs are not the answer (in this case)

Read More

Multiple developers against a single Tabular model is a drinking problem

The project I’m currently working on has at it’s heart a whopping great tabular model with dreams of eating more tables and being the biggest, fattest tabular model there ever was. To achieve it’s ambition of eating all this lovely data and becoming so vast it needs an army of chefs … er, developers… to feed it.

Except it has one big problem:

Two hands one mouth drinking problem
Two hands one mouth drinking problem

 

It only has one “mouth” – the .bim file. Only one person at a time can work on it. If you want to do simultaneous development, realistically all you can do is maintain separate versions and merge your changes together either by a diff tool, manually (or potentially via BIML script if that eventuates)

So I’ve raised a connect: Componentise Tabular models to allow for multi user development to request that the .bim file can get broken up into chunks. The text of the connect is below:

At present the Tabular model is a single code object (the .bim) file. This means it is not possible to break up the development of a model across multiple developers – a problem in large projects. It also makes source control less effective as you cannot track which specific objects within the model have been changed.

The model needs to be componentised so that individual elements can be worked on separately and then merged back into a main model, in a manner more similar to OLAP cubes.

Elements that could be broken out are: Perspectives, Connection Managers, Individual tables, Calculated Columns, Measures and relationships

Anyway…  if you think this is a problem you’d like to see solved – vote it up!

Read More

BIML and MIST – a first encounter

The MIST developers – Varigence – have been waving their product at me for a wee while now and I’ve finally had a chance to get into the IDE and get a better feel for it.

Before I get too carried away, here’s a quick 101. There’s this thing called BIML – an XML dialect for describing BI Assets (for now, only in the Microsoft world). This opens the door to scripting and therefore simpler dynamic generation of BI objects. BIML can be used by BIDS Helper (a thing you should have if you are an active BI developer) or the more focused BIML IDE MIST.

Now, I’ve seen the shiny video that promised the world, but nothing quite beats hands on experience. So I’ve started following the online user guide and got as far as building a Dimension table.

My feelings so far? I’m a bit “meh”. Now I know there’s a lot more capability to the product which I haven’t got to yet – so this is far from final commentary – but there are a few clear things that I think need to be looked at in the product to give it the sense of really being a development accelerator.

First up, it’s pretty clunky. It suffers heavily from “kitchen sinkism” – i.e. because it can do something, there’s a dialog box / screen / tab for it displayed all at once. Take for example this table development screen:

MIST Table Development
MIST Table Development

 

There’s a lot going on and that’s on a 1920×1080 screen….   some better screen space organisation is needed.

Next up is the fact that the designers don’t add a lot over the basic SSMS  capability. The table designer there is effectively the same blank grid that you get in the SSMS table designer, but without even the courtesy of being able to add another row without going back to the ribbon. At this point I’d be more inclined to develop in SSMS and import to MIST.

Then my next concern is over value add / accessibility.  For example, when setting up a project there’s some basic stuff that needs to be done – setting up connections, importing from databases – that should just be a wizard when starting up  a fresh project.  When creating a dimension, a bundle of default attributes should be added (preferably from a checklist or custom template).

So my first impression is that it needs a user experience workover. However this is a far from unique criticism of many developer tools so I won’t go too  hard on them. I’ll press on with it and see how the functionality unveils itself.

Read More