I’ve recently wrapped up writing the draft of a PowerPivot book (news on that once it’s published) and as part of having to make sure I “knew my onions” I spent a bit of time working my way around understanding the compression engine. I came across this post – Optimizing High Cardinality Columns in VertiPaq – by Marco Russo, and it sparked my interest in seeing how it could be applied to a couple of common data types – financial amounts and date / times. This first lead to me getting distracted building a tabular model to see how much memory columns (and other objects) used. Now i’m getting back to what took me down that path in the first place: seeing how different data type constructions affect memory usage.
How PowerPivot compresses Data
As an introduction, it really helps to understand how PowerPivot compresses data in the first place*. The key tool it uses is a Dictionary which assigns an integer key to a data value. Then when the data is stored it actually stores the key, rather than the data. When presenting the data, it retrieves the keys and shows the user the values in the dictionary.
To illustrate, in this list of Names and Values:
We have several repetitions of Name. These get stored in the dictionary as follows:
Then, internally PowerPivot stores the data of Names/Values like this:
This results in high compression because a text value takes up much more space than an integer value in the database. This effect multiples the more repetitive (i.e. lower cardinality) the data is. High cardinality data, typically numeric values and timestamps – do not compress as well as the number of dictionary entries is often not much less than the number of actual values.
* Quick caveat: this is the theory, not necessarily the practice. The actual compression algorithms used are proprietary to Microsoft so they may not always follow this pattern.
Splitting Data – the theory
The key to Marco’s approach is to split data down into forms with lower cardinality. So what does that mean?
For a financial amount, the data will be in the form nnnnnnn.dd – i.e. integer and fraction, dollars and cents, pounds and pence, etc. But the key thing is that the cents / pence / “dd’ portion is very low cardinality – there are only one hundred variations. Also, stripping out the “dd” potion will probably end up reducing the cardinality of the number overall. For example, consider these unique 4 numbers:
That is four distinct numbers… but two integer parts and two fraction parts. At this small scale it makes no difference, but for thousands of values it can make a big impact on cardinality.
For a DateTime the data will be in the form dd/mm/yy : hh:mm:ss.sss. You can separate out the time component or round it down to reduce cardinality. Your use case will determine what makes sense, and we will look at both below.
Splitting Data – the practice
Any good theory needs a test, so I created a one million row data set with the following fields:
TranCode: A 3 character Alpha transaction code
TranAmount: A random number roughly between 0.00 and 20,000.00
TranAmountInteger: The Integer part of TranAmount
TranAmountFraction: The Fraction part of TranAmount
TranDateTime: A random date in 2014 down to the millisecond
TranDate: The date part of TranDateTime
TranTime_s: The time part of TranDateTime rounded to the second expressed as a time datatype
TranTime_ms: The time part of TranDateTime rounded to the millisecond expressed as a time datatype
TranTime_num_s: The time part of TranDateTime rounded to the second expressed as an integer datatype
TranTime_num_ms: The time part of TranDateTime rounded to the millisecond expressed as an integer datatype
TranTime_s_DateBaseLined: The time part of TranDateTime rounded to the second expressed as a datetime datatype, baselined to the date 01/10/1900
TranTime_ms_DateBaseLined: The time part of TranDateTime rounded to the millisecond expressed as a datetime datatype, baselined to the date 01/10/1900
The generating code is available here. I’ve used some T-SQL Non Uniform Random Number functions to get more “realistic” data as early drafts of this test were delivering odd results because the data was too uniformly distributed so VertiPaq couldn’t compress it effectively.
You may be wondering why I’ve produced TranTime as time and datetime datatypes – the short answer is Tabular Models treat sql server time datatypes as text datatypes in the tabular model, so I wanted to check if that made a difference as I was getting some strange results for split time.
I then imported the table into a tabular model and processed it, then used the discover_memory_object_usage to work out space consumed by column. The results were this:
There was a clear saving for splitting the financial amounts into integer and fractions – the split column saved around 50% of the space.
DateTime behaved very oddly. Rounding down the precision from milliseconds to seconds brought big savings – which makes sense as the cardinality of the column went from 1,000,000 to 60,000. However splitting it out to just the time component actually increased space used.
I tried fixing this by baselining the time component to a specific date – so all millisecond/second components were added to the same date (01/01/1900) – this basically made no difference.
A more effective variation was to just capture the number of milliseconds / seconds since the start of the date as an integer, which saved about 89% and 92% of space respectively.
UPDATE: Read this post to get an explanation for the behaviour of DateTime compression
Splitting Data – the advice
Though there are certain costs associated with doing so, such as the loss of the ability to do DISTINCTCOUNT on values, but if your model is pushing memory limits then splitting decimal numbers into their integer and fraction (especially currency fields) can make a big difference – my experiments showed 50% and that was using fairly random data – real life tends to be a bit more ordered so you can hope for more savings.
Fundamentally it looks like DateTime values compress poorly, and Time values even more so. A better solution – at least from a compression standpoint – is to store the date value as a Date datatype in the model, and have any time component stored as integers. How this impacts performance when bringing these together at runtime using the DATEADD function is a matter for you to test!
A quick and dirty one – in attempting some clever dynamic security modelling in DAX I was warned about a gotcha in my design – that calculated columns were only evaluated when the model processed, so any approach based on calculated columns was doomed to failure. I didn’t quite believe it so I decided to do a road test in PowerPivot. Using a simple data set of one Dimension and one Fact, like below:
I hooked them up in PowerPivot with a relationship between “Column” on both tables. Then, using the ISFILTERED() function I created a couple of calculations. One, at Row Level, that would return a value of 1 if I filtered on the Attribute column:
Which I then put a SUM on top of. I also added one at measure level, perfoming a similar function:
Well, nothing terribly clear, but in this page there is this small paragraph:
When a column contains a formula, the value is computed for each row. The results are calculated for the column as soon as you create the formula. Column values are then recalculated as necessary, such as when the underlying data is refreshed.
It doesn’t say exactly when “recalculated as necessary” is necessary, but the implication is that it’s a model level change, rather than the switching of context, or in my case the changing of the user viewing the model.
So in summary, we have to assume that our calculated column values are fixed upon initial calculation, formula changes or data load (or processing in a Tabular model) and there isn’t a way to make the value in a given cell change.
I’ve recently been involved in a couple of cases for a client that revolved heavily around PowerPivot, and its prompted me to highlight some of the things which make me simultaneously love and hate it (though in balance its clearly a thing to love).
I hate PowerPivot because…
So let’s start with the negative. It’s a v1 Product, which means of course it will get better. But because it’s v1 and Microsoft clearly felt a compelling need to give Qlikview something to worry about (and TM1 to some extent) so it’s been shoved into the limelight in a starring role its not entirely ready for.
So, my gripes are:
No drillthrough. You can slice and dice superfast, sure… but unlike SSAS (which has flawed drillthrough), you can’t drillthrough at all in a Pivot Table pointing at a PowerPivot workbook
No API. So there’s no way to programatically build a PowerPivot workbook, or interact with the PowerPivot data model from Excel (even for a simple data refresh)
No Data Mining Support. There are some half hearted workarounds documented here but you simply cannot point the Excel Data Mining Add-In at a PowerPivot data model. If you want to do that, you need to lean on Predixion Software (fortunately Jamie McLennan, part of the original MS Data Mining team is heavily involved so it’s a good bet this is a good option, but the cloud aspect may deter some). To add insult to injury, there is no Data Mining Add-In for 64-bit Excel, and the Data Mining and Power Pivot Add-ins often conflict.
Another common gripe is about a lack of Hierarchy support (this doesn’t really bother me but drives SSAS guys up the wall). This is think is due to be resolved by BISM anyway in Denali. So, at the end of round 1, it’s taken a few punches, but it’s the new kid and it’s far from out for the count…
I love PowerPivot because…
Microsoft have been pushing “BI For the Masses” as their vision since SQL2005. With SharePoint 2010 and Office 2010 I think they are finally getting their vision translated as reality. To start with, I wasn’t entirely sure what the benefit of loading PowerPivot workbooks as data sources into SharePoint actually was, but I had my lightbulb moment last week. If you have a Data Warehouse, or a set of data sources that as a business user you don’t really have any control over, you can be a bit stuck. Building cubes can be a bit of a challenge and you may not be able to get the access / skills / project approval needed to build such a data model. But if you can do it in PowerPivot… then you can model your data to your need, upload to SharePoint – and suddenly you have your data mart for your department. The amount of power this gives to the users who are stuck in the “Import to Access, do some dodgy manipulation, then surface in Excel” rut is phenomenal. I’m not going to get started on what can be done with the REST API with Excel and SharePoint.
Of course another easy thing to love about it is that it all lives in the Excel environment that so many BI users know and love, so user adoption is rarely a challenge.
I think however, there’s one thing that simply cannot be ignored – it’s mindbogglingly fast. In a POC I did recently with my colleague Boyan Penev (to be fair, he did all the work) we were tossing around an 80 million row fact table with some hairy relationships in it and it still responded in subsecond times. Because of this, it’s going to be very, very hard to ignore as the underlying technologies (SharePoint and Office 2010) gain traction.