I’ve recently wrapped up writing the draft of a PowerPivot book (news on that once it’s published) and as part of having to make sure I “knew my onions” I spent a bit of time working my way around understanding the compression engine. I came across this post – Optimizing High Cardinality Columns in VertiPaq – by Marco Russo, and it sparked my interest in seeing how it could be applied to a couple of common data types – financial amounts and date / times. This first lead to me getting distracted building a tabular model to see how much memory columns (and other objects) used. Now i’m getting back to what took me down that path in the first place: seeing how different data type constructions affect memory usage.
How PowerPivot compresses Data
As an introduction, it really helps to understand how PowerPivot compresses data in the first place*. The key tool it uses is a Dictionary which assigns an integer key to a data value. Then when the data is stored it actually stores the key, rather than the data. When presenting the data, it retrieves the keys and shows the user the values in the dictionary.
To illustrate, in this list of Names and Values:
We have several repetitions of Name. These get stored in the dictionary as follows:
Then, internally PowerPivot stores the data of Names/Values like this:
This results in high compression because a text value takes up much more space than an integer value in the database. This effect multiples the more repetitive (i.e. lower cardinality) the data is. High cardinality data, typically numeric values and timestamps – do not compress as well as the number of dictionary entries is often not much less than the number of actual values.
* Quick caveat: this is the theory, not necessarily the practice. The actual compression algorithms used are proprietary to Microsoft so they may not always follow this pattern.
Splitting Data – the theory
The key to Marco’s approach is to split data down into forms with lower cardinality. So what does that mean?
For a financial amount, the data will be in the form nnnnnnn.dd – i.e. integer and fraction, dollars and cents, pounds and pence, etc. But the key thing is that the cents / pence / “dd’ portion is very low cardinality – there are only one hundred variations. Also, stripping out the “dd” potion will probably end up reducing the cardinality of the number overall. For example, consider these unique 4 numbers:
That is four distinct numbers… but two integer parts and two fraction parts. At this small scale it makes no difference, but for thousands of values it can make a big impact on cardinality.
For a DateTime the data will be in the form dd/mm/yy : hh:mm:ss.sss. You can separate out the time component or round it down to reduce cardinality. Your use case will determine what makes sense, and we will look at both below.
Splitting Data – the practice
Any good theory needs a test, so I created a one million row data set with the following fields:
- TranCode: A 3 character Alpha transaction code
- TranAmount: A random number roughly between 0.00 and 20,000.00
- TranAmountInteger: The Integer part of TranAmount
- TranAmountFraction: The Fraction part of TranAmount
- TranDateTime: A random date in 2014 down to the millisecond
- TranDate: The date part of TranDateTime
- TranTime_s: The time part of TranDateTime rounded to the second expressed as a time datatype
- TranTime_ms: The time part of TranDateTime rounded to the millisecond expressed as a time datatype
- TranTime_num_s: The time part of TranDateTime rounded to the second expressed as an integer datatype
- TranTime_num_ms: The time part of TranDateTime rounded to the millisecond expressed as an integer datatype
- TranTime_s_DateBaseLined: The time part of TranDateTime rounded to the second expressed as a datetime datatype, baselined to the date 01/10/1900
- TranTime_ms_DateBaseLined: The time part of TranDateTime rounded to the millisecond expressed as a datetime datatype, baselined to the date 01/10/1900
The generating code is available here. I’ve used some T-SQL Non Uniform Random Number functions to get more “realistic” data as early drafts of this test were delivering odd results because the data was too uniformly distributed so VertiPaq couldn’t compress it effectively.
You may be wondering why I’ve produced TranTime as time and datetime datatypes – the short answer is Tabular Models treat sql server time datatypes as text datatypes in the tabular model, so I wanted to check if that made a difference as I was getting some strange results for split time.
I then imported the table into a tabular model and processed it, then used the discover_memory_object_usage to work out space consumed by column. The results were this:
There was a clear saving for splitting the financial amounts into integer and fractions – the split column saved around 50% of the space.
DateTime behaved very oddly. Rounding down the precision from milliseconds to seconds brought big savings – which makes sense as the cardinality of the column went from 1,000,000 to 60,000. However splitting it out to just the time component actually increased space used.
I tried fixing this by baselining the time component to a specific date – so all millisecond/second components were added to the same date (01/01/1900) – this basically made no difference.
A more effective variation was to just capture the number of milliseconds / seconds since the start of the date as an integer, which saved about 89% and 92% of space respectively.
UPDATE: Read this post to get an explanation for the behaviour of DateTime compression
Splitting Data – the advice
Though there are certain costs associated with doing so, such as the loss of the ability to do DISTINCTCOUNT on values, but if your model is pushing memory limits then splitting decimal numbers into their integer and fraction (especially currency fields) can make a big difference – my experiments showed 50% and that was using fairly random data – real life tends to be a bit more ordered so you can hope for more savings.
Fundamentally it looks like DateTime values compress poorly, and Time values even more so. A better solution – at least from a compression standpoint – is to store the date value as a Date datatype in the model, and have any time component stored as integers. How this impacts performance when bringing these together at runtime using the DATEADD function is a matter for you to test!