I’ve recently wrapped up writing the draft of a PowerPivot book (news on that once it’s published) and as part of having to make sure I “knew my onions” I spent a bit of time working my way around understanding the compression engine. I came across this post – Optimizing High Cardinality Columns in VertiPaq – by Marco Russo, and it sparked my interest in seeing how it could be applied to a couple of common data types – financial amounts and date / times. This first lead to me getting distracted building a tabular model to see how much memory columns (and other objects) used. Now i’m getting back to what took me down that path in the first place: seeing how different data type constructions affect memory usage.
How PowerPivot compresses Data
As an introduction, it really helps to understand how PowerPivot compresses data in the first place*. The key tool it uses is a Dictionary which assigns an integer key to a data value. Then when the data is stored it actually stores the key, rather than the data. When presenting the data, it retrieves the keys and shows the user the values in the dictionary.
To illustrate, in this list of Names and Values:
We have several repetitions of Name. These get stored in the dictionary as follows:
Then, internally PowerPivot stores the data of Names/Values like this:
This results in high compression because a text value takes up much more space than an integer value in the database. This effect multiples the more repetitive (i.e. lower cardinality) the data is. High cardinality data, typically numeric values and timestamps – do not compress as well as the number of dictionary entries is often not much less than the number of actual values.
* Quick caveat: this is the theory, not necessarily the practice. The actual compression algorithms used are proprietary to Microsoft so they may not always follow this pattern.
Splitting Data – the theory
The key to Marco’s approach is to split data down into forms with lower cardinality. So what does that mean?
For a financial amount, the data will be in the form nnnnnnn.dd – i.e. integer and fraction, dollars and cents, pounds and pence, etc. But the key thing is that the cents / pence / “dd’ portion is very low cardinality – there are only one hundred variations. Also, stripping out the “dd” potion will probably end up reducing the cardinality of the number overall. For example, consider these unique 4 numbers:
That is four distinct numbers… but two integer parts and two fraction parts. At this small scale it makes no difference, but for thousands of values it can make a big impact on cardinality.
For a DateTime the data will be in the form dd/mm/yy : hh:mm:ss.sss. You can separate out the time component or round it down to reduce cardinality. Your use case will determine what makes sense, and we will look at both below.
Splitting Data – the practice
Any good theory needs a test, so I created a one million row data set with the following fields:
TranCode: A 3 character Alpha transaction code
TranAmount: A random number roughly between 0.00 and 20,000.00
TranAmountInteger: The Integer part of TranAmount
TranAmountFraction: The Fraction part of TranAmount
TranDateTime: A random date in 2014 down to the millisecond
TranDate: The date part of TranDateTime
TranTime_s: The time part of TranDateTime rounded to the second expressed as a time datatype
TranTime_ms: The time part of TranDateTime rounded to the millisecond expressed as a time datatype
TranTime_num_s: The time part of TranDateTime rounded to the second expressed as an integer datatype
TranTime_num_ms: The time part of TranDateTime rounded to the millisecond expressed as an integer datatype
TranTime_s_DateBaseLined: The time part of TranDateTime rounded to the second expressed as a datetime datatype, baselined to the date 01/10/1900
TranTime_ms_DateBaseLined: The time part of TranDateTime rounded to the millisecond expressed as a datetime datatype, baselined to the date 01/10/1900
The generating code is available here. I’ve used some T-SQL Non Uniform Random Number functions to get more “realistic” data as early drafts of this test were delivering odd results because the data was too uniformly distributed so VertiPaq couldn’t compress it effectively.
You may be wondering why I’ve produced TranTime as time and datetime datatypes – the short answer is Tabular Models treat sql server time datatypes as text datatypes in the tabular model, so I wanted to check if that made a difference as I was getting some strange results for split time.
I then imported the table into a tabular model and processed it, then used the discover_memory_object_usage to work out space consumed by column. The results were this:
There was a clear saving for splitting the financial amounts into integer and fractions – the split column saved around 50% of the space.
DateTime behaved very oddly. Rounding down the precision from milliseconds to seconds brought big savings – which makes sense as the cardinality of the column went from 1,000,000 to 60,000. However splitting it out to just the time component actually increased space used.
I tried fixing this by baselining the time component to a specific date – so all millisecond/second components were added to the same date (01/01/1900) – this basically made no difference.
A more effective variation was to just capture the number of milliseconds / seconds since the start of the date as an integer, which saved about 89% and 92% of space respectively.
UPDATE: Read this post to get an explanation for the behaviour of DateTime compression
Splitting Data – the advice
Though there are certain costs associated with doing so, such as the loss of the ability to do DISTINCTCOUNT on values, but if your model is pushing memory limits then splitting decimal numbers into their integer and fraction (especially currency fields) can make a big difference – my experiments showed 50% and that was using fairly random data – real life tends to be a bit more ordered so you can hope for more savings.
Fundamentally it looks like DateTime values compress poorly, and Time values even more so. A better solution – at least from a compression standpoint – is to store the date value as a Date datatype in the model, and have any time component stored as integers. How this impacts performance when bringing these together at runtime using the DATEADD function is a matter for you to test!
Even if you use an adblocker, I bet you’ve all seen this:
Well, the good news is if you have a chubby Enterprise Edition SQL Server, you too can get it on a fast weight loss program.
One Weird Old Tip
Okay, okay, I’ll pack in the click-baiting. SQL Server packs a great feature called Data Compression. What this does is compress the data on disk using a low CPU intensity algorithm. You can apply it to the data tables (and associated clustered indexes) and non-clustered indexes.
Now, there’s no such thing as a free lunch, but this isn’t far off paying a dollar for a really, really good sandwich. The cost for applying compression is that you burn CPU decompressing the data on read and compressing it on write. However reported costs of this on a server are about a 2-5% increase in baseline CPU usage. Most Data Warehouse applications barely tickle their CPU (unless someone has written some shocking code) so this cost is to all intents and purposes negligible.
Now for the taste of the really, really good sandwich. Disk usage savings in my experience are around the 50% mark (+/- 5%). As a practical example of that, one of my clients had a problem with a 680GB database – they needed more space for more data and indexes and were scrabbling for funding before they tipped over the 1TB mark. A quick evaluation showed they could cut their disk consumption to to 280GB! No more need for funding scrabbles…!
So, how do I lose this weight?
It’s so easy. (Well, as long as you have Enterprise Edition).
ALTER TABLE <table_name>
REBUILD PARTITION = ALL WITH (DATA_COMPRESSION = <option>)
Just run a quick ALTER TABLE and compression is applied. It may take a while to compress the data, but after that it’s just … on. You will most likely never notice a difference, and you don’t have to start writing queries any differently.
I’m not going to run through the subtleties of whether to choose ROW or PAGE compression (or COLUMNSTORE, but that’s your only option for COLUMNSTORE tables in 2014+) but when I’ve run my evaluations on various tables I’ve generally seen PAGE showing a slight advantage.
Will my shadow be fat still?
Hmm. I did promise I’d stop. So there is a bonus compression feature which applies to all editions worth talking about (i.e. Standard, BI, Enterprise) from SQL2008R2 which is backup compression. It’s also available in 2008 but Enterprise only. There are some minor limitations, but in most circumstances you can simply compress your backups as you take them. Which is handy as again it will significantly reduce the amount of archival storage you use at again a negligible performance cost.
Hey, BI Monkey, can you help me with a script?
Why, yes I can – and here it is: Size and compress. This will size every table in your database, recommend a compression option and generate output that looks a little like this:
It was written for SQL2012 so doesn’t take into account Columnstore compressed tables in 2014, and isn’t smart enough to check the edition, but otherwise it’s a good starting point.
Now, go get your database being as fit as it should be!
I’m currently doing some Tabular Cube design, and part of the design has to include several “Y/N” flag fields. My inner geek then wondered what, if any, difference the data type made to the compression rates of the cube.
So, to start with I created a SQL Server table with three columns with the three usual types for this type of data
… and then populated it with a mix of Y/N’s for char and 1/0’s for tinyint and bit. I thought about 40 billion rows should do it, as that gave me about half a gig of on disk data.
Next I sucked it into a Tabular model, which gave me the following data types:
… and yes, that is bytes. Assuming that the data is stored evenly on disk that means that approx 180 million bytes were compressed to 300 thousand – a compression ratio of 600:1 – yoinks!
Anyway, the first pass of the results seem to show that char is mildly more inefficient that tinyint and bit. However, if you take it a level deeper, and just look at the Memory assigned to the In-Memory Table.Columns element, the most significant source of difference between the memory usage, you see this:
char(1) = 17,944 bytes
tinyint = 2,344 bytes
bit = 2,344 bytes
So it appears that char(1) is much less efficient that tinyint or bit for storage, which end up being the same at this level of detail. For those of you who care deeply, you can view my results spreadsheet.
My conclusion? To be honest the amount of memory used by these columns is going to be pretty trivial unless you have a gigantic dataset, but if you have a flag field on your fact tables, I would choose to use bit in your database and TRUE/FALSE in your tabular model if you can.
The thing about Big Data is, well… it’s big. Which has impacts in terms of how long it takes you to move your data about and the space it needs to be stored in. Now as a novice, I had assumed that you had to decompress your data to process it and I also had to tolerate the huge volumes of output my (admittedly not very efficient) code output.
As it turns out, you can not only process input in a compressed format, you can also compress the output – as detailed in the Hadoop Streaming documentation. So now my jobs start smaller and end smaller, and without a massive performance overhead.
So how does it work? Well, to read compressed data you have to configure absolutely nothing. It just works, as long as Hadoop recognises the compression algorithm. To compress the output, you need to tell the job to do so. Using the “-D” option you can set some generic command options to configure the job. A sample job – formatted for HDInsight – is below, with the key options highlighted in blue: