A couple of times recently I have come up against requirements which have required some fairly complex logic to apply security. One involved some fairly gnarly relationships coming from multiple directions, the other involved grinding through Hierarchies from parent nodes down to permitted viewable children.
The problem with both cases is that though the logic can sometimes be written (albeit usually in an ugly as hell manner) – the functions needed to do so perform atrociously. For complex relationships you are obligated to take in context after context, changing filters and doing all sorts of DAX voodoo. As we know by now, avoiding relationships is good for performance. Hierarchies can be managed through the PATH function, but it’s a text operation that is far from speedy.
Let’s give a quick example of some complex security – consider the below data model:
Here the security controls who can see what has been spent on a Task in the FactTable object. How can see what depends on their Role and/or the Unit they are in. There is also a 1:many relationship between a person and the login they can use.
So for dynamic security you need to navigate from the User Id to the Person and assess what Unit they are in for the Unit based permissions. You also need to assess what Role they are in to get the Role based permissions.
I took one look at this and shuddered at the messy DAX I was going to have to write, plus how terribly it would perform.
Do it in the Cube? Yeah Nah.
So I thought “Yeah nah” and decided the cube was the wrong place to be doing this. Ultimately all I was trying to get towards was to pair a given login with a set of tasks that login would have permissions against. This is something that could easily be pushed back into the ETL layer. The logic to work it out would still be complex, but at the point of data consumption – the bit that really matters – there would be only minimal thinking by the cube engine.
So my solution enforces security through a role scanning a two column table which contains all valid pairings of login and permitted tasks to view. Very fast to execute when browsing data and a lot easier to code for. The hard work is done in loading that table, but the cube application of security is fast and easy to follow. The hierarchy equivalent is a pairing of User Id with all the nodes in the Hierarchy that are permitted to be seen.
As a final note, for those non-Aussie readers the expression “Yeah nah” is a colloquialism that implies that the speaker can’t be bothered with the option in front of them. For example: “Do you want a pie from the Servo, Dave?” “Yeah nah.”
I’ve recently wrapped up writing the draft of a PowerPivot book (news on that once it’s published) and as part of having to make sure I “knew my onions” I spent a bit of time working my way around understanding the compression engine. I came across this post – Optimizing High Cardinality Columns in VertiPaq – by Marco Russo, and it sparked my interest in seeing how it could be applied to a couple of common data types – financial amounts and date / times. This first lead to me getting distracted building a tabular model to see how much memory columns (and other objects) used. Now i’m getting back to what took me down that path in the first place: seeing how different data type constructions affect memory usage.
How PowerPivot compresses Data
As an introduction, it really helps to understand how PowerPivot compresses data in the first place*. The key tool it uses is a Dictionary which assigns an integer key to a data value. Then when the data is stored it actually stores the key, rather than the data. When presenting the data, it retrieves the keys and shows the user the values in the dictionary.
To illustrate, in this list of Names and Values:
We have several repetitions of Name. These get stored in the dictionary as follows:
Then, internally PowerPivot stores the data of Names/Values like this:
This results in high compression because a text value takes up much more space than an integer value in the database. This effect multiples the more repetitive (i.e. lower cardinality) the data is. High cardinality data, typically numeric values and timestamps – do not compress as well as the number of dictionary entries is often not much less than the number of actual values.
* Quick caveat: this is the theory, not necessarily the practice. The actual compression algorithms used are proprietary to Microsoft so they may not always follow this pattern.
Splitting Data – the theory
The key to Marco’s approach is to split data down into forms with lower cardinality. So what does that mean?
For a financial amount, the data will be in the form nnnnnnn.dd – i.e. integer and fraction, dollars and cents, pounds and pence, etc. But the key thing is that the cents / pence / “dd’ portion is very low cardinality – there are only one hundred variations. Also, stripping out the “dd” potion will probably end up reducing the cardinality of the number overall. For example, consider these unique 4 numbers:
That is four distinct numbers… but two integer parts and two fraction parts. At this small scale it makes no difference, but for thousands of values it can make a big impact on cardinality.
For a DateTime the data will be in the form dd/mm/yy : hh:mm:ss.sss. You can separate out the time component or round it down to reduce cardinality. Your use case will determine what makes sense, and we will look at both below.
Splitting Data – the practice
Any good theory needs a test, so I created a one million row data set with the following fields:
TranCode: A 3 character Alpha transaction code
TranAmount: A random number roughly between 0.00 and 20,000.00
TranAmountInteger: The Integer part of TranAmount
TranAmountFraction: The Fraction part of TranAmount
TranDateTime: A random date in 2014 down to the millisecond
TranDate: The date part of TranDateTime
TranTime_s: The time part of TranDateTime rounded to the second expressed as a time datatype
TranTime_ms: The time part of TranDateTime rounded to the millisecond expressed as a time datatype
TranTime_num_s: The time part of TranDateTime rounded to the second expressed as an integer datatype
TranTime_num_ms: The time part of TranDateTime rounded to the millisecond expressed as an integer datatype
TranTime_s_DateBaseLined: The time part of TranDateTime rounded to the second expressed as a datetime datatype, baselined to the date 01/10/1900
TranTime_ms_DateBaseLined: The time part of TranDateTime rounded to the millisecond expressed as a datetime datatype, baselined to the date 01/10/1900
The generating code is available here. I’ve used some T-SQL Non Uniform Random Number functions to get more “realistic” data as early drafts of this test were delivering odd results because the data was too uniformly distributed so VertiPaq couldn’t compress it effectively.
You may be wondering why I’ve produced TranTime as time and datetime datatypes – the short answer is Tabular Models treat sql server time datatypes as text datatypes in the tabular model, so I wanted to check if that made a difference as I was getting some strange results for split time.
I then imported the table into a tabular model and processed it, then used the discover_memory_object_usage to work out space consumed by column. The results were this:
There was a clear saving for splitting the financial amounts into integer and fractions – the split column saved around 50% of the space.
DateTime behaved very oddly. Rounding down the precision from milliseconds to seconds brought big savings – which makes sense as the cardinality of the column went from 1,000,000 to 60,000. However splitting it out to just the time component actually increased space used.
I tried fixing this by baselining the time component to a specific date – so all millisecond/second components were added to the same date (01/01/1900) – this basically made no difference.
A more effective variation was to just capture the number of milliseconds / seconds since the start of the date as an integer, which saved about 89% and 92% of space respectively.
UPDATE: Read this post to get an explanation for the behaviour of DateTime compression
Splitting Data – the advice
Though there are certain costs associated with doing so, such as the loss of the ability to do DISTINCTCOUNT on values, but if your model is pushing memory limits then splitting decimal numbers into their integer and fraction (especially currency fields) can make a big difference – my experiments showed 50% and that was using fairly random data – real life tends to be a bit more ordered so you can hope for more savings.
Fundamentally it looks like DateTime values compress poorly, and Time values even more so. A better solution – at least from a compression standpoint – is to store the date value as a Date datatype in the model, and have any time component stored as integers. How this impacts performance when bringing these together at runtime using the DATEADD function is a matter for you to test!
A Hierarchy for exploring the structure of the memory use
An Attribute for the Model (so you can filter on just the model you want)
An Attribute for the Model Object (e.g. Hierarchy, Column Storage, Data Sources, etc.)
An Attribute to identify Server objects (such as Server Assemblies) vs Model objects
Before we get into the gnarly details, here’s a look at what comes out the other side:
What you get is the capacity to browse down the hierarchy and apply a few useful filters:
Filter to the Model(s) you are interested in
Filter for the type of Model Object (e.g. Column, Hierarchy) you want to focus on
Filter for Server / Model level objects (largely useful for just getting rid of server level noise)
Things that work well, and not so well.
Actually, it mostly works pretty well. It cleans up most of the GUIDs that make navigation tricky, categorises objects usefully (for me, anyway) and the logic baked into the view that does most of the work is not too hard to follow.
The biggest problem is that the hierarchy of objects doesn’t always make sense – there seem to be Model level objects at the Server level with no attached model. This is probably more to do with my understanding of how the server handles certain objects.
However, I’m always happy to get some feedback on this and any suggestions – especially on how to categorise things properly – will be greatly appreciated.
How to get this in your environment
The solution comes in a few parts:
SQL Table to hold the contents of DISCOVER_OBJECT_MEMORY_USAGE
SSIS Package to extract the results from DISCOVER_OBJECT_MEMORY_USAGE into the table
SQL View to translate, clean and categorise the output from DISCOVER_OBJECT_MEMORY_USAGE
A Tabular model to help structure exploring the output
The cube on my project has been hitting some apparent concurrency issues, so I’ve been hunting for advice on how to tune the hardware (model tuning has already gone a long way). Unfortunately Microsoft don’t have any reference architectures – and their only other advice was to try and use an appliance in Direct Query mode – which was not practical in our circumstances any way.
As usual, the gents at SQLBI had something useful to say on the subject based on a customer case study, which is detailed in this white paper. While well worth a read, I’ll summarise the key findings:
Standard Server CPU’s don’t perform well enough, and you will need to look at faster CPU’s with a large cache
Faster CPU’s are better than more CPU’s in terms of return on investment for perfromance
Fast RAM is a must
For NUMA aware servers you need to set the Node Affinity to a single node, preferably using a Hyper-V host for your tabular server
Setting aside the last point, which is a bit deep in server config and requires more explanation, the key thing is to look for fast CPU. They found that Workstation Blades were generally better than Server Blades, and some of the best performance they got was out of one of their Dev’s gaming rigs!
We’ll be trying some of this out and hopefully I can keep you posted with results. I have more stuff on monitoring tabular in the pipeline now I’ve finished my PowerPivot book (to be published soon).
Well, that title makes much more sense in the context of this post in which I mused about the difficulty of developing against tabular models in a multi developer environment, given there is only one .bim file to work against. I even raised a connect to give the SSAS team to have something else to mark as “Won’t Fix” for the next release (cynical, me?).
Now to stretch an analogy if the problem is two drinks and only one mouth, then the solution clearly is cocktails*!
Mix me up a BISM Normaliser, barman!
A chap called Christian Wade has kindly built up a nifty Visual Studio plug in called BISM Normaliser which handily merges two tabular models together giving you an option to handle development in a multi user environment. You put 2 models side by side and get a comparison screen like this:
You can then merge in tables, columns, relationships, measures – all the good stuff. It’s like a diff but considerably more usable than doing a raw XML comparison. This means if you start from the same base model – advisable as tables are grouped by connections so if your connections don’t match you can’t merge – the dev team can work on separate areas and then merge it back together at some point.
It’s not a substitute for a proper multi-author environment, but at least it makes it possible. There are risks of course – it’s a no warranty codeplex plug in – and you won’t get the benefits of TFS managed components (source control, changes, etc) – and the code currently is set to expire in Dec 2014 so if Christian sells the code you’ll need to buy it off someone.
Anyway – there is a partial solution – on our project we’ve given it a first pass and it seems to do what it claims and since we have no alternative it’s going to get used. So, big thanks to Christian!
I’m currently doing some Tabular Cube design, and part of the design has to include several “Y/N” flag fields. My inner geek then wondered what, if any, difference the data type made to the compression rates of the cube.
So, to start with I created a SQL Server table with three columns with the three usual types for this type of data
… and then populated it with a mix of Y/N’s for char and 1/0’s for tinyint and bit. I thought about 40 billion rows should do it, as that gave me about half a gig of on disk data.
Next I sucked it into a Tabular model, which gave me the following data types:
… and yes, that is bytes. Assuming that the data is stored evenly on disk that means that approx 180 million bytes were compressed to 300 thousand – a compression ratio of 600:1 – yoinks!
Anyway, the first pass of the results seem to show that char is mildly more inefficient that tinyint and bit. However, if you take it a level deeper, and just look at the Memory assigned to the In-Memory Table.Columns element, the most significant source of difference between the memory usage, you see this:
char(1) = 17,944 bytes
tinyint = 2,344 bytes
bit = 2,344 bytes
So it appears that char(1) is much less efficient that tinyint or bit for storage, which end up being the same at this level of detail. For those of you who care deeply, you can view my results spreadsheet.
My conclusion? To be honest the amount of memory used by these columns is going to be pretty trivial unless you have a gigantic dataset, but if you have a flag field on your fact tables, I would choose to use bit in your database and TRUE/FALSE in your tabular model if you can.