SSAS Tabular at Scale

The cube on my project has been hitting some apparent concurrency issues, so I’ve been hunting for advice on how to tune the hardware (model tuning has already gone a long way). Unfortunately Microsoft don’t have any reference architectures – and their only other advice was to try and use an appliance in Direct Query mode – which was not practical in our circumstances any way.

As usual, the gents at SQLBI had something useful to say on the subject based on a customer case study, which is detailed in this white paper. While well worth a read, I’ll summarise the key findings:

  • Standard Server CPU’s don’t perform well enough, and you will need to look at faster CPU’s with a large cache
  • Faster CPU’s are better than more CPU’s in terms of return on investment for perfromance
  • Fast RAM is a must
  • For NUMA aware servers you need to set the Node Affinity to a single node, preferably using a Hyper-V host for your tabular server

Setting aside the last point, which is a bit deep in server config and requires more explanation, the key thing is to look for fast CPU. They found that Workstation Blades were generally better than Server Blades, and some of the best performance they got was out of one of their Dev’s gaming rigs!

We’ll be trying some of this out and hopefully I can keep you posted with results. I have more stuff on monitoring tabular in the pipeline now I’ve finished my PowerPivot book (to be published soon).

Also, don’t forget my upcoming DW Starter Training on Nov 24/25 in Sydney

October Sydney training roundup – MS BI, Cloud, Analytics

The end of the year is closing in fast but there’s still plenty of chances to learn from specialist providers Agile BI, Presciient and of course, me!

Topics cover the full spread of DW, BI and Analytics so there’s something for every role in the data focused organisation.

Build your Data Warehouse in SQL Server & SSIS with the BI Monkey

Nov 24/25 – Are you about to build your Data Warehouse with Microsoft tools and want to do it right first time?

This course is designed to help a novice understand what is involved in building a Data Warehouse both from a technical architecture and project delivery perspective. It also delivers you basic skills in the tools the Microsoft Business Intelligence suite offers you to do that with.

Get more detail here

Agile BI workshops

Power BI specialist Agile BI brings your product updates on this key new self service BI technology:

Oct 15 – Power BI workshop – Excel new features for reporting and data analysis – more detail here

Oct 30 – What Every Manager Should Know About Microsoft Cloud, Power BI for Office 365 and SQL Server 2014 – more detail here

Presciient Training

Dr Eugene Dubossarsky shares his deep business and technical exercise across a range of advanced and basic analytics. Full details here but the key list is:

Dec 9/10 – Predictive analytics and data science for big data

Dec 11/12 -Introduction to R and data visualisation

Dec 16/17 -Data analytics for fraud and anomaly detection, security and forensics

Dec 18/19 – Business analytics and data for beginners

 

PowerPivot calculated columns are not dynamic

A quick and dirty one – in attempting some clever dynamic security modelling in DAX I was warned about a gotcha in my design – that calculated columns were only evaluated when the model processed, so any approach based on calculated columns was doomed to failure. I didn’t quite believe it so I decided to do a road test in PowerPivot. Using a simple data set of one Dimension and one Fact, like below:

Simple Data Model

Simple Data Model

I hooked them up in PowerPivot with a relationship between “Column” on both tables. Then, using the ISFILTERED() function I created a couple of calculations. One, at Row Level, that would return a value of 1 if I filtered on the Attribute column:

=IF(ISFILTERED(Dim[Attribute]),1,0)

Which I then put a SUM on top of. I also added one at measure level, perfoming a similar function:

FilterCheckAsMeasure:=IF(ISFILTERED(Dim[Attribute]),1,0)

Then I created a Pivot Table checking the results, and got this:

Results

Results

The takeaway being that filtering on the Attribute was picked up by the table level measure, but the calculated column did not change value.

You can have a check of this yourself in my sample workbook: DAX Calculated Columns Evaluation Order.

What does the documentation say?

Well, nothing terribly clear, but in this page there is this small paragraph:

When a column contains a formula, the value is computed for each row. The results are calculated for the column as soon as you create the formula. Column values are then recalculated as necessary, such as when the underlying data is refreshed.

It doesn’t say exactly when “recalculated as necessary” is necessary, but the implication is that it’s a model level change, rather than the switching of context, or in my case the changing of the user viewing the model.

So in summary, we have to assume that our calculated column values are fixed upon initial calculation, formula changes or data load (or processing in a Tabular model) and there isn’t a way to make the value in a given cell change.

Creating effective date ranges from multiple sources using Window Functions

Sometimes dimension data is managed across multiple tables. Just to complicate things sometimes this data has independent effective date ranges on these sources. So when we try to tie our data together, trying to pick which item of data is effective when is a bit of a challenge.

A picture speaks a thousand lines of blog post, so the picture below spells it out:

Date Ranges from multiple sources

Date Ranges from multiple sources

Table A has a set of data with different effective periods. Table B has a set of data for the same attribute with a completely independent set of effective periods. In data terms, it looks like this:

Date ranges from multiple sources sample data

Date ranges from multiple sources sample data

The challenge is to join them together so we get the right combination of attributes effective at the right time, as per the first picture. Now there is a way to do it through a join with careful selection of start / end  dates in a CASE statement and filtering out of records using  WHERE clause. However that has the downfall that it cannot cope with records where there is no cross over of data – so records  “1-“,”4-” and  “5-” have to be added in later through a separate process.

The alternative is to get the window functions voodoo doll out, and stretch the brain a little so you can do it all in one query.

Step one in this exercise is realising that each tables start dates could also be end dates in the other table, and each tables end dates could also be start dates (less a day) in the other table. So we need to UNION End Dates from Table A with Start Dates from Table B, like so:

SELECT    ID,    [A Start Date] AS [Start Date]
FROM    Table_A
UNION
SELECT    ID,    [B Start Date]
FROM    Table_B
UNION
– All end dates are start dates + 1 too
SELECT    ID,    ISNULL(DATEADD(d,1,[A End Date]),’31 Dec 2099′)
FROM    Table_A
UNION
SELECT    ID,    ISNULL(DATEADD(d,1,[B End Date]),’31 Dec 2099′)
FROM    Table_B

Now, this gives us a full set of every possible start date – which is a starting point.The end result looks like this:

Union Results

Union Results

We can repeat the same trick for end dates and then do a cartesian join on the two sets and then we get a combination of every possible start and end date. No we need some criteria by which to select the right date pair. If we add a DATEDIFF to the resultset it becomes obvious we want to pick the smallest date range:

Crossjoin results with DATEDIFF

Crossjoin results with DATEDIFF

A WINDOW function gives us the intelligence to pick the right row. So if we apply a ROW_NUMBER() over a PARTITION of Start Date, ordering by End Date, then we just have to select the first row of each partition:

The final result

The final result

Now we have a complete set of effective date ranges on which to join our attribute tables!

Grab a copy of the query, including sample data scripts here: DateRanges

Don’t forget to check out my upcoming training “Build your Data Warehouse in SQL Server & SSIS with the BI Monkey” – a  primer course for anyone wanting to build a first DW using Microsoft tools.

 

July Training roundup

Ok, so there’s a lot of things happening in July or later in the year which you should be aware of, led either by me or people I know deliver great training across Microsoft BI, SQL Server, R, Data Science, Predictive modelling…  and more.

Build your Data Warehouse in SQL Server & SSIS with the BI Monkey

There’s a big advert to the right – this workshop taking place 28/29 July will help you get to grips with the core concepts of data warehousing, give you hands on experience implementing those concepts using the Microsoft SQL Server BI Toolset, then round it out with making you aware of all the peripheral details that can enhance your solution and success.

Register Now

AgileBI Workshops on SQL Server 2014 & PowerBI

AgileBI – which includes Iman Eftekhari, Viktor Isakov and myself – are delivering the following workshops:

  • What Every Manager Should Know About Microsoft Cloud, Power BI for Office 365 and SQL Server 2014 – 11/2 day session on July 24th
  • SQL Server 2014 & Power BI Jump Start – Full Day Workshop on July 31st
  • Power BI workshop – Excel new features for reporting and data analysis – lunchtime July 10th

Register Now

Presciient Training on R, Data Science & Predictive Modelling

  • Training led by the ever interesting Dr Eugene Dubossarsky in September across Canberra, Sydey & Melbourne – Sydney dates below:
  • Introduction to R and data visualisation – 9/10 Sep
  • Predictive modelling, data science and big data – 11/12 Sep
  • Data analytics for fraud and anomaly detection, security and forensics – 16/17 Sep
  • Advanced R – 18/19 Sep

Register Now

BISM Normaliser is a cocktail for the Tabular Model

Well, that title makes much more sense in the context of this post in which I mused about the difficulty of developing against tabular models in a multi developer environment, given there is only one .bim file to work against. I even raised a connect to give the SSAS team to have something else to mark as “Won’t Fix” for the next release (cynical, me?).

Now to stretch an analogy if the problem is two drinks  and only one mouth, then the solution clearly is cocktails*!

Mix me up a BISM Normaliser, barman!

A chap called Christian Wade has kindly built up a nifty Visual Studio plug in called BISM Normaliser which handily merges two tabular models together giving you an option to handle development in a multi user environment. You put 2 models side by side and get a comparison screen like this:

bism normaliser

bism normaliser

You can then merge in tables, columns, relationships, measures – all the good stuff. It’s like a diff but considerably more usable than doing a raw XML comparison. This means if you start from the same base model – advisable as tables are grouped by connections so if your connections don’t match you can’t merge – the dev team can work on separate areas and then merge it back together at some point.

It’s not a substitute for a proper multi-author environment, but at least it makes it possible. There are risks of course – it’s a no warranty codeplex plug in – and you won’t get the benefits of TFS managed components (source control, changes, etc) – and the code currently is set to expire in Dec 2014 so if Christian sells the code you’ll need to buy it off someone.

Anyway – there is a partial solution – on our project we’ve given it a first pass and it seems to do what it claims and since we have no alternative it’s going to get used. So, big thanks to Christian!

 

 

*Or Jagerbombs, but let’s not go there:

Jagerbombs are not the answer (in this case)

Jagerbombs are not the answer (in this case)

Trick of a tiny database: cut down 1GB of your database every day by simply using this 1 weird old tip

Even if you use an adblocker, I bet you’ve all  seen this:

Trick of a tiny database

Trick of a tiny database

 

Well, the good news is if you have a chubby Enterprise Edition SQL Server, you too can get it on a fast weight loss program.

One Weird Old Tip

Okay, okay, I’ll pack in the click-baiting. SQL Server packs a great feature called Data Compression. What this does is compress the data on disk using a low CPU intensity algorithm. You can apply it to the data tables (and associated clustered indexes) and non-clustered indexes.

Now, there’s no such thing as a free lunch, but this isn’t far off paying a dollar for a really, really good sandwich. The cost for applying compression is that you burn CPU decompressing the data on read and compressing it on write. However reported costs of this on a server are about a 2-5% increase in baseline CPU usage. Most Data Warehouse applications barely tickle their CPU (unless someone has written some shocking code) so this cost is to all intents and purposes negligible.

Now for the taste of the really, really good sandwich. Disk usage savings in my experience are around the 50% mark (+/- 5%). As a practical example of that, one of my clients had a problem with a 680GB database – they needed more space for more data and indexes and were scrabbling for funding before they tipped over the 1TB mark. A quick evaluation showed they could cut their disk consumption to  to 280GB! No more need for funding scrabbles…!

So, how do I lose this weight?

It’s so easy. (Well, as long as you have Enterprise Edition).

ALTER TABLE <table_name> 
REBUILD PARTITION = ALL WITH (DATA_COMPRESSION =  <option>)

Just run a quick ALTER TABLE and compression is applied. It may take a while to compress the data, but after that it’s just … on. You will most likely never notice a difference, and you don’t have to start writing queries any differently.

I’m not going to run through the subtleties of whether to choose ROW or PAGE compression (or COLUMNSTORE, but that’s your only option for COLUMNSTORE tables in 2014+) but when I’ve run my evaluations on various tables I’ve generally seen PAGE showing a slight advantage.

Will my shadow be fat still?

Hmm. I did promise I’d stop. So there is a bonus compression feature which applies to all editions worth talking about (i.e. Standard, BI, Enterprise) from SQL2008R2 which is backup compression. It’s also available in 2008 but Enterprise only. There are some minor limitations, but in most circumstances you can simply compress your backups as you take them. Which is handy as again it will significantly reduce the amount of archival storage you use at again a negligible performance cost.

Hey, BI Monkey, can you help me with a script?

Why, yes I can – and here it is: Size and compress. This will size every table in your database, recommend a compression option and generate output that looks a little like this:

Compression script output

Compression script output

It was written for SQL2012 so doesn’t take into account Columnstore compressed tables in 2014, and isn’t smart enough to check the edition, but otherwise it’s a good starting point.

Now, go get your database being as fit as it should be!

Before and after compression

Before and after compression

Multiple developers against a single Tabular model is a drinking problem

The project I’m currently working on has at it’s heart a whopping great tabular model with dreams of eating more tables and being the biggest, fattest tabular model there ever was. To achieve it’s ambition of eating all this lovely data and becoming so vast it needs an army of chefs … er, developers… to feed it.

Except it has one big problem:

Two hands one mouth drinking problem

Two hands one mouth drinking problem

 

It only has one “mouth” – the .bim file. Only one person at a time can work on it. If you want to do simultaneous development, realistically all you can do is maintain separate versions and merge your changes together either by a diff tool, manually (or potentially via BIML script if that eventuates)

So I’ve raised a connect: Componentise Tabular models to allow for multi user development to request that the .bim file can get broken up into chunks. The text of the connect is below:

At present the Tabular model is a single code object (the .bim) file. This means it is not possible to break up the development of a model across multiple developers – a problem in large projects. It also makes source control less effective as you cannot track which specific objects within the model have been changed.

The model needs to be componentised so that individual elements can be worked on separately and then merged back into a main model, in a manner more similar to OLAP cubes.

Elements that could be broken out are: Perspectives, Connection Managers, Individual tables, Calculated Columns, Measures and relationships

Anyway…  if you think this is a problem you’d like to see solved – vote it up!

BIML and MIST – a first encounter

The MIST developers – Varigence – have been waving their product at me for a wee while now and I’ve finally had a chance to get into the IDE and get a better feel for it.

Before I get too carried away, here’s a quick 101. There’s this thing called BIML – an XML dialect for describing BI Assets (for now, only in the Microsoft world). This opens the door to scripting and therefore simpler dynamic generation of BI objects. BIML can be used by BIDS Helper (a thing you should have if you are an active BI developer) or the more focused BIML IDE MIST.

Now, I’ve seen the shiny video that promised the world, but nothing quite beats hands on experience. So I’ve started following the online user guide and got as far as building a Dimension table.

My feelings so far? I’m a bit “meh”. Now I know there’s a lot more capability to the product which I haven’t got to yet – so this is far from final commentary – but there are a few clear things that I think need to be looked at in the product to give it the sense of really being a development accelerator.

First up, it’s pretty clunky. It suffers heavily from “kitchen sinkism” – i.e. because it can do something, there’s a dialog box / screen / tab for it displayed all at once. Take for example this table development screen:

MIST Table Development

MIST Table Development

 

There’s a lot going on and that’s on a 1920×1080 screen….   some better screen space organisation is needed.

Next up is the fact that the designers don’t add a lot over the basic SSMS  capability. The table designer there is effectively the same blank grid that you get in the SSMS table designer, but without even the courtesy of being able to add another row without going back to the ribbon. At this point I’d be more inclined to develop in SSMS and import to MIST.

Then my next concern is over value add / accessibility.  For example, when setting up a project there’s some basic stuff that needs to be done – setting up connections, importing from databases – that should just be a wizard when starting up  a fresh project.  When creating a dimension, a bundle of default attributes should be added (preferably from a checklist or custom template).

So my first impression is that it needs a user experience workover. However this is a far from unique criticism of many developer tools so I won’t go too  hard on them. I’ll press on with it and see how the functionality unveils itself.

Gimme a table, but hold the indexes there buddy!

I’ve just wrapped up a quick bit of work with a client who was just getting moving on their DW journey, and I found some fun things that the rookies had done.

I was also partly inspired to talk about this particular issue by this timely DBA Reactions post: When the developers hear my warning about 27 indexes on a single table

When the developers hear my warning about 27 indexes on a single table

When the developers hear my warning about 27 indexes on a single table

 

Part of what I was looking at was applying compression to their tables to reduce the disk space consumption. So I ran my handy query to size the databases and found an 8GB table with 12GB of indexes! Yoinks – something was clearly astray…  it turned out that one table – with about 30 columns – had 8 indexes on it.

Why not have 27 indexes?

So there’s a few good reasons:

  • That many indexes take a long time to build / update, slowing load performance (in this case it took about 5 minutes to load the table and a further 29 minutes to build the indexes….)
  • The Query Optimiser is going to struggle to pick the right index if any of them overlap
  • If you are on 2012+ you can just stick the entire table in a columnstore index, massively boost performance and consume less disk space (applying a columnstore on the whole table took a whopping 1 minute to build)

As a broad rule I’d consider these rules of thumb following when looking at a tables indexes:

  • If I have more than 2 indexes, I’m probably not indexing well, my table design is wrong or my queries need re-evaluating
  • If my indexes are more than 20% of the size of the table, I should probably look at replacing them with a whole table columnstore
  • Are my indexes making use of included columns to increase their coverage?

If any readers have some good index management tips, please share them in the comments!

Next Page »