SSAS Tabular at Scale

The cube on my project has been hitting some apparent concurrency issues, so I’ve been hunting for advice on how to tune the hardware (model tuning has already gone a long way). Unfortunately Microsoft don’t have any reference architectures – and their only other advice was to try and use an appliance in Direct Query mode – which was not practical in our circumstances any way.

As usual, the gents at SQLBI had something useful to say on the subject based on a customer case study, which is detailed in this white paper. While well worth a read, I’ll summarise the key findings:

  • Standard Server CPU’s don’t perform well enough, and you will need to look at faster CPU’s with a large cache
  • Faster CPU’s are better than more CPU’s in terms of return on investment for perfromance
  • Fast RAM is a must
  • For NUMA aware servers you need to set the Node Affinity to a single node, preferably using a Hyper-V host for your tabular server

Setting aside the last point, which is a bit deep in server config and requires more explanation, the key thing is to look for fast CPU. They found that Workstation Blades were generally better than Server Blades, and some of the best performance they got was out of one of their Dev’s gaming rigs!

We’ll be trying some of this out and hopefully I can keep you posted with results. I have more stuff on monitoring tabular in the pipeline now I’ve finished my PowerPivot book (to be published soon).

Also, don’t forget my upcoming DW Starter Training on Nov 24/25 in Sydney

Read More

BISM Normaliser is a cocktail for the Tabular Model

Well, that title makes much more sense in the context of this post in which I mused about the difficulty of developing against tabular models in a multi developer environment, given there is only one .bim file to work against. I even raised a connect to give the SSAS team to have something else to mark as “Won’t Fix” for the next release (cynical, me?).

Now to stretch an analogy if the problem is two drinks  and only one mouth, then the solution clearly is cocktails*!

Mix me up a BISM Normaliser, barman!

A chap called Christian Wade has kindly built up a nifty Visual Studio plug in called BISM Normaliser which handily merges two tabular models together giving you an option to handle development in a multi user environment. You put 2 models side by side and get a comparison screen like this:

bism normaliser
bism normaliser

You can then merge in tables, columns, relationships, measures – all the good stuff. It’s like a diff but considerably more usable than doing a raw XML comparison. This means if you start from the same base model – advisable as tables are grouped by connections so if your connections don’t match you can’t merge – the dev team can work on separate areas and then merge it back together at some point.

It’s not a substitute for a proper multi-author environment, but at least it makes it possible. There are risks of course – it’s a no warranty codeplex plug in – and you won’t get the benefits of TFS managed components (source control, changes, etc) – and the code currently is set to expire in Dec 2014 so if Christian sells the code you’ll need to buy it off someone.

Anyway – there is a partial solution – on our project we’ve given it a first pass and it seems to do what it claims and since we have no alternative it’s going to get used. So, big thanks to Christian!

 

 

*Or Jagerbombs, but let’s not go there:

Jagerbombs are not the answer (in this case)
Jagerbombs are not the answer (in this case)

Read More

Multiple developers against a single Tabular model is a drinking problem

The project I’m currently working on has at it’s heart a whopping great tabular model with dreams of eating more tables and being the biggest, fattest tabular model there ever was. To achieve it’s ambition of eating all this lovely data and becoming so vast it needs an army of chefs … er, developers… to feed it.

Except it has one big problem:

Two hands one mouth drinking problem
Two hands one mouth drinking problem

 

It only has one “mouth” – the .bim file. Only one person at a time can work on it. If you want to do simultaneous development, realistically all you can do is maintain separate versions and merge your changes together either by a diff tool, manually (or potentially via BIML script if that eventuates)

So I’ve raised a connect: Componentise Tabular models to allow for multi user development to request that the .bim file can get broken up into chunks. The text of the connect is below:

At present the Tabular model is a single code object (the .bim) file. This means it is not possible to break up the development of a model across multiple developers – a problem in large projects. It also makes source control less effective as you cannot track which specific objects within the model have been changed.

The model needs to be componentised so that individual elements can be worked on separately and then merged back into a main model, in a manner more similar to OLAP cubes.

Elements that could be broken out are: Perspectives, Connection Managers, Individual tables, Calculated Columns, Measures and relationships

Anyway…  if you think this is a problem you’d like to see solved – vote it up!

Read More

SSAS Tabular Compression of Binary and Flag fields

I’m currently doing some Tabular Cube design, and part of the design has to include several “Y/N” flag fields. My inner geek then wondered what, if any, difference the data type made to the compression rates of the cube.

So, to start with I created a SQL Server table with three columns with the three usual types for this type of data

  • char(1)
  • tinyint
  • bit

… and then populated it with a mix of Y/N’s for char and  1/0’s for tinyint and bit. I thought about 40 billion rows should do it, as that gave me about half a gig of on disk data.

Next I sucked it into a Tabular model, which gave me the following data types:

  • char(1) -> Text
  • tinyint -> Whole Number
  • bit -> TRUE/FALSE

Then I processed the model 3 times, each time using only one column in the data. Then, using the SSAS DMV $system.discover_object_memory_usage I looked at how much memory each column ended up using.

The results were:

  • char(1) = 323,754 bytes
  • tinyint = 304,256 bytes
  • bit = 303,840 bytes

…   and yes, that is bytes. Assuming that the data is stored evenly on disk that means that approx 180 million bytes were compressed to 300 thousand – a compression ratio of 600:1 – yoinks!

Anyway, the first pass of the results seem to show that char is mildly more inefficient that tinyint and bit. However, if you take it a level deeper, and just look at the Memory assigned to the In-Memory Table.Columns element, the most significant source of difference between the memory usage, you see this:

  • char(1) = 17,944 bytes
  • tinyint = 2,344 bytes
  • bit = 2,344 bytes

So it appears that char(1) is much less efficient that tinyint or bit for storage, which end up being the same at this level of detail. For those of you who care deeply, you can view my results spreadsheet.

My conclusion? To be honest the amount of memory used  by these columns is going to be pretty trivial unless you have a gigantic dataset, but if you have a flag field on your fact tables, I would choose to use bit in your database and TRUE/FALSE in your tabular model if you can.

Read More

Adventures in context with MDX

One of the key concepts in MDX is context – that a calculation’s results are affected by the slicers that are applied to the portion of the cube the calculation works with. Today I had some fun with getting the context right in a query through subcubes and where clauses, so here’s some useful lessons. Note the below examples run against the “Internet Operation” cube of the AdventureWorks tabular model.

Subcubes do not give implicitly context to created members

If you use a subcube to constrain your scope, as below, it doesn’t affect the scope of the calculated members. So the below query, which creates a subcube by the use of the SELECT FROM a secondary SELECT at the bottom:

WITH

MEMBER Measures.Context AS [Geography].[Geography].CurrentMember.UNIQUE_NAME

SELECT
{
Measures.Context
}
ON COLUMNS,

{
[Date].[Calendar].[Year]
*    [Date].[Month Name].Members
}
ON ROWS

FROM

// Subcube

(
SELECT
(
[Geography].[Geography].[Country Region].&[Australia]
,    [Date].[Calendar].[Year].&[2008]
)    ON COLUMNS

FROM [Internet Operation]
)

Gives a set of results like this:

MDX Subcube results
MDX Subcube results

So note that despite the filter in the Subcube to Australia, the calculated member is not aware of this scope limitation, and so it believes the CurrentMember of the Geography Dimension is the default “All” level.

So, to give it the intended context, you need to include the Geography dimension on the rows, like below:

WITH

MEMBER Measures.Context AS [Geography].[Geography].CurrentMember.UNIQUE_NAME

SELECT
{
Measures.Context
}
ON COLUMNS,

{
[Geography].[Geography].[Country Region]
* [Date].[Calendar].[Year]
* [Date].[Month Name].Members
}
ON ROWS

FROM

// Subcube

(
SELECT
(
[Geography].[Geography].[Country Region].&[Australia]
, [Date].[Calendar].[Year].&[2008]
) ON COLUMNS

FROM [Internet Operation]
)

The addition of the Geography Dimension gives the calculated member the context to operate in, and the results are filtered to Australia by the filter in the subcube:

MDX Subcube results
MDX Subcube results

Using a WHERE does give context to calculated members

Alternatively, using a WHERE clause does give context to calculated members:

WITH

MEMBER Measures.Context AS [Geography].[Geography].CurrentMember.UNIQUE_NAME

SELECT
{
Measures.Context
}
ON COLUMNS,

{
[Date].[Month Name].Members
}
ON ROWS

FROM [Internet Operation]

WHERE
(
[Geography].[Geography].[Country Region].&[Australia]
, [Date].[Calendar].[Year].&[2008]
)

As we can see from the results – we haven’t specified geography on the rows, just filtered in the WHERE – but the calculated member still has the context for the Geography dimension:

MDX WHERE results
MDX WHERE results

Why is WHERE different from a Subcube?

The WHERE has a different effect to the Subcube because the WHERE clause sets the context for the Geography Dimension before the calculated members are evaluated. The Subcube only sets the context when the calculated members are evaluated, hence the different results depending on whether the context is provided at runtime (as in the second example) or not at all (as per the first example).

Read More

Managing your history data

This post is to an extent a small rant about some design decisions I have been constrained by on my current project. These decisions were made predominantly for one fairly bad reason: it made the architect’s life easier (apologies to the architects if they are reading – but these were bad choices!)

The design choices in question are around the managing of history data. In one component of the system it relates to Database storage design, the other relates to Cube storage design. In both cases the history data is stored in a separate location to the “current” data.

Databases: Why separate history tables are a bad idea

The first – and most compelling – reason for not storing your history data in separate tables to your current tables is that it increases complexity for users. Instead of having one location to look for data, your users now have to use two.

The second compelling reason is that there is no point to doing this from a storage point of view. SQL 2005 & 2008 (Enterprise editions only, admittedly) provide partitioning. This enables the contents of an individual table to be stored in separate locations on different filegroups. This means that you can store your current days data in one location and your history in a different one. The reason for doing this is the same as splitting it into separate tables – that querying the current section will be faster than the historic section.  In theory queries against partitioned tables should in fact be faster as the current data is now no longer in the same filegroup as the history data.

Now, there is an overhead associated with designing and maintaining partitions but I don’t see that it is significantly larger than that required to deal with the process required to archive data into separate tables on a daily basis. Additionally when maintaining separate history tables, you need to separate out every single table, whether it gets 10 rows a day or 10 million. With partitioning you can just target the large tables that need that focus.

There are other downsides to maintaining separate tables. If you make a change to a table design, you need to do it in 2 places.  You also need to remember to update your history processes. If your history process fails, you can end up with users getting unexpected query results or ETL process failures when the system loads the next day’s data into the current tables –  and untangling it becomes a real mess. If your partition processes fail to run, you just have too much data in one filegroup for a while – unlikely to be fatal.

So if you have large tables you need to split out for performance purposes – do it at the back end, using the power of the database – which is designed to store data efficiently. Keep it away from the users – they neither need to know or care about your need to keep the data separate. If you want to give them a single object to query with the current day’s data, just use views.

Cubes: Why a separate history Cube is a bad idea

Much of the above applies here – SSAS also has partitions – so you can again store your historic and current data in separate physical locations with the users being totally unaware of this. Again there is overhead in maintenance, but this will also balance out with the maintenance and risks associated with maintaining two identical cubes that only differ in terms of data source.

Use your storage options!

So without banging on about the same things any further, please consider the following two points whenever considering managing your history data:

  1. How does what i’m planning affect my users?
  2. How does what i’m planning leverage the platforms capabilities?

Read More

Cannot View Data Mining Model in BIDS – function does not exist

I’d been running some Naive Bayes Data Mining models without problems as part of initiating a Data Mining exercise, so it was time to move on and cut the data some different ways. So I set up a Decision Tree model and it processed fine, but when I tried to view it a message box appeared telling me it wasn’t going to co-operate:

The tree graph cannot be created because of the following error:

‘Query (1,6) The

‘[System].[Microsoft].[AnalysisServices].[System].[DataMining].[DecisionTrees].[GetTreeScores] function does not exist.’.

Fortunately someone had hit this before, as the solution is rather obscure. The install I am working against is non-standard, being split across two drives. What had happened is the path for the Data Mining dll’s set up in the install process didn’t actually match where they were placed.

So when I looked under the assembly location – SSMS > AS Server > Assemblies > System > Properties, the Source Path referenced a dll that didn’t actually exist – so it appears this incorrect path does not raise an error when trying to start the server. To fix it, I located located where the dll really was, then updated the config files where this path is stored – System.0.asm.xml and VBAMDX.0.asm.xml – to point to that path.

A restart of the server and the models reprocessed and I could happily view the output!

Read More

Kilimanjaro, Projects Gemini and Madison Webcast

For those who haven’t seen much of Project Gemini but have heard the buzz, this TechNet Webcast: An Early look at SQL Server ‘Kilimanjaro’ and project ‘Madison’ – will give you a good insight. It also has some features on reusable Reporting Services components which look very impressive and info on Project Madison, which provides scalability features. Registration as usual is a pain, and forget trying to use the site using any browser other than IE – I wish Microsoft would make their content easier to access.

Anyway, onto the webcast – about the 1st quarter of the webcast is the usual generic roadmap blurb, but then the presenter gets into the real meat of Gemini – an Excel based ‘in memory’ analysis tool that allows joining between entities without having to know about such things, superfast analytics – pivoting, calculation, charts etc. and then being able to publish to Sharepoint. From an OLAP analysis point of view, the Pivot Tables also has slicers (effctively table wide filters) displayed in the spreadsheet as well, and it would be good if that made it into Pivot Tables generally in the next release of Office. It looks like an incredible tool and very easy to use – and may be a powerful step towards the realisation of the ‘BI for the masses’ vision. The presenter did let slip one weakness though – much has been made of the 100 million rows of data demo – but that data still has to be loaded into memory first and will still take significant time. I also suspect that how successful Gemini will be is going to depend on how much it will rely on good data structures being in place in an organisation to support it. The Data Warehouse is going to remain the core part of  any succesful BI delivery.

The next component of interest was the reusable Reporting Services components – there is the concept of a library of components that can be built – e.g. standard charts, logos, gauges etc – and then dropped into any report, either by a developer or a user in Report Builder 2.0.  What really grabbed my attention is that these components are version aware – i.e. if the library version of the component is updated then if you reopen the report in design mode it will let you know and give you the option to update. Again this points to ‘BI for the masses’ as you can have developers create some great components which any user can then drop in to their home grown reports. Plus as any developer knows, there’s a lot of repetition and any options for code re-use are always appreciated.

Finally, Project Madison was covered – and seems more about scalability up to multi-terabyte warehouses. It was a bit infrastructure focused for me so most of it passed me by, but clearly Microsoft are stepping up to try and address the market perception that they can’t scale.

Anyway, this all will be dropped in late 2010 as Kilimanjaro – an interim release of SQL Server.

Read More

Microsoft’s secret forecasting tool – the Office Suite

Last night I attended an IAPA presentation on basic forecasting concepts and the tools used, presented by the ever interesting Eugene Dubossarsky (of Presciient, an analytics consultancy).  I will skip over the forecasting content as for the Microsoft BI community, the interesting part is which tool he used for most basic forecasting activities. It was Excel. Then, when he needed to do more advanced work, he used – Excel. Only when he needed to do trickier stuff with larger amounts of data did he pull in a more heavyweight tool – Access.

That’s right – the office suite covers the majority of forecaster’s needs. SQL Server and Analysis Services didn’t get a look in until the really heavyweight analytics processes began. For his purposes however, Eugene much prefers R, an open source stats program that is free, very powerful and now a serious competitor to SAS – much to their annoyance. Microsoft are rumoured to be talking to the people behind R, and an acquisition would make sense for both sides – R is not user friendly, which Microsoft could provide help with – and adding the capabilities of R would allow Microsoft to take a slug at SAS’s BI market.

So, this shows that most users still aren’t fully aware of, let alone using Excel’s capabilites – otherwise they wouldn’t be paying analytics consultants to to use it for them. Microsoft are always pushing Excel further, so now i’ll cover two features of Excel that the power users may not be aware of. It’s easy to forget sometimes that the 2007 Office suite wasn’t just a new, pretty interface – it also added huge BI capabilities.

The Data Mining Add-In for Excel (download for SQL Server 2008 or 2005)

This Add-In allows you to leverage the Data Mining capabilities of Analysis Services through Excel. It allows you to use Excel as the front end for creating and working with Data Mining models that exist on your server. However what really makes it interesting for Excel users is that it allows you to perform Data Mining on your spreadsheet data.

There is a Virtual Lab here explaining and demonstrating their use.

Project Gemini

This feature is slated for the next release of Excel, and is an in-memory tool for analysing large amounts of data in an OLAP style, but without all the fiddly data modelling normally required. It is a clear slug at other players in the in-Memory market, such as QlikTech. The models created will also be able to be ported back to SSAS with minimum effort as well. For more details read this commentary from the OLAP Report.

Microsoft has one of the most powerful BI Tools in the world in Excel, users just need to be made aware!

Read More

SSAS Training Resource

I have added a link to Craig Utley’s excellent SSAS training video resource site LearnMicrosoftBI.com, which contains training videos on a variety of subjects in SSAS – dimensional modelling, Actions and the one I found most useful explaining the tricky but critical subject of Attribute Relationships (Video SSAS 109). Recommended for anyone starting out in SSAS or needing concepts clarifying.

Registration is required to download the videos (not sure why) – but it seems to generate no spam so not a big issue, and the content is very high quality for free content.

Read More