Ok, so there’s a lot of things happening in July or later in the year which you should be aware of, led either by me or people I know deliver great training across Microsoft BI, SQL Server, R, Data Science, Predictive modelling… and more.
There’s a big advert to the right – this workshop taking place 28/29 July will help you get to grips with the core concepts of data warehousing, give you hands on experience implementing those concepts using the Microsoft SQL Server BI Toolset, then round it out with making you aware of all the peripheral details that can enhance your solution and success.
AgileBI – which includes Iman Eftekhari, Viktor Isakov and myself – are delivering the following workshops:
- What Every Manager Should Know About Microsoft Cloud, Power BI for Office 365 and SQL Server 2014 – 11/2 day session on July 24th
- SQL Server 2014 & Power BI Jump Start – Full Day Workshop on July 31st
- Power BI workshop – Excel new features for reporting and data analysis – lunchtime July 10th
- Training led by the ever interesting Dr Eugene Dubossarsky in September across Canberra, Sydey & Melbourne – Sydney dates below:
- Introduction to R and data visualisation – 9/10 Sep
- Predictive modelling, data science and big data – 11/12 Sep
- Data analytics for fraud and anomaly detection, security and forensics – 16/17 Sep
- Advanced R – 18/19 Sep
Well, that title makes much more sense in the context of this post in which I mused about the difficulty of developing against tabular models in a multi developer environment, given there is only one .bim file to work against. I even raised a connect to give the SSAS team to have something else to mark as “Won’t Fix” for the next release (cynical, me?).
Now to stretch an analogy if the problem is two drinks and only one mouth, then the solution clearly is cocktails*!
Mix me up a BISM Normaliser, barman!
A chap called Christian Wade has kindly built up a nifty Visual Studio plug in called BISM Normaliser which handily merges two tabular models together giving you an option to handle development in a multi user environment. You put 2 models side by side and get a comparison screen like this:
You can then merge in tables, columns, relationships, measures – all the good stuff. It’s like a diff but considerably more usable than doing a raw XML comparison. This means if you start from the same base model – advisable as tables are grouped by connections so if your connections don’t match you can’t merge – the dev team can work on separate areas and then merge it back together at some point.
It’s not a substitute for a proper multi-author environment, but at least it makes it possible. There are risks of course – it’s a no warranty codeplex plug in – and you won’t get the benefits of TFS managed components (source control, changes, etc) – and the code currently is set to expire in Dec 2014 so if Christian sells the code you’ll need to buy it off someone.
Anyway – there is a partial solution – on our project we’ve given it a first pass and it seems to do what it claims and since we have no alternative it’s going to get used. So, big thanks to Christian!
*Or Jagerbombs, but let’s not go there:
Trick of a tiny database: cut down 1GB of your database every day by simply using this 1 weird old tip
Even if you use an adblocker, I bet you’ve all seen this:
Well, the good news is if you have a chubby Enterprise Edition SQL Server, you too can get it on a fast weight loss program.
One Weird Old Tip
Okay, okay, I’ll pack in the click-baiting. SQL Server packs a great feature called Data Compression. What this does is compress the data on disk using a low CPU intensity algorithm. You can apply it to the data tables (and associated clustered indexes) and non-clustered indexes.
Now, there’s no such thing as a free lunch, but this isn’t far off paying a dollar for a really, really good sandwich. The cost for applying compression is that you burn CPU decompressing the data on read and compressing it on write. However reported costs of this on a server are about a 2-5% increase in baseline CPU usage. Most Data Warehouse applications barely tickle their CPU (unless someone has written some shocking code) so this cost is to all intents and purposes negligible.
Now for the taste of the really, really good sandwich. Disk usage savings in my experience are around the 50% mark (+/- 5%). As a practical example of that, one of my clients had a problem with a 680GB database – they needed more space for more data and indexes and were scrabbling for funding before they tipped over the 1TB mark. A quick evaluation showed they could cut their disk consumption to to 280GB! No more need for funding scrabbles…!
So, how do I lose this weight?
It’s so easy. (Well, as long as you have Enterprise Edition).
ALTER TABLE <table_name> REBUILD PARTITION = ALL WITH (DATA_COMPRESSION = <option>)
Just run a quick ALTER TABLE and compression is applied. It may take a while to compress the data, but after that it’s just … on. You will most likely never notice a difference, and you don’t have to start writing queries any differently.
I’m not going to run through the subtleties of whether to choose ROW or PAGE compression (or COLUMNSTORE, but that’s your only option for COLUMNSTORE tables in 2014+) but when I’ve run my evaluations on various tables I’ve generally seen PAGE showing a slight advantage.
Will my shadow be fat still?
Hmm. I did promise I’d stop. So there is a bonus compression feature which applies to all editions worth talking about (i.e. Standard, BI, Enterprise) from SQL2008R2 which is backup compression. It’s also available in 2008 but Enterprise only. There are some minor limitations, but in most circumstances you can simply compress your backups as you take them. Which is handy as again it will significantly reduce the amount of archival storage you use at again a negligible performance cost.
Hey, BI Monkey, can you help me with a script?
Why, yes I can – and here it is: Size and compress. This will size every table in your database, recommend a compression option and generate output that looks a little like this:
It was written for SQL2012 so doesn’t take into account Columnstore compressed tables in 2014, and isn’t smart enough to check the edition, but otherwise it’s a good starting point.
Now, go get your database being as fit as it should be!
The project I’m currently working on has at it’s heart a whopping great tabular model with dreams of eating more tables and being the biggest, fattest tabular model there ever was. To achieve it’s ambition of eating all this lovely data and becoming so vast it needs an army of chefs … er, developers… to feed it.
Except it has one big problem:
It only has one “mouth” – the .bim file. Only one person at a time can work on it. If you want to do simultaneous development, realistically all you can do is maintain separate versions and merge your changes together either by a diff tool, manually (or potentially via BIML script if that eventuates)
So I’ve raised a connect: Componentise Tabular models to allow for multi user development to request that the .bim file can get broken up into chunks. The text of the connect is below:
At present the Tabular model is a single code object (the .bim) file. This means it is not possible to break up the development of a model across multiple developers – a problem in large projects. It also makes source control less effective as you cannot track which specific objects within the model have been changed.
The model needs to be componentised so that individual elements can be worked on separately and then merged back into a main model, in a manner more similar to OLAP cubes.
Elements that could be broken out are: Perspectives, Connection Managers, Individual tables, Calculated Columns, Measures and relationships
Anyway… if you think this is a problem you’d like to see solved – vote it up!
The MIST developers – Varigence – have been waving their product at me for a wee while now and I’ve finally had a chance to get into the IDE and get a better feel for it.
Before I get too carried away, here’s a quick 101. There’s this thing called BIML – an XML dialect for describing BI Assets (for now, only in the Microsoft world). This opens the door to scripting and therefore simpler dynamic generation of BI objects. BIML can be used by BIDS Helper (a thing you should have if you are an active BI developer) or the more focused BIML IDE MIST.
My feelings so far? I’m a bit “meh”. Now I know there’s a lot more capability to the product which I haven’t got to yet – so this is far from final commentary – but there are a few clear things that I think need to be looked at in the product to give it the sense of really being a development accelerator.
First up, it’s pretty clunky. It suffers heavily from “kitchen sinkism” – i.e. because it can do something, there’s a dialog box / screen / tab for it displayed all at once. Take for example this table development screen:
There’s a lot going on and that’s on a 1920×1080 screen…. some better screen space organisation is needed.
Next up is the fact that the designers don’t add a lot over the basic SSMS capability. The table designer there is effectively the same blank grid that you get in the SSMS table designer, but without even the courtesy of being able to add another row without going back to the ribbon. At this point I’d be more inclined to develop in SSMS and import to MIST.
Then my next concern is over value add / accessibility. For example, when setting up a project there’s some basic stuff that needs to be done – setting up connections, importing from databases – that should just be a wizard when starting up a fresh project. When creating a dimension, a bundle of default attributes should be added (preferably from a checklist or custom template).
So my first impression is that it needs a user experience workover. However this is a far from unique criticism of many developer tools so I won’t go too hard on them. I’ll press on with it and see how the functionality unveils itself.
I’ve just wrapped up a quick bit of work with a client who was just getting moving on their DW journey, and I found some fun things that the rookies had done.
I was also partly inspired to talk about this particular issue by this timely DBA Reactions post: When the developers hear my warning about 27 indexes on a single table
Part of what I was looking at was applying compression to their tables to reduce the disk space consumption. So I ran my handy query to size the databases and found an 8GB table with 12GB of indexes! Yoinks – something was clearly astray… it turned out that one table – with about 30 columns – had 8 indexes on it.
Why not have 27 indexes?
So there’s a few good reasons:
- That many indexes take a long time to build / update, slowing load performance (in this case it took about 5 minutes to load the table and a further 29 minutes to build the indexes….)
- The Query Optimiser is going to struggle to pick the right index if any of them overlap
- If you are on 2012+ you can just stick the entire table in a columnstore index, massively boost performance and consume less disk space (applying a columnstore on the whole table took a whopping 1 minute to build)
As a broad rule I’d consider these rules of thumb following when looking at a tables indexes:
- If I have more than 2 indexes, I’m probably not indexing well, my table design is wrong or my queries need re-evaluating
- If my indexes are more than 20% of the size of the table, I should probably look at replacing them with a whole table columnstore
- Are my indexes making use of included columns to increase their coverage?
If any readers have some good index management tips, please share them in the comments!
A looooong time a go I had a post on how to “Count the number of rows in every Table in a Database in no time!” which did an approximate row count and sizing of every table in a database using DMV’s.
Well, since then then I have found a far better way to quickly get a row count, table & index size from the handy stored proc sp_spaceused. It’s worth running DBCC UPDATEUSAGE first just to ensure the stats it works with are current.
I’ve wrapped a little cursor around it so you can – extremely quickly – get these stats for a whole database (sorted by descending data size but that can be easily tweaked in that final select):
DECLARE @TableSize TABLE
, [RowCount] int
, [Reserved] nvarchar(255)
, [Data] nvarchar(255)
, [Index Size] nvarchar(255)
, [Unused] nvarchar(255)
DECLARE @CurrentTable nvarchar(255)
DECLARE TableCursor CURSOR FOR
SELECT [Name] FROM SYS.tables
FETCH NEXT FROM TableCursor INTO @CurrentTable
WHILE @@FETCH_STATUS = 0
INSERT INTO @TableSize
exec sp_spaceused @CurrentTable
FETCH NEXT FROM TableCursor INTO @CurrentTable
SELECT * FROM @TableSize
order by CAST(REPLACE([Data],‘ KB’,”) as int) desc
… and this is what the results look like (Against good old AdventureWorks):
Thanks to the lovely people over at Syncfusion I’ve been able to share my experiences with HDInsight in a short eBook for the “Succinctly” series which was released this weekend. It is unsurprisingly called “HDInsight Succinctly” and is free to download.
For a quick precis this is the summary description from the website:
Master the higher-level languages and other features necessary to process data with HDInsight. Learn how to set up and manage HDInsight clusters on Azure, how to use Azure Blob Storage to store input and output data, connect with Microsoft BI, and much more. With the guidance of author James Beresford, HDInsight Succinctly will reveal a new avenue of data management.
You can also read my guest blog on Syncfusion’s site entitled “Being on the cutting edge is fun!” where I spend a bit of time extolling the virtues of HDInsight and Microsoft BI and getting all excited about analytics and its place in the future.
The BI Monkey has a confession – his partner is also somewhat of a data geek – though more buried in doing real analytics – but I have persuaded them of the benefits of using SQL instead of SAS sometimes. Right tool for the job etc. etc.
Now I was providing a bit of email support and was reminded of a valuable lesson – a written requirement is not always a very good way of expressing what you want – I spent most of the day bouncing emails advising about how to write cursors (yes, I know, sorry) – and when we actually sat in front of a laptop with the data I found out what it was we actually were trying to achieve and wrote a 5 line query that was literally thousands of times faster.
Are we talking about Window functions yet?
Sure, let’s get to the point. Actually I have blogged about them before but a refresher and different look is always helpful. Window functions allow you to define a “window” of data over which you can perform an operation against a set of rows. Note this is distinct from a GROUP BY which will aggregate the data. Window functions operate at the row level and leave the row data intact. The kind of capabilities this enables is things like adding a row sequence number within a group, cumulative totals, moving averages and other such shenanigans.
Now the key bit of code is the OVER clause. This allows you to specify a “window” of data over which to perform an operation. The window is defined by a PARTITION clause which specifies which column’s values to use to subdivide the data. Then an ORDER BY clause specifies what order – if any – the data in the PARTITION should be processed in. Finally a ROW or RANGE statement can influence the start and end points of the PARTITION.
However, that’s a lot of fine words and SQL speaks clearer:
, SUM(SalesAmount) OVER (PARTITION BY OrderDateKey ORDER BY SalesOrderNumber ROWS UNBOUNDED PRECEDING) AS CumulativeAggregation
, SUM(SalesAmount) OVER (PARTITION BY OrderDateKey ORDER BY SalesOrderNumber ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS SumOverWindow
, COUNT(SalesOrderNumber) OVER (PARTITION BY OrderDateKey ORDER BY SalesOrderNumber ROWS UNBOUNDED PRECEDING) AS RowNumberIncrement
, COUNT(SalesOrderNumber) OVER (PARTITION BY OrderDateKey ORDER BY SalesOrderNumber ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS RowNumberCountdown
ORDER BY SalesOrderNumber
This will run against the AdventureWorksDW2012 sample database.
So there are four columns I’m calculating across a window of “OrderDateKey”
- CumulativeAggregation – which accumulates values across the window
- SumOverWindow – which totals values across the window
- RowNumberIncrement – which provides an increasing row count across the window
- RowNumberCountdown – which provides a decreasing row count across the window
Here’s the results:
So to break these down a bit:
Here we PARTITION by OrderDateKey. So the window is based on values in the “OrderDateKey” column. So in the results above, we get a window for “20050701″, “20050702″, etc.
To get the accumulation to carry out in the order we want we apply an ORDER BY on “SalesOrderNumber”.
Now for the clever bit. The ROWS UNBOUNDED PRECEDING tells the SUM function to evaluate from the start of the window. Hence you get the cumulative sum appearing – each row evaluates the total for all values for SalesAmount from the start of the window to the current row.
We follow much of what we did in CumulativeAggregation – same PARTITION, and same ORDER BY. The ORDER BY has no impact on the result but is needed for the statement to be syntactically valid.
This time we extend the ROWS UNBOUNDED PRECEDING by adding AND UNBOUNDED FOLLOWING – this now tells the SUM function to evaluate the SUM from the start of the window to the end of the window. So this time each row has the total SalesAmount for the window. Useful for calculating things such as percentage in period.
This time we switch the SUM to a COUNT. Here our ROWS UNBOUNDED preceding counts the rows from the start of the window to the current row – effectively giving us a row number.
In our last example we change to a stated set of ROWS between CURRENT ROW AND UNBOUNDED FOLLOWING. So now the COUNT evaluates from the current row to the end of the window – which effectively gives us a countdown to the end row of the window.
Using Window Functions via the OVER clause we can efficiently perform complex aggregations in place of awkward cursors, sub-selects and other such workarounds we may have had to lean on in the past. I’ve covered a small proportion of the options available but this functionality opens up so powerful analytic possibilities.
Also, talking to a user – even if it the person you are supposed to know best in the world – is vastly more effective if you want to understand their problem and solve it efficiently!
The eagle eyed among you may have noticed I’ve added a big countdown banner to the right of the page announcing a course “Build your Data Warehouse in SQL Server & SSIS with the BI Monkey“, running for 2 days from July 28.
The course is aimed at giving an overview of how to build and manage a Data Warehouse using the SQL Server platform as a base. It’s aimed to be a solid mix of Data Warehouse theory and hands on development to get the concepts locked in. For full details see my new “Training” page, but the key outcomes are:
- Learn how to get the Architecture right and keep your solution flexible
- Understand the whole lifecycle of a DW project
- Get hands on with the tools you need to move data
The high level agenda is:
- The Three Main Data Warehouse architectures – Kimball, Inmon & Data Vault
- SSIS Basics – Projects, Control Flows and Data Flows
- Moving Data with SSIS
- ETL Control Frameworks
- Database considerations
- How HDInsight (Hadoop) fits in
- Code Management with Team Foundation Server
- Build Automation
- MDS, DQS and other useful Microsoft Toys
Click the button to register now – look forward to seeing you!