The end of the year is closing in fast but there’s still plenty of chances to learn from specialist providers Agile BI, Presciient and of course, me!
Topics cover the full spread of DW, BI and Analytics so there’s something for every role in the data focused organisation.
Build your Data Warehouse in SQL Server & SSIS with the BI Monkey
Nov 24/25 – Are you about to build your Data Warehouse with Microsoft tools and want to do it right first time?
This course is designed to help a novice understand what is involved in building a Data Warehouse both from a technical architecture and project delivery perspective. It also delivers you basic skills in the tools the Microsoft Business Intelligence suite offers you to do that with.
Ok, so there’s a lot of things happening in July or later in the year which you should be aware of, led either by me or people I know deliver great training across Microsoft BI, SQL Server, R, Data Science, Predictive modelling… and more.
There’s a big advert to the right – this workshop taking place 28/29 July will help you get to grips with the core concepts of data warehousing, give you hands on experience implementing those concepts using the Microsoft SQL Server BI Toolset, then round it out with making you aware of all the peripheral details that can enhance your solution and success.
Even if you use an adblocker, I bet you’ve all seen this:
Well, the good news is if you have a chubby Enterprise Edition SQL Server, you too can get it on a fast weight loss program.
One Weird Old Tip
Okay, okay, I’ll pack in the click-baiting. SQL Server packs a great feature called Data Compression. What this does is compress the data on disk using a low CPU intensity algorithm. You can apply it to the data tables (and associated clustered indexes) and non-clustered indexes.
Now, there’s no such thing as a free lunch, but this isn’t far off paying a dollar for a really, really good sandwich. The cost for applying compression is that you burn CPU decompressing the data on read and compressing it on write. However reported costs of this on a server are about a 2-5% increase in baseline CPU usage. Most Data Warehouse applications barely tickle their CPU (unless someone has written some shocking code) so this cost is to all intents and purposes negligible.
Now for the taste of the really, really good sandwich. Disk usage savings in my experience are around the 50% mark (+/- 5%). As a practical example of that, one of my clients had a problem with a 680GB database – they needed more space for more data and indexes and were scrabbling for funding before they tipped over the 1TB mark. A quick evaluation showed they could cut their disk consumption to to 280GB! No more need for funding scrabbles…!
So, how do I lose this weight?
It’s so easy. (Well, as long as you have Enterprise Edition).
ALTER TABLE <table_name>
REBUILD PARTITION = ALL WITH (DATA_COMPRESSION = <option>)
Just run a quick ALTER TABLE and compression is applied. It may take a while to compress the data, but after that it’s just … on. You will most likely never notice a difference, and you don’t have to start writing queries any differently.
I’m not going to run through the subtleties of whether to choose ROW or PAGE compression (or COLUMNSTORE, but that’s your only option for COLUMNSTORE tables in 2014+) but when I’ve run my evaluations on various tables I’ve generally seen PAGE showing a slight advantage.
Will my shadow be fat still?
Hmm. I did promise I’d stop. So there is a bonus compression feature which applies to all editions worth talking about (i.e. Standard, BI, Enterprise) from SQL2008R2 which is backup compression. It’s also available in 2008 but Enterprise only. There are some minor limitations, but in most circumstances you can simply compress your backups as you take them. Which is handy as again it will significantly reduce the amount of archival storage you use at again a negligible performance cost.
Hey, BI Monkey, can you help me with a script?
Why, yes I can – and here it is: Size and compress. This will size every table in your database, recommend a compression option and generate output that looks a little like this:
It was written for SQL2012 so doesn’t take into account Columnstore compressed tables in 2014, and isn’t smart enough to check the edition, but otherwise it’s a good starting point.
Now, go get your database being as fit as it should be!
Part of what I was looking at was applying compression to their tables to reduce the disk space consumption. So I ran my handy query to size the databases and found an 8GB table with 12GB of indexes! Yoinks – something was clearly astray… it turned out that one table – with about 30 columns – had 8 indexes on it.
Why not have 27 indexes?
So there’s a few good reasons:
That many indexes take a long time to build / update, slowing load performance (in this case it took about 5 minutes to load the table and a further 29 minutes to build the indexes….)
The Query Optimiser is going to struggle to pick the right index if any of them overlap
If you are on 2012+ you can just stick the entire table in a columnstore index, massively boost performance and consume less disk space (applying a columnstore on the whole table took a whopping 1 minute to build)
As a broad rule I’d consider these rules of thumb following when looking at a tables indexes:
If I have more than 2 indexes, I’m probably not indexing well, my table design is wrong or my queries need re-evaluating
If my indexes are more than 20% of the size of the table, I should probably look at replacing them with a whole table columnstore
Well, since then then I have found a far better way to quickly get a row count, table & index size from the handy stored proc sp_spaceused. It’s worth running DBCC UPDATEUSAGE first just to ensure the stats it works with are current.
I’ve wrapped a little cursor around it so you can – extremely quickly – get these stats for a whole database (sorted by descending data size but that can be easily tweaked in that final select):
The BI Monkey has a confession – his partner is also somewhat of a data geek – though more buried in doing real analytics – but I have persuaded them of the benefits of using SQL instead of SAS sometimes. Right tool for the job etc. etc.
Now I was providing a bit of email support and was reminded of a valuable lesson – a written requirement is not always a very good way of expressing what you want – I spent most of the day bouncing emails advising about how to write cursors (yes, I know, sorry) – and when we actually sat in front of a laptop with the data I found out what it was we actually were trying to achieve and wrote a 5 line query that was literally thousands of times faster.
Are we talking about Window functions yet?
Sure, let’s get to the point. Actually I have blogged about them before but a refresher and different look is always helpful. Window functions allow you to define a “window” of data over which you can perform an operation against a set of rows. Note this is distinct from a GROUP BY which will aggregate the data. Window functions operate at the row level and leave the row data intact. The kind of capabilities this enables is things like adding a row sequence number within a group, cumulative totals, moving averages and other such shenanigans.
Now the key bit of code is the OVER clause. This allows you to specify a “window” of data over which to perform an operation. The window is defined by a PARTITION clause which specifies which column’s values to use to subdivide the data. Then an ORDER BY clause specifies what order – if any – the data in the PARTITION should be processed in. Finally a ROW or RANGE statement can influence the start and end points of the PARTITION.
However, that’s a lot of fine words and SQL speaks clearer:
, SUM(SalesAmount) OVER (PARTITION BY OrderDateKey ORDER BY SalesOrderNumber ROWS UNBOUNDED PRECEDING) ASCumulativeAggregation
, SUM(SalesAmount) OVER (PARTITION BY OrderDateKey ORDER BY SalesOrderNumber ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ASSumOverWindow
, COUNT(SalesOrderNumber) OVER (PARTITION BY OrderDateKey ORDER BY SalesOrderNumber ROWS UNBOUNDED PRECEDING) ASRowNumberIncrement
, COUNT(SalesOrderNumber) OVER (PARTITION BY OrderDateKey ORDER BY SalesOrderNumber ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) ASRowNumberCountdown
So there are four columns I’m calculating across a window of “OrderDateKey”
CumulativeAggregation – which accumulates values across the window
SumOverWindow – which totals values across the window
RowNumberIncrement – which provides an increasing row count across the window
RowNumberCountdown – which provides a decreasing row count across the window
Here’s the results:
So to break these down a bit:
Here we PARTITION by OrderDateKey. So the window is based on values in the “OrderDateKey” column. So in the results above, we get a window for “20050701”, “20050702”, etc.
To get the accumulation to carry out in the order we want we apply an ORDER BY on “SalesOrderNumber”.
Now for the clever bit. The ROWS UNBOUNDED PRECEDING tells the SUM function to evaluate from the start of the window. Hence you get the cumulative sum appearing – each row evaluates the total for all values for SalesAmount from the start of the window to the current row.
We follow much of what we did in CumulativeAggregation – same PARTITION, and same ORDER BY. The ORDER BY has no impact on the result but is needed for the statement to be syntactically valid.
This time we extend the ROWS UNBOUNDED PRECEDING by adding AND UNBOUNDED FOLLOWING – this now tells the SUM function to evaluate the SUM from the start of the window to the end of the window. So this time each row has the total SalesAmount for the window. Useful for calculating things such as percentage in period.
This time we switch the SUM to a COUNT. Here our ROWS UNBOUNDED preceding counts the rows from the start of the window to the current row – effectively giving us a row number.
In our last example we change to a stated set of ROWS between CURRENT ROW AND UNBOUNDED FOLLOWING. So now the COUNT evaluates from the current row to the end of the window – which effectively gives us a countdown to the end row of the window.
Using Window Functions via the OVER clause we can efficiently perform complex aggregations in place of awkward cursors, sub-selects and other such workarounds we may have had to lean on in the past. I’ve covered a small proportion of the options available but this functionality opens up so powerful analytic possibilities.
Also, talking to a user – even if it the person you are supposed to know best in the world – is vastly more effective if you want to understand their problem and solve it efficiently!
The course is aimed at giving an overview of how to build and manage a Data Warehouse using the SQL Server platform as a base. It’s aimed to be a solid mix of Data Warehouse theory and hands on development to get the concepts locked in. For full details see my new “Training” page, but the key outcomes are:
Learn how to get the Architecture right and keep your solution flexible
Understand the whole lifecycle of a DW project
Get hands on with the tools you need to move data
The high level agenda is:
The Three Main Data Warehouse architectures – Kimball, Inmon & Data Vault
SSIS Basics – Projects, Control Flows and Data Flows
Moving Data with SSIS
ETL Control Frameworks
How HDInsight (Hadoop) fits in
Code Management with Team Foundation Server
MDS, DQS and other useful Microsoft Toys
Click the button to register now – look forward to seeing you!
The stored procedures are pretty vanilla – see step 4 in this link for an example – take the XML of the message body and process it within the confines of the queue.
However the procedures cant be executed in the query editor so they only log their output to the error log – so to debug you need to look in the main SQL Server log – found at somewhere like “C:\Program Files\Microsoft SQL Server\MSSQL11.SQLSERVER\MSSQL\Log\ERRORLOG” – openable with a text editor of your choosing.
Then hunt for lines that look like this:
2013-11-08 10:56:26.10 spid34s The activated proc ‘[dbo].[PayloadReceive]’ running on queue ‘ServiceBrokerDatabase.dbo.InboundQueue’ output the following: ‘Some useful info’
To break that down, you get timestamp, process id, and then a message that says effectively “stored procedure” on “queue” output “some stuff”. So the “some stuff” could be an error message output by SQL, or a PRINT statement in your proc as you try to test.
The final part is to install Office – specifically Excel – and the key Add-Ins that you will need for demo purposes.
Important: We want to install 64-bit Office to leverage PowerPivot in Excel properly. So don’t use the default Setup option, but navigate to the x64 folder on the installation media and run that version of setup.
1. Run 64 Bit Office setup and get the Product Key screen:
2. Enter your key
3. Click Continue for the Licence Terms dialog. Accept the licence terms
4. Click Continue. This will bring you to the Installation Type screen:
We only want certain components so choose “Customize”, which gives the Installation Options dialog:
5. Follow the settings shown above to remove the following components:
You can be flexible with the above – for example I’ve left in PowerPoint as it’s often easier to demo the slide decks from within the VM during demos – but the lighter the install the better. Enter User Information if you want.
Important Check: If you see an additional tab in the dialog in Fig 6.3 called “Platform” you are installing 32-bit – cancel and start again, installing the 64 bit version as described at the start.
6. Once configured, click “Install Now”. It’s time for another progress bar and a meal, this one takes a while:
Eventually you will get the completion screen:
7. Click close.
8. Restart the VM, run Windows Update, Install any updates and Restart again for good measure.
9. Install the PowerPivot add-in. Run the x64 Installer:
10. Click Next for the Licence Terms screen. Accept the licence terms.
12. Click Next then enter your details on the Registration Information screen
13. Click Next for the confirmation screen
14. Click Install. Enjoy the progress bar:
15. Once complete, click Finish.
16. Install the Data Mining Add In:
18. Click Next for the Licence terms screen. Accept the licence terms.
19. Click next to get to the Feature Selection:
In a change from the default, select the Data Mining Client for Excel.
20. Click Next for the COnfirmation Screen.
21. Click Install at the next screen.
22. Click Finish once completed.
And that’s it! You have a complete demo machine. Optionally you can do some further tweaks for performance or just get stuck in. If you haven’t been religiously taking snapshots, now is a very good time to take one and baseline your VM.