SQL Pass Day #3
The third and final day at SQL Pass was presaged by me at the bloggers table (though only able to manically tweet) watching Dr DeWitt’s keynote – and I can see why his keynotes are so highly regarded. His subject was Big Data – and given the potential for this to be a dull and impenetrable subject area – he gave a great and illuminating talk on the topic.
Topics that he covered included:
- ACID vs BASE (i.e the battle between consistency of data vs. availability of data)
- NoSQL is a means of querying raw data with no cleansing / structure / ETL
- His expectation is that Structured (SQL) and Unstructured (Hadoop) data will coexist in organisations
- Hadoop consists of Storage (HDFS) and Process (MapReduce)
- MapReduce is too complex to work with so languages such as Hive and Pig sit on top of it
- Sqoop is the tool to make Unstructured and Structured data talk – but performance is not good
I can’t really do his talk justice but now I understand Hadoop a whole lot better – essentially it’s just a read only store of unstructured data, a very different beast to a relational database and addressing totally different needs.
SQL Pass Day #2
So, on to day 2 of SQL Pass, and a very SSIS focused one – mainly because I attended Matt Masson’s SSIS session and learned about a whole new bunch of nice features that have made it in to the next version.
I took away the following interesting points from that session:
- CDC (Change Data Capture) is supported more effectively through some new components – a CDC Control, CDC Source and CDC Splitter
- ODBC improvements mean improved performance for non SQL Server databases
- Connection Managers get a few new features – Offline, Expression and Project indicators. Also Offline Connection Managers are picked out through a timeout and importantly now halt validation of any related components (so you no longer get those drags as SSIS tries to validate components and flows hooked to dead connections)
- File Connection Manager can now handle variable numbers of columns (i.e. it won’t crash)
- Pivot gets a UI – hurrah! (Note: SCD still sucks)
- Project Parameters can be configured at design time though Visual Studio Configurations
- Breakpoints are now in the Script Component so we can see what data is causing or components to blow up
- Data Taps – data viewers for live execution that dump out to .csv files
- Package Execution via PowerShell or even T-SQL!
Matt also gave a preview of Barcelona in action, and it looks pretty neat.
I also attended a DQS session that showed a few new features on the UI. Elad Ziklik highlighted that the CTP3 release should be viewed more as a capability preview rather than a test drive of a functional product – so looking forward to a new CTP.
One more day down – one to go…!
SQL Pass Day #1
So the BI Monkey was at PASS and intended blogging on his phone or at one of the many internet pods provided, but sadly WordPress and IE / Android aren’t friends so this is going to be a retrospective.
So, here are my takeaways from day 1:
From the keynote:
Denali becomes SQL2012 and is slated for release in the first half of next year (way to allow yourself some leeway on the final release date!)
Crescent becomes Power View – and will work on multiple mobile platforms – Apple & Android via Browser, WP7 via a richer app
HADOOP is going to be supported in Windows server – the first CTP is due next year. If you have access to the PASS DVD’s / Sessions – see Dr Dewitt’s presentation on HADOOP – very enlightening
From wandering around the summit:
As per Elad Ziklik of the DQS team, performance is one of their focuses for the next release
I got to finally meet Ivan Peev of CozyRoc after many years of email and phone conversations, and he told me about their new SAS Connector – which allows reading and writing to SAS datasets without an actual SAS install.
I also got to meet Matt Masson, guru of the SSIS team who told me about Project Barcelona – a tool that will do data lineage, metadata management and impact analysis via a crawler as opposed to a manually maintained set.
I also got to sit in on a customer feedback session about BI in the Cloud – unfortunately all under NDA – but it was a great forum to discuss and help direct Microsoft’s Cloud BI ambitions.
I also had a chat with fellow Aussie BI guy Roger Noble who told me about a use for the Term Extraction transformation in SSIS – using it to scan through documents and auto-tag them as the were uploaded to SharePoint – which is pretty cool.
So, that was Day 1… Day 2 to follow!
Columnstore Indexes revisited
Having now researched Columnstore Indexes further, I thought I’d share the key learning I’ve picked up on this feature – which now sounds even more powerful than I’d originally thought.
The most important thing to take away is that a Columnstore Index should actually cover the entire table. Its name is a little misleading – the feature is less of an index, and more of a shadow copy of the table’s data, compressed with the Vertipaq voodoo. I suspect they have used the term index because the Columnstore doesn’t cover all data types – the important ones are there, but some extreme decimals and blobs are excluded – for a full list see the MSDN documentation. So for any big table, whack a Columnstore index across the entire table.
Next up is to understand how to use them and how to detect when they are or are not being used. The key thing is to only use them in isolation (e.g. summary queries) or for Inner Joins. Outer Joins don’t work right now, though there are cunning workarounds that apply if you are Outer Joining to summary data – see Eric Hanson’s video referenced below somewhere around the 50 minute mark.
You can detect when they are being used by the Execution Mode described in the Query Plan. This is new in Denali and is either Row or Batch. Row means traditional SQL Server execution and Batch means the Columnstore is being used.
So, the key takeaways:
- For any large table put a Columnstore index across the entire table
- Only join using Inner Joins
- Spot the use of the Columnstore in Query plans via the Execution Mode of Batch
Useful reference material:
- Eric Hanson’s Tech-Ed video (a bit dull, but informative – if you know your Data Warehousing theory, skip the first 15 minutes)
- The TechNet Columnstore Index FAQ
- Columnstore Indexes for Fast Data Warehouse Query Processing in SQL Server 11.0 – Whitepaper by Eric Hanson
- MSDN documentation
OLE DB is dead – Long live ODBC!
Something I picked up in passing at the MSDN SSIS forum – SQL Native Client OLE DB is being deprecated post Denali. See this post on the SQL Native Client Team Blog.
What does this mean?
- First up this only applies to SQL Server connections – other OLE DB connectivity such as Teradata and Oracle will be unaffected.
- In the short term, nothing – if you are using OLE DB it will carry on working for a fair few years yet – but if you are planning for a systems longevity – look to using ODBC for all your SQL Server connectivity needs
It’s a slightly confusing move given that ODBC underperforms OLE DB – so it’s bad news for us SSIS people, unless part of the change also includes some significant performance and capability improvements.
Columnstore indexes in Denali (aka: “Apollo”)
James Serra has a great post on a new feature in SQL Server Denali – Columnstore indexes.
The tl/dr version is this:
Columnstore indexes use the Vertipaq compression engine (that’s the shiny compression engine in PowerPivot) to further compact indexes to make querying them between 10-100 times faster.
The most significant limitation is that tables become read-only when they have a columnstore index (no Inserts / Updates / Deletes etc) – though James notes you can work around this by using Partitions if you are dealing with tables that are just additive. Otherwise indexes will need to be dropped and recreated as data changes.
So – a powerful new indexing feature which, with careful management – can have a serious positive impact on the performance of your Data Warehouse.
Ranking and Numbering rows – and subsets of rows – in T-SQL
I recently had to deal with a scenario where I needed to pivot out some rows after ordering (ranking) them according to specific rules so I could present some rows of data as columns, but in a specific order (don’t ask why, it’ll make me grind my teeth about data analysts that don’t understand how to analyse data…). The ordering in itself was only part of the solution, as to Pivot the data, the keys need to be specified in the query, so the natural keys can’t be used. The scenario is set out below:

Fig 1: Rank and Pivot. The Rank column needed to be added
My first thought was that I’d have to solve this with a cursor, which wasn’t a practical option as there were 1.5m rows of data to process, and if my solution involves a cursor I instantly think it’s a lousy solution. However I was pleased to discover the T-SQL function ROW_NUMBER() which allows you to add row numbering to ordered data and even subgroups of that data. (The below samples use the AdventureWorks2008 database.)
First up, basic row numbering:
SELECT ROW_NUMBER() OVER (ORDER BY ProductId) AS ID_Key
, [ProductID]
, [LocationID]
, [Shelf]
, [Bin]
, [Quantity]FROM [Production].[ProductInventory]
WHERE [ProductID] IN (1,2,3,4)
The above query adds an ID key to the data based on ordering by the ProductID field. The ROW_NUMBER() function requires an OVER clause to know on what basis it should assign the key, and this has to be an ORDER BY statement. The end result looks like this:

Fig 2: Simple row numbering
You can extend this to order within a subgroup, by specifying a PARTITION BY clause so ROW_NUMBER() operates with that subgroup. In the example below I partition by ProductId:
SELECT ROW_NUMBER() OVER (PARTITION BY ProductId ORDER BY Quantity DESC) AS Subset_ID_Key
, [ProductID]
, [LocationID]
, [Shelf]
, [Bin]
, [Quantity]FROM [Production].[ProductInventory]
WHERE [ProductID] IN (1,2,3,4)
Which yields this result, with the ranking now only applying within a Product Id:

Fig 3: Row numbering within a Subgroup
Which can then be pivoted on the rank, as the key of the rank is now known:
SELECT ProductID
, [1] AS Bin_1
, [2] AS Bin_3
, [3] AS Bin_3FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY ProductId ORDER BY Quantity DESC) AS Subset_ID_Key
, [ProductID]
, [Bin]FROM [Production].[ProductInventory]
WHERE [ProductID] IN (1,2,3,4)
) AS Pivot_Source
PIVOT
(
MAX(Bin)
FOR Subset_ID_Key IN ([1],[2],[3])
) AS Pivot_Output
Which yields this final output:

Fig 4: Ranked and Pivoted
All done within a single query, and not a cursor in sight. ROW_NUMBER() was a great function to discover!
MSDN Documentation is here for:
- ROW_NUMBER() – the key function
- OVER – ordering and subgrouping the results of ROW_NUMBER
- PIVOT – for pivoting out the results
Use the Index, Luke
While I’m on a T-SQL bent, I stumbled upon (literally, beware of StumbleUpon as a way to waste heroic amounts of time) a free eBook called “Use The Index, Luke”. Its by a chap called Markus Winand, who I freely admit to having never heard of (and Google isn’t particularly enlightening).
However it’s a work in progress explaining in relatively clear terms what database indexes are, how they work and how they can impact performance. It’s a pretty important topic for anyone who writes SQL – so I’d recommend reading what’s there and tracking it’s progress.
Why to avoid DISTINCT and GROUP BY to get unique records
This is a quick and dirty post on the use of DISTINCT or GROUP BY to get unique records, based on something I helped a developer with over the last couple of weeks.
Their thought process was that because they were getting duplicate records, the easiest way to get rid of them was to slap a DISTINCT at the start of the query to get unique results. Which, in a sense, is OK – because it worked (sort of).However there’s two very good reasons why this is not always a good approach.
#1: Your query is wrong
If you are getting back duplicate records, what it probably means is that you are really doing your query wrong. The below example is an admittedly imperfect example of this – as the first query returns far more than intended – but was close to what I was dealing with:
USE AdventureWorks
/* Query 1: Using DISTINCT to try to eliminate duplicates */
select DISTINCT
s.Name,
CASE
WHEN sc.ContactTypeID = 11 THEN ‘Y’
ELSE ‘N’
END AS ’OwnerContact’
from Sales.Store s
left join Sales.StoreContact sc
ON s.CustomerID = sc.CustomerID/* Query 2: Using a properly formed WHERE clause */
select s.Name,
‘Y’ AS ’OwnerContact’
from Sales.Store s
left join Sales.StoreContact sc
ON s.CustomerID = sc.CustomerID
WHERE sc.ContactTypeID = 11
What I’m trying to illustrate with the example above is that if you consider more carefully what records you are bringing back in your joins, you are less likely to end up with duplicates. By making sure you are only joining to tables in such a way as to bring back the data you need is going to reduce the risk of other errors creeping in.
#2: Performance
If you are getting duplicate records, you are bringing back more data than you need. On top of this the DISTINCT or GROUP BY operations are having to go over the whole returned data set to identify unique records. This can get pretty expensive pretty quicky in terms of database operations.
From my perspective badly performing queries are a lesser sin than incorrect ones. I doubt many business users will be making decisions based on query length, but they will on the data you serve up to them.
Summing up
All I want to do in this post is make you pause and think before doing a DISTINCT or GROUP BY purely to eliminate duplicates – especially if you don’t really understand why you are getting them. A better designed and more accurate query can often get rid of the dupes and cut off the risk of bad data in the future.
Update 25 Jul 2011: Mark Caldwell articulates it better than me @ SQL Team Blog: Why I Hate DISTINCT
SQL Server BI / DW Scalability
One of the common FUD* techniques deployed against SQL Server is to raise questions about its ability to scale. Rather than blather on about technical reasons why this is bunkum, i’ll just hit you up with some numbers from this springs SQL PASS:
| Category | Metric |
| Largest single database (DW) | 80 TB |
| Largest table | 20 TB |
| Biggest total data 1 customer | 2.5 PB |
| Highest transactions per second 1 db – OLTP | 36,000 |
| Fastest I/O subsystem in production | 18 GB/sec |
| Data load for 1TB | 20 minutes |
| Largest cube | 4.2 TB |
So, seriously, unless you have volumes of data on a par with MySpace… Scalability is not an issue.