SQL Pass Day #1

So the BI Monkey was at PASS and intended blogging on his phone or at one of the many internet pods provided, but sadly WordPress and IE / Android aren’t friends so this is going to be a retrospective.

So, here are my takeaways from day 1:

From the keynote:

Denali becomes SQL2012 and is slated for release in the first half of next year (way to allow yourself some leeway on the final release date!)

Crescent becomes Power View – and will work on multiple mobile platforms – Apple & Android via Browser, WP7 via a richer app

HADOOP is going to be supported in Windows server – the first CTP is due next year. If you have access to the PASS DVD’s / Sessions – see Dr Dewitt’s presentation on HADOOP – very enlightening

From wandering around the summit:

As per Elad Ziklik of the DQS team, performance is one of their focuses for the next release

I got to finally meet Ivan Peev of CozyRoc after many years of email and phone conversations, and he told me about their new SAS Connector – which allows reading and writing to SAS datasets without an actual SAS install.

I also got to meet Matt Masson, guru of the SSIS team who told me about Project Barcelona – a tool that will do data lineage, metadata management and impact analysis via a crawler as opposed to a manually maintained set.

I also got to sit in on a customer feedback session about BI in the Cloud – unfortunately all under NDA – but it was a great forum to discuss and help direct Microsoft’s Cloud BI ambitions.

I also had a chat with fellow Aussie BI guy Roger Noble who told me about a use for the Term Extraction transformation in SSIS – using it to scan through documents and auto-tag them as the were uploaded to SharePoint – which is pretty cool.

So, that was Day 1… Day 2 to follow!

Columnstore Indexes revisited

Having now researched Columnstore Indexes further, I thought I’d share the key learning I’ve picked up on this feature – which now sounds even more powerful than I’d originally thought.

The most important thing to take away is that a Columnstore Index should actually cover the entire table. Its name is a little misleading – the feature is less of an index, and more of a shadow copy of the table’s data, compressed with the Vertipaq voodoo. I suspect they have used the term index because the Columnstore doesn’t cover all data types – the important ones are there, but some extreme decimals and blobs are excluded – for a full list see the MSDN documentation. So for any big table, whack a Columnstore index across the entire table.

Next up is to understand how to use them and how to detect when they are or are not being used. The key thing is to only use them in isolation (e.g. summary queries) or for Inner Joins. Outer Joins don’t work right now, though there are cunning workarounds that apply if you are Outer Joining to summary data – see Eric Hanson’s video referenced below somewhere around the 50 minute mark.

You can detect when they are being used by the Execution Mode described in the Query Plan. This is new in Denali and is either Row or Batch. Row means traditional SQL Server execution and Batch means the Columnstore is being used.

So, the key takeaways:

  • For any large table put a Columnstore index across the entire table
  • Only join using Inner Joins
  • Spot the use of the Columnstore in Query plans via the Execution Mode of Batch

Useful reference material:

OLE DB is dead – Long live ODBC!

Something I picked up in passing at the MSDN SSIS forum – SQL Native Client OLE DB is being deprecated post Denali. See this post on the SQL Native Client Team Blog.

What does this mean?

  1. First up this only applies to SQL Server connections – other OLE DB connectivity such as Teradata and Oracle will be unaffected.
  2. In the short term, nothing – if you are using OLE DB it will carry on working for a fair few years yet – but if you are planning for a systems longevity – look to using ODBC for all your SQL Server connectivity needs

It’s a slightly confusing move given that ODBC underperforms OLE DB – so it’s bad news for us SSIS people, unless part of the change also includes some significant performance and capability improvements.

Columnstore indexes in Denali (aka: “Apollo”)

James Serra has a great post on a new feature in SQL Server Denali – Columnstore indexes.

The tl/dr version is this:

Columnstore indexes use the Vertipaq compression engine (that’s the shiny compression engine in PowerPivot) to further compact indexes to make querying them between 10-100 times faster.

The most significant limitation is that tables become read-only when they have a columnstore index (no Inserts / Updates / Deletes etc) – though James notes you can work around this by using Partitions if you are dealing with tables that are just additive. Otherwise indexes will need to be dropped and recreated as data changes.

So – a powerful new indexing feature which, with careful management – can have a serious positive impact on the performance of your Data Warehouse.

Ranking and Numbering rows – and subsets of rows – in T-SQL

I recently had to deal with a scenario where I needed to pivot out some rows after ordering (ranking) them according to specific rules so I could present some rows of data as columns, but in a specific order (don’t ask why, it’ll make me grind my teeth about data analysts that don’t understand how to analyse data…). The ordering in itself was only part of the solution, as to Pivot the data, the keys need to be specified in the query, so the natural keys can’t be used. The scenario is set out below:

Fig 1: Rank and Pivot. The Rank column needed to be added

Fig 1: Rank and Pivot. The Rank column needed to be added

My first thought was that I’d have to solve this with a cursor, which wasn’t a practical option as there were 1.5m rows of data to process, and if my solution involves a cursor I instantly think it’s a lousy solution. However I was pleased to discover the T-SQL function ROW_NUMBER() which allows you to add row numbering to ordered data and even subgroups of that data. (The below samples use the AdventureWorks2008 database.)

First up, basic row numbering:

SELECT ROW_NUMBER() OVER (ORDER BY ProductId) AS ID_Key
,        [ProductID]
,        [LocationID]
,        [Shelf]
,        [Bin]
,        [Quantity]

FROM [Production].[ProductInventory]

WHERE [ProductID] IN (1,2,3,4)

The above query adds an ID key to the data based on ordering by the ProductID field. The ROW_NUMBER() function requires an OVER clause to know on what basis it should assign the key, and this has to be an ORDER BY statement. The end result looks like this:

Fig 2: Simple row numbering

Fig 2: Simple row numbering

You can extend this to order within a subgroup, by specifying a PARTITION BY clause so ROW_NUMBER() operates with that subgroup. In the example below I partition by ProductId:

SELECT ROW_NUMBER() OVER (PARTITION BY ProductId ORDER BY Quantity DESC) AS Subset_ID_Key
,        [ProductID]
,        [LocationID]
,        [Shelf]
,        [Bin]
,        [Quantity]

FROM [Production].[ProductInventory]

WHERE [ProductID] IN (1,2,3,4)

Which yields this result, with the ranking now only applying within a Product Id:

Fig 3: Row numbering within a Subgroup

Fig 3: Row numbering within a Subgroup

Which can then be pivoted on the rank, as the key of the rank is now known:

SELECT ProductID
,        [1] AS Bin_1
,        [2] AS Bin_3
,        [3] AS Bin_3

FROM

(

SELECT ROW_NUMBER() OVER (PARTITION BY ProductId ORDER BY Quantity DESC) AS Subset_ID_Key
,        [ProductID]
,        [Bin]

FROM [Production].[ProductInventory]

WHERE [ProductID] IN (1,2,3,4)

) AS Pivot_Source

PIVOT

(
MAX(Bin)
FOR Subset_ID_Key IN ([1],[2],[3])
) AS Pivot_Output

Which yields this final output:

Fig 4: Ranked and Pivoted

Fig 4: Ranked and Pivoted

All done within a single query, and not a cursor in sight. ROW_NUMBER() was a great function to discover!

MSDN Documentation is here for:

  • ROW_NUMBER() – the key function
  • OVER – ordering and subgrouping the results of ROW_NUMBER
  • PIVOT – for pivoting out the results

Use the Index, Luke

While I’m on a T-SQL bent, I stumbled upon (literally, beware of StumbleUpon as a way to waste heroic amounts of time) a free eBook called “Use The Index, Luke”. Its by a chap called Markus Winand, who I freely admit to having never heard of  (and Google isn’t particularly enlightening).

However it’s a work in progress explaining in relatively clear terms what database indexes are, how they work and how they can impact performance. It’s a pretty important topic for anyone who writes SQL – so I’d recommend reading what’s there and tracking it’s progress.

Why to avoid DISTINCT and GROUP BY to get unique records

This is a quick and dirty post on the use of DISTINCT or GROUP BY to get unique records, based on something I helped a developer with over the last couple of weeks.

Their thought process was that because they were getting duplicate records, the easiest way to get rid of them was to slap a DISTINCT at the start of the query to get unique results. Which, in a sense, is OK – because it worked (sort of).However there’s two very good reasons why this is not always a good approach.

#1: Your query is wrong

If you are getting back duplicate records, what it probably means is that you are really doing your query wrong. The below example is an admittedly imperfect example of this – as the first query returns far more than intended – but was close to what I was dealing with:

USE AdventureWorks

/* Query 1: Using DISTINCT to try to eliminate duplicates */

select    DISTINCT
s.Name,
CASE
WHEN sc.ContactTypeID = 11 THEN ‘Y’
ELSE ‘N’
END    AS    ’OwnerContact’
from    Sales.Store s
left join Sales.StoreContact sc
ON s.CustomerID = sc.CustomerID

/* Query 2: Using a properly formed WHERE clause */

select    s.Name,
‘Y’ AS    ’OwnerContact’
from    Sales.Store s
left join Sales.StoreContact sc
ON s.CustomerID = sc.CustomerID
WHERE sc.ContactTypeID = 11

What I’m trying to illustrate with the example above is that if you consider more carefully what records you are bringing back in your joins, you are less likely to end up with duplicates. By making sure you are only joining to tables in such a way as to bring back the data you need is going to reduce the risk of other errors creeping in.

#2: Performance

If you are getting duplicate records, you are bringing back more data than you need. On top of this the DISTINCT or GROUP BY operations are having to go over the whole returned data set to identify unique records. This can get pretty expensive pretty quicky in terms of database operations.

From my perspective badly performing queries are a lesser sin than incorrect ones. I doubt many business users will be making decisions based on query length, but they will on the data you serve up to them.

Summing up

All I want to do in this post is make you pause and think before doing a DISTINCT or GROUP BY purely to eliminate duplicates – especially if you don’t really understand why you are getting them. A better designed and more accurate query can often get rid of the dupes and cut off the risk of bad data in the future.

Update 25 Jul 2011: Mark Caldwell articulates it better than me @ SQL Team Blog: Why I Hate DISTINCT

SQL Server BI / DW Scalability

One of the common FUD* techniques deployed against SQL Server is to raise questions about its ability to scale. Rather than blather on about technical reasons why this is bunkum, i’ll just hit you up with some numbers from this springs SQL PASS:

Category Metric
Largest single database (DW) 80 TB
Largest table 20 TB
Biggest total data 1 customer 2.5 PB
Highest transactions per second 1 db – OLTP 36,000
Fastest I/O subsystem in production 18 GB/sec
Data load for 1TB 20 minutes
Largest cube 4.2 TB



So, seriously, unless you have volumes of data on a par with MySpace… Scalability is not an issue.

An SQL alternative to the SCD

In SQL 2008 a new T-SQL construct was added - the MERGE operation. (Ok, pedants will know this wasn’t new to Oracle,  but it was new to SQL Server).

This operation allows for the merging of a dataset into a reference dataset – which can be remarkably similar to Insert / Update operations effected by the Slowly Changing Dimension transformation. However the way it operates is very different. Instead of the SCD’s row by row evaluation approach, the MERGE operation is a set based operation. What this means is it compares the whole of the source dataset to the reference dataset in a single pass. This has significant implications for performance – on a site where I implemented this the operation which took 1,200 seconds in the SCD cut down to 51 seconds using a Merge.

There are limitations and differences to be aware of:

  • You cannot directly return row counts for Insert / Update / Ignore operations in the Merge
  • As it is a bulk operation a single row will cause failure of the whole batch
  • There’s no GUI – just hand crafted SQL
  • Less error trapping / logging options
  • More flexibility in terms of actions when matches / non matches are found

The main reason why you would consider the SQL Merge – it handles Type 1, and with a little cunning, Type 2 dimensions – in a fraction of the time it takes the SCD to plod through. It’s still not as fast as a proper in memory comparison using something such as TableDifference – but it’s always good to know you have something else available in your toolkit.

Further information:

SQL Server Aliases

As part of a meeting on setting up DR systems, one of the Server Techs mentioned using SQL Server Aliases to allow the cubes to be identical in structure – right down to connections – yet live on separate environments and point at different SQL Servers. This was news to me, but a quick google turned this up: How to setup and use a SQL Server alias.

An Alias is an Operating System level setting that allows you give a friendly name to your Server – and you can then connect to the server using that friendly name. What this means is that as you migrate your code from environment to environment, if you use the same friendly name for your SQL Server in each tier of the deployment, you don’t have to change your connection strings.

It’s a neat trick (though i’m not sure how the names resolve if you use the same name in environments that can see each other) – it certainly requires some proper planning from an infrastructure point of view.

« Previous PageNext Page »