Creating effective date ranges from multiple sources using Window Functions

Sometimes dimension data is managed across multiple tables. Just to complicate things sometimes this data has independent effective date ranges on these sources. So when we try to tie our data together, trying to pick which item of data is effective when is a bit of a challenge.

A picture speaks a thousand lines of blog post, so the picture below spells it out:

Date Ranges from multiple sources
Date Ranges from multiple sources

Table A has a set of data with different effective periods. Table B has a set of data for the same attribute with a completely independent set of effective periods. In data terms, it looks like this:

Date ranges from multiple sources sample data
Date ranges from multiple sources sample data

The challenge is to join them together so we get the right combination of attributes effective at the right time, as per the first picture. Now there is a way to do it through a join with careful selection of start / end  dates in a CASE statement and filtering out of records using  WHERE clause. However that has the downfall that it cannot cope with records where there is no cross over of data – so records  “1-“,”4-” and  “5-” have to be added in later through a separate process.

The alternative is to get the window functions voodoo doll out, and stretch the brain a little so you can do it all in one query.

Step one in this exercise is realising that each tables start dates could also be end dates in the other table, and each tables end dates could also be start dates (less a day) in the other table. So we need to UNION End Dates from Table A with Start Dates from Table B, like so:

SELECT    ID,    [A Start Date] AS [Start Date]
FROM    Table_A
UNION
SELECT    ID,    [B Start Date]
FROM    Table_B
UNION
— All end dates are start dates + 1 too
SELECT    ID,    ISNULL(DATEADD(d,1,[A End Date]),’31 Dec 2099′)
FROM    Table_A
UNION
SELECT    ID,    ISNULL(DATEADD(d,1,[B End Date]),’31 Dec 2099′)
FROM    Table_B

Now, this gives us a full set of every possible start date – which is a starting point.The end result looks like this:

Union Results
Union Results

We can repeat the same trick for end dates and then do a cartesian join on the two sets and then we get a combination of every possible start and end date. No we need some criteria by which to select the right date pair. If we add a DATEDIFF to the resultset it becomes obvious we want to pick the smallest date range:

Crossjoin results with DATEDIFF
Crossjoin results with DATEDIFF

A WINDOW function gives us the intelligence to pick the right row. So if we apply a ROW_NUMBER() over a PARTITION of Start Date, ordering by End Date, then we just have to select the first row of each partition:

The final result
The final result

Now we have a complete set of effective date ranges on which to join our attribute tables!

Grab a copy of the query, including sample data scripts here: DateRanges

Don’t forget to check out my upcoming training “Build your Data Warehouse in SQL Server & SSIS with the BI Monkey” – a  primer course for anyone wanting to build a first DW using Microsoft tools.

 

Read More

Gimme a table, but hold the indexes there buddy!

I’ve just wrapped up a quick bit of work with a client who was just getting moving on their DW journey, and I found some fun things that the rookies had done.

I was also partly inspired to talk about this particular issue by this timely DBA Reactions post: When the developers hear my warning about 27 indexes on a single table

When the developers hear my warning about 27 indexes on a single table
When the developers hear my warning about 27 indexes on a single table

 

Part of what I was looking at was applying compression to their tables to reduce the disk space consumption. So I ran my handy query to size the databases and found an 8GB table with 12GB of indexes! Yoinks – something was clearly astray…  it turned out that one table – with about 30 columns – had 8 indexes on it.

Why not have 27 indexes?

So there’s a few good reasons:

  • That many indexes take a long time to build / update, slowing load performance (in this case it took about 5 minutes to load the table and a further 29 minutes to build the indexes….)
  • The Query Optimiser is going to struggle to pick the right index if any of them overlap
  • If you are on 2012+ you can just stick the entire table in a columnstore index, massively boost performance and consume less disk space (applying a columnstore on the whole table took a whopping 1 minute to build)

As a broad rule I’d consider these rules of thumb following when looking at a tables indexes:

  • If I have more than 2 indexes, I’m probably not indexing well, my table design is wrong or my queries need re-evaluating
  • If my indexes are more than 20% of the size of the table, I should probably look at replacing them with a whole table columnstore
  • Are my indexes making use of included columns to increase their coverage?

If any readers have some good index management tips, please share them in the comments!

Read More

Size every table in your database in a flash – Part 2

A looooong time a go I had a post on how to “Count the number of rows in every Table in a Database in no time!” which did an approximate row count and sizing of every table in a database using DMV’s.

Well, since then then I have found  a far better way to quickly get a row count, table & index size from the handy stored proc sp_spaceused. It’s worth running DBCC UPDATEUSAGE first just to ensure the stats it works with are current.

I’ve wrapped a little cursor around it so you can – extremely quickly – get these stats for a whole database (sorted by descending data size but that can be easily tweaked in that final select):

DECLARE @TableSize TABLE
(
[TableName] nvarchar(255)
,    [RowCount] int
,    [Reserved] nvarchar(255)
,    [Data] nvarchar(255)
,    [Index Size] nvarchar(255)
,    [Unused] nvarchar(255)
)

DECLARE @CurrentTable nvarchar(255)

DECLARE TableCursor CURSOR FOR
SELECT [Name] FROM SYS.tables

OPEN TableCursor

FETCH NEXT FROM TableCursor INTO  @CurrentTable

WHILE @@FETCH_STATUS = 0

BEGIN

INSERT INTO @TableSize
exec sp_spaceused @CurrentTable

FETCH NEXT FROM TableCursor INTO  @CurrentTable

END

CLOSE TableCursor
DEALLOCATE TableCursor

SELECT * FROM @TableSize
order by CAST(REPLACE([Data],‘ KB’,) as int) desc

… and this is what the results look like (Against good old AdventureWorks):

Table Sizing Results
Table Sizing Results

Read More

How to do complex aggregations through Window functions in SQL Server

The BI Monkey has a confession – his partner is also somewhat of a data geek – though more buried in doing real analytics – but I have persuaded them of the benefits of using SQL instead of SAS sometimes. Right tool for the job etc. etc.

Now I was providing a bit of email support and was reminded of a valuable lesson – a written requirement is not always a very good way of expressing what you want – I spent most of the day bouncing emails advising about how to write cursors (yes, I know, sorry) – and when we actually sat in front of a laptop with the data I found out what it was we actually were trying to achieve and wrote a 5 line query that was literally thousands of times faster.

Are we talking about Window functions yet?

Sure, let’s get to the point. Actually I have blogged about them before but a refresher and different look is always helpful. Window functions allow you to define a “window” of data over which you can perform an operation against a set of rows. Note this is distinct from a GROUP BY which will aggregate the data. Window functions operate at the row level and leave the row data intact. The kind of capabilities this enables is things like adding a row sequence number within a group, cumulative totals, moving averages and other such shenanigans.

Now the key bit of code is the OVER clause. This allows you to specify a “window” of data over which to perform an operation. The window is defined by a PARTITION clause which specifies which column’s values to use to subdivide the data. Then an ORDER BY clause specifies what order – if any – the data in the PARTITION should be processed in. Finally a ROW or RANGE statement can influence the start and end points of the PARTITION.

However, that’s a lot of fine words and SQL speaks clearer:

SELECT    OrderDateKey
,        SalesOrderNumber
,        SalesAmount
,        SUM(SalesAmount) OVER (PARTITION BY OrderDateKey ORDER BY SalesOrderNumber ROWS UNBOUNDED PRECEDINGAS CumulativeAggregation
,        SUM(SalesAmount) OVER (PARTITION BY OrderDateKey ORDER BY SalesOrderNumber ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWINGAS SumOverWindow
,        COUNT(SalesOrderNumber) OVER (PARTITION BY OrderDateKey ORDER BY SalesOrderNumber ROWS UNBOUNDED PRECEDINGAS RowNumberIncrement
,        COUNT(SalesOrderNumber) OVER (PARTITION BY OrderDateKey ORDER BY SalesOrderNumber ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWINGAS RowNumberCountdown

FROM [dbo].[FactInternetSales]

ORDER BY SalesOrderNumber

This will run against the AdventureWorksDW2012 sample database.

So there are four columns I’m calculating across a window of “OrderDateKey”

  1. CumulativeAggregation – which accumulates values across the window
  2. SumOverWindow – which totals values across the window
  3. RowNumberIncrement – which provides an increasing row count across the window
  4. RowNumberCountdown – which provides a decreasing row count across the window

Here’s the results:

Window Function Results
Window Function Results

So to break these down a bit:

CumulativeAggregation

Here we PARTITION by OrderDateKey. So the window is based on values in the “OrderDateKey” column. So in the results above, we get a window for “20050701”, “20050702”, etc.

To get the  accumulation to carry out in the order we want we apply an ORDER BY on “SalesOrderNumber”.

Now for the clever bit. The ROWS UNBOUNDED PRECEDING tells the SUM function to evaluate from the start of the window. Hence you get the cumulative sum appearing – each row evaluates the total for all values for SalesAmount from the start of the window to the current row.

SumOverWindow

We follow much of what we did in CumulativeAggregation – same PARTITION, and same ORDER BY. The ORDER BY has no impact on the result but is needed for the statement to be syntactically valid.

This time we extend the ROWS UNBOUNDED PRECEDING by adding AND UNBOUNDED FOLLOWING – this now tells the SUM function to evaluate the SUM from the start of the window to the end of the window. So this time each row has the total SalesAmount for the window. Useful for calculating things such as percentage in period.

RowNumberIncrement

This time we switch the SUM to a COUNT. Here our ROWS UNBOUNDED preceding counts the rows from the start of the window to the current row – effectively giving us a row number.

RowNumberCountdown

In our last example we change to a stated set of ROWS between CURRENT ROW AND UNBOUNDED FOLLOWING. So now the COUNT evaluates from the current row to the end of the window – which effectively gives us a countdown to the end row of the window.

Summary

Using Window Functions via the OVER clause we can efficiently perform complex aggregations in place of awkward cursors, sub-selects and other such workarounds we may have had to lean on in the past. I’ve covered a small proportion of the options available but this functionality opens up so powerful analytic possibilities.

Also, talking to a user – even if it the person you are supposed to know best in the world – is vastly more effective if you want to understand their problem and solve it efficiently!

Read More

Troubleshooting Service Broker Stored Procedures

A quick post on troubleshooting the Stored Procedures that can get called to handle the messages in a Service Broker queue.

The stored procedures are pretty vanilla – see step 4 in this link for an example – take the XML of the message body and process it within the confines of the queue.

However the procedures cant be executed in the query editor so they only log their output to the error log – so to debug you need to look in the main SQL Server log – found at somewhere like “C:\Program Files\Microsoft SQL Server\MSSQL11.SQLSERVER\MSSQL\Log\ERRORLOG” – openable with a text editor of your choosing.
Then hunt for lines that look like this:

2013-11-08 10:56:26.10 spid34s     The activated proc ‘[dbo].[PayloadReceive]’ running on queue ‘ServiceBrokerDatabase.dbo.InboundQueue’ output the following:  ‘Some useful info’

To break that down, you get timestamp, process id, and then a message that says effectively “stored procedure” on “queue” output “some stuff”. So the “some stuff” could be an error message output by SQL, or a PRINT statement in your proc as you try to test.

Read More

ETL vs SQL

I’ve recently been asked to explain why I would use an ETL tool such as SSIS over good old fashioned SQL code. Here’s a quick summary, plus some links to others discussing the same topic:

  • Easier to maintain the code – through atomicity and simpler interfaces
  • Many reusable components – a lot of code (e.g. Lookups) has already been constructed
  • More flexible than SPs – they have more functionality (e.g. Cached Lookups, FTP)
  • Can deliver better performance than SPs – use of In Memory techniques can boost performance
  • Features such as logging/audit and metadata are built in
  • Support is more broadly available – it’s easier to find out why a prebuilt, widely used component is throwing errors

I appreciate not everyone will agree, and here’s some people who do, and don’t:

 

Read More

Insert an Image into a SQL Server table

Need to store an image into a SQL Server table? Easy. First make sure you have a varbinary(max) column to store it. Then do something along these lines:

UPDATE YourTable

SET ImageColumn= ( SELECT * FROM OPENROWSET (BULK ‘C:\Temp\arms4.jpg’, SINGLE_BLOB) AS a)

WHERE FilterColumn = ‘Filter Value’

Read More

Ranking and Numbering rows – and subsets of rows – in T-SQL

I recently had to deal with a scenario where I needed to pivot out some rows after ordering (ranking) them according to specific rules so I could present some rows of data as columns, but in a specific order (don’t ask why, it’ll make me grind my teeth about data analysts that don’t understand how to analyse data…). The ordering in itself was only part of the solution, as to Pivot the data, the keys need to be specified in the query, so the natural keys can’t be used. The scenario is set out below:

Fig 1: Rank and Pivot. The Rank column needed to be added
Fig 1: Rank and Pivot. The Rank column needed to be added

My first thought was that I’d have to solve this with a cursor, which wasn’t a practical option as there were 1.5m rows of data to process, and if my solution involves a cursor I instantly think it’s a lousy solution. However I was pleased to discover the T-SQL function ROW_NUMBER() which allows you to add row numbering to ordered data and even subgroups of that data. (The below samples use the AdventureWorks2008 database.)

First up, basic row numbering:

SELECT ROW_NUMBER() OVER (ORDER BY ProductId) AS ID_Key
,        [ProductID]
,        [LocationID]
,        [Shelf]
,        [Bin]
,        [Quantity]

FROM [Production].[ProductInventory]

WHERE [ProductID] IN (1,2,3,4)

The above query adds an ID key to the data based on ordering by the ProductID field. The ROW_NUMBER() function requires an OVER clause to know on what basis it should assign the key, and this has to be an ORDER BY statement. The end result looks like this:

Fig 2: Simple row numbering
Fig 2: Simple row numbering

You can extend this to order within a subgroup, by specifying a PARTITION BY clause so ROW_NUMBER() operates with that subgroup. In the example below I partition by ProductId:

SELECT ROW_NUMBER() OVER (PARTITION BY ProductId ORDER BY Quantity DESC) AS Subset_ID_Key
,        [ProductID]
,        [LocationID]
,        [Shelf]
,        [Bin]
,        [Quantity]

FROM [Production].[ProductInventory]

WHERE [ProductID] IN (1,2,3,4)

Which yields this result, with the ranking now only applying within a Product Id:

Fig 3: Row numbering within a Subgroup
Fig 3: Row numbering within a Subgroup

Which can then be pivoted on the rank, as the key of the rank is now known:

SELECT ProductID
,        [1] AS Bin_1
,        [2] AS Bin_3
,        [3] AS Bin_3

FROM

(

SELECT ROW_NUMBER() OVER (PARTITION BY ProductId ORDER BY Quantity DESC) AS Subset_ID_Key
,        [ProductID]
,        [Bin]

FROM [Production].[ProductInventory]

WHERE [ProductID] IN (1,2,3,4)

) AS Pivot_Source

PIVOT

(
MAX(Bin)
FOR Subset_ID_Key IN ([1],[2],[3])
) AS Pivot_Output

Which yields this final output:

Fig 4: Ranked and Pivoted
Fig 4: Ranked and Pivoted

All done within a single query, and not a cursor in sight. ROW_NUMBER() was a great function to discover!

MSDN Documentation is here for:

  • ROW_NUMBER() – the key function
  • OVER – ordering and subgrouping the results of ROW_NUMBER
  • PIVOT – for pivoting out the results

Read More

Use the Index, Luke

While I’m on a T-SQL bent, I stumbled upon (literally, beware of StumbleUpon as a way to waste heroic amounts of time) a free eBook called “Use The Index, Luke”. Its by a chap called Markus Winand, who I freely admit to having never heard of  (and Google isn’t particularly enlightening).

However it’s a work in progress explaining in relatively clear terms what database indexes are, how they work and how they can impact performance. It’s a pretty important topic for anyone who writes SQL – so I’d recommend reading what’s there and tracking it’s progress.

Read More

Why to avoid DISTINCT and GROUP BY to get unique records

This is a quick and dirty post on the use of DISTINCT or GROUP BY to get unique records, based on something I helped a developer with over the last couple of weeks.

Their thought process was that because they were getting duplicate records, the easiest way to get rid of them was to slap a DISTINCT at the start of the query to get unique results. Which, in a sense, is OK – because it worked (sort of).However there’s two very good reasons why this is not always a good approach.

#1: Your query is wrong

If you are getting back duplicate records, what it probably means is that you are really doing your query wrong. The below example is an admittedly imperfect example of this – as the first query returns far more than intended – but was close to what I was dealing with:

USE AdventureWorks

/* Query 1: Using DISTINCT to try to eliminate duplicates */

select    DISTINCT
s.Name,
CASE
WHEN sc.ContactTypeID = 11 THEN ‘Y’
ELSE ‘N’
END    AS    ‘OwnerContact’
from    Sales.Store s
left join Sales.StoreContact sc
ON s.CustomerID = sc.CustomerID

/* Query 2: Using a properly formed WHERE clause */

select    s.Name,
‘Y’ AS    ‘OwnerContact’
from    Sales.Store s
left join Sales.StoreContact sc
ON s.CustomerID = sc.CustomerID
WHERE sc.ContactTypeID = 11

What I’m trying to illustrate with the example above is that if you consider more carefully what records you are bringing back in your joins, you are less likely to end up with duplicates. By making sure you are only joining to tables in such a way as to bring back the data you need is going to reduce the risk of other errors creeping in.

#2: Performance

If you are getting duplicate records, you are bringing back more data than you need. On top of this the DISTINCT or GROUP BY operations are having to go over the whole returned data set to identify unique records. This can get pretty expensive pretty quicky in terms of database operations.

From my perspective badly performing queries are a lesser sin than incorrect ones. I doubt many business users will be making decisions based on query length, but they will on the data you serve up to them.

Summing up

All I want to do in this post is make you pause and think before doing a DISTINCT or GROUP BY purely to eliminate duplicates – especially if you don’t really understand why you are getting them. A better designed and more accurate query can often get rid of the dupes and cut off the risk of bad data in the future.

Update 25 Jul 2011: Mark Caldwell articulates it better than me @ SQL Team Blog: Why I Hate DISTINCT

Read More