Creating effective date ranges from multiple sources using Window Functions

Sometimes dimension data is managed across multiple tables. Just to complicate things sometimes this data has independent effective date ranges on these sources. So when we try to tie our data together, trying to pick which item of data is effective when is a bit of a challenge.

A picture speaks a thousand lines of blog post, so the picture below spells it out:

Table A has a set of data with different effective periods. Table B has a set of data for the same attribute with a completely independent set of effective periods. In data terms, it looks like this:

The challenge is to join them together so we get the right combination of attributes effective at the right time, as per the first picture. Now there is a way to do it through a join with careful selection of start / end  dates in a CASE statement and filtering out of records using  WHERE clause. However that has the downfall that it cannot cope with records where there is no cross over of data – so records  “1-“,”4-” and  “5-” have to be added in later through a separate process.

The alternative is to get the window functions voodoo doll out, and stretch the brain a little so you can do it all in one query.

Step one in this exercise is realising that each tables start dates could also be end dates in the other table, and each tables end dates could also be start dates (less a day) in the other table. So we need to UNION End Dates from Table A with Start Dates from Table B, like so:

SELECT    ID,    [A Start Date] AS [Start Date]
FROM    Table_A
UNION
SELECT    ID,    [B Start Date]
FROM    Table_B
UNION
— All end dates are start dates + 1 too
SELECT    ID,    ISNULL(DATEADD(d,1,[A End Date]),’31 Dec 2099′)
FROM    Table_A
UNION
SELECT    ID,    ISNULL(DATEADD(d,1,[B End Date]),’31 Dec 2099′)
FROM    Table_B

Now, this gives us a full set of every possible start date – which is a starting point.The end result looks like this:

We can repeat the same trick for end dates and then do a cartesian join on the two sets and then we get a combination of every possible start and end date. No we need some criteria by which to select the right date pair. If we add a DATEDIFF to the resultset it becomes obvious we want to pick the smallest date range:

A WINDOW function gives us the intelligence to pick the right row. So if we apply a ROW_NUMBER() over a PARTITION of Start Date, ordering by End Date, then we just have to select the first row of each partition:

Now we have a complete set of effective date ranges on which to join our attribute tables!

Grab a copy of the query, including sample data scripts here: DateRanges

Don’t forget to check out my upcoming training “Build your Data Warehouse in SQL Server & SSIS with the BI Monkey” – a  primer course for anyone wanting to build a first DW using Microsoft tools.

Trick of a tiny database: cut down 1GB of your database every day by simply using this 1 weird old tip

Even if you use an adblocker, I bet you’ve all  seen this:

Well, the good news is if you have a chubby Enterprise Edition SQL Server, you too can get it on a fast weight loss program.

One Weird Old Tip

Okay, okay, I’ll pack in the click-baiting. SQL Server packs a great feature called Data Compression. What this does is compress the data on disk using a low CPU intensity algorithm. You can apply it to the data tables (and associated clustered indexes) and non-clustered indexes.

Now, there’s no such thing as a free lunch, but this isn’t far off paying a dollar for a really, really good sandwich. The cost for applying compression is that you burn CPU decompressing the data on read and compressing it on write. However reported costs of this on a server are about a 2-5% increase in baseline CPU usage. Most Data Warehouse applications barely tickle their CPU (unless someone has written some shocking code) so this cost is to all intents and purposes negligible.

Now for the taste of the really, really good sandwich. Disk usage savings in my experience are around the 50% mark (+/- 5%). As a practical example of that, one of my clients had a problem with a 680GB database – they needed more space for more data and indexes and were scrabbling for funding before they tipped over the 1TB mark. A quick evaluation showed they could cut their disk consumption to  to 280GB! No more need for funding scrabbles…!

So, how do I lose this weight?

It’s so easy. (Well, as long as you have Enterprise Edition).

```ALTER TABLE <table_name>
REBUILD PARTITION = ALL WITH (DATA_COMPRESSION =  <option>)
```

Just run a quick ALTER TABLE and compression is applied. It may take a while to compress the data, but after that it’s just … on. You will most likely never notice a difference, and you don’t have to start writing queries any differently.

I’m not going to run through the subtleties of whether to choose ROW or PAGE compression (or COLUMNSTORE, but that’s your only option for COLUMNSTORE tables in 2014+) but when I’ve run my evaluations on various tables I’ve generally seen PAGE showing a slight advantage.

Will my shadow be fat still?

Hmm. I did promise I’d stop. So there is a bonus compression feature which applies to all editions worth talking about (i.e. Standard, BI, Enterprise) from SQL2008R2 which is backup compression. It’s also available in 2008 but Enterprise only. There are some minor limitations, but in most circumstances you can simply compress your backups as you take them. Which is handy as again it will significantly reduce the amount of archival storage you use at again a negligible performance cost.

Hey, BI Monkey, can you help me with a script?

Why, yes I can – and here it is: Size and compress. This will size every table in your database, recommend a compression option and generate output that looks a little like this:

It was written for SQL2012 so doesn’t take into account Columnstore compressed tables in 2014, and isn’t smart enough to check the edition, but otherwise it’s a good starting point.

Now, go get your database being as fit as it should be!

Gimme a table, but hold the indexes there buddy!

I’ve just wrapped up a quick bit of work with a client who was just getting moving on their DW journey, and I found some fun things that the rookies had done.

I was also partly inspired to talk about this particular issue by this timely DBA Reactions post: When the developers hear my warning about 27 indexes on a single table

Part of what I was looking at was applying compression to their tables to reduce the disk space consumption. So I ran my handy query to size the databases and found an 8GB table with 12GB of indexes! Yoinks – something was clearly astray…  it turned out that one table – with about 30 columns – had 8 indexes on it.

Why not have 27 indexes?

So there’s a few good reasons:

• That many indexes take a long time to build / update, slowing load performance (in this case it took about 5 minutes to load the table and a further 29 minutes to build the indexes….)
• The Query Optimiser is going to struggle to pick the right index if any of them overlap
• If you are on 2012+ you can just stick the entire table in a columnstore index, massively boost performance and consume less disk space (applying a columnstore on the whole table took a whopping 1 minute to build)

As a broad rule I’d consider these rules of thumb following when looking at a tables indexes:

• If I have more than 2 indexes, I’m probably not indexing well, my table design is wrong or my queries need re-evaluating
• If my indexes are more than 20% of the size of the table, I should probably look at replacing them with a whole table columnstore
• Are my indexes making use of included columns to increase their coverage?

Size every table in your database in a flash – Part 2

A looooong time a go I had a post on how to “Count the number of rows in every Table in a Database in no time!” which did an approximate row count and sizing of every table in a database using DMV’s.

Well, since then then I have found  a far better way to quickly get a row count, table & index size from the handy stored proc sp_spaceused. It’s worth running DBCC UPDATEUSAGE first just to ensure the stats it works with are current.

I’ve wrapped a little cursor around it so you can – extremely quickly – get these stats for a whole database (sorted by descending data size but that can be easily tweaked in that final select):

DECLARE @TableSize TABLE
(
[TableName] nvarchar(255)
,    [RowCount] int
,    [Reserved] nvarchar(255)
,    [Data] nvarchar(255)
,    [Index Size] nvarchar(255)
,    [Unused] nvarchar(255)
)

DECLARE @CurrentTable nvarchar(255)

DECLARE TableCursor CURSOR FOR
SELECT [Name] FROM SYS.tables

OPEN TableCursor

FETCH NEXT FROM TableCursor INTO  @CurrentTable

WHILE @@FETCH_STATUS = 0

BEGIN

INSERT INTO @TableSize
exec sp_spaceused @CurrentTable

FETCH NEXT FROM TableCursor INTO  @CurrentTable

END

CLOSE TableCursor
DEALLOCATE TableCursor

SELECT * FROM @TableSize
order by CAST(REPLACE([Data],‘ KB’,) as int) desc

… and this is what the results look like (Against good old AdventureWorks):

How to do complex aggregations through Window functions in SQL Server

The BI Monkey has a confession – his partner is also somewhat of a data geek – though more buried in doing real analytics – but I have persuaded them of the benefits of using SQL instead of SAS sometimes. Right tool for the job etc. etc.

Now I was providing a bit of email support and was reminded of a valuable lesson – a written requirement is not always a very good way of expressing what you want – I spent most of the day bouncing emails advising about how to write cursors (yes, I know, sorry) – and when we actually sat in front of a laptop with the data I found out what it was we actually were trying to achieve and wrote a 5 line query that was literally thousands of times faster.

Are we talking about Window functions yet?

Sure, let’s get to the point. Actually I have blogged about them before but a refresher and different look is always helpful. Window functions allow you to define a “window” of data over which you can perform an operation against a set of rows. Note this is distinct from a GROUP BY which will aggregate the data. Window functions operate at the row level and leave the row data intact. The kind of capabilities this enables is things like adding a row sequence number within a group, cumulative totals, moving averages and other such shenanigans.

Now the key bit of code is the OVER clause. This allows you to specify a “window” of data over which to perform an operation. The window is defined by a PARTITION clause which specifies which column’s values to use to subdivide the data. Then an ORDER BY clause specifies what order – if any – the data in the PARTITION should be processed in. Finally a ROW or RANGE statement can influence the start and end points of the PARTITION.

However, that’s a lot of fine words and SQL speaks clearer:

SELECT    OrderDateKey
,        SalesOrderNumber
,        SalesAmount
,        SUM(SalesAmount) OVER (PARTITION BY OrderDateKey ORDER BY SalesOrderNumber ROWS UNBOUNDED PRECEDINGAS CumulativeAggregation
,        SUM(SalesAmount) OVER (PARTITION BY OrderDateKey ORDER BY SalesOrderNumber ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWINGAS SumOverWindow
,        COUNT(SalesOrderNumber) OVER (PARTITION BY OrderDateKey ORDER BY SalesOrderNumber ROWS UNBOUNDED PRECEDINGAS RowNumberIncrement
,        COUNT(SalesOrderNumber) OVER (PARTITION BY OrderDateKey ORDER BY SalesOrderNumber ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWINGAS RowNumberCountdown

FROM [dbo].[FactInternetSales]

ORDER BY SalesOrderNumber

This will run against the AdventureWorksDW2012 sample database.

So there are four columns I’m calculating across a window of “OrderDateKey”

1. CumulativeAggregation – which accumulates values across the window
2. SumOverWindow – which totals values across the window
3. RowNumberIncrement – which provides an increasing row count across the window
4. RowNumberCountdown – which provides a decreasing row count across the window

Here’s the results:

So to break these down a bit:

CumulativeAggregation

Here we PARTITION by OrderDateKey. So the window is based on values in the “OrderDateKey” column. So in the results above, we get a window for “20050701”, “20050702”, etc.

To get the  accumulation to carry out in the order we want we apply an ORDER BY on “SalesOrderNumber”.

Now for the clever bit. The ROWS UNBOUNDED PRECEDING tells the SUM function to evaluate from the start of the window. Hence you get the cumulative sum appearing – each row evaluates the total for all values for SalesAmount from the start of the window to the current row.

SumOverWindow

We follow much of what we did in CumulativeAggregation – same PARTITION, and same ORDER BY. The ORDER BY has no impact on the result but is needed for the statement to be syntactically valid.

This time we extend the ROWS UNBOUNDED PRECEDING by adding AND UNBOUNDED FOLLOWING – this now tells the SUM function to evaluate the SUM from the start of the window to the end of the window. So this time each row has the total SalesAmount for the window. Useful for calculating things such as percentage in period.

RowNumberIncrement

This time we switch the SUM to a COUNT. Here our ROWS UNBOUNDED preceding counts the rows from the start of the window to the current row – effectively giving us a row number.

RowNumberCountdown

In our last example we change to a stated set of ROWS between CURRENT ROW AND UNBOUNDED FOLLOWING. So now the COUNT evaluates from the current row to the end of the window – which effectively gives us a countdown to the end row of the window.

Summary

Using Window Functions via the OVER clause we can efficiently perform complex aggregations in place of awkward cursors, sub-selects and other such workarounds we may have had to lean on in the past. I’ve covered a small proportion of the options available but this functionality opens up so powerful analytic possibilities.

Also, talking to a user – even if it the person you are supposed to know best in the world – is vastly more effective if you want to understand their problem and solve it efficiently!

SQL Server Data Warehouse Training in Sydney – July 28

The eagle eyed among you may have noticed I’ve added a big countdown banner to the right of the page announcing a course “Build your Data Warehouse in SQL Server & SSIS with the BI Monkey“, running for 2 days from July 28.

The course is aimed at giving an overview of how to build and manage a Data Warehouse using the SQL Server platform as a base. It’s aimed to be a solid mix of Data Warehouse theory and hands on development to get the concepts locked in. For full details see my new “Training” page, but the key outcomes are:

• Learn how to get the Architecture right and keep your solution flexible
• Understand the whole lifecycle of a DW project
• Get hands on with the tools you need to move data

The high level agenda is:

• The Three Main Data Warehouse architectures – Kimball, Inmon & Data Vault
• SSIS Basics – Projects, Control Flows and Data Flows
• Moving Data with SSIS
• ETL Control Frameworks
• Database considerations
• How HDInsight (Hadoop) fits in
• Code Management with Team Foundation Server
• Testing
• Build Automation
• Deployment
• Management
• MDS, DQS and other useful Microsoft Toys

Click the button to register now – look forward to seeing you!

Troubleshooting Service Broker Stored Procedures

A quick post on troubleshooting the Stored Procedures that can get called to handle the messages in a Service Broker queue.

The stored procedures are pretty vanilla – see step 4 in this link for an example – take the XML of the message body and process it within the confines of the queue.

However the procedures cant be executed in the query editor so they only log their output to the error log – so to debug you need to look in the main SQL Server log – found at somewhere like “C:\Program Files\Microsoft SQL Server\MSSQL11.SQLSERVER\MSSQL\Log\ERRORLOG” – openable with a text editor of your choosing.
Then hunt for lines that look like this:

2013-11-08 10:56:26.10 spid34s     The activated proc ‘[dbo].[PayloadReceive]’ running on queue ‘ServiceBrokerDatabase.dbo.InboundQueue’ output the following:  ‘Some useful info’

To break that down, you get timestamp, process id, and then a message that says effectively “stored procedure” on “queue” output “some stuff”. So the “some stuff” could be an error message output by SQL, or a PRINT statement in your proc as you try to test.

The Geometry Data Type and SSRS Floorplan Reports

I was looking for a solution on how to build dynamic floorplans in SSRS, expecting to have to battle Visio to do such a thing. It turned out that a new data type, the Geometry spatial data type was introduced way back in SQL2008. This allows the creation and storage of polygon shapes in SQL Server. So now I have a database based mechanism to store objects on the floorplan, the floor itself and all the objects positions on the floor. Reporting Services can then surface all this using standard reporting capabilities.

How the Geometry Data Type works

The Geometry data type works on a simple x,y coordinate system to describe a polygon (official MSDN docco here). It can get more complicated than that, but this is the basic idea:

On a simple zero based set of x,y coordinates, you describe the path you take around the polygon – importantly ending back where you started – using the coordinates of each point of the polygon. In the case above, I’ve used a square, so that needs 5 points:

1. 1,1 (start)
2. 3,1
3. 3,3
4. 1,3
5. 1,1 (finish back at start)

We can describe the Polygon construct as follows in T-SQL:

‘POLYGON((1 1, 3 1, 3 3, 1 3, 1 1))’

So, to use this data first we need to construct a table:

CREATE TABLE [dbo].[GeometryTest](
[ID] [int] NOT NULL,
[Shape] [geometry] NOT NULL,
[Name] [nvarchar](50) NOT NULL,
[Value] [int] NULL,
CONSTRAINT [PK_GeometryTest] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

GO

Note there’s no configuration of the Geometry data type, or anything that needs to be enabled on the server to support this. I’ve included a Value column for the report, and an ID & Name column for metadata purposes.

Next I’ll load the table with a floorspace, and a few shapes – a couple of triangles and squares:

INSERT INTO [GeometryTest]
VALUES (1,’POLYGON((0 0, 7 0, 7 7, 0 7, 0 0))’,’Container’, NULL);

INSERT INTO [GeometryTest]
VALUES (2,’POLYGON((1 1, 3 1, 3 3, 1 3, 1 1))’,’Box One’,1);

INSERT INTO [GeometryTest]
VALUES (3,’POLYGON((4 1, 6 1, 6 3, 4 3, 4 1))’,’Box Two’,5);

INSERT INTO [GeometryTest]
VALUES (4,’POLYGON((1 4, 3 4, 3 6, 1 4))’,’Triangle One’,2);

INSERT INTO [GeometryTest]
VALUES (5,’POLYGON((4 4, 6 4, 4 6, 4 4))’,’Triangle Two’,3);

We can check this works with the handy spatial previewer in SSMS, running a select all against the table:

Using SSRS to generate a Spatial report

Next, in SSDT (or BIDS, if not running SQL2012), create a new report with this table we just created and loaded as a data source. Drag a Map report item from the toolbox onto the report canvas, and choose to use a SQL Server Spatial Query as your data source. You should have a data set available – if not, just create one as a select * from the our test table. Click next to get to the previewer – it will autodetect your geometry column, so just click next again.

When choosing a report type, I’ve opted to take a Color Analytical Map, then clicked next. For the Analytical dataset, I’m reusing the same table as I stored the values with the shapes. At the Data Visualisation screen, we need to change the “Field to Visualise” to the [Sum(Value)] option as there’s no autodetect here. Then click finish.

To make it display in pretty colours, select the MapPolygonLayer object, and change the Polygon Rules > Colour Rules > Distribution Type property to “Equal Distribution”. Then run the report:

Simple, but proves a point and shows how you can create floorplan style reports from a dynamic data source.

Insert an Image into a SQL Server table

Need to store an image into a SQL Server table? Easy. First make sure you have a varbinary(max) column to store it. Then do something along these lines:

UPDATE YourTable

SET ImageColumn= ( SELECT * FROM OPENROWSET (BULK ‘C:\Temp\arms4.jpg’, SINGLE_BLOB) AS a)

WHERE FilterColumn = ‘Filter Value’

Build your own SQL 2012 Demo Machine – Part 7 – Performance Tweaks

Step 8: Tweak Windows Server 2008R2

In the final  (optional) part we’ll apply some performance tweaks from BlackViper, just to lighten the VM and make it run a bit more smoothly.

TAKE A BACKUP NOW! You can break your VM doing this so snapshot, backup – whatever you prefer – but do it. Now.

Disable the following services:

• Diagnostic Policy Service
• Diagnostic Policy Host
• Diagnostic System Host
• IP Helper
• Power
• Print Spooler
• Problem Reports and Solutions Control Panel Support
• Remote Procedure Call (RPC) Locator
• Remote Registry
• Smart Card
• Smart Card Removal Policy
• Windows Font Cache Service
• Windows Remote Management (WS-Management)

Then Restart your machine. You are now done and can start adding content to your Demo machine and exploring SQL2012 features!