Ranking and Numbering rows – and subsets of rows – in T-SQL

I recently had to deal with a scenario where I needed to pivot out some rows after ordering (ranking) them according to specific rules so I could present some rows of data as columns, but in a specific order (don’t ask why, it’ll make me grind my teeth about data analysts that don’t understand how to analyse data…). The ordering in itself was only part of the solution, as to Pivot the data, the keys need to be specified in the query, so the natural keys can’t be used. The scenario is set out below:

Fig 1: Rank and Pivot. The Rank column needed to be added
Fig 1: Rank and Pivot. The Rank column needed to be added

My first thought was that I’d have to solve this with a cursor, which wasn’t a practical option as there were 1.5m rows of data to process, and if my solution involves a cursor I instantly think it’s a lousy solution. However I was pleased to discover the T-SQL function ROW_NUMBER() which allows you to add row numbering to ordered data and even subgroups of that data. (The below samples use the AdventureWorks2008 database.)

First up, basic row numbering:

SELECT ROW_NUMBER() OVER (ORDER BY ProductId) AS ID_Key
,        [ProductID]
,        [LocationID]
,        [Shelf]
,        [Bin]
,        [Quantity]

FROM [Production].[ProductInventory]

WHERE [ProductID] IN (1,2,3,4)

The above query adds an ID key to the data based on ordering by the ProductID field. The ROW_NUMBER() function requires an OVER clause to know on what basis it should assign the key, and this has to be an ORDER BY statement. The end result looks like this:

Fig 2: Simple row numbering
Fig 2: Simple row numbering

You can extend this to order within a subgroup, by specifying a PARTITION BY clause so ROW_NUMBER() operates with that subgroup. In the example below I partition by ProductId:

SELECT ROW_NUMBER() OVER (PARTITION BY ProductId ORDER BY Quantity DESC) AS Subset_ID_Key
,        [ProductID]
,        [LocationID]
,        [Shelf]
,        [Bin]
,        [Quantity]

FROM [Production].[ProductInventory]

WHERE [ProductID] IN (1,2,3,4)

Which yields this result, with the ranking now only applying within a Product Id:

Fig 3: Row numbering within a Subgroup
Fig 3: Row numbering within a Subgroup

Which can then be pivoted on the rank, as the key of the rank is now known:

SELECT ProductID
,        [1] AS Bin_1
,        [2] AS Bin_3
,        [3] AS Bin_3

FROM

(

SELECT ROW_NUMBER() OVER (PARTITION BY ProductId ORDER BY Quantity DESC) AS Subset_ID_Key
,        [ProductID]
,        [Bin]

FROM [Production].[ProductInventory]

WHERE [ProductID] IN (1,2,3,4)

) AS Pivot_Source

PIVOT

(
MAX(Bin)
FOR Subset_ID_Key IN ([1],[2],[3])
) AS Pivot_Output

Which yields this final output:

Fig 4: Ranked and Pivoted
Fig 4: Ranked and Pivoted

All done within a single query, and not a cursor in sight. ROW_NUMBER() was a great function to discover!

MSDN Documentation is here for:

  • ROW_NUMBER() – the key function
  • OVER – ordering and subgrouping the results of ROW_NUMBER
  • PIVOT – for pivoting out the results

Read More

Connect Improvements on Pivot and SCD denied…

It appears that our attempts to fix a couple of the less usable components in SSIS has been canned, once again. See the following links for the generic “Won’t Fix” response on putting a UI on the Pivot and making the SCD perform better than say, a dead possum.

Whilst this is slightly disappointing, fortunately for those who like the Dimension Merge (formerly Kimball SCD) component, Pragmatic Works have come to an arrangement with Todd McDermid to include a supported version of it available in a future release of their TaskFactory bundle of SSIS components. This means for those who live in risk-averse enterprises, you can now use the component as it comes with a support package.

Read More

SSIS Components in Denali

I’ve finally had a chance to boot up SSIS in Denali and play with it myself. In this post i’m going to look at some of the things that have – and surprisingly – have not appeared… 

The Components

I’ve had a scan through the Tasks in the Control Flow and Components in the Data Flow to see what’s there. First observation is that all the components have got funky new icons. I don’t see anything new function-wise on the control flow, but it’s interesting to note the Execute DTS 2000 Package task remains, implying this piece of backwards compatibility will still remain. 

SSIS Data Correction Component
SSIS Data Correction Component

 

There’s only one completely new item available on the Data Flow, which is the Data Correction component. However as per this blog post by Hennie de Nooijer on his BI Future blog, it appears to be unusable. It looks like this is where SSIS will touch upon the Data Quality Services component of SQL Server (SSDQS), but since it has no Advanced Editor available and it can’t be configured by the normal editor there’s not much I can find out about how this component works. One for review in a future CTP. 

There’s a sort of new, sort of not, item in the form of the Source and Destination assistants, which are effectively Wizards for building Source and Destination  adapters. It tidies up the toolbox a bit but other than that doesn’t seem to add much for experienced users. 

Fix these problems!

However, there are things that desperately needed to change that haven’t. The Slowly Changing Dimension component has had no visible changes, leaving us with the poorly performing (usually to the point of unusability) RBAR toy we have already (a previous suggestion on Connect was killed in 07). The Pivot Transformation still has no usable interface (a previous suggestion on Connect was killed in 06) and remains confusing and difficult to configure. The Derived Column transform still locks you into Unicode strings if you have a text field, and overrides your setting if you change it, then alter the expression, and there also appears to be no new functionality in terms of added functions (my request to make this extensible on connect has been killed). 

There generally seems to have been a lack of updates that address some of the existing issues with individual components – given that some of these – especially the SCD and Pivot – have been poor since day 1 – now 5 years ago – it’s a shame to have seen no changes at all. 

To at least get the Pivot and SCD fixed I’m raising new connect items for them – please vote for them by following the links below and we may get them fixed in a future CTP: 

Thanks in advance!

Read More

The Pivot Transformation

In this post I will be covering the Pivot Transformation. Due to the lack of a GUI or wizard for this component, it’s difficult to configure properly and painful to use, so I will try to walk through it as thoroughly as possible. I found the MSDN documentation on this component a bit confusing as well, so hopefully this will prove to be clearer.

The sample package can be found here for 2005 and guidelines on use are here.

What does the Pivot Transformation do?

This bit is fairly straightforward – it takes row values and converts them to columns, moving – but not aggregating –  a value as it does so. So if your data looked like this:

b
Fig 1: Pre Pivot Data

Pivoting on “Food Type” using “Cost” as a value would turn it into this:

b
Fig 2: Post Pivot Data

How to configure the Pivot Transformation

In the sample package Data Flow 1 I start with a query which delivers Line Totals and Order Quantities summarised by Product and Order Year. What I want to do is Pivot so I have Order Year on the rows, the values from Product Category as the Columns and the sum of Line Totals as the Value, as laid out below:

b
Fig 3: The Desired Pivot Results

Step 1: Configure the Inputs

The first step is to open the Pivot Transformation and on the Input Columns tab select the columns that are going to be used in the Pivot. Once that is done, switch to the Input and Output Properties tab. Here we are going to set how each column is used in the Pivot operation. This is done by setting the PivotUsage property of the Input Columns. This can be found by expanding Pivot Default Input > Input Columns and selecting the relevant Input Column. Down at the bottom of the list under the Custom header is the property PivotUsage. This can be set to one of four values:

  • 0 – the column is passed through unaffected
  • 1 – the column values become the rows of the pivot (aka the Set Key)
  • 2 – the column values become the column names of the pivot (aka the Pivot Column)
  • 3 – the column values that are pivoted in the pivot

So, in my example, I set OrderYear to 1, Product to 2 and Line Totals to 3.

b
Fig 4: Configuring the Pivot Transformation Inputs

Now something you have to configure upstream from this is that whatever column will be the rows of the pivot (PivotKeyUsage = 1) needs to be sorted for the component to work properly. The Pivot component iteslf doesn’t force this (as it does in a Merge Join) so you need to go back and make sure that your column is sorted by the time it hits the pivot. Setting the IsSorted and SortKeyPosition properties of the Set Key column upstream will make no difference as the Pivot Transformation won’t actually pay any attention to these settings, but it may cause unexpected results if the data itself isn’t sorted. You can see how this affects the output by running Data Flow 1a, which is the same as Data Flow 1 but not sorted properly.

Step 2: Configure the Outputs

Despite everything you’ve done so far, this component will have exactly zero outputs – each has to be manually defined. So you will need an output for your Rows (In my example, OrderYear) and an output for each of your Pivoted column’s values that you want to output. So in my example I will need to create an output for each Product Category that is in the source data – you cannot ignore a value as the component will return an error. For example, if I forgot to add a column for Components, I would get the error “The pivot key value “Components” is not valid” when I tried to execute.

To create an output, on the Inputs and Outputs tab expand the Pivot Default Output > OutputColumns and start clicking the Add Column button until you have enough columns. Then for each column you need to set the following properties:

  • Name – the name for the output column – so if you were pivoting a value of Bike, you could rename it to Bicycle in the output
  • PivotKeyValue – the value in the pivoted column that will go into this output
  • SourceColumn – the Lineage ID of the column that is to be used in the pivot. What this means is you need to go back to your input columns, and get the value of the LineageID property of the column that you set the PivotUsage of to 3 for pivoted columns, and the LineageID of the Column you set to PivotUsage of 1 for the Row Column.
b
Fig 5: Configuring the Pivot Transformation Outputs

Having fun yet? Unlikely, but fortunately you are done.

Multiple Pivots and the T-SQL Alternative

In the sample package Data Flow 2, I demonstrate a configuration of the Pivot Transformation that actually pivots two data values. This is achieved by adding a duplicate set of columns for the Pivot Column to receive the pivoted values in, and can be achieved mostly by duplicating the actions in Step 2 to create a new set of Pivot Columns.

I also demonstrate the Pivot T-SQL command in Data Flow 3 – this generates the same output as in Data Flow 1 but the aggregation and pivoting is done in a single operation with none of the awkward configuration required by the Pivot Transformation in SSIS. I won’t be going into the syntax for these queries here but lkinks to the MSDN documentation are at the end of this post.

Where to use the Pivot Transformation?

Because of the pain involved in configuring it, I would suggest you only use it when you absolutely have to and can’t push the operation back to the source. This is not least because the pivot – unlike Excel Pivot Tables or a T-SQL pivot – cannot aggregate, it needs aggregation first on the pivoted column. In my example, if I didn’t have an aggregate operation in the query, I would get the error “Duplicate pivot key value “Bikes”” as it tried to pivot another row which had a Product Category of “Bike”. The T-SQL Pivot query in SQL Server is far more powerful with its variety of aggregation options – even though it will initially be a bit awkward to write those queries to a non-expert. Additionally using T-SQL will allow you to lock down the number of columns – if you get a new value in the Pivot Column it won’t kill your package.

All in all the Pivot Transformation is badly implemented from a UI perspective, and weak from a data processing viewpoint due to it’s lack of aggregation capability.

MSDN Documentation for the Pivot Transformation can be found here for 2008 and here for 2005.

MSDN Documentation for the Pivot T-SQL command can be found here for 2008 and here for 2005.

If you’re still struggling, try looking at these other posts:

Read More