An SQL alternative to the SCD

In SQL 2008 a new T-SQL construct was added – the MERGE operation. (Ok, pedants will know this wasn’t new to Oracle,  but it was new to SQL Server).

This operation allows for the merging of a dataset into a reference dataset – which can be remarkably similar to Insert / Update operations effected by the Slowly Changing Dimension transformation. However the way it operates is very different. Instead of the SCD’s row by row evaluation approach, the MERGE operation is a set based operation. What this means is it compares the whole of the source dataset to the reference dataset in a single pass. This has significant implications for performance – on a site where I implemented this the operation which took 1,200 seconds in the SCD cut down to 51 seconds using a Merge.

There are limitations and differences to be aware of:

  • You cannot directly return row counts for Insert / Update / Ignore operations in the Merge
  • As it is a bulk operation a single row will cause failure of the whole batch
  • There’s no GUI – just hand crafted SQL
  • Less error trapping / logging options
  • More flexibility in terms of actions when matches / non matches are found

The main reason why you would consider the SQL Merge – it handles Type 1, and with a little cunning, Type 2 dimensions – in a fraction of the time it takes the SCD to plod through. It’s still not as fast as a proper in memory comparison using something such as TableDifference – but it’s always good to know you have something else available in your toolkit.

Further information:

Read More

The Merge Transformation

b
Fig 1: The Merge Transformation

In this post I will be covering the Merge Transformation. The sample package can be found here for 2005 and guidelines on use are here.

What does the Merge Transformation do?

The Merge Transformation merges two sorted data sets and outputs a sorted data set – and by Merge it really means a “Sorted Union All”. You provide it with the two sorted sets, specify the keys on which they are sorted, and the “merge” then takes place, ordering the output on those keys. In my example package, I merge two datasets – each with 290 rows – and get a 580 row dataset as output.

b
Fig 2: Configuring the Merge Transformation

You can see above you have to specify the sort keys (you can have multiple keys) and then the mappings for the columns.The sort keys must have the property IsSorted set to True – this will be done automatically by the Sort component, but will need to be set manually in the source if the data is sorted outside of SSIS – this is available in the advanced properties of the output columns of a component.

For simplicity in the example I have simply merged the same dataset, though in one copy I have left some columns off to show that you don’t have to match the columns, just the sort keys. All that happens in unmapped columns is that a null is entered for that columns value in the output dataset.

When would you use the Merge Transformation?

Technically, it should be used to merge two sorted datasets to get a sorted merged output data set. In practice, I fail to see much advantage in using this component -The only scenario I can see this being genuinely useful is when you have to Union two very large sorted datasets, need the output to keep that sort, and cannot afford the memory overhead of a blocking Sort component after the Union which will have to recieve all the data before it can do that sort.

The Union All component can accept more than 2 inputs, doesn’t require inputs to be sorted, and if you really need the output sorting, then you can sort the output using a Sort component. I demonstrate this in the sample package in data Flow 2, which returns the same results as the Merge Data Flow but with one less component.

MSDN Documentation for the Merge Transformation can be found here for 2008 and here for 2005.

Read More