In this post I will be covering the basics of the Slowly Changing Dimension (SCD) Transformation. The sample package can be found here for 2005 and guidelines on use are here. This is going to be an introductory post that will not cover all aspects of the SCD Transformation as it is one of the more involved components to configure – and the post will still manage to be quite long.
What does the SCD do?
The simplest explanation is that it compares the attributes (column values) of rows of incoming data against a reference table, using a unique key – called the Business Key – to identify the record to compare against. What can make it complex is the range of comparison options and possible outputs for the component. The component checks attributes for three scenarios:
- New record – no record with that business key exists in the reference table
- Changed attributes – a record with that business key exists and compared attributes have changed
- Unchanged attributes – a record with that business key exists and compared attributes have not changed
Now, within those scenarios are a subset of possibilities to allow for – that the changed attribute shouldn’t change (Fixed attributes), or that the history of changes needs to be tracked (Historic attributes), or allowing for Inferred members, which I will explain in a future post.
For the sake of simplicity, here I will only be covering a basic example where new records are added and changed records are updated – a.k.a. a Type 1 Dimension – see this Slowly Changing Dimension article on Wikipedia for more explanation of various Dimension Types.
Configuring the SCD for a Type 1 Dimension
The SCD Transformation is configured using a wizard which launches when you double click on the component. It first prompts you to choose the reference table – in the most common Data Warehouse scenarios this would be a Dimension or Fact table. In the example package, it is the dimension table SCD_Nuts.
Once you have selected the reference table, the component will automatically map columns from the Input Data Flow to columns in the reference (Dimension) table by name. It will not allow mappings to columns where data types do not match, so if you have columns that have the same name in the source data flow and reference table that do not map, check the data types – the wizard will not allow you to force mappings of mismatched data types to repair later. Once you have set up your mappings, using the drop-down you have to identify the Business Key column(s) – it can be a compound key of multiple columns, or a single column. In our example, it is the unique key column Nut_Key.
On the next page of the wizard, you determine the Change Type of the non-key columns you want to compare. If a column value change doesn’t matter and you don’t want it to trigger any action, simply leave it off the list on this page. If a change in the column value does trigger an action, then add it to the list of Dimension Columns and set its Change Type. For the purposes of this example, the value we will be selecting is Changing Attribute, and in the sample package all attribute columns are being assessed for changes.
The next page – Fixed and Changing Attribute Options – is for handling Fixed attributes and Type 2 dimension changes, so we will leave any options here unselected and move on. The next page – Inferred Dimension Members – also does not apply here, so uncheck the “Enable inferred member support” and click next.
The final page displays a list of outputs the component will generate. In this case it lists only the New Record Output, but it will also generate an output for Changing records. Click finish and the component will add an OLEDB Destination for the new records, and an OLEDB Command to handle updating changed records. Each added transformation will be fully mapped and in the case of the Changed records, the update query is written.
Problems with the SCD
The SCD has issues with ease of use and performance. I’ll start with the most important one: Performance. How the SCD works is for each incoming row of data it issues a SQL command to check against the reference (or Dimension) table in the database to compare the incoming row against its corresponding row in the reference (you can watch this happening in SQL profiler). This isn’t a problem for small reference tables, but once you start processing thousands of incoming rows against tables with thousands of reference rows, performance starts to drag, because it is doing these row by row checks. The only performance tuning option you have at your disposal is to index the Business Key in the reference table. It would be much better if it was possible to cache the reference table in memory so lookups could be done in memory instead of row by row against the database – according to this Connect article it may be on the list for the next release.
In terms of ease of use, there’s a couple of annoying things that can trip you up with this component. First of all, every time you complete the wizard, it creates new output transformations, deleting the old ones. If you have customised these in anyway – e.g. adding an update date column – it gets annoying fast. Fortunately the workaround is easy – create your own transformations to receive the outputs independently of the SCD, and when the wizard completes, just delete the outputs it creates and re-map the output data flows to your own transformations.
Secondly, the Wizard is actually disconnected from what is stored in the package for the data flow. The wizard’s data is stored in a separate chunk of XML within the package definition. What this means in practice is if you use the Advanced Editor to make any changes, these will not be picked up by the wizard if you run it again. So its quite easy to make tweaks that get lost if you re-use the wizard.
Where would you use the SCD?
As per its name, you will most likely use this in data warehousing scenarios for maintaining slowly changing dimension and fact tables, or any table where you want to update data to reflect current values. Note that the reference / dimension table can only reside in SQL Server.
There are some alternatives to the SCD available – notably Table Difference from SQLBI.com (now at version 2.0) which I have used and is very quick, and the Kimball Method SCD from Codeplex, which I haven’t used but will certainly be looking at and may cover in one of the following posts on the SCD.