Connect Improvements on Pivot and SCD denied…

It appears that our attempts to fix a couple of the less usable components in SSIS has been canned, once again. See the following links for the generic “Won’t Fix” response on putting a UI on the Pivot and making the SCD perform better than say, a dead possum.

Whilst this is slightly disappointing, fortunately for those who like the Dimension Merge (formerly Kimball SCD) component, Pragmatic Works have come to an arrangement with Todd McDermid to include a supported version of it available in a future release of their TaskFactory bundle of SSIS components. This means for those who live in risk-averse enterprises, you can now use the component as it comes with a support package.

Read More

SSIS Components in Denali

I’ve finally had a chance to boot up SSIS in Denali and play with it myself. In this post i’m going to look at some of the things that have – and surprisingly – have not appeared… 

The Components

I’ve had a scan through the Tasks in the Control Flow and Components in the Data Flow to see what’s there. First observation is that all the components have got funky new icons. I don’t see anything new function-wise on the control flow, but it’s interesting to note the Execute DTS 2000 Package task remains, implying this piece of backwards compatibility will still remain. 

SSIS Data Correction Component
SSIS Data Correction Component

 

There’s only one completely new item available on the Data Flow, which is the Data Correction component. However as per this blog post by Hennie de Nooijer on his BI Future blog, it appears to be unusable. It looks like this is where SSIS will touch upon the Data Quality Services component of SQL Server (SSDQS), but since it has no Advanced Editor available and it can’t be configured by the normal editor there’s not much I can find out about how this component works. One for review in a future CTP. 

There’s a sort of new, sort of not, item in the form of the Source and Destination assistants, which are effectively Wizards for building Source and Destination  adapters. It tidies up the toolbox a bit but other than that doesn’t seem to add much for experienced users. 

Fix these problems!

However, there are things that desperately needed to change that haven’t. The Slowly Changing Dimension component has had no visible changes, leaving us with the poorly performing (usually to the point of unusability) RBAR toy we have already (a previous suggestion on Connect was killed in 07). The Pivot Transformation still has no usable interface (a previous suggestion on Connect was killed in 06) and remains confusing and difficult to configure. The Derived Column transform still locks you into Unicode strings if you have a text field, and overrides your setting if you change it, then alter the expression, and there also appears to be no new functionality in terms of added functions (my request to make this extensible on connect has been killed). 

There generally seems to have been a lack of updates that address some of the existing issues with individual components – given that some of these – especially the SCD and Pivot – have been poor since day 1 – now 5 years ago – it’s a shame to have seen no changes at all. 

To at least get the Pivot and SCD fixed I’m raising new connect items for them – please vote for them by following the links below and we may get them fixed in a future CTP: 

Thanks in advance!

Read More

An SQL alternative to the SCD

In SQL 2008 a new T-SQL construct was added – the MERGE operation. (Ok, pedants will know this wasn’t new to Oracle,  but it was new to SQL Server).

This operation allows for the merging of a dataset into a reference dataset – which can be remarkably similar to Insert / Update operations effected by the Slowly Changing Dimension transformation. However the way it operates is very different. Instead of the SCD’s row by row evaluation approach, the MERGE operation is a set based operation. What this means is it compares the whole of the source dataset to the reference dataset in a single pass. This has significant implications for performance – on a site where I implemented this the operation which took 1,200 seconds in the SCD cut down to 51 seconds using a Merge.

There are limitations and differences to be aware of:

  • You cannot directly return row counts for Insert / Update / Ignore operations in the Merge
  • As it is a bulk operation a single row will cause failure of the whole batch
  • There’s no GUI – just hand crafted SQL
  • Less error trapping / logging options
  • More flexibility in terms of actions when matches / non matches are found

The main reason why you would consider the SQL Merge – it handles Type 1, and with a little cunning, Type 2 dimensions – in a fraction of the time it takes the SCD to plod through. It’s still not as fast as a proper in memory comparison using something such as TableDifference – but it’s always good to know you have something else available in your toolkit.

Further information:

Read More

The Slowly Changing Dimension Transformation, part 2 – Type 2 Dimensions

b
Fig 1: The Slowly Changing Dimension Transformation

In this post I will be covering how to use the Slowly Changing Dimension (SCD) Transformation to update a Type 2 Dimension, that is, one that tracks changes in values over time. The sample package can be found here for 2005 and guidelines on use are here. This is the second post in the series looking at some slightly more advanced behaviour – for the basics of the Slowly Changing Dimension (SCD) Transformation please read my post: The Slowly Changing Dimension Transformation, part 1 – Type 1 Dimensions.

Configuring the SCD for a Type 2 Dimension

The first work you need to do for a Type 2 dimension actually resides in your dimension table design – you need to decide whether you are going to track changes in your table using either a simple indicator to identify current and expired records, or if you want to use effective dates – the component doesn’t natively allow you to use both, though you can customise the output to do so. The Current / Expired indicator actually uses a small text string which can either be set to the string value pairs “True” / “False” or “Current” / “Expired” – no customisation of these is allowed in the component (again, you can customise the output to change this, but the wizard will only allow mapping of the column to one that will accept text strings). The Effective dates option requires a start and end date datetime column, and in the wizard you use a variable to set the time used. The sample package demonstrates a few possibilities but below I will describe using effective dates.

b
Fig 2: Select a Dimension Table and Keys

First of all, note when on the first page, Select a Dimension Table and Keys the Effective dates (and Current indicator) are not mapped. Because I have named the columns in line with what the SCD expects for such indicators, it ignores them completely in the mapping – they cannot even be selected as Input Columns. If you name them differently in your design, simply map them as “Not a Key Column”.

b
Fig 3: Slowly Changing Dimension Columns

In the Slowly Changing Dimension Columns page set the change type of each column to “Historic” so the component will track history of changes.

b
Fig 4: Historical Attribute Options

The wizard will present a page that is only displayed when you have selected Historic change type columns.  Here the start and end date columns are specified, and the component needs a datetime variable to use to set the expiry of old records and the start date of new records. Here I have just used the Package Start time variable – in practice you may well want to specify a variable populated with something else, such as the extract date of the data.

b
Fig 5: Finishing the SCD Wizard

When go to finish the wizard, you will note the additional “Historical Attribute Output”  will be generated. In practice this means a setof components will be output to manage the changes, which are illustrated below (click to zoom in).  The derived columns add the effective start and end date columns and the OLE DB Command expires old records. Please review the sample component to see how this works in practice.

b
Fig 6: Historical Change Outputs

SCD Considerations for Type 2 Dimensions

One of the most important things to bear in mind is that the component is not intelligent in terms of knowing which data is new – so if you had two records for a given key in the sample file, you would have to sort it so it would feed it the most recent item last so that item would be the current one. It also provides no support for data which has its own change dates – for example if a record had an update date and you wished to use that to form the effective date.

The SCD component is only really suitable for tracking Type 2 changes in sources where there will be one record per key per extract and the source itself has no change tracking capabilities. Given this weakness and the difficulties with using this component generally (in terms of configuration and performance) – you may well want to look at the alternatives I mentioned in my original post about the SCD. This is a component that definitely needs an overhaul for the next release.

MSDN Documentation for the Slowly Changing Dimension Transformation can be found here for 2008 and here for 2005.

Read More

The Slowly Changing Dimension Transformation, part 1 – Type 1 Dimensions

b
Fig 1: The Slowly Changing Dimension Transformation

In this post I will be covering the basics of the Slowly Changing Dimension (SCD) Transformation. The sample package can be found here for 2005 and guidelines on use are here. This is going to be an introductory post that will not cover all aspects of the SCD Transformation as it is one of the more involved components to configure – and the post will still manage to be quite long.

What does the SCD do?

The simplest explanation is that it compares the attributes (column values) of rows of incoming data against a reference table, using a unique key – called the Business Key – to identify the record to compare against. What can make it complex is the range of comparison options and possible outputs for the component. The component checks attributes for three scenarios:

  1. New record – no record with that business key exists in the reference table
  2. Changed attributes – a record with that business key exists and compared attributes have changed
  3. Unchanged attributes – a record with that business key exists and compared attributes have not changed

Now, within those scenarios are a subset of possibilities to allow for – that the changed attribute shouldn’t change (Fixed attributes), or that the history of changes needs to be tracked (Historic attributes), or allowing for Inferred members, which I will explain in a future post.

For the sake of simplicity, here I will only be covering a basic example where new records are added and changed records are updated – a.k.a. a Type 1 Dimension – see this Slowly Changing Dimension article on Wikipedia for more explanation of various Dimension Types.

Configuring the SCD for a Type 1 Dimension

The SCD Transformation is configured using a wizard which launches when you double click on the component. It first prompts you to choose the reference table – in the most common Data Warehouse scenarios this would be a Dimension or Fact table. In the example package, it is the dimension table SCD_Nuts.

b
Fig 2: Selecting a Dimension Table and Keys

Once you have selected the reference table, the component will automatically map columns from the Input Data Flow to columns in the reference (Dimension) table by name. It will not allow mappings to columns where data types do not match, so if you have columns that have the same name in the source data flow and reference table that do not map, check the data types – the wizard will not allow you to force mappings of mismatched data types to repair later. Once you have set up your mappings, using the drop-down you have to identify the Business Key column(s) – it can be a compound key of multiple columns, or a single column. In our example, it is the unique key column Nut_Key.

b
Fig 3: Slowly Changing Dimension Columns

On the next page of the wizard, you determine the Change Type of the non-key columns you want to compare. If a column value change doesn’t matter and you don’t want it to trigger any action, simply leave it off the list on this page. If a change in the column value does trigger an action, then add it to the list of Dimension Columns and set its Change Type. For the purposes of this example, the value we will be selecting is Changing Attribute, and in the sample package all attribute columns are being assessed for changes.

The next page – Fixed and Changing Attribute Options – is for handling Fixed attributes and Type 2 dimension changes, so we will leave any options here unselected and move on. The next page – Inferred Dimension Members – also does not apply here, so uncheck the “Enable inferred member support” and click next.

b
Fig 4: Finishing the SCD Wizard

The final page displays a list of outputs the component will generate. In this case it lists only the New Record Output, but it will also generate an output for Changing records. Click  finish and the component will add an OLEDB Destination for the new records, and an OLEDB Command to handle updating changed records. Each added transformation will be fully mapped and in the case of the Changed records, the update query is written.

Problems with the SCD

The SCD has issues with ease of use and performance. I’ll start with the most important one: Performance. How the SCD works is for each incoming row of data it issues a SQL command to check against the reference (or Dimension) table in the database to compare the incoming row against its corresponding row in the reference (you can watch this happening in SQL profiler). This isn’t a problem for small reference tables, but once you start processing thousands of incoming rows against tables with thousands of reference rows, performance starts to drag, because it is doing these row by row checks. The only performance tuning option you have at your disposal is to index the Business Key in the reference table. It would be much better if it was possible to cache the reference table in memory so lookups could be done in memory instead of row by row against the database – according to this Connect article it may be on the list for the next release.

In terms of ease of use, there’s a couple of annoying things that can trip you up with this component. First of all, every time you complete the wizard, it creates new output transformations, deleting the old ones. If you have customised these in anyway – e.g. adding an update date column – it gets annoying fast. Fortunately the workaround is easy – create your own transformations to receive the outputs independently of the SCD, and when the wizard completes, just delete the outputs it creates and re-map the output data flows to your own transformations.

Secondly, the Wizard is actually disconnected from what is stored in the package for the data flow. The wizard’s data is stored in a separate chunk of XML within the package definition. What this means in practice is if you use the Advanced Editor to make any changes, these will not be picked up by the wizard if you run it again. So its quite easy to make tweaks that get lost if you re-use the wizard.

Where would you use the SCD?

As per its name, you will most likely use this in data warehousing scenarios for maintaining slowly changing dimension and fact tables, or any table where you want to update data to reflect current values. Note that the reference / dimension table can only reside in SQL Server.

There are some alternatives to the SCD available – notably Table Difference from SQLBI.com (now at version 2.0) which I have used and is very quick, and the Kimball Method SCD from Codeplex, which I haven’t used but will certainly be looking at and may cover in one of the following posts on the SCD.

MSDN Documentation for the Slowly Changing Dimension Transformation can be found here for 2008 and here for 2005.

Read More