SQL Server Data Quality Services & SSIS – Performance

This is a snippet of a post on the performance of the DQS engine when called from SSIS. I’ve created a simple number based Domain rule and replicated it 5 times in my knowledge base. My package then feeds copies of the same set of data into the DQS component (5000 rows) and runs it through 1 – 5 domains.

The performance profile is as below:

SSIS DQS Component Performance

SSIS DQS Component Performance

There seems to be a fairly linear relationship between the number of domains being processed and execution time. Note that I’ve created a dummy value for “0″ to indicate what the start-up time of the DQS component might be, as it’s impossible to have a DQS Cleansing Component in the flow with no columns mapped.

I’d ignore the actual numbers – this is on a development VM which is definitely not configured for performance – and I’m aware the DQS Team are working on performance issues (though by the looks of it, better be working hard).

SQL Server Data Quality Services & SSIS

So far in my posts on SSDQS we’ve looked at the Data Quality Services Client and building SSDQS Knowledge Bases. Now in practice when handling bulk data a need to reference this in routine loads is needed, and to nobody’s surprise, SSIS is the tool for the job.

The DQS Cleansing Component

So, in our (shiny, new) SSIS Toolbox we have a new component to connect to DQS – the DQS Cleansing Component:

SSIS DQS Cleansing Component

SSIS DQS Cleansing Component

The DQS cleansing component pushes a data flow to the DQS Engine for validation. This requires a special Connection Manager, the DQS Cleansing Connection Manager, which as we can see below is a simple creature:

SSIS DQS Cleansing Connection Manager

SSIS DQS Cleansing Connection Manager

The sole option at this point is to choose which DQS Server to point at. So, lets look at what we get in the SSIS Component once we use the Connection Manager:

SSIS DQS Cleansing Component Connection Manager options

SSIS DQS Cleansing Component Connection Manager options

Once again – still nice and simple – choosing your Connection Manager allows you to then pick from a list of Published Knowledge Bases. Once a KB is selected, a list of the available Domains is populated, though there is nothing you can do with this list other than review it. So next we move to the Mapping tab:

SSIS DQS Cleansing Component Mapping Tab

SSIS DQS Cleansing Component Mapping Tab

The usual suspects are there – pick your input columns in the top half of the tab and they become available for mapping in the lower half. Each input column can be mapped to a single Domain (I can’t quite see how Composite Domains work in this context). You then get three output streams – the Output, Corrected Output and Status Output. The Output is just the column passed through, Corrected is the column value corrected by the DQS Engine and the Status is the record status (which comes out as Correct, Corrected or Unknown which corresponds to the DQS Data Quality Project statuses. In the Advanced Editor you can also switch on Confidence and and Reason Outputs, which relate to matching projects.

Note that there is only a single output for the DQS Cleansing Component – if you want to send OK, Error and Invalid records to different locations, you will need to do so with a downstream Conditional Split component.

Summary

So we’ve had a quick look at the basics of automating DQS activities using SSIS, and how SSIS plugs in to the DQS Server. Subsequent posts will start digging into some practical implementation including performance.

Some further reading can be found here:

Connect Improvements on Pivot and SCD denied…

It appears that our attempts to fix a couple of the less usable components in SSIS has been canned, once again. See the following links for the generic “Won’t Fix” response on putting a UI on the Pivot and making the SCD perform better than say, a dead possum.

Whilst this is slightly disappointing, fortunately for those who like the Dimension Merge (formerly Kimball SCD) component, Pragmatic Works have come to an arrangement with Todd McDermid to include a supported version of it available in a future release of their TaskFactory bundle of SSIS components. This means for those who live in risk-averse enterprises, you can now use the component as it comes with a support package.

SSIS Components in Denali

I’ve finally had a chance to boot up SSIS in Denali and play with it myself. In this post i’m going to look at some of the things that have - and surprisingly – have not appeared… 

The Components

I’ve had a scan through the Tasks in the Control Flow and Components in the Data Flow to see what’s there. First observation is that all the components have got funky new icons. I don’t see anything new function-wise on the control flow, but it’s interesting to note the Execute DTS 2000 Package task remains, implying this piece of backwards compatibility will still remain. 

SSIS Data Correction Component

SSIS Data Correction Component

 

There’s only one completely new item available on the Data Flow, which is the Data Correction component. However as per this blog post by Hennie de Nooijer on his BI Future blog, it appears to be unusable. It looks like this is where SSIS will touch upon the Data Quality Services component of SQL Server (SSDQS), but since it has no Advanced Editor available and it can’t be configured by the normal editor there’s not much I can find out about how this component works. One for review in a future CTP. 

There’s a sort of new, sort of not, item in the form of the Source and Destination assistants, which are effectively Wizards for building Source and Destination  adapters. It tidies up the toolbox a bit but other than that doesn’t seem to add much for experienced users. 

Fix these problems!

However, there are things that desperately needed to change that haven’t. The Slowly Changing Dimension component has had no visible changes, leaving us with the poorly performing (usually to the point of unusability) RBAR toy we have already (a previous suggestion on Connect was killed in 07). The Pivot Transformation still has no usable interface (a previous suggestion on Connect was killed in 06) and remains confusing and difficult to configure. The Derived Column transform still locks you into Unicode strings if you have a text field, and overrides your setting if you change it, then alter the expression, and there also appears to be no new functionality in terms of added functions (my request to make this extensible on connect has been killed). 

There generally seems to have been a lack of updates that address some of the existing issues with individual components – given that some of these – especially the SCD and Pivot – have been poor since day 1 – now 5 years ago – it’s a shame to have seen no changes at all. 

To at least get the Pivot and SCD fixed I’m raising new connect items for them – please vote for them by following the links below and we may get them fixed in a future CTP: 

Thanks in advance!

Denali (SQL11) New SSIS Features

Jamie Thompson has highlighted a set of TechNet articles on the new features coming in SSIS in Denali / SQL 11, which are all followable from this starting link.

SSIS Projects

The features highlighted fall under the concept of “Project Deployment”. This is a change in the way package deployment works (hopefully it will work better than the last mechanism), but importantly adds a new concept – Parameters. These are effectively a project wide set of variables that are available to every package – which puts an end to endless chaining of Parent Package variables or collecting the same settings in every package for an execution scenario. These Parameters can reference another new concept, an Environment Variable which can drive different values into the parameter according to which  Environment you are operating in.

Another feature that is only mentioned in passing is that when a project is deployed, previous versions of the project are retained so rollback is a simple server operation.

Design Improvements

A few design-time improvements are on the way too. Undo / Redo appears at last, the Toolbox has better organisation options and components can now be edited when disconnected. Data Viewers are easier to use and column mapping has become easier when sources / targets change.

Overall

This is hopefully not all we are getting in SQL11, but that will probably depend on the interlinking with some of the other features I suspect are coming and cannot discuss here. The Projects feature is nice, and the Design Time improvements will address some of the frustrations of working with SSIS. I will have to see if I can get my hands on this CTP…

The Merge Join Transformation

Fig 1: The Merge Join Transformation

Fig 1: The Merge Join Transformation

In this post I will be covering the Merge Join Transformation. The sample package can be found here for 2005 and guidelines on use are here.

What does the Merge Join Transformation do?

The Merge Join Transformation joins two data sets on a common key (or keys). It’s the same as performing an SQL Join – the options you have are Inner Join, Full Outer Join and Left Outer Join. However the difference is that SSIS rather than SQL is doing the join, so it’s an in-memory activity. The Merge Join is a Partially Blocking transformation – it needs to receive all data per join key value pair from all inputs before it can proceed, so this can have a significant negative effect on performance.

Configuring the Merge Join Transformation

A key thing to note for the Merge Join Transformation is that the inputs must be sorted on the keys on which you are joining the datasets, for both sets. This means you either have to set the IsSorted property on your source if the data is coming in sorted, or push it through a Sort transformation before you can attempt to do the join. In the example package, I have used 2 OLE DB sources. One of these is unsorted and I sort the output using a Sort column – note this is another blocking transform and another chance to drag performance down. The other one is sorted, and I tell SSIS by setting the IsSorted property of the OLE DB Source Output to true using the advanced Editor.

In terms of configuration of the Merge Join, all you need to do is map the join keys using a drag and drop and select the columns you want on the output, as shown below:

Configuring the Merge Join Transformation

Fig 2: Configuring the Merge Join Transformation

The checkbox to the left of the column name in the join section indicates whether that column is output from the transformation. In the box below the join section, you can rename the output columns if you want. The Order column is the Sort Order, and is not configurable. The Join Key checkbox is ticked when you create a relationship.

There is no advanced editor, but if you access the properties from the design surface there are a few extra properties to access which are worth knowing about:

  • MaxBuffersPerInput – which sets how many buffers to be active for one input at a time – see here for more detail
  • TreatNullsAsEqual – which decides whether to join Null values or not – by default this is true.

Where should you use the Merge Join Transformation?

The answer to this is simply that it should be used wherever it is not possible to perform the join using native SQL functionality – e.g. joining data from text file sources or other non database sources. If your sources are coming from the same database, use a query and get the database to do the work. It’ll be faster and more efficient in the vast majority of cases, and the fact that it’s a blocking transformation will slow down your package overall.

MSDN Documentation for the Merge Join Transformation can be found here for 2008 and here for 2005.

If you are still struggling, try these additional resources:

Also on advanced use of the component:

If you need specific help or advice, or have suggestions on the post, please leave a comment and I will do my best to help you.

Understanding the SSIS Object Model – part 1 – Executables

I have recently been buried deep in the process of creating SSIS packages through code – a deep and interesting challenge, not least because I have had to teach myself C# and get to grips with Object Oriented Programming – a long way from my grubby VBA I was writing many years ago.

Key to the success of understanding how to build packages through code is to understand the SSIS Object Model. Over a series of posts I will try and explain how the Object Model looks and works. So, on with Part 1 – Executables. As I walk through this section, I will demonstrate adding an Execute SQL task to a package.

Executables

The top level object in the Object Model is a Package. However a Package is also an Executable object – along with all the SSIS Containers. These Containers are the Sequence Container, ForEach & For Loops and also TaskHosts (which contain Tasks such as the ExecuteSQL Task and Data Flow – more about TaskHosts in a bit). So at the top of the tree we have the Package, but below that we have Control Flow tasks.

In broad (and possibly not quite right) terms, an Executable is a part of the Object Model that can issue a command to start an activity. It is important to note that Data Flow sources, transformations and destination are not Executables. The Data Flow Task is – because it can start the Data Flow – but you cannot in isolation call a Derived Column Transformation, for example. In this context it’s a bit like an engine – the Data Flow Task is the ignition that starts things happening, but on its own any Transformation is like a piston – vitally important but useless until the engine starts.

So, to begin the code snippets, here I will add an Executable to a Package.

            // Declare and create a package object
            PackageTestPkg = new Package();

            // Create an executable object
            ExecutableNewExe = TestPkg.Executables.Add(“STOCK:SQLTask”);

What happened above is firstly I created a new Package Object. Then, to that object I used the Executables.Add method to add an Executable to the Packages’ Executables collection, with the moniker “STOCK:SQLTask” to define what type of Executable I was adding.

TaskHosts

TaskHostsare the Containers for the tasks you are familiar with from the Control Flow such as the ExecuteSQL Task and Data Flow. You will never see a TaskHost through BIDS – it is a programming concept rather than anything visual. A TaskHost has an InnerObject (and more on these in the next section) which describes the details for type of task it is (i.e. ExecuteSQL Task, Data Flow, etc.).

A good way to think of a TaskHost is like an empty glass jam jar. In itself it doesn’t do much, but if you fill it with some ExecuteSQLTask Jam, you can see it’s an Execute SQL Task.

Below we cast this added Executable as a TaskHost so we can start configuring it.

            // Cast as TaskHost
            TaskHostNewTaskHost = (TaskHost)NewExe;

            // Give it a name
            NewTaskHost.Name = “I am a new Execute SQL Task”;

There’s a limited range of properties you can set at the TaskHost level which are detailed here, such as Name – these are the properties that are common to all TaskHosts. To set properties that are specific to a type of Task, you need to access its InnerObject.

InnerObjects

The InnerObject is the jam in the TaskHost jar from my analogy above – it contains the specific stuff for a given task. So, in order to configure a task fully, you need to access it’s InnerObject to set the properties that are specific to that type of task.

So, below we cast this TaskHost as an ExecuteSQLTask, thus filling the TaskHost jar with our tasty ExecuteSQLTask jam and set its SQLStatementSourceProperty:

            // Cast as ExecuteSQLTask
            ExecuteSQLTaskNewSQLTask = (ExecuteSQLTask)NewTaskHost.InnerObject;

            // Set a ExecuteSQLTask specific property
            NewSQLTask.SqlStatementSource = “And here is my SQL Statement”;

Wrap-Up

So, now you should be a little wiser as to what is required to add and configure a Task on the Control Flow – understanding how Executables, TaskHosts and InnerObjects interact.

Convert Text Stream to String

One of the ongoing challenges with SSIS is its difficulty in handling complex or damaged text files. One approach to dealing with such files is to bring them all in as one wide text column and then split them using code. Sometimes, the file is too wide for that approach, so below is an extension of that method where you import the column as a text stream (DT_TEXT, or Unicode DT_NTEXT) and then split the text stream in a script transformation:

        ‘ Declare variables
        Dim TextStream As Byte()            ‘ To hold Text Stream
        Dim TextStreamAsString As String    ‘ To Hold Text Stream converted to String
        Dim StringArray() As String         ‘ To contain split Text Stream

        ‘ Load Text Stream into variable
        TextStream = Row.TextStreamColumn.GetBlobData(0, CInt(Row.Column0.Length))

        ‘ Convert Text Stream to string
        TextStreamAsString = System.Text.Encoding.ASCII.GetString(TextStream)

        ‘ Split string into array and output
        StringArray = TextStreamAsString.Split(“#”)        

        Row.Column1 = StringArray(1).ToString
        Row.Column2 = StringArray(2).ToString
        Row.Column3 = StringArray(3).ToString  

An important thing to note is that in the step where the Text Stream is converted to a string, the Encoding will depend on the type of text stream you are bringing in – Unicode files will need ”Unicode” instead of “ASCII”. Also I have used a hash (“#”) as the column delimiter but that value will vary depending on what type of file you are bringing in.

Modifying an SSIS Package through code

Part of any SSIS development experience inevitably results in you discovering a minor mistake or something that was missed a long way into the development cycle – or even after, in testing (you do test your code, right?). Then you are faced with the tedious job of opening every single package, making a change in every one… and getting some serious mouse finger. Much like I once did when I learned about BufferTempStoragePath.

Fortunately, there is a way to automate these fixes. The SSIS Object model is (relatively) easily manipulated through .NET languages – so it’s not too difficult to write a small program that will change your package. Below is a sample I knocked up that will add a variable to an existing package and save the change:

using System;
usingMicrosoft.SqlServer.Server;
usingMicrosoft.SqlServer.Dts.Runtime;

namespace Package_Modifier
{
    class Program
    {
        static void Main(string[] args)
        {
            // Initialize an Application and Package object
            Application app = new Application();
            Package package = null;

            // Set a package path
            StringpkgPath = “C:\\BI Monkey\\SamplePackage.dtsx”;

            // Load the package in package object
            package = app.LoadPackage(pkgPath, null);

            // Add the new variable
            package.Variables.Add(“NewVar”, false, “User”, 0);

            // Save the package
            app.SaveToXml(pkgPath, package, null);
          }
    }
}

You can essentially make any change you like to a package – I’ve chosen adding a variable because it’s an easy manipulation of the package object and I’ve got a long way to go before I work out how to do anything much harder :)

The Row Sampling Transformation

Fig 1: The Row Sampling Transformation

Fig 1: The Row Sampling Transformation

It’s been a long time since I did one of these! In this post I will be covering the Row Sampling Transformation. The sample package can be found here for 2005 and guidelines on use are here.

What does the Row Sampling Transformation do?

The Row Sampling Transformation takes a fixed number of rows from a source data set – in a similar manner to the Percentage Sampling Transformation, except that instead of a proportion of your data, it takes a fixed number of rows. It splits your data set into two sets, the Sampled and Unsampled outputs, as below where 10 rows of a 100 row data set have been sampled:

The Row Sampling Transformation outputs

Fig 2: The Row Sampling Transformation outputs

The assigning of rows to an output is nominally random, but given the same data set and random seed (explained below), the same rows will always be selected each time you run the package.

Configuring the Row Sampling Transformation

There are two important properties to configure on the transformation. First is the Number of rows, which determines how many rows will fall into the Sample output. Second is the random seed. This seed tells the random selection algorithm which rows to choose. If you fix the seed, you will get consistent results – if you understand a little about randomisation in computing, you will understand randomness is a bit of a relative concept to a computer. If you leave the checkbox unselected, the package will pick a random seed based on the OSes’ tick count, so results will appear to change.

You can also name your Sample and Unselected outputs, should you wish. It’s worth noting that you aren’t obliged to actually use either output downstream of the component, so you can use this component to select a fixed number of rows from your source – or ignore a fixed number of rows from your source, by only using the Unselected output.

Configuring the Row Sampling Transformation

Fig 3: Configuring the Row Sampling Transformation

Where should you use the Row Sampling Transformation?

The main use for this would be to select a fixed size subset of data. This subset could be used for Data Mining test sets, or for limiting your data set size when testing packages – e.g. if you are running against a multimillion row data source, you could just run the package with 100 rows to see if your processes worked.

MSDN Documentation for the Row Sampling Transformation can be found here for 2008 and here for 2005.

If you need specific help or advice, or have suggestions on the post, please leave a comment and I will do my best to help you.

« Previous PageNext Page »