Modifying an SSIS Package through code

Part of any SSIS development experience inevitably results in you discovering a minor mistake or something that was missed a long way into the development cycle – or even after, in testing (you do test your code, right?). Then you are faced with the tedious job of opening every single package, making a change in every one… and getting some serious mouse finger. Much like I once did when I learned about BufferTempStoragePath.

Fortunately, there is a way to automate these fixes. The SSIS Object model is (relatively) easily manipulated through .NET languages – so it’s not too difficult to write a small program that will change your package. Below is a sample I knocked up that will add a variable to an existing package and save the change:

using System;
usingMicrosoft.SqlServer.Server;
usingMicrosoft.SqlServer.Dts.Runtime;

namespace Package_Modifier
{
    class Program
    {
        static void Main(string[] args)
        {
            // Initialize an Application and Package object
            Application app = new Application();
            Package package = null;

            // Set a package path
            StringpkgPath = “C:\\BI Monkey\\SamplePackage.dtsx”;

            // Load the package in package object
            package = app.LoadPackage(pkgPath, null);

            // Add the new variable
            package.Variables.Add(“NewVar”, false, “User”, 0);

            // Save the package
            app.SaveToXml(pkgPath, package, null);
          }
    }
}

You can essentially make any change you like to a package – I’ve chosen adding a variable because it’s an easy manipulation of the package object and I’ve got a long way to go before I work out how to do anything much harder :)

SSIS ETL Framework v1 goes Beta!

A quick update on the status of the BI Monkey SSIS ETL Framework (on Codeplex @ http://ssisetlframework.codeplex.com/)

The Framework v1 has gone into Beta – i’ve completed all the testing, and now just need to tidy up the reports and add some extra logging capability – but it is now fully functional. I’ve left it in Beta as I want to get some feedback on it before I move it live, plus fix those small details I mentioned.

So now I will press on with updating the documentation (yes, really!) and start laying the foundations for the more Enterprise level v2 framework.

I look forward to your feedback – please take advantage of Codeplex’s issue logging functionality to help me manage bugs and improvements.

The Row Sampling Transformation

Fig 1: The Row Sampling Transformation

Fig 1: The Row Sampling Transformation

It’s been a long time since I did one of these! In this post I will be covering the Row Sampling Transformation. The sample package can be found here for 2005 and guidelines on use are here.

What does the Row Sampling Transformation do?

The Row Sampling Transformation takes a fixed number of rows from a source data set – in a similar manner to the Percentage Sampling Transformation, except that instead of a proportion of your data, it takes a fixed number of rows. It splits your data set into two sets, the Sampled and Unsampled outputs, as below where 10 rows of a 100 row data set have been sampled:

The Row Sampling Transformation outputs

Fig 2: The Row Sampling Transformation outputs

The assigning of rows to an output is nominally random, but given the same data set and random seed (explained below), the same rows will always be selected each time you run the package.

Configuring the Row Sampling Transformation

There are two important properties to configure on the transformation. First is the Number of rows, which determines how many rows will fall into the Sample output. Second is the random seed. This seed tells the random selection algorithm which rows to choose. If you fix the seed, you will get consistent results – if you understand a little about randomisation in computing, you will understand randomness is a bit of a relative concept to a computer. If you leave the checkbox unselected, the package will pick a random seed based on the OSes’ tick count, so results will appear to change.

You can also name your Sample and Unselected outputs, should you wish. It’s worth noting that you aren’t obliged to actually use either output downstream of the component, so you can use this component to select a fixed number of rows from your source – or ignore a fixed number of rows from your source, by only using the Unselected output.

Configuring the Row Sampling Transformation

Fig 3: Configuring the Row Sampling Transformation

Where should you use the Row Sampling Transformation?

The main use for this would be to select a fixed size subset of data. This subset could be used for Data Mining test sets, or for limiting your data set size when testing packages – e.g. if you are running against a multimillion row data source, you could just run the package with 100 rows to see if your processes worked.

MSDN Documentation for the Row Sampling Transformation can be found here for 2008 and here for 2005.

If you need specific help or advice, or have suggestions on the post, please leave a comment and I will do my best to help you.

An SQL alternative to the SCD

In SQL 2008 a new T-SQL construct was added - the MERGE operation. (Ok, pedants will know this wasn’t new to Oracle,  but it was new to SQL Server).

This operation allows for the merging of a dataset into a reference dataset – which can be remarkably similar to Insert / Update operations effected by the Slowly Changing Dimension transformation. However the way it operates is very different. Instead of the SCD’s row by row evaluation approach, the MERGE operation is a set based operation. What this means is it compares the whole of the source dataset to the reference dataset in a single pass. This has significant implications for performance – on a site where I implemented this the operation which took 1,200 seconds in the SCD cut down to 51 seconds using a Merge.

There are limitations and differences to be aware of:

  • You cannot directly return row counts for Insert / Update / Ignore operations in the Merge
  • As it is a bulk operation a single row will cause failure of the whole batch
  • There’s no GUI – just hand crafted SQL
  • Less error trapping / logging options
  • More flexibility in terms of actions when matches / non matches are found

The main reason why you would consider the SQL Merge – it handles Type 1, and with a little cunning, Type 2 dimensions – in a fraction of the time it takes the SCD to plod through. It’s still not as fast as a proper in memory comparison using something such as TableDifference – but it’s always good to know you have something else available in your toolkit.

Further information:

Managing your history data

This post is to an extent a small rant about some design decisions I have been constrained by on my current project. These decisions were made predominantly for one fairly bad reason: it made the architect’s life easier (apologies to the architects if they are reading – but these were bad choices!)

The design choices in question are around the managing of history data. In one component of the system it relates to Database storage design, the other relates to Cube storage design. In both cases the history data is stored in a separate location to the “current” data.

Databases: Why separate history tables are a bad idea

The first – and most compelling – reason for not storing your history data in separate tables to your current tables is that it increases complexity for users. Instead of having one location to look for data, your users now have to use two.

The second compelling reason is that there is no point to doing this from a storage point of view. SQL 2005 & 2008 (Enterprise editions only, admittedly) provide partitioning. This enables the contents of an individual table to be stored in separate locations on different filegroups. This means that you can store your current days data in one location and your history in a different one. The reason for doing this is the same as splitting it into separate tables – that querying the current section will be faster than the historic section.  In theory queries against partitioned tables should in fact be faster as the current data is now no longer in the same filegroup as the history data.

Now, there is an overhead associated with designing and maintaining partitions but I don’t see that it is significantly larger than that required to deal with the process required to archive data into separate tables on a daily basis. Additionally when maintaining separate history tables, you need to separate out every single table, whether it gets 10 rows a day or 10 million. With partitioning you can just target the large tables that need that focus.

There are other downsides to maintaining separate tables. If you make a change to a table design, you need to do it in 2 places.  You also need to remember to update your history processes. If your history process fails, you can end up with users getting unexpected query results or ETL process failures when the system loads the next day’s data into the current tables –  and untangling it becomes a real mess. If your partition processes fail to run, you just have too much data in one filegroup for a while – unlikely to be fatal.

So if you have large tables you need to split out for performance purposes – do it at the back end, using the power of the database – which is designed to store data efficiently. Keep it away from the users – they neither need to know or care about your need to keep the data separate. If you want to give them a single object to query with the current day’s data, just use views.

Cubes: Why a separate history Cube is a bad idea

Much of the above applies here – SSAS also has partitions – so you can again store your historic and current data in separate physical locations with the users being totally unaware of this. Again there is overhead in maintenance, but this will also balance out with the maintenance and risks associated with maintaining two identical cubes that only differ in terms of data source.

Use your storage options!

So without banging on about the same things any further, please consider the following two points whenever considering managing your history data:

  1. How does what i’m planning affect my users?
  2. How does what i’m planning leverage the platforms capabilities?

AUSSUG Followup

After the great session at AUSSUG on Tuesday, I thought i’d share the presentation materials, so here’s my powerpoint slides. I enjoyed the shorter, more interactive format.

I’ve also been busy updating the documentation for the BI Monkey SSIS ETL Framework on the codeplex site, and have uploaded a fresh alpha release – which is significantly more tested and reliable – it’s already had quite a few downloads so i’m looking forward to some feedback soon.

The BI Monkey SSIS ETL Framework

I have, after at least 3 incarnations of building SSIS based ETL control frameworks, decided to do a fourth and (possibly) final one – for all to share. The alpha release of this is available now on CodePlex @ http://ssisetlframework.codeplex.com. I’ve put an aplha up, rather than a better tested Beta because the youngest BI Monkey in my household appears to be unhappy being a baby right now and is taking me away from this side project.

It’s fairly standard stuff from my point of view – it’s metadata driven framework that consists of a single control package and a template execution package. I’ve aimed to include the following features, all of which can be turned on or off in the metadata without altering any code.

  • Recoverability
  • Extraction constraints
  • Execution order
  • Dependencies
  • Failure handling

The SSIS packages are pretty much a simple interface for where the meat of the decision making is occurring – in a bundle of stored procedures where it’s a lot easier to write and maintain complex decisions about processing than in SSIS.

I’m adhering to my own self-set principles on usability as well, namely to hit these targets:

  • High visibility of operation via reports
  • Fully commented code
  • No acronynms in table and field names
  • Any codes used exposed in a metadata table

I will be doing a demo of it at the AUSSUG session in Sydney this coming Tuesday as part of the SSIS round table which is taking place:

James Beresford:  ETL Frameworks

Design and Implementation considerations when building ETL Control Frameworks, Including the debut demo of the BI Monkey ETL Framework

Kevin Wong:  Practical ADO. NET

Three practical showcases using ADO.NET in SSIS including re-usable in memory resultsets

Glyn Llewellyn:  Putting the T back into ETL

Looking at alternative ways in SSIS to perform data transformations and correction and discussing the advantages and disadvantages of each method

(Full details at the AUSSUG website)

I hope to see you there!

Loops and Lookups – a performance problem

A bit of a performance tuning nugget around loops and lookups which I faced recently.

We have a scenario where we are looping a few hundred times to execute a series of changing SQL statements, then passing the data through lookups before writing to target. We are doing this with a mix of the standard lookup and Cozyroc’s Lookup Plus task. However we were finding that while executing each SQL query took less than a second, the SSIS package was taking a long time to run each loop.

The loop was taking the query and then pausing for a while before moving any data. Eventually we realised this pause was SSIS building up the caches for the lookups. The standard lookups were running off a pre-built cache but the CozyRoc task doesn’t support caching and had to rebuild each lookup cache every time the loop was instantiated – not a big job but repeated a few hundred times those few seconds soon mount up.

The lessons here are:

  1. If you are reusing a lookup multiple times, use the Cache Transform to preload the cache into memory (2008 onwards)
  2. Be aware of the performance implication of caching lookups and the fact that they will be rebuilt each time the Data Flow is started

SSIS Command Line Utilities part 2: dtexec

In my previous post on dtutil, I looked at the command line utility to move and alter SSIS packages. In this post I will be looking at dtexec, the command line utility that executes packages. It allows for significant alteration of run-time options, including configured values, connections, logging and so on.

dtexec: Basic Package Execution

Like dtutil, dtexec uses a similar set of basic options to identify where the package you are operating on is located: /File, /DTS & /SQL. The eagle eyed among you will note that while the full option name is the same the abbreviated is slightly different, though their operation is the same. There is also a slight difference in the /SQL additional options – because no movement of the package is taking place, you only need to specify the /Server, /Password and /User for the package being executed (rather than separate identification of the Source and Destination package). Similarly there is a /Decrypt option for providing the password to encrypted packages. For details on this, check either my previous post or MSDN documentation.

There are two important options around basic execution. First is the /X86 option. This forces SQL Agent to to execute the package in 32-bit mode on a 64-bit server, which can be important in scenarios where 64-bit compatibility isn’t available. This commonly is an issue with drivers – however this is ignored when running from the command line. When running scenarios where you have SSIS servers in your environment running in 32 and 64-bit you probably want to read this MSDN article first: 64-bit Considerations for Integration Services.

The second is the the option to /Validate the package instead of executing it. What this does is halt the package after the validation phase – this can be useful for testing in deployment scenarios. How I sometimes use it is following deployment with a dtutil script, I run a dtexec script to validate the deployed packages, which helps catch any issues with missing / incorrect configurations etc. before you do a test run. This can be used in conjunction with the /W[arnAsError] option, which will treat a warning as an Error and cause validation to fail.

As a side note, you can place comments in the dtexec command line using the /Rem option.

dtexec: Configuring Execution

Execution can be configured using a number of options. First up is the ability to get dtexec to reference a file with all your dtexec options set in it, using the /Com[mandFile] {filespec} option. This can be handy if you are in an environment where you call dtexec via a remote tool such as Control M that has a character limit in terms of the batch command it can send.

You can provide or override the configurations used in execution using the /Conf[igFile] {filespec} to point at an XML configuration file. Similarly you can override specified connection settings using the /Conn[ection] id_or_name;{connection_string} option – multiple instances of this option are allowed and you can either point at a connection name or the GUID of the connection. You can associate an SSIS logging type with package execution using the /L[ogger] classid_orprogid;{configstring} option – this could be useful in troubleshooting if you don’t want logging enabled during normal execution but want to be able to turn it on for debug.

Individual Package properties such as variable values can be /Set – and you can set as many as is required. Finally you can set the Maximum number of executables permitted using the /M[axConcurrent] {concurrent_executables} option.

dtexec: Using Checkpoints

dtexec can make a package override its defined checkpoint usage settings – though as i’ve alluded to before, be very careful when using checkpointing as you won’t always get the expected results. But the options available are to turn /CheckPointing {on\off}, specify the actual file using /CheckFile and specify the usage of Checkpoints at package start using /Restart {deny | force | ifPossible}. These options match the settings in the package of SaveCheckpoints, CheckpointFile and CheckpointUsage.

dtexec: Execution Logging and Reporting

When running packages from dtexec, you get some additional logging options which can be useful when you are trying to do a more detailed debug on a package. Some are useful for immediate visualisation in the console window as you execute while others create file based logs.

The first console option is /ConsoleLog, which allows you to write to the console window the same entries that would normally be captured by SSIS execution logging. You can select which of these columns you want to capture, the most important of these likely to be Message. Further fine tuning in the settings for this option allow you to tighten the logging to a specific task or event – again as per standard logging. Next is /Reporting  level, which tunes what type of event to capture – Errors, Information, Warnings and so on. Usually when you want this level of reporting i’d advise using the Verbose switch to just capture everything. There is also the capability to create an event exclusion list for this option. Finally, /Sum will put a count of the number of rows the next component will receive in the log.

If you want to push all this logging out to a file (and the logs can get pretty big so they can blow out the console window pretty quickly) you can write them out using the /VLog option. If you need to venture further into debugging you can create the dump files .mdmp and .tmp using either the /DumpOnError option to dump on any error or /Dump with a specified error to only generate the files under certain circumstances. These are advanced options and to understand the output i’d advise reading the MSDN article Working with Debug Dump Files first.

dtexec: Package Verification

The dtexec utility offers a few options to verify the package you are pointing at is the one you intend to execute. First up is the option to verify the build versions using the /VerifyBuild option which allows you to tell dtexec to only run the package if it has the specified Major (and optionally Minor) build version. These are the values set at the control flow level in the SSIS package you are calling. You can check the package has a valid digital signature using /VerifySigned option. Finally are the options to match the GUID of the package you are executing using /VerifyPackageId and the version GUID of the package using the /VerifyVersionID option.

All of the options above are interesting additional protections to ensure your system is running the code that it should be, but realistically they are poor and fiddly substitutes for a proper release management process.

Why use dtexec?

Most of the options above are of course covered in SQL Agent execution options, but there are scenarios where dtexec can prove useful – most commonly where SQL Agent isn’t available or isn’t permitted to be used by IT policy. It also has value in deployment and testing scenarios. Most of the time I would advocate using SQL Agent to schedule and automate your jobs, but it’s good to have such a flexible tool available.

There is also a GUI option, dtexecui.exe which allows for the configuration of almost all the above options using a simple GUI. It also has the advantage of being able to generate the dtexec command line prompt as well, making syntax errors much more avoidable.

Official MSDN documentation can be found here for 2008 and here for 2005.

SSIS Command Line Utilities part 1: dtutil

I am back on the study wagon, getting ready for Exam 70-448: TS: Microsoft SQL Server 2008, Business Intelligence Development and Maintenance. One topic that is key for is understanding the command line utilities that allow you to carry out various functions across all the BI technologies. Possibly in excessive depth, but no-one can accuse the MS BI certifications of being easy.

There are two that impact SSIS – dtutil and dtexec. In simple terms, dtutil moves packages, and dtexec executes them. In this post I will be focussing on dtutil. dtexec will be covered in a later post.

dtutil: deploying packages

The main use of dtutil is to script the deployment of packages. For file operations it is equivalent to a simple copy operation, though you can add SSIS specific operations, such as changing the package GUID. However it also allows for movement to and from the SQL Server msdb store, which can’t be done via copy / paste.

Deploying a package requires you specify its source location using one of the following options:

  • /Fi (or /File) + File Path + Package Name.dtsx – for packages in the File System
  • /DT (or /DTS) + Package Name – for packages in the SSIS Package Store
  • /SQ (or /SQL) + Package Name – for packages in msdb

There are a few important things to note about the above options. Firstly for File System operations, you need to specify the .dtsx extension. Secondly, for SSIS Package store operations you don’t need to, even though it is still strictly speaking a file system operation. Finally, you don’t need to know where the Package Store is – dtutil works that out for you, which is handy if you have custom setups of the package store.

For packages located in msdb, you may also need to provide connection details using the following options:

  • /SourceS (or /SourceServer) – the source server name. If not used, it assumes localhost
  • /SourceU (or /SourceUser) – the SQL Authentication user name
  • /SourceP (or /SourcePassword) – the SQL Authentication Password

Note that the username and password options are only required if SQL Authentication is being used. If left out, dtutil will try and authenticate to msdb using Windows Authentication for the current user. 

Next up, you need to tell dtutil to copy the package and specify the destination. The option to move it is /C or (/Copy). This is then followed by the same options as specifying the source - though without the preceding backslash (i.e. Fi, DT or SQ) – then a semicolon, and finally the destination path. If you are using a SQL Server destination you may also need to provide connection details with the /DestS, /DestU and /DestP options.

As usual, an example is worth its weight in gold, so heres some to help you understand. In each case I will give the Full and Abbreviated versions – both of which do exactly the same job.

Moving a package from File to SSIS Package Store:

dtutil /File C:\Package.dtsx /Copy DTS;Package

dtutil /Fi C:\Package.dtsx /C DT;Package

Moving a package from msdb on a Named Instance of SQL Server using SQL Server Authentication to the file system

dtutil /SQL Folder\Package /SourceServer SQLSERVER.INSTANCE /SourceUser Monkey_User /SourcePassword P@$$word /Copy File;C:\Package.dtsx

dtutil /SQ Folder\Package /SourceS SQLSERVER.INSTANCE /SourceU Monkey_User /SourceP P@$$word /C FI;C:\Package.dtsx

Using a little batch scripting these can easily be converted into jobs which will deploy a bundle of packages quickly and easily.

dtutil: altering packages

dtutil is more than a deployment tool – it can also alter packages using the following options:

  • /Encrypt – copies and encrypts the package with a specified password and protection level
  • /IDRegenerate – Generates a new GUID for the package
  • /Sign  – Applies a digital signature to a package

The last two can be useful as options within the deployment process, where the Encrypt can replace the Copy operation if you want the deployed environments packages to be more secure.

dtutil: file and folder operations

Finally, dtutil can work components of the file system or MSDB folder structures, creating and deleting files and folders

  • /Delete – Delete a package
  • /Exists – Test if a package exists
  • /FCreate – Create a folder
  • /FDelete – Delete a folder
  • /FDirectory – List the contents of a folder
  • /FExists – Test if a folder exists
  • /FRename – Rename a folder
  • /Move – Move a package from one location to another

Most of these would be useful in an initial deployment or for tidying up.

dtutil: other things

There are a few leftover functions which I don’t have a categorisation, so here they are.

  • /Quiet – suppress prompts
  • /Remark – add a comment
  • /Decrypt – decrypt a package as an operation is performed

Why use dtutil?

The main reason for using dtutil is simply that the BIDS supplied deployment options, and all that mucking about with Deployment Manifests, is limited, not as scriptable, and also has been buggy, having problems with configuration files (though I believe this is now fixed in 2008, but since I stopped using it long ago, I can’t be sure). dtutil also offers additional functionality, such as being able to script package operations such as encryption.

Official MSDN documentation can be found here for 2008 and here for here for 2005.

Next Page »