Modifying an SSIS Package through code
Part of any SSIS development experience inevitably results in you discovering a minor mistake or something that was missed a long way into the development cycle – or even after, in testing (you do test your code, right?). Then you are faced with the tedious job of opening every single package, making a change in every one… and getting some serious mouse finger. Much like I once did when I learned about BufferTempStoragePath.
Fortunately, there is a way to automate these fixes. The SSIS Object model is (relatively) easily manipulated through .NET languages – so it’s not too difficult to write a small program that will change your package. Below is a sample I knocked up that will add a variable to an existing package and save the change:
using System;
usingMicrosoft.SqlServer.Server;
usingMicrosoft.SqlServer.Dts.Runtime;namespace Package_Modifier
{
class Program
{
static void Main(string[] args)
{
// Initialize an Application and Package object
Application app = new Application();
Package package = null;// Set a package path
StringpkgPath = “C:\\BI Monkey\\SamplePackage.dtsx”;// Load the package in package object
package = app.LoadPackage(pkgPath, null);// Add the new variable
package.Variables.Add(“NewVar”, false, “User”, 0);// Save the package
app.SaveToXml(pkgPath, package, null);
}
}
}
You can essentially make any change you like to a package – I’ve chosen adding a variable because it’s an easy manipulation of the package object and I’ve got a long way to go before I work out how to do anything much harder
The Row Sampling Transformation

Fig 1: The Row Sampling Transformation
It’s been a long time since I did one of these! In this post I will be covering the Row Sampling Transformation. The sample package can be found here for 2005 and guidelines on use are here.
What does the Row Sampling Transformation do?
The Row Sampling Transformation takes a fixed number of rows from a source data set – in a similar manner to the Percentage Sampling Transformation, except that instead of a proportion of your data, it takes a fixed number of rows. It splits your data set into two sets, the Sampled and Unsampled outputs, as below where 10 rows of a 100 row data set have been sampled:

Fig 2: The Row Sampling Transformation outputs
The assigning of rows to an output is nominally random, but given the same data set and random seed (explained below), the same rows will always be selected each time you run the package.
Configuring the Row Sampling Transformation
There are two important properties to configure on the transformation. First is the Number of rows, which determines how many rows will fall into the Sample output. Second is the random seed. This seed tells the random selection algorithm which rows to choose. If you fix the seed, you will get consistent results – if you understand a little about randomisation in computing, you will understand randomness is a bit of a relative concept to a computer. If you leave the checkbox unselected, the package will pick a random seed based on the OSes’ tick count, so results will appear to change.
You can also name your Sample and Unselected outputs, should you wish. It’s worth noting that you aren’t obliged to actually use either output downstream of the component, so you can use this component to select a fixed number of rows from your source – or ignore a fixed number of rows from your source, by only using the Unselected output.

Fig 3: Configuring the Row Sampling Transformation
Where should you use the Row Sampling Transformation?
The main use for this would be to select a fixed size subset of data. This subset could be used for Data Mining test sets, or for limiting your data set size when testing packages – e.g. if you are running against a multimillion row data source, you could just run the package with 100 rows to see if your processes worked.
MSDN Documentation for the Row Sampling Transformation can be found here for 2008 and here for 2005.
If you need specific help or advice, or have suggestions on the post, please leave a comment and I will do my best to help you.
Other Blog roundup
Shockingly, it appears it’s not just me on the Internet blogging about Microsoft BI
I’ve added a couple of new links to by Relevant blogs. First up is Boyan Penev’s blog- a good blog that I’ve been referencing in my recent excursions into MDX. He’s also just joined the consultancy that I work for, which is a happy coincidence.
Next is Duane Douglas’s SSISBI.com- particularly for his posts on programmatically building SSIS packages which is a 12 part series starting here. It is comparatively easy to build packages from code. One of the extensions of my ETL Framework project that I’m very keen on doing is getting a utility built that generates packages that conform to the framework, rather than using templates which are easier to break. Hence my current fervent studies in C# which will no doubt please Todd McDermid.
On the subject of Todd, he recently presented at a SQL Saturday, and put up a few presentations which I think were pretty on the nose about their subjects. First of those was a Business Intelligence and Data Warehouse Primer – which for newcomers to DW/BI laid out the landscape nicely. The second was on the options available for Dimension Processing in SSIS. Finally there was one on the guts of SSIS which is in the same territory as my own ABC of ETL with SSIS. Follow this link to get to the post with the presentations.
The BI Monkey SSIS ETL Framework
I have, after at least 3 incarnations of building SSIS based ETL control frameworks, decided to do a fourth and (possibly) final one – for all to share. The alpha release of this is available now on CodePlex @ http://ssisetlframework.codeplex.com. I’ve put an aplha up, rather than a better tested Beta because the youngest BI Monkey in my household appears to be unhappy being a baby right now and is taking me away from this side project.
It’s fairly standard stuff from my point of view – it’s metadata driven framework that consists of a single control package and a template execution package. I’ve aimed to include the following features, all of which can be turned on or off in the metadata without altering any code.
- Recoverability
- Extraction constraints
- Execution order
- Dependencies
- Failure handling
The SSIS packages are pretty much a simple interface for where the meat of the decision making is occurring – in a bundle of stored procedures where it’s a lot easier to write and maintain complex decisions about processing than in SSIS.
I’m adhering to my own self-set principles on usability as well, namely to hit these targets:
- High visibility of operation via reports
- Fully commented code
- No acronynms in table and field names
- Any codes used exposed in a metadata table
I will be doing a demo of it at the AUSSUG session in Sydney this coming Tuesday as part of the SSIS round table which is taking place:
James Beresford: ETL Frameworks
Design and Implementation considerations when building ETL Control Frameworks, Including the debut demo of the BI Monkey ETL Framework
Kevin Wong: Practical ADO. NET
Three practical showcases using ADO.NET in SSIS including re-usable in memory resultsets
Glyn Llewellyn: Putting the T back into ETL
Looking at alternative ways in SSIS to perform data transformations and correction and discussing the advantages and disadvantages of each method
(Full details at the AUSSUG website)
I hope to see you there!
Loops and Lookups – a performance problem
A bit of a performance tuning nugget around loops and lookups which I faced recently.
We have a scenario where we are looping a few hundred times to execute a series of changing SQL statements, then passing the data through lookups before writing to target. We are doing this with a mix of the standard lookup and Cozyroc’s Lookup Plus task. However we were finding that while executing each SQL query took less than a second, the SSIS package was taking a long time to run each loop.
The loop was taking the query and then pausing for a while before moving any data. Eventually we realised this pause was SSIS building up the caches for the lookups. The standard lookups were running off a pre-built cache but the CozyRoc task doesn’t support caching and had to rebuild each lookup cache every time the loop was instantiated – not a big job but repeated a few hundred times those few seconds soon mount up.
The lessons here are:
- If you are reusing a lookup multiple times, use the Cache Transform to preload the cache into memory (2008 onwards)
- Be aware of the performance implication of caching lookups and the fact that they will be rebuilt each time the Data Flow is started
SSIS Command Line Utilities part 2: dtexec
In my previous post on dtutil, I looked at the command line utility to move and alter SSIS packages. In this post I will be looking at dtexec, the command line utility that executes packages. It allows for significant alteration of run-time options, including configured values, connections, logging and so on.
dtexec: Basic Package Execution
Like dtutil, dtexec uses a similar set of basic options to identify where the package you are operating on is located: /File, /DTS & /SQL. The eagle eyed among you will note that while the full option name is the same the abbreviated is slightly different, though their operation is the same. There is also a slight difference in the /SQL additional options – because no movement of the package is taking place, you only need to specify the /Server, /Password and /User for the package being executed (rather than separate identification of the Source and Destination package). Similarly there is a /Decrypt option for providing the password to encrypted packages. For details on this, check either my previous post or MSDN documentation.
There are two important options around basic execution. First is the /X86 option. This forces SQL Agent to to execute the package in 32-bit mode on a 64-bit server, which can be important in scenarios where 64-bit compatibility isn’t available. This commonly is an issue with drivers – however this is ignored when running from the command line. When running scenarios where you have SSIS servers in your environment running in 32 and 64-bit you probably want to read this MSDN article first: 64-bit Considerations for Integration Services.
The second is the the option to /Validate the package instead of executing it. What this does is halt the package after the validation phase – this can be useful for testing in deployment scenarios. How I sometimes use it is following deployment with a dtutil script, I run a dtexec script to validate the deployed packages, which helps catch any issues with missing / incorrect configurations etc. before you do a test run. This can be used in conjunction with the /W[arnAsError] option, which will treat a warning as an Error and cause validation to fail.
As a side note, you can place comments in the dtexec command line using the /Rem option.
dtexec: Configuring Execution
Execution can be configured using a number of options. First up is the ability to get dtexec to reference a file with all your dtexec options set in it, using the /Com[mandFile] {filespec} option. This can be handy if you are in an environment where you call dtexec via a remote tool such as Control M that has a character limit in terms of the batch command it can send.
You can provide or override the configurations used in execution using the /Conf[igFile] {filespec} to point at an XML configuration file. Similarly you can override specified connection settings using the /Conn[ection] id_or_name;{connection_string} option – multiple instances of this option are allowed and you can either point at a connection name or the GUID of the connection. You can associate an SSIS logging type with package execution using the /L[ogger] classid_orprogid;{configstring} option – this could be useful in troubleshooting if you don’t want logging enabled during normal execution but want to be able to turn it on for debug.
Individual Package properties such as variable values can be /Set – and you can set as many as is required. Finally you can set the Maximum number of executables permitted using the /M[axConcurrent] {concurrent_executables} option.
dtexec: Using Checkpoints
dtexec can make a package override its defined checkpoint usage settings – though as i’ve alluded to before, be very careful when using checkpointing as you won’t always get the expected results. But the options available are to turn /CheckPointing {on\off}, specify the actual file using /CheckFile and specify the usage of Checkpoints at package start using /Restart {deny | force | ifPossible}. These options match the settings in the package of SaveCheckpoints, CheckpointFile and CheckpointUsage.
dtexec: Execution Logging and Reporting
When running packages from dtexec, you get some additional logging options which can be useful when you are trying to do a more detailed debug on a package. Some are useful for immediate visualisation in the console window as you execute while others create file based logs.
The first console option is /ConsoleLog, which allows you to write to the console window the same entries that would normally be captured by SSIS execution logging. You can select which of these columns you want to capture, the most important of these likely to be Message. Further fine tuning in the settings for this option allow you to tighten the logging to a specific task or event – again as per standard logging. Next is /Reporting level, which tunes what type of event to capture – Errors, Information, Warnings and so on. Usually when you want this level of reporting i’d advise using the Verbose switch to just capture everything. There is also the capability to create an event exclusion list for this option. Finally, /Sum will put a count of the number of rows the next component will receive in the log.
If you want to push all this logging out to a file (and the logs can get pretty big so they can blow out the console window pretty quickly) you can write them out using the /VLog option. If you need to venture further into debugging you can create the dump files .mdmp and .tmp using either the /DumpOnError option to dump on any error or /Dump with a specified error to only generate the files under certain circumstances. These are advanced options and to understand the output i’d advise reading the MSDN article Working with Debug Dump Files first.
dtexec: Package Verification
The dtexec utility offers a few options to verify the package you are pointing at is the one you intend to execute. First up is the option to verify the build versions using the /VerifyBuild option which allows you to tell dtexec to only run the package if it has the specified Major (and optionally Minor) build version. These are the values set at the control flow level in the SSIS package you are calling. You can check the package has a valid digital signature using /VerifySigned option. Finally are the options to match the GUID of the package you are executing using /VerifyPackageId and the version GUID of the package using the /VerifyVersionID option.
All of the options above are interesting additional protections to ensure your system is running the code that it should be, but realistically they are poor and fiddly substitutes for a proper release management process.
Why use dtexec?
Most of the options above are of course covered in SQL Agent execution options, but there are scenarios where dtexec can prove useful – most commonly where SQL Agent isn’t available or isn’t permitted to be used by IT policy. It also has value in deployment and testing scenarios. Most of the time I would advocate using SQL Agent to schedule and automate your jobs, but it’s good to have such a flexible tool available.
There is also a GUI option, dtexecui.exe which allows for the configuration of almost all the above options using a simple GUI. It also has the advantage of being able to generate the dtexec command line prompt as well, making syntax errors much more avoidable.
Official MSDN documentation can be found here for 2008 and here for 2005.
AUSSUG Upcoming Sessions
In case you aren’t in the Australian SQL Server User Group, AUSSUG, there are a few upcoming sessions in Sydney which will be pretty useful – check out the official site to register. Sessions are free and always useful.
Lunchtime Wed 3rd March, 2010 – Ensuring Optimal Performance in SQL Server 2008 Based Applications with Viktor Isakov
Evening * Thu * 4th Mar 2010 – What’s new in Reporting Services 2008 R2 and PerformancePoint Services 2010 with Peter Myers, presenting what’s new in the upcoming release of Reporting Services 2008 R2 and PerformancePoint Services 2010
And TBA date in April 2010 – Knights of the SSIS Round Table – Kevin Wong, Glyn Llewelyn and myself will be presenting a series of mini demos followed by Q&A, so a chance to pick some expert brains
Hope to catch you at one of the sessions!
SSIS Command Line Utilities part 1: dtutil
I am back on the study wagon, getting ready for Exam 70-448: TS: Microsoft SQL Server 2008, Business Intelligence Development and Maintenance. One topic that is key for is understanding the command line utilities that allow you to carry out various functions across all the BI technologies. Possibly in excessive depth, but no-one can accuse the MS BI certifications of being easy.
There are two that impact SSIS – dtutil and dtexec. In simple terms, dtutil moves packages, and dtexec executes them. In this post I will be focussing on dtutil. dtexec will be covered in a later post.
dtutil: deploying packages
The main use of dtutil is to script the deployment of packages. For file operations it is equivalent to a simple copy operation, though you can add SSIS specific operations, such as changing the package GUID. However it also allows for movement to and from the SQL Server msdb store, which can’t be done via copy / paste.
Deploying a package requires you specify its source location using one of the following options:
- /Fi (or /File) + File Path + Package Name.dtsx – for packages in the File System
- /DT (or /DTS) + Package Name – for packages in the SSIS Package Store
- /SQ (or /SQL) + Package Name – for packages in msdb
There are a few important things to note about the above options. Firstly for File System operations, you need to specify the .dtsx extension. Secondly, for SSIS Package store operations you don’t need to, even though it is still strictly speaking a file system operation. Finally, you don’t need to know where the Package Store is – dtutil works that out for you, which is handy if you have custom setups of the package store.
For packages located in msdb, you may also need to provide connection details using the following options:
- /SourceS (or /SourceServer) – the source server name. If not used, it assumes localhost
- /SourceU (or /SourceUser) – the SQL Authentication user name
- /SourceP (or /SourcePassword) – the SQL Authentication Password
Note that the username and password options are only required if SQL Authentication is being used. If left out, dtutil will try and authenticate to msdb using Windows Authentication for the current user.
Next up, you need to tell dtutil to copy the package and specify the destination. The option to move it is /C or (/Copy). This is then followed by the same options as specifying the source - though without the preceding backslash (i.e. Fi, DT or SQ) – then a semicolon, and finally the destination path. If you are using a SQL Server destination you may also need to provide connection details with the /DestS, /DestU and /DestP options.
As usual, an example is worth its weight in gold, so heres some to help you understand. In each case I will give the Full and Abbreviated versions – both of which do exactly the same job.
Moving a package from File to SSIS Package Store:
dtutil /File C:\Package.dtsx /Copy DTS;Package
dtutil /Fi C:\Package.dtsx /C DT;Package
Moving a package from msdb on a Named Instance of SQL Server using SQL Server Authentication to the file system
dtutil /SQL Folder\Package /SourceServer SQLSERVER.INSTANCE /SourceUser Monkey_User /SourcePassword P@$$word /Copy File;C:\Package.dtsx
dtutil /SQ Folder\Package /SourceS SQLSERVER.INSTANCE /SourceU Monkey_User /SourceP P@$$word /C FI;C:\Package.dtsx
Using a little batch scripting these can easily be converted into jobs which will deploy a bundle of packages quickly and easily.
dtutil: altering packages
dtutil is more than a deployment tool – it can also alter packages using the following options:
- /Encrypt – copies and encrypts the package with a specified password and protection level
- /IDRegenerate – Generates a new GUID for the package
- /Sign – Applies a digital signature to a package
The last two can be useful as options within the deployment process, where the Encrypt can replace the Copy operation if you want the deployed environments packages to be more secure.
dtutil: file and folder operations
Finally, dtutil can work components of the file system or MSDB folder structures, creating and deleting files and folders
- /Delete – Delete a package
- /Exists – Test if a package exists
- /FCreate – Create a folder
- /FDelete – Delete a folder
- /FDirectory – List the contents of a folder
- /FExists – Test if a folder exists
- /FRename – Rename a folder
- /Move – Move a package from one location to another
Most of these would be useful in an initial deployment or for tidying up.
dtutil: other things
There are a few leftover functions which I don’t have a categorisation, so here they are.
- /Quiet – suppress prompts
- /Remark – add a comment
- /Decrypt – decrypt a package as an operation is performed
Why use dtutil?
The main reason for using dtutil is simply that the BIDS supplied deployment options, and all that mucking about with Deployment Manifests, is limited, not as scriptable, and also has been buggy, having problems with configuration files (though I believe this is now fixed in 2008, but since I stopped using it long ago, I can’t be sure). dtutil also offers additional functionality, such as being able to script package operations such as encryption.
Official MSDN documentation can be found here for 2008 and here for here for 2005.
BIDS Helper
When you are a BI Monkey like me, you probably spend a fair bit of your time buried deep in Business Intelligence Development Studio (BIDS) and probably wishing it was a little less frustrating to work with.
Enter… BIDS Helper
Over at codeplex there is a project called BIDS Helper, which adds some useful additional functionality to the standard BIDS environment. It improves the development environment for integration, analysis and reporting services. In this post I will focus more on the SSIS capabilities that are added. The additional functionality is fully documented on the codeplex site, and I will highlight some of the more useful functions below – the links are to the documentation which is nice and visual, making it easy to understand.
Expression and Configuration Highlighter - this adds a highlight to components, connection managers and variables that are controlled by an expression and / or configuration, which allows you to rapidly see what is subject to change at runtime.
Expression List – this lists every property that is set by a variable (though currently not expressions that set the value of variables) – great for finding out what is actually in the package, as these things can be easily overlooked otherwise
Non-Default Properties Report – this lists any property in the package that is set to a non-default value. Very handy for debugging and seeing what is non standard.
SSIS Performance Visualization- this executes the package and presents a gantt chart of the execution time, and also can present performance over time – useful when performance tuning.
Variables Window Extensions - this adds the functionality of being able to change a variables scope – very handy for when you create a variable at the wrong scope (something I do all the time).
BIDS Helper is a handy little tool that adds a few nice functions for SSIS developers without intefering with the existing UI or design environment – I thoroughly recommend installing it, especially as it’s free!
Fuzzy Thinking
I’ve covered off the Fuzzy Lookup and Fuzzy Grouping transformations in SSIS and noticed in my research that these capabilities aren’t particularly coherently talked about on the web. So below I thought i’d collect some of the better articles for your late night reading. There isn’t all that much out there, unfortunately.
So, how does it all work?
Here are a few articles covering theory, mostly from Microsoft:
- Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005 – a detailed article on the use of the components, with some coverage of under the hood behaviour and performance considerations
- Fuzzy Lookups and Groupings Provide Powerful Data Cleansing Capabilities – explains at a conceptual level using the Fuzzy Lookup components and considerations in understanding the results
- Robust and Efficient Fuzzy Match for Online Data Cleaning – a research paper by the creators of the Fuzzy Lookup algorithms which goes into the guts of how fuzzy matching works, with details on Q-Grams, Error tolerant indexes and so forth
It is probably worth reiterating that because of the way the algorithms and their Q-Grams work, when longer strings are being analysed for fuzzy matches, the better the chances of a good match. When I first started using the algorithms I was doing some client matching and matched first and last names separately. Once I had a deeper understanding of the components, I started matching on a full name and the quality and reliability of matches improved significantly.
Ok, so how do I make it work?
Now, some articles covering practical implementation of the tasks:
- Fuzzy Lookup and Fuzzy Grouping transformations – guides to the practical use of the transformations from me, the BI Monkey
- Using Fuzzy Lookup Transformations in SQL Server Integration Services – a practical example of using a fuzzy lookup
- Adventures with Fuzzy Matching – from Jamie Thompson, which provides some cautions on the reliability of the results
- Fuzzy Lookup and Fuzzy Grouping transformations – MSDN documentation
The best thing you can do is get some sample data and play with the components to understand what it is they do. The results are impressive – if not bulletproof – and can make a great contribution to de-duplicating client data, etc.
And what does the BI Monkey have to say about it?
Fuzzy Matching is a powerful and easy to use tool which is great for approximate grouping of data for analysis where a margin of error is tolerable. It is also a great helper in data cleansing exercises. Having too much faith in the results where exact matches are required will cause you to fall over at some point, so be careful. If you are engaged in such an exercise and want some experienced support, please get in touch.
And in other news, fresh from Jamie Thompson – Fuzzy Lookup and Regex are going to become available in SQL2008R2.
If you have come across any articles that you think really contribute something to the understanding of fuzzy matching technologies in SQL Server / SSIS, please let me know or post a link in the comments so I can improve this article.