SSIS Command Line Utilities part 2: dtexec

In my previous post on dtutil, I looked at the command line utility to move and alter SSIS packages. In this post I will be looking at dtexec, the command line utility that executes packages. It allows for significant alteration of run-time options, including configured values, connections, logging and so on.

dtexec: Basic Package Execution

Like dtutil, dtexec uses a similar set of basic options to identify where the package you are operating on is located: /File, /DTS & /SQL. The eagle eyed among you will note that while the full option name is the same the abbreviated is slightly different, though their operation is the same. There is also a slight difference in the /SQL additional options – because no movement of the package is taking place, you only need to specify the /Server, /Password and /User for the package being executed (rather than separate identification of the Source and Destination package). Similarly there is a /Decrypt option for providing the password to encrypted packages. For details on this, check either my previous post or MSDN documentation.

There are two important options around basic execution. First is the /X86 option. This forces SQL Agent to to execute the package in 32-bit mode on a 64-bit server, which can be important in scenarios where 64-bit compatibility isn’t available. This commonly is an issue with drivers – however this is ignored when running from the command line. When running scenarios where you have SSIS servers in your environment running in 32 and 64-bit you probably want to read this MSDN article first: 64-bit Considerations for Integration Services.

The second is the the option to /Validate the package instead of executing it. What this does is halt the package after the validation phase – this can be useful for testing in deployment scenarios. How I sometimes use it is following deployment with a dtutil script, I run a dtexec script to validate the deployed packages, which helps catch any issues with missing / incorrect configurations etc. before you do a test run. This can be used in conjunction with the /W[arnAsError] option, which will treat a warning as an Error and cause validation to fail.

As a side note, you can place comments in the dtexec command line using the /Rem option.

dtexec: Configuring Execution

Execution can be configured using a number of options. First up is the ability to get dtexec to reference a file with all your dtexec options set in it, using the /Com[mandFile] {filespec} option. This can be handy if you are in an environment where you call dtexec via a remote tool such as Control M that has a character limit in terms of the batch command it can send.

You can provide or override the configurations used in execution using the /Conf[igFile] {filespec} to point at an XML configuration file. Similarly you can override specified connection settings using the /Conn[ection] id_or_name;{connection_string} option – multiple instances of this option are allowed and you can either point at a connection name or the GUID of the connection. You can associate an SSIS logging type with package execution using the /L[ogger] classid_orprogid;{configstring} option – this could be useful in troubleshooting if you don’t want logging enabled during normal execution but want to be able to turn it on for debug.

Individual Package properties such as variable values can be /Set – and you can set as many as is required. Finally you can set the Maximum number of executables permitted using the /M[axConcurrent] {concurrent_executables} option.

dtexec: Using Checkpoints

dtexec can make a package override its defined checkpoint usage settings – though as i’ve alluded to before, be very careful when using checkpointing as you won’t always get the expected results. But the options available are to turn /CheckPointing {on\off}, specify the actual file using /CheckFile and specify the usage of Checkpoints at package start using /Restart {deny | force | ifPossible}. These options match the settings in the package of SaveCheckpoints, CheckpointFile and CheckpointUsage.

dtexec: Execution Logging and Reporting

When running packages from dtexec, you get some additional logging options which can be useful when you are trying to do a more detailed debug on a package. Some are useful for immediate visualisation in the console window as you execute while others create file based logs.

The first console option is /ConsoleLog, which allows you to write to the console window the same entries that would normally be captured by SSIS execution logging. You can select which of these columns you want to capture, the most important of these likely to be Message. Further fine tuning in the settings for this option allow you to tighten the logging to a specific task or event – again as per standard logging. Next is /Reporting  level, which tunes what type of event to capture – Errors, Information, Warnings and so on. Usually when you want this level of reporting i’d advise using the Verbose switch to just capture everything. There is also the capability to create an event exclusion list for this option. Finally, /Sum will put a count of the number of rows the next component will receive in the log.

If you want to push all this logging out to a file (and the logs can get pretty big so they can blow out the console window pretty quickly) you can write them out using the /VLog option. If you need to venture further into debugging you can create the dump files .mdmp and .tmp using either the /DumpOnError option to dump on any error or /Dump with a specified error to only generate the files under certain circumstances. These are advanced options and to understand the output i’d advise reading the MSDN article Working with Debug Dump Files first.

dtexec: Package Verification

The dtexec utility offers a few options to verify the package you are pointing at is the one you intend to execute. First up is the option to verify the build versions using the /VerifyBuild option which allows you to tell dtexec to only run the package if it has the specified Major (and optionally Minor) build version. These are the values set at the control flow level in the SSIS package you are calling. You can check the package has a valid digital signature using /VerifySigned option. Finally are the options to match the GUID of the package you are executing using /VerifyPackageId and the version GUID of the package using the /VerifyVersionID option.

All of the options above are interesting additional protections to ensure your system is running the code that it should be, but realistically they are poor and fiddly substitutes for a proper release management process.

Why use dtexec?

Most of the options above are of course covered in SQL Agent execution options, but there are scenarios where dtexec can prove useful – most commonly where SQL Agent isn’t available or isn’t permitted to be used by IT policy. It also has value in deployment and testing scenarios. Most of the time I would advocate using SQL Agent to schedule and automate your jobs, but it’s good to have such a flexible tool available.

There is also a GUI option, dtexecui.exe which allows for the configuration of almost all the above options using a simple GUI. It also has the advantage of being able to generate the dtexec command line prompt as well, making syntax errors much more avoidable.

Official MSDN documentation can be found here for 2008 and here for 2005.

AUSSUG Upcoming Sessions

In case you aren’t in the Australian SQL Server User Group, AUSSUG, there are a few upcoming sessions in Sydney which will be pretty useful – check out the official site to register. Sessions are free and always useful.

Lunchtime Wed 3rd March, 2010 – Ensuring Optimal Performance in SQL Server 2008 Based Applications with Viktor Isakov

Evening * Thu * 4th Mar 2010 – What’s new in Reporting Services 2008 R2 and PerformancePoint Services 2010 with Peter Myers, presenting what’s new in the upcoming release of Reporting Services 2008 R2 and PerformancePoint Services 2010

And TBA date in April 2010 – Knights of the SSIS Round Table – Kevin Wong, Glyn Llewelyn and myself will be presenting a series of mini demos followed by Q&A, so a chance to pick some expert brains

Hope to catch you at one of the sessions!

Migrating from Cognos to Microsoft BI

If you are considering moving from Cognos BI to Microsoft BI (and as a Microsoft BI guy, I’ll usually recommend it), it’s worth knowing what’s what in each stack and where the similarities lie. It’s also important to understand how much effort is going to be involved to see if there’s a genuine business benefit. I’ll frame the discussion from a Cognos to Microsoft Equivalent perspective, though some matches are more approximate than others. My reference for the anatomy of the Cognos stack is here: http://www.cognos-bi.info/cognos8.html

Why Migrate from Cognos to Microsoft?

There are myriad reasons why anyone may want to move from Cognos to Microsoft, but i’ll probably surprise you by saying functionality isn’t one of the bigger drivers. Each suite has its own strengths and weaknesses and on balance there’s probably not enough strength on either side to move for that alone, unless addressing specific business needs. I’d view the main reasons as:

  1. Licensing- Cognos is a suite of separate applications that need to be licenced separately, as opposed to SQL Server which is a single product. There can be significant cost and administrative savings from moving to the Microsoft suite. How much this works out to be will depend on your licensing agreements, but reality is Microsoft is the cheaper option.
  2. Existing SQL Assets – Many times companies have SQL Server licences for their databases and want to extract more value from those licences
  3. Cognos 8 upgrade- Your business may be looking at the cost of moving to Cognos 8 from Cognos 7 and not seeing enough ROI potential.
  4. Performance- many Cognos users report unsatisfactory performance from deployments
  5. Uncertainty- With the IBM buyout of Cognos, the number of products likely to be in the mix has gone up – what’s going to stop being supported? What other tools will I need? With Microsoft you have the certainty of a single integrated platform with a long and clear roadmap
  6. Skills availablity – there are simply more Microsoft BI developers out there, and generally they cost less to employ and are cheaper to train.
  7. Application Integration- Microsoft applications are quite open in terms of connectivity, customisation and integration with other applications. Cognos is a closed box . For example anyone can talk to an Analysis Services cube, but no non-Cognos application can talk to a Cognos cube. Reporting services can easily be extended with custom controls.

Decision Stream / Data Manager = SSIS (Integration Services)

Decision Stream (Cognos 7) and Data Manager (Cognos 8) – though there is minimal difference between the two – are Cognos’ ETL tools. They are showing their age badly, and are very “black box”. I personally loathe the clunky interface and untunable performance. But – they work. The SQL Server equivalent tool is Integration Services – a much more up to date, faster and configurable ETL tool. SSIS is a more generalist tool compared to Data Manager, and so it requires a bit more thought (though not a lot) for building Data Warehouses. However if you have large volumes of data to shift, Cognos simply cannot compete.

The reality is ETL is one of the hardest components to migrate. It often has a lot of business rules (and in Data Manager it’s possible to hide them in a million different places) and if you get the migration wrong, you stand to devalue your existing DW. In any migration scenario this would likely be the last on the list, unless addressing a specific need. I would generally leave this as a legacy system that gets superseded over time.

PowerPlay = SSAS (Analysis Services)

Cognos cubes are built in PowerPlay in both Cognos 7 and 8 (in fact there was no upgrade of Powerplay in Cognos 8). Whilst adequate, there are many stories of business migrating to SSAS for significant performance benefits – I am aware of one client whose cubes processing time went from 24 hours to 30 minutes by migrating. SSAS has significant market penetration – getting hard figures is difficult (without paying a lot of money – last public figures are to 2006) but Microsoft is the leading vendor by a significant margin – so obtaining expertise is less difficult (though as with ETL, true OLAP wizards are rare and expensive creatures).

Cubes are one of the easier components to migrate (unless you have very complex ones), as any well designed cube is usually sat on top of a robust ETL which handles the really complex business rules. In the best case scenario you just have to build your dimensions and measures and the job is done.

Report Studio = SSRS (Reporting Services)

Technically speaking, Visual Studio is the direct equivalent as that is the design environment, but Reporting Services is SQL Servers’ Reporting engine. It provides fixed parameterised reporting with the capacity to provide automated deliveries through easily managed dynamic lists. SQL2005 was functional but a little lacklustre, but in SQL2008 has come on leaps and bounds. It’s not the most stellar part of the Microsoft BI stack, but flat reports don’t usually need stellar functionality.

Migration of reports is usually pretty straightforward – you have the layout and formats already – and once the underlying cubes are set up it’s simply a matter of rebuilding each report.

Query Studio = SSRS Report Builder or Excel

Query Studio is a user tool for building simple reports, functionality which is carried out in the SQL world using a tool called Report Builder, or even just though Excel (see the section below on Analysis Studio). Personally, I don’t like Report Builder, though the next edition (3.0) is supposed to be an improvement – but then so was 2.0, so I’m a little sceptical. However if you have good cubes, or have access to the forthcoming PowerPivot, most analysts who want ad hoc data will migrate rapidly to Excel.

This is unlikely to be a significantly used application so the change in functionality probably won’t be an issue, but here Cognos has a better offering.

Analysis Studio = Excel

For working with cubes from an analytical context, Excel is Microsoft’s tool for the job. Using functionality based on Pivot Tables you get a capable (though admittedly not perfect) tool for drilling down, slicing and performing analysis on OLAP data. Unless you have very demanding users, Excel will meet most needs – especially if you have upgraded to 2007 (and if you haven’t, you really really should). If you do have demanding users, there’s a host of 3rd party applications which plug in to excel and can dress specific needs.

Further to this basic functionality, workbooks with connections to SSAS can then be uploaded to SharePoint and distributed via a set of technologies called Excel Services, allowing for fast dissemination of results.

Metric Studio = SharePoint

 Metric Studio is aimed at Dashboarding and Scorecarding. Microsoft at one point had an offering called PerformancePoint to address this, but it suffered a premature demise, but the relevant components migrated into SharePoint, and anyone with an Enterprise licence can install them. This isn’t a particularly exciting tool for either suite so I’ll move on. The migration path is similar to that to SSRS.

Cognos Connection = SharePoint

Cognos Connection is the BI portal for the Cognos suite. Sharepoint is the same for the Microsoft stack… and  an Enterprise CMS to boot, plus a whole heap of other things I have only limited awareness of. Cognos Connection will have better integration with the BI Metadata, Sharepoint will have better integration with your Enterprise generally. It depends what matters more to your business.

Migration here depends on the extent to which you have SharePoint deployed in the organisation already. If it’s there, it becomes a manual exercise in migrating content. If its not, then it’s a major enterprise change.

Bits that don’t match on either side

Of course there are areas where the two suites don’t quite tie up. Cognos has Planning, which Microsoft did, but now doesn’t. Microsoft has PowerPivot, an in memory analysis tool which Cognos hasn’t got a counter offer for. Framework Manager has an approximate equivalent in the Data Source Views that underpin SSAS cubes, but the implementation is quite different and more database oriented. Content Store, the Cognos metadata store has no match in the Microsoft stack and can be perceived as one of the MS BI stacks bigger weaknesses. Event Studio is approximately matched by SQL Server Notification Services. Of course Microsoft’s whole underpinning is based on SQL Server, a robust enterprise database system with scaling for massive warehouses and Master Data Management which Cognos has no direct offering for.

Final Comments

Undoubtedly i’ve got a few things wrong here – my knowledge of the Cognos stack is not as detailed as that of the Microsoft Stack – and I’ll welcome any corrections from the Cognos Community. This post is intended to help those with Cognos installations understand what Microsoft has to offer, the applications involved and what they translate to in Cognos speak. Some of the features I mention are due in SQL2008R2 which will be released in a few months.

If you are looking for an answer to the question “Should I Migrate?” then the answer will always be that it depends on your circumstances. As I said at the outset, each stack has its own strengths and weaknesses and each one may address your business needs better than the other. This is something that needs careful analysis from experienced BI professionals. I look forward to the debate this is likely to promote!

SSIS Command Line Utilities part 1: dtutil

I am back on the study wagon, getting ready for Exam 70-448: TS: Microsoft SQL Server 2008, Business Intelligence Development and Maintenance. One topic that is key for is understanding the command line utilities that allow you to carry out various functions across all the BI technologies. Possibly in excessive depth, but no-one can accuse the MS BI certifications of being easy.

There are two that impact SSIS – dtutil and dtexec. In simple terms, dtutil moves packages, and dtexec executes them. In this post I will be focussing on dtutil. dtexec will be covered in a later post.

dtutil: deploying packages

The main use of dtutil is to script the deployment of packages. For file operations it is equivalent to a simple copy operation, though you can add SSIS specific operations, such as changing the package GUID. However it also allows for movement to and from the SQL Server msdb store, which can’t be done via copy / paste.

Deploying a package requires you specify its source location using one of the following options:

  • /Fi (or /File) + File Path + Package Name.dtsx – for packages in the File System
  • /DT (or /DTS) + Package Name – for packages in the SSIS Package Store
  • /SQ (or /SQL) + Package Name – for packages in msdb

There are a few important things to note about the above options. Firstly for File System operations, you need to specify the .dtsx extension. Secondly, for SSIS Package store operations you don’t need to, even though it is still strictly speaking a file system operation. Finally, you don’t need to know where the Package Store is – dtutil works that out for you, which is handy if you have custom setups of the package store.

For packages located in msdb, you may also need to provide connection details using the following options:

  • /SourceS (or /SourceServer) – the source server name. If not used, it assumes localhost
  • /SourceU (or /SourceUser) – the SQL Authentication user name
  • /SourceP (or /SourcePassword) – the SQL Authentication Password

Note that the username and password options are only required if SQL Authentication is being used. If left out, dtutil will try and authenticate to msdb using Windows Authentication for the current user. 

Next up, you need to tell dtutil to copy the package and specify the destination. The option to move it is /C or (/Copy). This is then followed by the same options as specifying the source - though without the preceding backslash (i.e. Fi, DT or SQ) – then a semicolon, and finally the destination path. If you are using a SQL Server destination you may also need to provide connection details with the /DestS, /DestU and /DestP options.

As usual, an example is worth its weight in gold, so heres some to help you understand. In each case I will give the Full and Abbreviated versions – both of which do exactly the same job.

Moving a package from File to SSIS Package Store:

dtutil /File C:\Package.dtsx /Copy DTS;Package

dtutil /Fi C:\Package.dtsx /C DT;Package

Moving a package from msdb on a Named Instance of SQL Server using SQL Server Authentication to the file system

dtutil /SQL Folder\Package /SourceServer SQLSERVER.INSTANCE /SourceUser Monkey_User /SourcePassword P@$$word /Copy File;C:\Package.dtsx

dtutil /SQ Folder\Package /SourceS SQLSERVER.INSTANCE /SourceU Monkey_User /SourceP P@$$word /C FI;C:\Package.dtsx

Using a little batch scripting these can easily be converted into jobs which will deploy a bundle of packages quickly and easily.

dtutil: altering packages

dtutil is more than a deployment tool – it can also alter packages using the following options:

  • /Encrypt – copies and encrypts the package with a specified password and protection level
  • /IDRegenerate – Generates a new GUID for the package
  • /Sign  – Applies a digital signature to a package

The last two can be useful as options within the deployment process, where the Encrypt can replace the Copy operation if you want the deployed environments packages to be more secure.

dtutil: file and folder operations

Finally, dtutil can work components of the file system or MSDB folder structures, creating and deleting files and folders

  • /Delete – Delete a package
  • /Exists – Test if a package exists
  • /FCreate – Create a folder
  • /FDelete – Delete a folder
  • /FDirectory – List the contents of a folder
  • /FExists – Test if a folder exists
  • /FRename – Rename a folder
  • /Move – Move a package from one location to another

Most of these would be useful in an initial deployment or for tidying up.

dtutil: other things

There are a few leftover functions which I don’t have a categorisation, so here they are.

  • /Quiet – suppress prompts
  • /Remark – add a comment
  • /Decrypt – decrypt a package as an operation is performed

Why use dtutil?

The main reason for using dtutil is simply that the BIDS supplied deployment options, and all that mucking about with Deployment Manifests, is limited, not as scriptable, and also has been buggy, having problems with configuration files (though I believe this is now fixed in 2008, but since I stopped using it long ago, I can’t be sure). dtutil also offers additional functionality, such as being able to script package operations such as encryption.

Official MSDN documentation can be found here for 2008 and here for here for 2005.

BIDS Helper

When you are a BI Monkey like me, you probably spend a fair bit of your time buried deep in Business Intelligence Development Studio (BIDS) and probably wishing it was a little less frustrating to work with.

Enter… BIDS Helper

Over at codeplex there is a project called BIDS Helper, which adds some useful additional functionality to the standard BIDS environment.  It improves the development environment for integration, analysis and reporting services.  In this post I will focus more on the SSIS capabilities that are added.  The additional functionality is fully documented on the codeplex site, and I will highlight some of the more useful functions below – the links are to the documentation which is nice and visual, making it easy to understand.

Expression and Configuration Highlighter - this adds a highlight to components, connection managers and variables that are controlled by an expression and / or configuration, which allows you to rapidly see what is subject to change at runtime.

Expression List – this lists every property that is set by a variable (though currently not expressions that set the value of variables) – great for finding out what is actually in the package, as these things can be easily overlooked otherwise

Non-Default Properties Report – this lists any property in the package that is set to a non-default value. Very handy for debugging and seeing what is non standard.

SSIS Performance Visualization- this executes the package and presents a gantt chart of the execution time, and also can present performance over time – useful when performance tuning.

Variables Window Extensions - this adds the functionality of being able to change a variables scope – very handy for when you create a variable at the wrong scope (something I do all the time).

BIDS Helper is a handy little tool that adds a few nice functions for SSIS developers without intefering with the existing UI or design environment – I thoroughly recommend installing it, especially as it’s free!

Windows 7 Security & Bitlocker Drive Encryption

Having got my hands on Windows 7 at last, one of the important things I wanted to do was secure my data using Bitlocker Drive Encryption. After losing 2 laptops to a burglar with large amounts of personal data on them, I am being a little more careful from here on, so that the physical loss of the laptop will be the worst possible outcome, and ID theft or misuse of my data (e.g. raiding my pitiful bank accounts!) won’t be a concern. Bitlocker is only available on Ultimate and Enterprise editions – this I think is an uncalled for feature restriction – to enhance its reputation for security, Microsoft would be better off opening this up to all editions of Windows 7. Users of other editions will have to use Folder Level encryption, which I’ll cover towards the end.

Bitlocker Drive Encryption encrypts the entire disk transparently. What “Transparent” really means to me and you is that if you copy a file from a Bitlocker Encrypted disk to an unencrypted disk, that file is readable by any machine. However should someone try to read the encrypted disk directly they wouldn’t see anything except encrypted garbage. So if your disks are Bitlocker Encrypted, unless someone has the Recovery Key, your drives and data are unreadable.

How do I enable Bitlocker?

So how does this work in practice? Well, a lot depends on whether your computer has a TPM 1.2 chip on the motherboard. If you do, you can simply open the  Bitlocker control panel (just type Bitlocker in the search bar) and enable Bitlocker. The TPM chip manages the storage of the Bitlocker keys – and Windows 7 will stop the process if you don’t have a key. You set a PIN and after that your computer won’t boot without a PIN, and if someone tries to remove the drives to read them – well, they can get the drives out but reading theier content is impossible.

If you don’t have a TPM chip, you can put the Bitlocker key on a USB Key. To do this requires a bit of a workaround as by default Bitlocker will only work if you have a TPM chip installed, and the wizard will block progress if you don’t. Fortunately the nice folks over at sevenforums.com have posted a guide on How to Turn BitLocker On or Off without a TPM for Windows 7. It involves going into the Group Policy Editor (some sort of security thing) but is incredibly simple to do. Again, not 100% sure why Microsoft decided to make this a bit awkward to set up, so hopefully this will change in the future.

You can then apply Bitlocker and store the key on a USB drive. From now on your computer will only boot if you have the USB Key plugged in (you can – and should - remove it after the computer has booted). This is a minor inconvenience and you have to remember to keep the USB key seperate from the machine – otherwise if someone steals your computer and you’ve left the USB key in, your encryption efforts are wasted.

How do I prevent me locking myself out with Bitlocker?

This applies to those of us using USB Keys, but the advice is simple:

  • Keep multiple copies of the Bitlocker recovery keys somewhere accessible and safe – such as on a external HDD, Live Mesh, a webmail account…
  • Create multiple USB keys – from the Bitlocker control panel you can load the startup key on as many USB keys as you like. I have 3 copies so if a USB drive fails, or goes walkies, I don’t have to worry about not being able to use my computer. You can also store multiple computers bitlocker keys on one USB key.

I don’t have Ultimate / Enterprise – what are my options?

Well, option 1 is to upgrade using Windows Anytime. But assuming you don’t want to, the other option is Folder Level encryption. All you need to do is pick a folder, right click, select Properties and under the Advanced options choose to Encrypt the folder. If this is the first time you are encrypting something on your machine it will prompt you to back up the security certificate – as with the Bitlocker recovery keys I strongly recommend making multiple copies in multiple safe locations . Your folder will then encrypt and turn green so they are easy to spot. They aren’t angry, like the Hulk :)

Now this isn’t transparent – if you move a file from your encrypted folder to another location it remains encrypted and unreadable without the security certificate. But if someone stole your PC, unless you have autologin set, those files are unreadable, which for the purposes of this discussion is what we intended

So, am I safe and secure now?

Well, maybe, as long as you have done a few other basic things:

  1. Require a password on startup
  2. Have a screensaver that requires a password to unlock
  3. Backed up recovery keys
  4. Backed up your data
  5. Backed up your data
  6. Backed up your data

Good luck, and keep your data safe!

Fuzzy Thinking

I’ve covered off the Fuzzy Lookup and Fuzzy Grouping transformations in SSIS and noticed in my research that these capabilities aren’t particularly coherently talked about on the web. So below I thought i’d collect some of the better articles for your late night reading. There isn’t all that much out there, unfortunately.

So, how does it all work?

Here are a few articles covering theory, mostly from Microsoft:

It is probably worth reiterating that because of the way the algorithms and their Q-Grams work, when longer strings are being analysed for fuzzy matches, the better the chances of a good match. When I first started using the algorithms I was doing some client matching and matched first and last names separately. Once I had a deeper understanding of the components, I started matching on a full name and the quality and reliability of matches improved significantly.

Ok, so how do I make it work?

Now, some articles covering practical implementation of the tasks:

The best thing you can do is get some sample data and play with the components to understand what it is they do. The results are impressive – if not bulletproof – and can make a great contribution to de-duplicating client data, etc.

And what does the BI Monkey have to say about it?

Fuzzy Matching is a powerful and easy to use tool which is great for approximate grouping of data for analysis where a margin of error is tolerable. It is also a great helper in data cleansing exercises. Having too much faith in the results where exact matches are required will cause you to fall over at some point, so be careful. If you are engaged in such an exercise and want some experienced support, please get in touch.

And in other news, fresh from Jamie Thompson – Fuzzy Lookup and Regex are going to become available in SQL2008R2.

If you have come across any articles that you think really contribute something to the understanding of fuzzy matching technologies in SQL Server / SSIS, please let me know or post a link in the comments so I can improve this article.

AUSSUG: Integration Services 2008 new features – Tues 10 Nov 2009

One for those Sydney resident BI Monkeys – Kasim Wirama, SQL MVP is presenting to the Sydney Chapter of the Australian SQL Server User Group on new features in SSIS for SQL 2008.

Please register at the AUSSUG website if you plan to attend – sessions are free. I expect to be there, finally able to attend for the first time in months, having missed some very interesting sounding sessions. The presentations are always useful and a great chance to meet with some of the SQL Server Gurus in Sydney.

Full details below:

Venue: Mechanics’ School of Arts, Level 1, 280 Pitt St, Sydney

Duration
5:30PM Networking (Pizza & Drinks provided)
6:00PM – 8:00PM Session

On: Tuesday, 10 November 2009

Session Details
====================

SQL Server Integration Services (SSIS) is prime ETL engine of SQL Server 2005/2008.
At this session, I would like to show new feature and improvement in SSIS 2008 and
presents practical SSIS solutions, i.e.:

* Referenced resultset
* Branching in control flow and data flow
* Centralized SSIS package configuration and logging
* Multi column output with Oledbcommand component
* Streaming (high performance) sequence number
* Dynamic source file connection with much less code than former DTS does
* Lookup improvement in SSIS 2008

The Fuzzy Grouping Transformation

Fig 1: The Fuzzy Grouping Transformation

Fig 1: The Fuzzy Grouping Transformation

In this post I will be covering the Fuzzy Grouping Transformation. The sample package can be found here for 2005 and guidelines on use are here.

What does the Fuzzy Grouping Transformation do?

The Fuzzy Grouping Transformation allows you to identify similar items within a dataset. It doesn’t – as its name perhaps suggests – actually perform any group by operations. However it does provide you with the information you need to group rows by, as what it does is help identify similar rows within a dataset. It uses the same algorithms as the Fuzzy Lookup transformation, and operates a bit like a Fuzzy Lookup except that it effectively creates its own runtime lookup table, based on the incoming data, which it then references to identify similar rows.

If this is all getting a bit confusing, let’s take a look at a example of some output:

Fuzzy Grouping Output

Fig 2: Fuzzy Grouping Output

This shows three input rows with the field FullName in which the Fuzzy Grouping transformation has been told to look for similar values. These three rows have all been decided to be similar to the FullName “Gregory Alderson”. Note that – as with the Fuzzy Lookup – the match is assigned a similarity score (in the column _Similarity_FullName). Input rows are also assigned a unique key (in the column _key_in) and when a row is matched to another, the key of that row is also stored (in the column _key_out). Note the Fuzzy Lookup component is a synchronous transformation – i.e. you will get one output row for each input row.

So what has actually happened here?

  • First, the component has loaded all the data into temporary tables.
  • Second, it scanned through using fuzzy algorithms to look for similar items.
  • Third, for each input row it output either the best match above the threshold similarity, or otherwise decided it had no matches.

You can then use this output to perform an aggregate group by operation to sum up your data by similar items, using the Aggregate Transformation.

Configuring the Fuzzy Grouping Transformation

The Fuzzy Grouping Transformation is configured over three tabs when you open the component. The first thing you need to configure is the database connection it will use to create the temporary tables it needs to perform the fuzzy matching on the Connection Manager tab. These will be built in TempDB, so ensure the user specified in your connection manager has appropriate permissions on TempDB.

Configuring the Connection Manager for the Fuzzy Grouping Transformation

Fig 3: Configuring the Connection Manager for the Fuzzy Grouping Transformation

Next, on the Columns tab, select the columns that are going to be used to identify similar rows, and those which will simply pass through the component in the ‘Available Input Columns’ box. If a column is to be grouped, check on the left hand side. If it is simply to pass through just check on the right hand side. Any column selected for grouping will be passed through automatically.

Configuring the Columns used in the Fuzzy Grouping Transformation

Fig 4: Configuring the Columns used in the Fuzzy Grouping Transformation

For each column to be analysed for similarity there are a range of settings that influence how the matching occurs and how some output is named. These work as follows:

  • Input Column – the column selected with the checkbox for grouping
  • Output Alias – the name of the Input column in the Output data flow
  • Group Output Alias – the column which will hold the best match value from the grouping
  • Match Type – Exact or Fuzzy matching
  • Minimum Similarity – The minimum similarity for matches for that column
  • Similarity Output Alias – the name of the output column containing the similarity score for that column
  • Numerals – tunes how the matches handle numbers in the data
  • Comparison Flags – fine tuning the string handling

Full details for how these settings work – especially the last two – can be found here on MSDN. I would expect the most common ones you would play with would be Match Type – when grouping on multiple columns and are confident some of them have good quality data, setting to exact match will improve the overall result for matching a row of data. Similarly with Minimum Similarity, you can tune for the quality of match you will accept on individual columns.

Note you are not limited to a single column for grouping – you can select several which will contribute to the overall similarity score for the row. In the sample package I demonstrate this in data flow 2.

Configuring the Advanced options for the Fuzzy Grouping Transformation

Fig 5: Configuring the Advanced options for the Fuzzy Grouping Transformation

Finally, on the Advanced tab we set the global values that will affect the result you get. The key one is the Similarity Threshold – this is what will set the minimum threshold of match that will be accepted overall for a row to be regarded as similar to another row. The right setting for this value will depend on the data you have and what quality of match you will accept. The Similarity score this evaluates against is a function of the similarity scores of all the columns that are being evaluated for the match.

Where should you use the Fuzzy Grouping Transformation?

The scenario I can see this being most useful in is in an analytical context. From a data cleansing point of view it can be a quick way of assessing how effective a data deduplicating exercise may be, and what columns and parameters to use. From a purely analytical approach it would be useful for doing rough groupings on data where exact results aren’t important and a degree of error can be tolerated.

Be aware that this component is a blocking transformation – i.e. it requires all input data to be consumed and processed before it will output anything, which can result in poor performance.

MSDN Documentation for the Fuzzy Grouping Transformation can be found here for 2008 and here for 2005.

If you are still struggling, try these additional resources:

If you need specific help or advice, or have suggestions on the post, please leave a comment and I will do my best to help you.

First experiences with Project Gemini

What are my Gemini First Impressions?

I’m one of the chosen many who have access to the Gemini CTP2 – and had my first hands on experience with Gemini yesterday. I was up and running with it very quickly and grinning very shortly afterwards. The time it took me from connecting to the AdventureworksDW to producing a nice interactive “super” pivot table with slicers and very rapid response was very short.

As someone with experience of being an analyst, Gemini looks like a great tool for rapidly accessing and manipulating data. You connect directly to your data source (which doesn’t have to be SQL Server, btw), create some relationships and start playing. There are some new formulas to understand, but if you can work with a relational database this stuff will be a breeze. The response times are impressive, as are the connectivity options once you have finished. A Gemini workbook can become a data source for reports and deployed to MOSS without losing its connectivity as it falls into Excel Services.

Is this a Data Warehouse Killer?

One of my contemporaries said he was very concerned about Gemini leading to the next generation of spreadmarts, and that it could damage the market for Data Warehouses. I half agree, half disagree.

Firstly, the concern that Gemini will create the next spreadmart nightmare. Yes, it will create a new mess of user created, uncentralised data. However – as with spreadmarts – we will as an industry learn from this and find a way to manage it. Only this time we have prior experience of the phenomenon. A further mitigation factor for this risk is that I see the number of people using Gemini will be smaller than those using Excel. Those people will also be more likely to be analysts and hopefully more conscious of the risks of a Gemini-mart.

Will it kill the Data Warehouse? Probably not – analysts – the market for Gemini – don’t want to spend their time preparing and cleaning data. The DW will remain a good source of clean data for analysts to work with, and will remain a valuable part of the enterprise BI suite. It may remove the need for some warehouses – especially those based on clean data sources. I can also see it opening the door for new warehouses – as Gemini will make more data sources available to analysts. They will see it, play with it – declare it as important and bring it into the warehouse. So overall I expect the impact to be neutral.

So how does the BI Monkey feel about Gemini?

Easy – very, very excited!

Next Page »