Modifying an SSIS Package through code

Part of any SSIS development experience inevitably results in you discovering a minor mistake or something that was missed a long way into the development cycle – or even after, in testing (you do test your code, right?). Then you are faced with the tedious job of opening every single package, making a change in every one… and getting some serious mouse finger. Much like I once did when I learned about BufferTempStoragePath.

Fortunately, there is a way to automate these fixes. The SSIS Object model is (relatively) easily manipulated through .NET languages – so it’s not too difficult to write a small program that will change your package. Below is a sample I knocked up that will add a variable to an existing package and save the change:

using System;
usingMicrosoft.SqlServer.Server;
usingMicrosoft.SqlServer.Dts.Runtime;

namespace Package_Modifier
{
    class Program
    {
        static void Main(string[] args)
        {
            // Initialize an Application and Package object
            Application app = new Application();
            Package package = null;

            // Set a package path
            StringpkgPath = “C:\\BI Monkey\\SamplePackage.dtsx”;

            // Load the package in package object
            package = app.LoadPackage(pkgPath, null);

            // Add the new variable
            package.Variables.Add(“NewVar”, false, “User”, 0);

            // Save the package
            app.SaveToXml(pkgPath, package, null);
          }
    }
}

You can essentially make any change you like to a package – I’ve chosen adding a variable because it’s an easy manipulation of the package object and I’ve got a long way to go before I work out how to do anything much harder :)

SSIS ETL Framework v1 goes Beta!

A quick update on the status of the BI Monkey SSIS ETL Framework (on Codeplex @ http://ssisetlframework.codeplex.com/)

The Framework v1 has gone into Beta – i’ve completed all the testing, and now just need to tidy up the reports and add some extra logging capability – but it is now fully functional. I’ve left it in Beta as I want to get some feedback on it before I move it live, plus fix those small details I mentioned.

So now I will press on with updating the documentation (yes, really!) and start laying the foundations for the more Enterprise level v2 framework.

I look forward to your feedback – please take advantage of Codeplex’s issue logging functionality to help me manage bugs and improvements.

BI Documenter from Pragmatic Works

I recently had to demo SSIS to an enterprise as part of an ETL tool evaluation. One of the Microsoft BI stacks weaknesses is the lack of Data Lineage tracking. What this means is there is nothing embedded in the toolset that allows you to identify clearly the source of a data item in a package / report / cube without digging into the development environment. Rumours are this will be fixed in the next release, however nothing has yet been confirmed.

However, where the Microsoft BI stack has a competitor beating edge is the 3rd Party ecosystem – so where there is a gap in the toolset, often another company will step in and fill it. In this case, Data Lineage issues are addressed by a tool created by Pragmatic Works (which is run by Brian Knight, an SSIS heavyweight) called BI Documenter.

BI Documenter Review

So what does the tool do? It has 3 key functions:

Documentation Generation: The tool auto-generates documentation for Databases, SSIS packages, SSAS Cubes and SSRS reports. It’s quick, and the output is pretty – and it’s really a bit useless. Its benefit is if you have to say you’ve produced some documentation and need to do so with minimal effort. The reason I say this is because it provides no context for why things have been done, what the purpose of the component is, where it fits in to the framework etc. My usual gripe with documentation I come across is that it only tells me the what, not the why, and the why helps me solve a problem. This tool can’t do anything to address this.

Data Lineage: Now this is where the main value lies. Through simple navigation you can select any object (table / view / package etc) in the solution and see what objects depend on it and what objects it depends on in turn. This is great in a complex system where if you need to make a change and need to find out what that impacts.

Now its not perfect – it seems to skip documenting some sources, such as flat files, so they get missed in the impact analysis. And the level of granularity is at the object level – for example you can’t see the impact of an individual column change, just at the table level, but its still a great start and a useful tool.

Snapshot comparison: A final piece of value which can be useful in troubleshooting. BI Documenter takes snapshots of your solution to document – and you can compare these to identify changes in the solution. The detail level is pretty good and will be a great place to start tracing changes in your system when your source control systems fail.

Conclusions

Is the tool worth it? At a maximum cost of US$500 a seat, it’s definitely got a place somewhere in your organisation. The documentation tool is of limited use but the Data Lineage and Snapshot comparisons are worth the cost of the product. Full details here: Pragmatic Works – BI Documenter

Passed 70-448: SQL 2008 BI Developer

Hurrah! At long last I finally got round to taking 70-448: SQL 2008 BI Developer and passed!

Test Exams

As well as blowing my own trumpet, there is also a point to this post about studying for these exams – and specifically using practice exams. This is a bit controversial – as evidenced by this recent thread in the MSDN SSIS Forums. Some exams are a brain dump – like Test King – and if you just want to learn the questions and answers, then that will serve you well, but your certification will be worthless once people find that you don’t actually know what you are talking about.

I’ve been using Transcender’s exams to get me through 70-448 (and it’s 2005 predecessor) and found in useful because it helped me understand what kind of questions I’d be facing, but also where I needed to study. So rather than a rote learning exercise, I used them as a springboard to guide where I needed to study.

I’m not going to endorse Transcender particularly – I found myself having to correct far too manyof their questions where either the question was misleading, or the answer was simply wrong. To their credit they did actually respond to the issues I raised and corrected them, but it’s frustrating when you review your answers from a practice run and come to the realisation that the exam itself was wrong.

I’ve also found their customer service unacceptably slow – being a consultant I migrate laptops and o/s a lot and have burnt through my allotted product activations – I’ve had to request a manual activation code. I did this 4 days ago and am still waiting…

Also, the Exam Sucks – a bit

Some of the critiques levelled above at Transcender’s practice exam also apply to the real thing. Some of the questions and answers are vaguely worded or use terms that don’t match the product itself.

However my biggest bugbear is the scope of the exam. There’s quite a few questions on things that bear no relevance to the day to day experience of a BI Developer. I point heavily to the questions on SSRS Configuration files which in practice are rarely touched, or the emphasis on the use of Checkpoints in SSIS which most people avoid due to their poor implementation. But I guess this is true of many exams regardless of subject so is a bit forgivable.

Regardless, I still think certifications have their value – they force you to study areas you are less familiar with. It won’t make you proficient, but it will at least make you aware of capabilities. Experience, however, still trumps a bit of paper :)

SSRS & Stored Procedures

As an ETL Monkey, my experience with reports has been a bit incidental. One thing that puzzled me though is why report developers always wrote stored procedures to generate data for their reports, instead of using the SQL capabilities within the report. They said “Best Practice” and I was happy to leave them to it!

In this detailed post Adam Haines explains why. It’s very detailed so here are the key points if you have a short attention span:

  • The query can be maintained independently of SSRS, allowing tuning the query without accessing or modifying the reports
  • The execution plans can be cached if you use an Stored Procedure, but not if you use SSRS
  • Stored procedures allow the use of certain objects that cannot be used in embedded T-SQL in the report such as temp tables and indexes specific to those temp tables as well as table variables
  • Stored Procedures provide a layer of abstraction between the report and the business logic
  • Stored Procedures allow re-use of similar logic

Credit for the above list to my colleague John Simon who authored most of the above points in an internal discussion.

Infrastructure pains

One of the headaches that has plagued various projects I have been part of has been Infrastructure. From the provisioning of environments, to dodgy release practices, to environments that were deemed “unnecessary” – sometimes the problems have not been the code, but where and how it gets into the wild.

Dev -> Test -> Prod

At the very least, any software release should go through these basic code promotion steps. When you’re doing a BI project, just because its a bit odd in software terms, doesn’t mean you can skip the standard code promotion activities. Development should be done in the Dev environment, where any damage done by bad code is minimal. Once it “works” it needs to be promoted into a Test environment to ensure that it actually does work, and it’s not a fluke of the right test data having been constructed. Once its been tested, it can then go into Production.

This means on a project you need these three environments up and running from the outset. It can sometimes be a hard sell to smaller, less experienced IT departments who haven’t experienced the pain of an overly keen developer trashing production IT infrastructure. It does increase cost, but prevention is far better than cure.

As far as how the environments look, Dev can be whatever you like – the databases can be a mess, the code can be spaghetti – who cares? This is the developers playground and they can do what they like in here. However Test and Prod should be exactly the same. This way you can spot the “but it worked in Development” problems that somehow drag down production.

Dev = Test = Prod

Now, the next important thing is to ensure each of these environments are physically the same. So all the software is the same, it’s patched to the same version, it has the same network cards and drive mappings. If you have a seperate database and SSAS box, don’t just use one machine in Dev and Test because “in theory” it’s the same as production.

This - like the code promotion cycle above – is about prevention being better than cure. I’ve wasted many hours of my development life debugging issues that eventually turned out to be due to inconsistent patching, drive letters not being the same in different environments and so on. One of the issues with inconsistent environments is that after a while you accommodate for them – and forget you are doing it – then a new developer comes on board and blows things up because you’ve become so accustomed to the workarounds you’ve almost forgotten they’re there.

The key here is to have a good infrastructure build guide that explains how each environment is constructed, so the Infrastructure team have no excuse for building inconsistent environments. There will of course be some differences – IP addresses, Server Names, etc. – but these will be documented and should be legitimate.

Dev -> Test = Dev -> Prod

Finally, code promotion should be the same regardless of environment. If you find yourself making allowances for a quirk in one environment… well, see my comments above about inconsistent environments. Code promotion should be a boring routine that can be done by anyone who can follow simple instructions. Because in theory, your developers should have no access to production environments and the promotion from Dev to Test should be considered a dry run for the promotion to Prod.

Surely this is a bit too much?

Yes, yes it is. If we all coded perfectly and when we deployed we never made a mistake it’s totally unneccessary :)

The Row Sampling Transformation

Fig 1: The Row Sampling Transformation

Fig 1: The Row Sampling Transformation

It’s been a long time since I did one of these! In this post I will be covering the Row Sampling Transformation. The sample package can be found here for 2005 and guidelines on use are here.

What does the Row Sampling Transformation do?

The Row Sampling Transformation takes a fixed number of rows from a source data set – in a similar manner to the Percentage Sampling Transformation, except that instead of a proportion of your data, it takes a fixed number of rows. It splits your data set into two sets, the Sampled and Unsampled outputs, as below where 10 rows of a 100 row data set have been sampled:

The Row Sampling Transformation outputs

Fig 2: The Row Sampling Transformation outputs

The assigning of rows to an output is nominally random, but given the same data set and random seed (explained below), the same rows will always be selected each time you run the package.

Configuring the Row Sampling Transformation

There are two important properties to configure on the transformation. First is the Number of rows, which determines how many rows will fall into the Sample output. Second is the random seed. This seed tells the random selection algorithm which rows to choose. If you fix the seed, you will get consistent results – if you understand a little about randomisation in computing, you will understand randomness is a bit of a relative concept to a computer. If you leave the checkbox unselected, the package will pick a random seed based on the OSes’ tick count, so results will appear to change.

You can also name your Sample and Unselected outputs, should you wish. It’s worth noting that you aren’t obliged to actually use either output downstream of the component, so you can use this component to select a fixed number of rows from your source – or ignore a fixed number of rows from your source, by only using the Unselected output.

Configuring the Row Sampling Transformation

Fig 3: Configuring the Row Sampling Transformation

Where should you use the Row Sampling Transformation?

The main use for this would be to select a fixed size subset of data. This subset could be used for Data Mining test sets, or for limiting your data set size when testing packages – e.g. if you are running against a multimillion row data source, you could just run the package with 100 rows to see if your processes worked.

MSDN Documentation for the Row Sampling Transformation can be found here for 2008 and here for 2005.

If you need specific help or advice, or have suggestions on the post, please leave a comment and I will do my best to help you.

Other Blog roundup

Shockingly, it appears it’s not just me on the Internet blogging about Microsoft BI :)

I’ve added a couple of new links to by Relevant blogs. First up is Boyan Penev’s blog- a good blog that I’ve been referencing in my recent excursions into MDX. He’s also just joined the consultancy that I work for, which is a happy coincidence.

Next is Duane Douglas’s SSISBI.com- particularly for his posts on programmatically building SSIS packages which is a 12 part series starting here. It is comparatively easy to build packages from code. One of the extensions of my ETL Framework project that I’m very keen on doing is getting a utility built that generates packages that conform to the framework, rather than using templates which are easier to break. Hence my current fervent studies in C# which will no doubt please Todd McDermid.

On the subject of Todd, he recently presented at a SQL Saturday, and put up a few presentations which I think were pretty on the nose about their subjects. First of those was a Business Intelligence and Data Warehouse Primer – which for newcomers to DW/BI laid out the landscape nicely. The second was on the options available for Dimension Processing in SSIS. Finally there was one on the guts of SSIS which is in the same territory as my own ABC of ETL with SSIS. Follow this link to get to the post with the presentations.

An SQL alternative to the SCD

In SQL 2008 a new T-SQL construct was added - the MERGE operation. (Ok, pedants will know this wasn’t new to Oracle,  but it was new to SQL Server).

This operation allows for the merging of a dataset into a reference dataset – which can be remarkably similar to Insert / Update operations effected by the Slowly Changing Dimension transformation. However the way it operates is very different. Instead of the SCD’s row by row evaluation approach, the MERGE operation is a set based operation. What this means is it compares the whole of the source dataset to the reference dataset in a single pass. This has significant implications for performance – on a site where I implemented this the operation which took 1,200 seconds in the SCD cut down to 51 seconds using a Merge.

There are limitations and differences to be aware of:

  • You cannot directly return row counts for Insert / Update / Ignore operations in the Merge
  • As it is a bulk operation a single row will cause failure of the whole batch
  • There’s no GUI – just hand crafted SQL
  • Less error trapping / logging options
  • More flexibility in terms of actions when matches / non matches are found

The main reason why you would consider the SQL Merge – it handles Type 1, and with a little cunning, Type 2 dimensions – in a fraction of the time it takes the SCD to plod through. It’s still not as fast as a proper in memory comparison using something such as TableDifference – but it’s always good to know you have something else available in your toolkit.

Further information:

Managing your history data

This post is to an extent a small rant about some design decisions I have been constrained by on my current project. These decisions were made predominantly for one fairly bad reason: it made the architect’s life easier (apologies to the architects if they are reading – but these were bad choices!)

The design choices in question are around the managing of history data. In one component of the system it relates to Database storage design, the other relates to Cube storage design. In both cases the history data is stored in a separate location to the “current” data.

Databases: Why separate history tables are a bad idea

The first – and most compelling – reason for not storing your history data in separate tables to your current tables is that it increases complexity for users. Instead of having one location to look for data, your users now have to use two.

The second compelling reason is that there is no point to doing this from a storage point of view. SQL 2005 & 2008 (Enterprise editions only, admittedly) provide partitioning. This enables the contents of an individual table to be stored in separate locations on different filegroups. This means that you can store your current days data in one location and your history in a different one. The reason for doing this is the same as splitting it into separate tables – that querying the current section will be faster than the historic section.  In theory queries against partitioned tables should in fact be faster as the current data is now no longer in the same filegroup as the history data.

Now, there is an overhead associated with designing and maintaining partitions but I don’t see that it is significantly larger than that required to deal with the process required to archive data into separate tables on a daily basis. Additionally when maintaining separate history tables, you need to separate out every single table, whether it gets 10 rows a day or 10 million. With partitioning you can just target the large tables that need that focus.

There are other downsides to maintaining separate tables. If you make a change to a table design, you need to do it in 2 places.  You also need to remember to update your history processes. If your history process fails, you can end up with users getting unexpected query results or ETL process failures when the system loads the next day’s data into the current tables –  and untangling it becomes a real mess. If your partition processes fail to run, you just have too much data in one filegroup for a while – unlikely to be fatal.

So if you have large tables you need to split out for performance purposes – do it at the back end, using the power of the database – which is designed to store data efficiently. Keep it away from the users – they neither need to know or care about your need to keep the data separate. If you want to give them a single object to query with the current day’s data, just use views.

Cubes: Why a separate history Cube is a bad idea

Much of the above applies here – SSAS also has partitions – so you can again store your historic and current data in separate physical locations with the users being totally unaware of this. Again there is overhead in maintenance, but this will also balance out with the maintenance and risks associated with maintaining two identical cubes that only differ in terms of data source.

Use your storage options!

So without banging on about the same things any further, please consider the following two points whenever considering managing your history data:

  1. How does what i’m planning affect my users?
  2. How does what i’m planning leverage the platforms capabilities?

Next Page »