The Row Sampling Transformation

Fig 1: The Row Sampling Transformation
Fig 1: The Row Sampling Transformation

It’s been a long time since I did one of these! In this post I will be covering the Row Sampling Transformation. The sample package can be found here for 2005 and guidelines on use are here.

What does the Row Sampling Transformation do?

The Row Sampling Transformation takes a fixed number of rows from a source data set – in a similar manner to the Percentage Sampling Transformation, except that instead of a proportion of your data, it takes a fixed number of rows. It splits your data set into two sets, the Sampled and Unsampled outputs, as below where 10 rows of a 100 row data set have been sampled:

The Row Sampling Transformation outputs
Fig 2: The Row Sampling Transformation outputs

The assigning of rows to an output is nominally random, but given the same data set and random seed (explained below), the same rows will always be selected each time you run the package.

Configuring the Row Sampling Transformation

There are two important properties to configure on the transformation. First is the Number of rows, which determines how many rows will fall into the Sample output. Second is the random seed. This seed tells the random selection algorithm which rows to choose. If you fix the seed, you will get consistent results – if you understand a little about randomisation in computing, you will understand randomness is a bit of a relative concept to a computer. If you leave the checkbox unselected, the package will pick a random seed based on the OSes’ tick count, so results will appear to change.

You can also name your Sample and Unselected outputs, should you wish. It’s worth noting that you aren’t obliged to actually use either output downstream of the component, so you can use this component to select a fixed number of rows from your source – or ignore a fixed number of rows from your source, by only using the Unselected output.

Configuring the Row Sampling Transformation
Fig 3: Configuring the Row Sampling Transformation

Where should you use the Row Sampling Transformation?

The main use for this would be to select a fixed size subset of data. This subset could be used for Data Mining test sets, or for limiting your data set size when testing packages – e.g. if you are running against a multimillion row data source, you could just run the package with 100 rows to see if your processes worked.

MSDN Documentation for the Row Sampling Transformation can be found here for 2008 and here for 2005.

If you need specific help or advice, or have suggestions on the post, please leave a comment and I will do my best to help you.

One thought on “The Row Sampling Transformation

  1. The data sampling task is also useful for data profiling. When you need to profile a very large table, the SSIS data profiling task chokes easily. The samling task can be used to take a random subset of the table. This is far superior to using a select top n input for sampling, since the top n, in a table which accumulates data over time, may retain only old records and eliminate more recent records with a potentially different data profile.

    Likwise for your case of extracting/creating test data, it is far better to have a random sample than just scraping potentially stale records only.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>