The Sort Transformation

Fig 1: The Sort Transformation

In this post I will be covering the Sort Transformation. The sample package can be found here for 2005 and guidelines on use are here.

What does the Sort Transformation do?

You will be unsurprised to know that it sorts data based on a specified key, with the additional option to remove duplicates. The Sort Transformation is a blocking component within the Data Flow – what this means is that it needs to consume all data fed to it before it will output any data – this can lead to performance issues if you have large amounts of data to sort, or many Sort components in your data flow.

Fig 2: Configuring the Sort Transformation

To configure the Sort Transformation, you simply select the keys that you want to sort the data set on, set the priority (Sort Order) in which you want the sort applied to the keys and the direction you want to sort them in (Sort Type). If needed you can change the name of the sort keys in the output (Output Alias). You can optionally drop columns from the input set (just uncheck the Pass Through checkbox) – though if you’re doing this, you should check if that column even needed to be passed there in the first place. Finally you can remove duplicates by checking the “Remove rows with duplicate sort values”. Be aware this will only check for duplicates on the keys, so if you have duplicate sort keys with different associated data you will only get one of those rows in the output. If sorting text data you may also need to set the Comparison Flags to manage how the component deals with certain special cases.

When would you use the Sort Transformation?

You would use it when you want to sort or deduplicate data! However, it’s always worth questioning if you really need to sort data – it makes it easier for developers to review but from a machine point of view, it often makes little difference. If you are using the Sort to deduplicate data, have a look at voting on Jamie Thompson’s suggestion for a DISTINCT component on Connect. However Sort and Distinct operations are best pushed back to a database if at all possible. If you are sorting large amounts of data and hitting performance problems, you may want to take a look at the custom NSort component – i’ve not used it but it may be worth a look if the default component is proving inadequate.

MSDN Documentation for the Sort Transformation can be found here for 2008 and here for 2005.

Read More