In SQL 2005, the Derived Column Transformation used to drive me nuts because any time you changed a string value, the component automatically changed the datatype to Unicode, even if it was originally string. This meant any time you changed an Expression that returned a non-unicode string, you had to change it back.
In SQL2008 it has got worse! Now the same change of data type is forced on you, but now the only way to change the data type back is to go into the Advanced Editor and amend the data type of the Output Column there, because – apparently by design – the data type is now locked in the normal UI.
Why by design? No idea, but according to this Connect article, that’s the way it’s going to stay – please vote this up and maybe Microsoft will revisit this incredibly annoying “feature”.
Update: I’ve also discovered that even if you drag a non-unicode column from the data flow into the Expression (nothing else, just the column) it still converts it to Unicode!
In this post I will be covering the Derived Column Transformation. The sample package can be found here for 2005 and guidelines on use are here.
What does the Derived Column Transformation do?
The Derived Column Transformation provides a means to change column data as it passes through the data flow. It uses the SSIS Expression Language to transform the data and allows you to replace an existing column’s value or create a new column which can use values from other columns or variables to create a new data item.
In the Derived Column editor, there are three panes to work with. The top left pane has two folders, one for Variables and one for Columns. The top right pane contains folders for the various functions available. The bottom pane is where you define the Derived Columns. Both the top panes have drag and drop functionality, so you can click and drag a function, column or variable into the Expression area of the Derived Column configuration pane.
Setting up a Derived column is straightforward – if adding a new column, provide a name for it in the Derived Column Name column. If replacing the content of an existing column, select its name in the Derived Column dropdown, and the Derived Column Name will lock down to the name of the column being replaced – in the example above see the ListPrice column for an example.
The next part is the Expression itself. For details of the language for the expression, see my posts on the SSIS Expression Language. The options you have are basic expressions, such as ROUNDing numbers, YEAR functions for dates and SUBSTRING for strings. Two things that I will raise are NULL Functions and Type Casts. Because SSIS is so strict about data typing, if you evaluate or return a Null in your expression, you have to pull a Null from the Null functions list – for example if assessing if a Date value is null, you would need to use the Null function NULL (DT_DATE). Similarly Type Casts convert data to a specific type, so for example if you wanted to put a string source type value into a float column, you would have to convert it using the Type Cast (DT_R8) [Sting Column].
The Data Type, Length, Precision, Scale and Code pages are determined automatically from the type of data you are working with – note that you cannot change the data type of an existing column. The only room you really have to move here is to change string lengths and code pages, or precision and scale for numerics when you create a new column.
In the sample package I show a few simple examples of Column manipulation, using column on column, variable on column, pure variable and pure column operations.
What are the Derived Column Transformation’s limitations?
My biggest problem with the Derived Column Transformation is that the function list is small, and worst of all, fixed. In a rare example of Cognos Data Manager being better than SSIS, Data Manager allows for the creation of custom functions that can then be re-used. SSIS offers no such flexibility, which means if you have complex operations that need to be done repeatedly (e.g. Trimming & Nulling incoming strings to clean input data) you can’t create a custom function to simplify the operation and make the operation reuseable. I’ve added a Feature Suggestion on Connect to request an package level extensible function library – please vote for it if you agree this is a big hole in the component. (Update 18 Aug 2010: Microsoft have responded, closing it and saying that “it does not meet the bar to be included in an upcoming product release”, which is a bit disappointing)
One thing which trips up a few people is that you cannot use the result of one derived column in another derived column within the same transformation. The logic behind this is pretty simple – each column is treated as a separate independent item within the data flow and can only consume columns that are input to the transformation. A derived column is effectively an output and so cannot be referenced within the component it was created in.
Lesser gripes are that the Editor is too small. There is no call out box like when setting component properties using expressions, so complex expressions quickly become difficult to read and debug. Add to that the only way to get the syntax error messages is to hover over the function that is invalid and try to read the note that appears for about 4 seconds, meaning the only real way to read long error messages is to hover over it with the mouse and do a screen capture.
Where should you use the Derived Column Transformation?
It is best used when you have to perform simple operations to change data values – for example TRIMming strings, simple IF statements and SUBSTRINGs. It also reproduces all of the functionality in the Audit and Data Conversion tasks, so if you are using those anywhere you may want to consider replacing them with a Derived Column. If your expressions are getting complicated or you repeat alot of operations, you may want to move these operations to the uglier but more powerful Script Component, which I will be covering soon.
MSDN Documentation for the Derived Column Transformation can be found here for 2008 and here for 2005.
If you are still struggling, try these additional resources:
Today as part of some ongoing painful text file imports, I had to clean up some phone numbers that had been entered into a free text field. Consequently the data contained this kind of garbage data:
(03) 8765 4321
03-87654321 – Bob’s mums number
(03)87654321 – or use 04 87654321 during day
Which didn’t lend itself to being fixed in a Derived Column, as every possible replace character would have to be hardcoded, like so:
REPLACE(REPLACE([PhoneNumberText],”(“,””),”(“,””) – and that’s just to clear the brackets. Doing this for evey possible stray character was out of the question.
What I wanted was just to get rid of any character which wasn’t a numeric. My mind immediately leapt to Regular Expressions, but couldn’t see how to do these easily in T-SQL against the source or at all in a standard SSIS component.
Which worked beautifully and saved me writing some truly awful code. However it does lead me to gripe about an absent feature of SSIS that bugs me. The limited number of functions and inability to create custom functions to use in the Derived Column transformation means it is easy to fall back on to the Script Component to do your dirty work. This annoys me because a) i’m not a VB programmer and b) it reduces the transparency of what is going on in the data flow. Fingers crossed this will improve in 2011 – i’ve added feedback on Connect to request this – please vote it up if you consider it important.