Today as part of some ongoing painful text file imports, I had to clean up some phone numbers that had been entered into a free text field. Consequently the data contained this kind of garbage data:
(03) 8765 4321
03-87654321 – Bob’s mums number
(03)87654321 – or use 04 87654321 during day
Which didn’t lend itself to being fixed in a Derived Column, as every possible replace character would have to be hardcoded, like so:
REPLACE(REPLACE([PhoneNumberText],”(“,””),”(“,””) – and that’s just to clear the brackets. Doing this for evey possible stray character was out of the question.
What I wanted was just to get rid of any character which wasn’t a numeric. My mind immediately leapt to Regular Expressions, but couldn’t see how to do these easily in T-SQL against the source or at all in a standard SSIS component.
Fortunately Brian Knight has a solution – using Regular Expressions within a script component. So, quoting him almost verbatim:
In the Imports area, add the following line of code:
Next, replace the Input0_ProcessInputRow subroutine with the following code:
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
If Row.ProductNumber_IsNull = False Or Len(Row.ProductNumber.ToString) <> 0 Then
Dim pattern As String = String.Empty
Dim r As Regex = Nothing
pattern = “[^0-9]”
r = New Regex(pattern, RegexOptions.Compiled)
Row.ProductNumber = Regex.Replace(Row.ProductNumber, pattern, “”)
Which worked beautifully and saved me writing some truly awful code. However it does lead me to gripe about an absent feature of SSIS that bugs me. The limited number of functions and inability to create custom functions to use in the Derived Column transformation means it is easy to fall back on to the Script Component to do your dirty work. This annoys me because a) i’m not a VB programmer and b) it reduces the transparency of what is going on in the data flow. Fingers crossed this will improve in 2011 – i’ve added feedback on Connect to request this – please vote it up if you consider it important.