Replacing multiple characters in a Text String in SSIS

Today as part of some ongoing painful text file imports, I had to clean up some phone numbers that had been entered into a free text field. Consequently the data contained this kind of garbage data:

(03) 8765 4321

0387654321

03-87654321 – Bob’s mums number

(03)87654321 – or use 04 87654321 during day

Which didn’t lend itself to being fixed in a Derived Column, as every possible replace character would have to be hardcoded, like so:

REPLACE(REPLACE([PhoneNumberText],”(”,”"),”(”,”") – and that’s just to clear the brackets. Doing this for evey possible stray character was out of the question.

What I wanted was just to get rid of any character which wasn’t a numeric. My mind immediately leapt to Regular Expressions, but couldn’t see how to do these easily in T-SQL against the source or at all in a standard SSIS component.

Fortunately Brian Knight has a solution – using Regular Expressions within a script component. So, quoting him almost verbatim:

In the Imports area, add the following line of code:

Imports System.Text.RegularExpressions

Next, replace the Input0_ProcessInputRow subroutine with the following code:

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

If Row.ProductNumber_IsNull = False Or Len(Row.ProductNumber.ToString) <> 0 Then

Dim pattern As String = String.Empty

Dim r As Regex = Nothing

pattern = “[^0-9]”
r = New Regex(pattern, RegexOptions.Compiled)

Row.ProductNumber = Regex.Replace(Row.ProductNumber, pattern, “”)

End If

End Sub

Which worked beautifully and saved me writing some truly awful code. However it does lead me to gripe about an absent feature of SSIS that bugs me. The limited number of functions and inability to create custom functions to use in the Derived Column transformation means it is easy to fall back on to the Script Component to do your dirty work. This annoys me because a) i’m not a VB programmer and b) it reduces the transparency of what is going on in the data flow. Fingers crossed this will improve in 2011 – i’ve added feedback on Connect to request this – please vote it up if you consider it important.

Comments

6 Responses to “Replacing multiple characters in a Text String in SSIS”
  1. Jon says:

    Not to be picky but the proposed solution is for a ’script component’ not script task, one takes place in control flow and the other in data flow. Thus, causing an issue with the actual VB code.

    Jon

  2. BI Monkey says:

    You’re right – a common slip on my part i’m afraid – and now corrected, thanks.

  3. Jon B says:

    In my case, the ‘Row.ProductNumber’ is a read-only property which cannot be modified

  4. BI Monkey says:

    I haven’t but it sounds like a good alternative! I love SSIS extensibility :)

Trackbacks

Check out what others are saying about this post...
  1. [...] when the Derived Column editor doesn’t give me what I need – such as when I want to use Regular Expressions, or reuse a function across many [...]



Comments

Please leave a comment or ask the BI Monkey for more info.
To prevent spamming, comments have to be approved so they will not appear instantly.
If you want a pic to show with your comment, go get a gravatar