One of the ongoing challenges with SSIS is its difficulty in handling complex or damaged text files. One approach to dealing with such files is to bring them all in as one wide text column and then split them using code. Sometimes, the file is too wide for that approach, so below is an extension of that method where you import the column as a text stream (DT_TEXT, or Unicode DT_NTEXT) and then split the text stream in a script transformation:
‘ Declare variables Dim TextStream AsByte() ‘ To hold Text Stream Dim TextStreamAsString As String ‘ To Hold Text Stream converted to String Dim StringArray() As String ‘ To contain split Text Stream
‘ Load Text Stream into variable
TextStream = Row.TextStreamColumn.GetBlobData(0, CInt(Row.Column0.Length))
‘ Convert Text Stream to string
TextStreamAsString = System.Text.Encoding.ASCII.GetString(TextStream)
‘ Split string into array and output
StringArray = TextStreamAsString.Split(“#”)
An important thing to note is that in the step where the Text Stream is converted to a string, the Encoding will depend on the type of text stream you are bringing in – Unicode files will need “Unicode” instead of “ASCII”. Also I have used a hash (“#”) as the column delimiter but that value will vary depending on what type of file you are bringing in.
It is not uncommon for flat file sources to have delimiters where they shouldn’t be – extra commas in address fields is a particularly common problem, for example. These don’t necessarily cause the import to fail – especially if the data is not being imported is not being done so in a strongly typed manner.
To understand what happens, you need to understand how SSIS Flat File sources handle delimited files. Unlike Access or Excel, it doesn’t carve up the rows based on it finding a delimiter and creating a column. In an SSIS source the number of columns is predefined. So imagine I had a Flat File Connection manager configured as below:
How it reads in a row in a file is as follows:
Scan until reaches 1st comma – Put found data in “RowID”
Continue scan until reaches 2nd comma – Put found data in “MonkeyFood”
Continue scan until reaches 3rd comma – Put found data in “Motivation Factor”
Continue scan until reaches Carriage Return / Line Feed – Put found data in “Cost”
So the outcome is that if you happen to have more than 3 commas in your row, anything after the 3rd comma but before the Carriage Return / Line Feed ends up in the 4th column. So if you look at my example below, where the 2nd row of the text file has gone a little comma crazy and has 6 commas in it:
The end result of pumping it through the connection manager as defined in Fig 1 is this:
Note how everything after the 3rd comma has ended up in the final column, because after reaching the 3rd column it has scanned across for the Carriage return / Line Feed to define the final column ignoring any further commas.
Question: What Data Type should you use for importing a column in Flat File containing Decimal data?
Answer: numeric [DT_NUMERIC]
You cannot use the decimal [DT_DECIMAL] type, because in the Advanced Editor of the Flat File connection, the decimal type for some reason only allows you to set the Scale (the number of digits after the decimal point) – the Precision is greyed out (precision is the total number of digits). The numeric data type allows the setting of both values.
Fortunately the SSIS numeric type maps to SQL Server decimal columns without complaint, so you don’t have to add a Data Conversion to change numeric to decimal before using it. I have raised a bug on Connect – please vote it up if you consider this worth fixing.
Today when trying to import an Oracle sourced delimited file, this error cropped up:
[Flat File Source ] Error: The column delimiter for column “ColumnName” was not found.
[Flat File Source ] Error: An error occurred while processing file “MyTextFile” on data row [nnnn].
Error: SSIS Error Code DTS_E_PRIMEOUTPUTFAILED. The PrimeOutput method on component “Flat File Source” (1) returned error code 0xC0202092. The component returned a failure code when etc etc…
Now this was despite the Error Output Disposition for this column being set to “Redirect Row”. So regardless of how I configured the Error Handling of the source, a single bad row would kill the load of the file.
Courtesy of a method described by jwelch at AgileBI.com I approached the issue by importing the entire contents of the file into a single column, then parsing it in a script task. So the first step was to import the entire text file into a database as a single column. This can be done by configuring the connection manager as below, setting the single column as delimited by the row delimiter:
Then I processed the data from the database table in a script task acting as a transformation on the Data Flow, with a script that read pretty much like jwelch’s. It splits the data via a delimiter using the VB Split() function, with the advantage that you can handle cases where there is an incorrect number of columns, and fixing a row in a database for reload is a lot easier than trying to find and fix a row in a massive text file.
Public Class ScriptMain
Private columnDelimiter() As Char = CType(",", Char())
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
Dim rowValues As String()
rowValues = Row.Line.Split(columnDelimiter)
' Test Correct number of rowsIf rowValues.GetUpperBound(0) <> 2 Then
'Row is not complete - Handle errorRow.Column001_IsNull = True
Row.Column002_IsNull = True
Row.Column003_IsNull = True
Else ' Row is OK, output values
Row.Column001 = rowValues.GetValue(0).ToString()
Row.Column002 = rowValues.GetValue(1).ToString()
Row.Column003 = rowValues.GetValue(2).ToString()
Except mine had 180 columns(!), which caused me a fair amount of mouse clicking to set up all the output columns on the Script Component. The file came through, however, successfully bypassing the flaw in the Flat File Source.
I was importing some flat files recently and by virtue of forgetting my own best practices managed to create headaches for myself. So here’s a couple of BI Monkey aspirins to ensure your imports go well:
Tip 1 : Always import the columns of your file into varchars big enough to take the contents of the column
Flat files have the habit of containing bad data – strings instead of numerics being the worst culprit, with oversize fields and unexpected delimiters being the close runners up in the trouble stakes. So if you want to avoid errors importing the file, pull everything you can into suitably big varchars. That way you avoid data conversion and truncation errors. It’s much easier to deal with these issues once the fields are in the database, as you can fix a single broken row far more easily in a database than in a massive text file.
Tip 2: Always TRIM the contents of the columns you are pulling in
Unless you desperately need those trailing spaces, TRIM them in a Derived Column Transform as part of the import, particularly if you are pulling in from a Fixed With file. This prevents database bloat – if you have a 100 character wide column which is just a ‘Y’ with 99 trailing spaces, the database will still store 100 characters – 99 bytes of junk – over a million rows thats nearly 100MB. If you trim it you pass 1 byte to the database, and if it goes into a varchar, you will only store one byte of data. Otherwise it’s very easy to have Gigabyte sized import tables which actually contain a few hundred MB of useful data.
Please feel free to add any other hints and tips in the comments.
The Flat File Source is one of the simplest sources to configure, but conversely the Connection Manager it uses is the one that requires the most manual configuration. As the two are inseparable I will cover them both in this (somewhat lengthy) post.
This Connection Manager is the means provided for accessing file based data sources. In terms of configuration it is probably the component that requires the most manual input because so little of the metadata can be obtained from the source itself, unlike a relational source for example.
The General Tab
The first thing you are required to do when creating a Flat File Connection Manager is to select a file. This means if the file does not exist yet you will have to create a placeholder file otherwise you won’t be able to proceed. Once a file has been selected the first page of options on the General tab become available. Note absolutely no information is derived from the source file – as an a example, it will not autodetect a CSV file and prefill anything for you even if you select a CSV file type.
The first set of options are to set the Locale, whether the file is in Unicode format and what Code Page to use. In practice you are unlikely to change these options very often, unless data is coming from a country other than your own.
Below the dividing line come the basic file information settings. The most important of these is the Format. Your options are:
Delimited – most common, where each column is separated by a character such as a comma (CSV), Tab or Space
Fixed Width – common from mainframes – a file where each column width is a fixed number of characters
Text Qualifier – a symbol that encloses text strings – commonly a double quote (“) – needed when your text strings may contain the delimiter character – only for Delimited Format files
Header row delimiter – The end of line delimiter in the header rows
Header rows to skip – Number of rows to skip before attempting to read in data. Useful if the text file has rows of header data which is of no relevance to the data flow.
Column names in the first data row – Checkbox to indicate if the first row has the column names in it. If present the connection manager will map these names in the next tab
The Columns Tab
This tab behaves differently depending on whether you selected Delimited or Fixed Width / Ragged Right in the General Tab.
If you are loading a delimited file, the tab gives you the option to specify the Row Delimiter (i.e. the end of row character) and the Column Delimiter (the character that separates data items). If you have these set correctly, a preview will be displayed showing a few rows of sample data, as below:
If you are loading a Fixed Width or Ragged Right file, you will be presented with a screen where you manually set the columns. For Fixed Width, you first need to set the overall row width – this can be done either by dragging the red line which marks the end of a row in the GUI, or entering the Row width in the text box below the GUI. For Ragged Right the end of line is identified by the Row Delimiter character and does not need to be manually set in the GUI. The data should then start looking like rows of data. To then set each column click and place a black line (which is draggable) to mark where each column starts / ends. For a big file this can be a pain, so you can skip this and set it in the Advanced tab if you have the file definition to hand, which I will cover in the next section.
The Advanced Tab
Again, this tab has different options depending on the file type selected.
If a Delimited file type is selected, you will able to set the following properties:
Name – the column name
Column Delimiter – the character that terminates the column. This can be manually entered or selected form a drop down. An interesting feature is that because this can vary by column, it is possible to have different delimiters splitting columns within the same row.
DataType – by default this will be a 50 wide string [DT_STR] – but if you click the Suggest Types button below the properties pane, you can fire a wizard which will sample the data and try and work out the smallest applicable data type to use for the columns. Beware this will apply the results of the wizard to all columns.
TextQualified – A boolean to say whether the Text Qualifier specified in the General tab applies to thos column.
If a Fixed Width file type is selected, you get these options instead:
InputColumnWidth – the width of the Incoming column. If you bypassed the GUI in the Columns tab, here is where you would manually set each column width.
DataType – by default this will a string [DT_STR]. The Suggest Types button is still clickable but seems to have no effect.
OutputColumnWidth – this is the width of the column the component will output – by default as wide as the incoming column. You may want to shrink this if for example the column is 6 wide but you know it will only ever have a single character in it.
If you are manually configuring a file definition, the New and Delete buttons will insert and delete new columns as required.
The Preview Tab
At this point, the preview tab will simply display your correctly configured file in all its glory.
Configuring The Flat File Source, and its ‘hidden’ options
In the main component editor, all that needs to be done is to select the Flat File connection manager already constructed, choose how to treat nulls in the source and choose your columns.
However the flat file source has a few of extra properties hidden in the Advanced Editor. First, on the Component Properties tab under Custom Properties is a Property called FileNameColumnName – which will add a column to the component that contains the file name. The second, on the Input and Output properties tab, under the Custom Properties of the Output Columns is the FastParse option. For specific numeric and date types and formats, this offers a different parsing engine (i.e. something that tries to convert the text into the data type you expect) which is allegedly faster, as per MSDN documentation here. However I did some loose testing and found performance actually degraded in some cases, and when it did change, it wasn’t by a significant margin – especially as reading flat files is so blisteringly fast in the first place. I would suggest ignoring this feature unless you have a specific performance problem to address. Finally, in the same place there is the UseBinaryFormat option for when you want to pass through raw binary data.
The Flat File Source is the simplest data source to configure – however the Flat File connection manager it needs requires significant manual input to set up correctly. The Connection manager does offer a very high degree of flexibility in how you handle data coming from flat file sources. Reading Flat Files is very very fast, so if you experience performance issues it’s more likely to be tied to network or disk problems that the package itself.
MSDN Documentation for the Flat File Source can be found here for 2008 and and here for 2005.
MSDN Documentation for the Flat File Connection Manager can be found here for 2008 and and here for 2005.
For reasons best kept to myself, I have been trying to move some data from SQL Server 2008 into Oracle on two unconnected machines. The best way of doing this – given the tools to hand* – appeared to be export from SQL2008 to flat files and suck it into Oracle via Data Manager (DM). Because apparently a simple “import text file” feature in Oracle is too much to ask for from most client tools. All hail SSMS, all curse SQL Developer. Due to issues with the data containing common delimiter characters, I decided to output to a Fixed Width flat file.
I subsequently had a barrel of laughs trying to get things into Data Manager because of its expectations of what constitutes a flat file versus what SSIS thinks is a flat file. For once i’m going to blame both parties equally for this mess, as both tools do odd and annoying things in a very non-transparent manner.
DM didn’t take the default flat file as it didn’t acknowledge what passed in SSIS for a row delimiter (the character at the end of the line that signifies a new row) so I decided to switch to Ragged Right as that has a specifiable row delimiter. Data Manager liked this a bit more, but complained that the last column of each data file terminated unexpectedly early.
It turns out that Ragged Right format trims trailing spaces from the last column – so if you have a column 8 characters wide, but only 4 characters in it, the last column will show the row delimiter after the 4th character. So Ragged Right isn’t truly “Fixed Width” – more – “Fixed Width, apart from the last column”. It’s counterintuitive, but if you read the documentation, it is behaving as intended, and fortunately there is an alternative approach.
The next step was to choose Fixed Width with an added column for row delimiter (an option when building your Flat File connector). This ensures every column is output at the full width of the column, even for the last one.
Well, almost. What I discovered next is that Numeric columns have been – regardless of size – all output with a width of 260. So your single digit integer gets output as 259 characters of spaces and a single numeric. This odd behaviour is controlled by the InputColumnWidth property (found on the advanced page for each column), rather than the expected OutputColumnWidth property, which for some reason is not configurable for numeric fields. To output a column just as wide as the numeric field, you need to set the InputColumnWidth to the Precision of the field, plus one for the decimal point and plus one for a minus symbol if the field is signed. Hence a field of precision 4, scale 2 – such as 21.13 – needs a InputColumnWidth of 5 to take all the output characters. If the numeric field is signed and carrying negative numbers – you need another point on there, so -21.13 would need an InputColumnWidth of 6 to export successfully.
Of course, Data Manager expects a precision 4 numeric to only be 4 characters wide, so when setting the precision in the SQLText Designer, you need to set the precision to match the InputColumnWidth used to export the field as well.
And that, my fellow BI monkeys, is how I wasted a full working day trying to get 14 text files from one database to another. It has done nothing to endear me to Oracle, that’s for sure.
* which includes that trusty ETL tool, a USB memory stick