Flat File Import tips

I was importing some flat files recently and by virtue of forgetting my own best practices managed to create headaches for myself. So here’s a couple of BI Monkey aspirins to ensure your imports go well:

Tip 1 : Always import the columns of your file into varchars big enough to take the contents of the column

Flat files have the habit of containing bad data – strings instead of numerics being the worst culprit, with oversize fields and unexpected delimiters being the close runners up in the trouble stakes. So if you want to avoid errors importing the file, pull everything you can into suitably big varchars. That way you avoid data conversion and truncation errors. It’s much easier to deal with these issues once the fields are in the database, as you can fix a single broken row far more easily in a database than in a massive text file.

Tip 2: Always TRIM the contents of the columns you are pulling in

Unless you desperately need those trailing spaces, TRIM them in a Derived Column Transform as part of the import, particularly if you are pulling in from a Fixed With file. This prevents database bloat – if you have a 100 character wide column which is just a ‘Y’ with 99 trailing spaces, the database will still store 100 characters – 99 bytes of junk – over a million rows thats nearly 100MB. If you trim it you pass 1 byte to the database, and if it goes into a varchar, you will only store one byte of data. Otherwise it’s very easy to have Gigabyte sized import tables which actually contain a few hundred MB of useful data.

Please feel free to add any other hints and tips in the comments.

4 thoughts on “Flat File Import tips

  1. Specially important when you’re importing dates, with different formats (YYYYMMDD, DDMMYYYY) is to validate against a well defined Time Dimension table…dates in flat files almost never come all well formated… and once you transform them to validate through a lookup search, why not get a date_id (tipically a smallint) and store that instead of a varchar or datetime data type? You could go from 8 bytes per date (datetime) to a 2 byte per date (signed smallint)… (MS BI stack at least)…

    Just my experience, here you decide to have a longer ETL execution, in favour of a smaller table and more consistent data…

  2. Agreed, dates are often one of the biggest headaches when coming from a non-typed source. If coming from a flat file, I don’t deal with them until they are in the database as a chunk of text. I then try to get the value into a datetime field before doing further validation, using SSIS and SQL to wrangle the date into a valid format.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>