Loops and Lookups – a performance problem

A bit of a performance tuning nugget around loops and lookups which I faced recently.

We have a scenario where we are looping a few hundred times to execute a series of changing SQL statements, then passing the data through lookups before writing to target. We are doing this with a mix of the standard lookup and Cozyroc’s Lookup Plus task. However we were finding that while executing each SQL query took less than a second, the SSIS package was taking a long time to run each loop.

The loop was taking the query and then pausing for a while before moving any data. Eventually we realised this pause was SSIS building up the caches for the lookups. The standard lookups were running off a pre-built cache but the CozyRoc task doesn’t support caching and had to rebuild each lookup cache every time the loop was instantiated – not a big job but repeated a few hundred times those few seconds soon mount up.

The lessons here are:

  1. If you are reusing a lookup multiple times, use the Cache Transform to preload the cache into memory (2008 onwards)
  2. Be aware of the performance implication of caching lookups and the fact that they will be rebuilt each time the Data Flow is started

Read More

The Fuzzy Lookup Transformation

b
Fig 1: The Fuzzy Lookup Transformation

In this slightly long post I will be covering the Fuzzy Lookup Transformation, one of my favourite components in SSIS. The sample package can be found here for 2005 and guidelines on use are here.

How can a Lookup be Fuzzy?

A lookup becomes Fuzzy when it can match to records that are similar, but not identical to, the lookup key. For example, it can match “Jon Smith” to “John Smith” – this component is very useful for helping consolide client data such as names and addresses from multiple systems. For those of you who want to know more, I thoroughly recommend this article – Fuzzy Lookup and Fuzzy Grouping in SQL Server  Integration Services 2005 – which explains (unusually well for tech doco on BOL) how the Fuzzy Lookup process actually works.

An important thing to get to grips with is that when the Fuzzy Lookup outputs its matches, it also outputs a Similarity and Confidence score for the match. The Similarity score is easy to understand – it’s a measure on a 0-1 scale of how similar the matched item is to the lookup key. So for example, “John Smith” and “Jon Smith” rate a 0.89, whereas “John Smith” and “James Smith” rate a lower 0.62. The Confidence score is on the same scale but not fixed for a given match – essentially the higher it is, the lower the number of alternative matches it has found. So from my example, “Johnathan Fruitbat” has just two matches, and the highest Confidence is 0.38. In contrast “John James Walliams” has 15 matches and the highest Confidence is just 0.08.

In practice I have tended to pay most attention to the Similarity score. In the example i’ve prepared, I have created a list of names of varying similarity to help you get a feel for how the matching functions in practice, and what the scores look like.

What are Fuzzy Lookup Indexes?

Like a normal lookup, you need an input table and a lookup reference table. However in this case the lookup reference is a called an Index (this is a bit misleading as it is not an Index in the normal database sense of the word). These indexes store fragments of the lookup items for fuzzy searching. When you use a Fuzzy lookup you have 4 options  on working with these indexes.

b
Fig 2: Configuring the Reference Table and Indexes for the Fuzzy Lookup
  1. Create a new index every time (“Generate New Index”)
  2. Create a new index and store it (as above, plus check “Store New Index” and provide a name for it)
  3. Create a new index with maintenance features (as above, plus also check “Maintain Stored Index”)
  4. Reuse an existing index (“Use Existing Index”, and select it)

Option 1 is fairly easy to understand – it creates the Index in memory from the reference table, uses it in that data flow, then discards it. This would be used when the Fuzzy Lookup you are doing is a one-off. Option 2 is similar except it stores the Index as a static object in the database. Option 3 is slightly more complex – it creates a trigger on your reference table to ensure that if it is updated, the Indexes for the Fuzzy Lookup also get updated. Finally, Option 4 takes an index that already exists as a database object and uses this to do the fuzzy matching.

So in what scenarios would you use the different options? The answers below match the option numbers above:

  1. A one off Lookup never to be repeated
  2. The first run, to be followed by repeated Lookups against a fixed reference (Data Flow 1 in the example package)
  3. The first run, to be followed by repeated Lookups against a changing reference
  4. Subsequent runs of a repeated Lookup (Data Flow 2 in the example package)

You only have to create the Lookup index once – this is important as this part of the process can be fairly intensive and time consuming – not something you want to repeat every run. So if you know you will be doing repeated lookups, when building your ETL processes you will have to factor in a one off load to initiate the Index.

Advanced settings for the Fuzzy Lookup

I will skip over the columns tab, as is simply requires you to match the column in your source to the lookup key in your reference table, and specify any additional columns you would like returned.

Under the Advanced Tab, you effectively have 3 options:

b
Fig 3: The Advanced Tab of the Fuzzy Lookup

First is “Maximum number of matches to output per lookup”. This allows you to constrain the number of matches the fuzzy lookup will return per input key. So again referring to my example, if I set this to 10, even though “John James Walliams” has 15 possible matches, only 10 would be returned. “Johnathan Fruitbat” would still send back just 2 matches, as there are only two to return. An important takeaway here is that the Fuzzy Lookup does not return one row per input row – it can return many, or even zero.

Second is “Similarity Threshold” – this is a sliding scale running from 0 to 1 which sets a cut-off for the quality of returns from  Fuzzy Lookup. In practice you rarely want to set this much below 0.75 – but you will need to do a few test runs to see which level is appropriate for your data.

Third is “Token Delimiters” – this is one of the means the Fuzzy Lookup uses to break your lookup keys into smaller pieces for fuzzy matching. You would only add to or remove from these if your data has special characters delimiting (or not delimiting) items within the data.

Finally, a couple of properties only available from the Properties pane are WarmCaches and Exhaustive. WarmCaches specifies whether the Indexes and Lookup Table are partly preloaded into memory – by default this is set to True and I can’t really see a scenario where this should be changed. Exhaustive increases the intensity of the matching process – making the checking more thorough – but at a performance cost. I would mark this as an option to play with depending on how well your matching is doing, and set it to True where possible.

Where would you use the Fuzzy Lookup?

The main uses for the Fuzzy Lookup that I have encountered has been matching supposedly common data from different data sources such as client lists and deduplicating reference data where items have been entered many times with slight variations. Essentially any time you need to tie up data that should be – but isn’t – the same, this is a very useful tool.

MSDN Documentation for the Fuzzy Lookup Transformation can be found here for 2008 and here for 2005.

Read More