How can a Lookup be Fuzzy?
A lookup becomes Fuzzy when it can match to records that are similar, but not identical to, the lookup key. For example, it can match “Jon Smith” to “John Smith” – this component is very useful for helping consolide client data such as names and addresses from multiple systems. For those of you who want to know more, I thoroughly recommend this article – Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005 – which explains (unusually well for tech doco on BOL) how the Fuzzy Lookup process actually works.
An important thing to get to grips with is that when the Fuzzy Lookup outputs its matches, it also outputs a Similarity and Confidence score for the match. The Similarity score is easy to understand – it’s a measure on a 0-1 scale of how similar the matched item is to the lookup key. So for example, “John Smith” and “Jon Smith” rate a 0.89, whereas “John Smith” and “James Smith” rate a lower 0.62. The Confidence score is on the same scale but not fixed for a given match – essentially the higher it is, the lower the number of alternative matches it has found. So from my example, “Johnathan Fruitbat” has just two matches, and the highest Confidence is 0.38. In contrast “John James Walliams” has 15 matches and the highest Confidence is just 0.08.
In practice I have tended to pay most attention to the Similarity score. In the example i’ve prepared, I have created a list of names of varying similarity to help you get a feel for how the matching functions in practice, and what the scores look like.
What are Fuzzy Lookup Indexes?
Like a normal lookup, you need an input table and a lookup reference table. However in this case the lookup reference is a called an Index (this is a bit misleading as it is not an Index in the normal database sense of the word). These indexes store fragments of the lookup items for fuzzy searching. When you use a Fuzzy lookup you have 4 options on working with these indexes.
- Create a new index every time (“Generate New Index”)
- Create a new index and store it (as above, plus check “Store New Index” and provide a name for it)
- Create a new index with maintenance features (as above, plus also check “Maintain Stored Index”)
- Reuse an existing index (“Use Existing Index”, and select it)
Option 1 is fairly easy to understand – it creates the Index in memory from the reference table, uses it in that data flow, then discards it. This would be used when the Fuzzy Lookup you are doing is a one-off. Option 2 is similar except it stores the Index as a static object in the database. Option 3 is slightly more complex – it creates a trigger on your reference table to ensure that if it is updated, the Indexes for the Fuzzy Lookup also get updated. Finally, Option 4 takes an index that already exists as a database object and uses this to do the fuzzy matching.
So in what scenarios would you use the different options? The answers below match the option numbers above:
- A one off Lookup never to be repeated
- The first run, to be followed by repeated Lookups against a fixed reference (Data Flow 1 in the example package)
- The first run, to be followed by repeated Lookups against a changing reference
- Subsequent runs of a repeated Lookup (Data Flow 2 in the example package)
You only have to create the Lookup index once – this is important as this part of the process can be fairly intensive and time consuming – not something you want to repeat every run. So if you know you will be doing repeated lookups, when building your ETL processes you will have to factor in a one off load to initiate the Index.
Advanced settings for the Fuzzy Lookup
I will skip over the columns tab, as is simply requires you to match the column in your source to the lookup key in your reference table, and specify any additional columns you would like returned.
Under the Advanced Tab, you effectively have 3 options:
First is “Maximum number of matches to output per lookup”. This allows you to constrain the number of matches the fuzzy lookup will return per input key. So again referring to my example, if I set this to 10, even though “John James Walliams” has 15 possible matches, only 10 would be returned. “Johnathan Fruitbat” would still send back just 2 matches, as there are only two to return. An important takeaway here is that the Fuzzy Lookup does not return one row per input row – it can return many, or even zero.
Second is “Similarity Threshold” – this is a sliding scale running from 0 to 1 which sets a cut-off for the quality of returns from Fuzzy Lookup. In practice you rarely want to set this much below 0.75 – but you will need to do a few test runs to see which level is appropriate for your data.
Third is “Token Delimiters” – this is one of the means the Fuzzy Lookup uses to break your lookup keys into smaller pieces for fuzzy matching. You would only add to or remove from these if your data has special characters delimiting (or not delimiting) items within the data.
Finally, a couple of properties only available from the Properties pane are WarmCaches and Exhaustive. WarmCaches specifies whether the Indexes and Lookup Table are partly preloaded into memory – by default this is set to True and I can’t really see a scenario where this should be changed. Exhaustive increases the intensity of the matching process – making the checking more thorough – but at a performance cost. I would mark this as an option to play with depending on how well your matching is doing, and set it to True where possible.
Where would you use the Fuzzy Lookup?
The main uses for the Fuzzy Lookup that I have encountered has been matching supposedly common data from different data sources such as client lists and deduplicating reference data where items have been entered many times with slight variations. Essentially any time you need to tie up data that should be – but isn’t – the same, this is a very useful tool.