The Fuzzy Lookup Transformation

b
Fig 1: The Fuzzy Lookup Transformation

In this slightly long post I will be covering the Fuzzy Lookup Transformation, one of my favourite components in SSIS. The sample package can be found here for 2005 and guidelines on use are here.

How can a Lookup be Fuzzy?

A lookup becomes Fuzzy when it can match to records that are similar, but not identical to, the lookup key. For example, it can match “Jon Smith” to “John Smith” – this component is very useful for helping consolide client data such as names and addresses from multiple systems. For those of you who want to know more, I thoroughly recommend this article – Fuzzy Lookup and Fuzzy Grouping in SQL Server  Integration Services 2005 – which explains (unusually well for tech doco on BOL) how the Fuzzy Lookup process actually works.

An important thing to get to grips with is that when the Fuzzy Lookup outputs its matches, it also outputs a Similarity and Confidence score for the match. The Similarity score is easy to understand – it’s a measure on a 0-1 scale of how similar the matched item is to the lookup key. So for example, “John Smith” and “Jon Smith” rate a 0.89, whereas “John Smith” and “James Smith” rate a lower 0.62. The Confidence score is on the same scale but not fixed for a given match – essentially the higher it is, the lower the number of alternative matches it has found. So from my example, “Johnathan Fruitbat” has just two matches, and the highest Confidence is 0.38. In contrast “John James Walliams” has 15 matches and the highest Confidence is just 0.08.

In practice I have tended to pay most attention to the Similarity score. In the example i’ve prepared, I have created a list of names of varying similarity to help you get a feel for how the matching functions in practice, and what the scores look like.

What are Fuzzy Lookup Indexes?

Like a normal lookup, you need an input table and a lookup reference table. However in this case the lookup reference is a called an Index (this is a bit misleading as it is not an Index in the normal database sense of the word). These indexes store fragments of the lookup items for fuzzy searching. When you use a Fuzzy lookup you have 4 options  on working with these indexes.

b
Fig 2: Configuring the Reference Table and Indexes for the Fuzzy Lookup
  1. Create a new index every time (“Generate New Index”)
  2. Create a new index and store it (as above, plus check “Store New Index” and provide a name for it)
  3. Create a new index with maintenance features (as above, plus also check “Maintain Stored Index”)
  4. Reuse an existing index (“Use Existing Index”, and select it)

Option 1 is fairly easy to understand – it creates the Index in memory from the reference table, uses it in that data flow, then discards it. This would be used when the Fuzzy Lookup you are doing is a one-off. Option 2 is similar except it stores the Index as a static object in the database. Option 3 is slightly more complex – it creates a trigger on your reference table to ensure that if it is updated, the Indexes for the Fuzzy Lookup also get updated. Finally, Option 4 takes an index that already exists as a database object and uses this to do the fuzzy matching.

So in what scenarios would you use the different options? The answers below match the option numbers above:

  1. A one off Lookup never to be repeated
  2. The first run, to be followed by repeated Lookups against a fixed reference (Data Flow 1 in the example package)
  3. The first run, to be followed by repeated Lookups against a changing reference
  4. Subsequent runs of a repeated Lookup (Data Flow 2 in the example package)

You only have to create the Lookup index once – this is important as this part of the process can be fairly intensive and time consuming – not something you want to repeat every run. So if you know you will be doing repeated lookups, when building your ETL processes you will have to factor in a one off load to initiate the Index.

Advanced settings for the Fuzzy Lookup

I will skip over the columns tab, as is simply requires you to match the column in your source to the lookup key in your reference table, and specify any additional columns you would like returned.

Under the Advanced Tab, you effectively have 3 options:

b
Fig 3: The Advanced Tab of the Fuzzy Lookup

First is “Maximum number of matches to output per lookup”. This allows you to constrain the number of matches the fuzzy lookup will return per input key. So again referring to my example, if I set this to 10, even though “John James Walliams” has 15 possible matches, only 10 would be returned. “Johnathan Fruitbat” would still send back just 2 matches, as there are only two to return. An important takeaway here is that the Fuzzy Lookup does not return one row per input row – it can return many, or even zero.

Second is “Similarity Threshold” – this is a sliding scale running from 0 to 1 which sets a cut-off for the quality of returns from  Fuzzy Lookup. In practice you rarely want to set this much below 0.75 – but you will need to do a few test runs to see which level is appropriate for your data.

Third is “Token Delimiters” – this is one of the means the Fuzzy Lookup uses to break your lookup keys into smaller pieces for fuzzy matching. You would only add to or remove from these if your data has special characters delimiting (or not delimiting) items within the data.

Finally, a couple of properties only available from the Properties pane are WarmCaches and Exhaustive. WarmCaches specifies whether the Indexes and Lookup Table are partly preloaded into memory – by default this is set to True and I can’t really see a scenario where this should be changed. Exhaustive increases the intensity of the matching process – making the checking more thorough – but at a performance cost. I would mark this as an option to play with depending on how well your matching is doing, and set it to True where possible.

Where would you use the Fuzzy Lookup?

The main uses for the Fuzzy Lookup that I have encountered has been matching supposedly common data from different data sources such as client lists and deduplicating reference data where items have been entered many times with slight variations. Essentially any time you need to tie up data that should be – but isn’t – the same, this is a very useful tool.

MSDN Documentation for the Fuzzy Lookup Transformation can be found here for 2008 and here for 2005.

12 thoughts on “The Fuzzy Lookup Transformation

  1. Have you determined a rule of thumb for how big a data set can be before performance becomes to big of an issue when exhaustive is set to true?

  2. Hi Kevin. I haven’t tested the Exhaustive property thoroughly but I found one case where a data set of about 100,000 rows jumped from about a 5 minute to 8 hour processing time when I switched it to True. My gut feeling is the threshold is pretty low, but because of the number of variables in any such process around hardware and match complexity I would just test in your environment and see how much of a drag it adds.

    However what I did find is that for some matching scenarios, the additional matches it brought in weren’t all that valuable. How that would apply to your situation will depend on how well the matching algorithm will work with your data. Your best bet is to do a test on a sample of your data and see if the performance overhead is worth it in terms of additional matches obtained.

  3. James, Setting WarmCaches to false is certainly worth trying when you have a large index and only a few rows to try and match. I’ve gained significant benefit on occasions by turning it off. Consider the ratio between the fuzzy index rows (the reference data) and the rows to match (the input rows), the larger this ratio the more likely it is that you’ll get a performance gain.

    It is a bit like the traditional Lookup and the default cache option, it can sometimes spend more time populating the cache than it takes to just run the query direct.

  4. Just wanted to mention apparently there is a bug in the Fuzzy Lookup component, which appears on 64bit systems. I have lost a couple of hours banging my head with this. Please vote it is important for MS to fix this issue here:

    https://connect.microsoft.com/SQLServer/feedback/details/488387/fuzzy-lookup-triggers-sqldumper-even-with-very-small-dataset-with-run64bitruntime-set-to-false

    The problem is not new because I was able to reproduce it even under SQL 2005 SP3.

  5. hi..
    i need to do same thing
    plz help i have created one ssis package.. i want to insert price into table checking based on company name column …..

  6. Please can anyone tell me how i update records in the refrence table depend on their matches on the source table ?

  7. hi..

    I faced the problem that I spend whole day to find the solution but not found.
    that is, I create a new index with maintenance features and then existing the same index.

    After insert new record into reference table it works fine, but unfortunately, when use update/delete it turns failed with the message below.

    I really don’t know what happened on it?
    couldn’t the reference table be dynamic? or there exists some alternatives?

    It’s really appreciate for your help!!
    thanks.

    Jack

    ==============================================================================
    Msg 6549, Level 16, State 1, Procedure sp_FuzzyLookupTableMaintenanceInvoke, Line 0
    A .NET Framework error occurred during execution of user defined routine or aggregate ‘sp_FuzzyLookupTableMaintenanceInvoke':
    System.Data.SqlClient.SqlException: User defined routine, trigger or aggregate tried to rollback a transaction that is not started in that CLR level. An exception will be thrown to prevent execution of rest of the user defined routine, trigger or aggregate.
    System.Data.SqlClient.SqlException:
    at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection)
    at System.Data.SqlClient.SqlInternalConnectionSmi.ExecuteTransaction(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
    at System.Data.SqlClient.SqlInternalTransaction.Rollback()
    at System.Data.SqlClient.SqlTransaction.Rollback()
    at Microsoft.SqlServer.Dts.TxBestMatch.TableMaintenance.TranWrap(DataCleaningOperation c)
    . User transaction, if any, will be rolled back.
    ==============================================================================

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>