I’ve covered off the Fuzzy Lookup and Fuzzy Grouping transformations in SSIS and noticed in my research that these capabilities aren’t particularly coherently talked about on the web. So below I thought i’d collect some of the better articles for your late night reading. There isn’t all that much out there, unfortunately.
So, how does it all work?
Here are a few articles covering theory, mostly from Microsoft:
It is probably worth reiterating that because of the way the algorithms and their Q-Grams work, when longer strings are being analysed for fuzzy matches, the better the chances of a good match. When I first started using the algorithms I was doing some client matching and matched first and last names separately. Once I had a deeper understanding of the components, I started matching on a full name and the quality and reliability of matches improved significantly.
Ok, so how do I make it work?
Now, some articles covering practical implementation of the tasks:
The best thing you can do is get some sample data and play with the components to understand what it is they do. The results are impressive – if not bulletproof – and can make a great contribution to de-duplicating client data, etc.
And what does the BI Monkey have to say about it?
Fuzzy Matching is a powerful and easy to use tool which is great for approximate grouping of data for analysis where a margin of error is tolerable. It is also a great helper in data cleansing exercises. Having too much faith in the results where exact matches are required will cause you to fall over at some point, so be careful. If you are engaged in such an exercise and want some experienced support, please get in touch.
If you have come across any articles that you think really contribute something to the understanding of fuzzy matching technologies in SQL Server / SSIS, please let me know or post a link in the comments so I can improve this article.
In this post I will be covering the Fuzzy Grouping Transformation. The sample package can be found here for 2005 and guidelines on use are here.
What does the Fuzzy Grouping Transformation do?
The Fuzzy Grouping Transformation allows you to identify similar items within a dataset. It doesn’t – as its name perhaps suggests – actually perform any group by operations. However it does provide you with the information you need to group rows by, as what it does is help identify similar rows within a dataset. It uses the same algorithms as the Fuzzy Lookup transformation, and operates a bit like a Fuzzy Lookup except that it effectively creates its own runtime lookup table, based on the incoming data, which it then references to identify similar rows.
If this is all getting a bit confusing, let’s take a look at a example of some output:
This shows three input rows with the field FullName in which the Fuzzy Grouping transformation has been told to look for similar values. These three rows have all been decided to be similar to the FullName “Gregory Alderson”. Note that – as with the Fuzzy Lookup – the match is assigned a similarity score (in the column _Similarity_FullName). Input rows are also assigned a unique key (in the column _key_in) and when a row is matched to another, the key of that row is also stored (in the column _key_out). Note the Fuzzy Lookup component is a synchronous transformation – i.e. you will get one output row for each input row.
So what has actually happened here?
First, the component has loaded all the data into temporary tables.
Second, it scanned through using fuzzy algorithms to look for similar items.
Third, for each input row it output either the best match above the threshold similarity, or otherwise decided it had no matches.
You can then use this output to perform an aggregate group by operation to sum up your data by similar items, using the Aggregate Transformation.
Configuring the Fuzzy Grouping Transformation
The Fuzzy Grouping Transformation is configured over three tabs when you open the component. The first thing you need to configure is the database connection it will use to create the temporary tables it needs to perform the fuzzy matching on the Connection Manager tab. These will be built in TempDB, so ensure the user specified in your connection manager has appropriate permissions on TempDB.
Next, on the Columns tab, select the columns that are going to be used to identify similar rows, and those which will simply pass through the component in the ‘Available Input Columns’ box. If a column is to be grouped, check on the left hand side. If it is simply to pass through just check on the right hand side. Any column selected for grouping will be passed through automatically.
For each column to be analysed for similarity there are a range of settings that influence how the matching occurs and how some output is named. These work as follows:
Input Column – the column selected with the checkbox for grouping
Output Alias – the name of the Input column in the Output data flow
Group Output Alias – the column which will hold the best match value from the grouping
Match Type – Exact or Fuzzy matching
Minimum Similarity – The minimum similarity for matches for that column
Similarity Output Alias – the name of the output column containing the similarity score for that column
Numerals – tunes how the matches handle numbers in the data
Comparison Flags – fine tuning the string handling
Full details for how these settings work – especially the last two – can be found here on MSDN. I would expect the most common ones you would play with would be Match Type – when grouping on multiple columns and are confident some of them have good quality data, setting to exact match will improve the overall result for matching a row of data. Similarly with Minimum Similarity, you can tune for the quality of match you will accept on individual columns.
Note you are not limited to a single column for grouping – you can select several which will contribute to the overall similarity score for the row. In the sample package I demonstrate this in data flow 2.
Finally, on the Advanced tab we set the global values that will affect the result you get. The key one is the Similarity Threshold – this is what will set the minimum threshold of match that will be accepted overall for a row to be regarded as similar to another row. The right setting for this value will depend on the data you have and what quality of match you will accept. The Similarity score this evaluates against is a function of the similarity scores of all the columns that are being evaluated for the match.
Where should you use the Fuzzy Grouping Transformation?
The scenario I can see this being most useful in is in an analytical context. From a data cleansing point of view it can be a quick way of assessing how effective a data deduplicating exercise may be, and what columns and parameters to use. From a purely analytical approach it would be useful for doing rough groupings on data where exact results aren’t important and a degree of error can be tolerated.
Be aware that this component is a blocking transformation – i.e. it requires all input data to be consumed and processed before it will output anything, which can result in poor performance.
MSDN Documentation for the Fuzzy Grouping Transformation can be found here for 2008 and here for 2005.
If you are still struggling, try these additional resources:
In this slightly long post I will be covering the Fuzzy Lookup Transformation, one of my favourite components in SSIS. The sample package can be found here for 2005 and guidelines on use are here.
How can a Lookup be Fuzzy?
A lookup becomes Fuzzy when it can match to records that are similar, but not identical to, the lookup key. For example, it can match “Jon Smith” to “John Smith” – this component is very useful for helping consolide client data such as names and addresses from multiple systems. For those of you who want to know more, I thoroughly recommend this article – Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005 – which explains (unusually well for tech doco on BOL) how the Fuzzy Lookup process actually works.
An important thing to get to grips with is that when the Fuzzy Lookup outputs its matches, it also outputs a Similarity and Confidence score for the match. The Similarity score is easy to understand – it’s a measure on a 0-1 scale of how similar the matched item is to the lookup key. So for example, “John Smith” and “Jon Smith” rate a 0.89, whereas “John Smith” and “James Smith” rate a lower 0.62. The Confidence score is on the same scale but not fixed for a given match – essentially the higher it is, the lower the number of alternative matches it has found. So from my example, “Johnathan Fruitbat” has just two matches, and the highest Confidence is 0.38. In contrast “John James Walliams” has 15 matches and the highest Confidence is just 0.08.
In practice I have tended to pay most attention to the Similarity score. In the example i’ve prepared, I have created a list of names of varying similarity to help you get a feel for how the matching functions in practice, and what the scores look like.
What are Fuzzy Lookup Indexes?
Like a normal lookup, you need an input table and a lookup reference table. However in this case the lookup reference is a called an Index (this is a bit misleading as it is not an Index in the normal database sense of the word). These indexes store fragments of the lookup items for fuzzy searching. When you use a Fuzzy lookup you have 4 options on working with these indexes.
Create a new index every time (“Generate New Index”)
Create a new index and store it (as above, plus check “Store New Index” and provide a name for it)
Create a new index with maintenance features (as above, plus also check “Maintain Stored Index”)
Reuse an existing index (“Use Existing Index”, and select it)
Option 1 is fairly easy to understand – it creates the Index in memory from the reference table, uses it in that data flow, then discards it. This would be used when the Fuzzy Lookup you are doing is a one-off. Option 2 is similar except it stores the Index as a static object in the database. Option 3 is slightly more complex – it creates a trigger on your reference table to ensure that if it is updated, the Indexes for the Fuzzy Lookup also get updated. Finally, Option 4 takes an index that already exists as a database object and uses this to do the fuzzy matching.
So in what scenarios would you use the different options? The answers below match the option numbers above:
A one off Lookup never to be repeated
The first run, to be followed by repeated Lookups against a fixed reference (Data Flow 1 in the example package)
The first run, to be followed by repeated Lookups against a changing reference
Subsequent runs of a repeated Lookup (Data Flow 2 in the example package)
You only have to create the Lookup index once – this is important as this part of the process can be fairly intensive and time consuming – not something you want to repeat every run. So if you know you will be doing repeated lookups, when building your ETL processes you will have to factor in a one off load to initiate the Index.
Advanced settings for the Fuzzy Lookup
I will skip over the columns tab, as is simply requires you to match the column in your source to the lookup key in your reference table, and specify any additional columns you would like returned.
Under the Advanced Tab, you effectively have 3 options:
First is “Maximum number of matches to output per lookup”. This allows you to constrain the number of matches the fuzzy lookup will return per input key. So again referring to my example, if I set this to 10, even though “John James Walliams” has 15 possible matches, only 10 would be returned. “Johnathan Fruitbat” would still send back just 2 matches, as there are only two to return. An important takeaway here is that the Fuzzy Lookup does not return one row per input row – it can return many, or even zero.
Second is “Similarity Threshold” – this is a sliding scale running from 0 to 1 which sets a cut-off for the quality of returns from Fuzzy Lookup. In practice you rarely want to set this much below 0.75 – but you will need to do a few test runs to see which level is appropriate for your data.
Third is “Token Delimiters” – this is one of the means the Fuzzy Lookup uses to break your lookup keys into smaller pieces for fuzzy matching. You would only add to or remove from these if your data has special characters delimiting (or not delimiting) items within the data.
Finally, a couple of properties only available from the Properties pane are WarmCaches and Exhaustive. WarmCaches specifies whether the Indexes and Lookup Table are partly preloaded into memory – by default this is set to True and I can’t really see a scenario where this should be changed. Exhaustive increases the intensity of the matching process – making the checking more thorough – but at a performance cost. I would mark this as an option to play with depending on how well your matching is doing, and set it to True where possible.
Where would you use the Fuzzy Lookup?
The main uses for the Fuzzy Lookup that I have encountered has been matching supposedly common data from different data sources such as client lists and deduplicating reference data where items have been entered many times with slight variations. Essentially any time you need to tie up data that should be – but isn’t – the same, this is a very useful tool.
MSDN Documentation for the Fuzzy Lookup Transformation can be found here for 2008 and here for 2005.