I’ve covered off the Fuzzy Lookup and Fuzzy Grouping transformations in SSIS and noticed in my research that these capabilities aren’t particularly coherently talked about on the web. So below I thought i’d collect some of the better articles for your late night reading. There isn’t all that much out there, unfortunately.
So, how does it all work?
Here are a few articles covering theory, mostly from Microsoft:
- Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005 – a detailed article on the use of the components, with some coverage of under the hood behaviour and performance considerations
- Fuzzy Lookups and Groupings Provide Powerful Data Cleansing Capabilities – explains at a conceptual level using the Fuzzy Lookup components and considerations in understanding the results
- Robust and Efficient Fuzzy Match for Online Data Cleaning – a research paper by the creators of the Fuzzy Lookup algorithms which goes into the guts of how fuzzy matching works, with details on Q-Grams, Error tolerant indexes and so forth
It is probably worth reiterating that because of the way the algorithms and their Q-Grams work, when longer strings are being analysed for fuzzy matches, the better the chances of a good match. When I first started using the algorithms I was doing some client matching and matched first and last names separately. Once I had a deeper understanding of the components, I started matching on a full name and the quality and reliability of matches improved significantly.
Ok, so how do I make it work?
Now, some articles covering practical implementation of the tasks:
- Fuzzy Lookup and Fuzzy Grouping transformations – guides to the practical use of the transformations from me, the BI Monkey
- Using Fuzzy Lookup Transformations in SQL Server Integration Services – a practical example of using a fuzzy lookup
- Adventures with Fuzzy Matching – from Jamie Thompson, which provides some cautions on the reliability of the results
- Fuzzy Lookup and Fuzzy Grouping transformations – MSDN documentation
The best thing you can do is get some sample data and play with the components to understand what it is they do. The results are impressive – if not bulletproof – and can make a great contribution to de-duplicating client data, etc.
And what does the BI Monkey have to say about it?
Fuzzy Matching is a powerful and easy to use tool which is great for approximate grouping of data for analysis where a margin of error is tolerable. It is also a great helper in data cleansing exercises. Having too much faith in the results where exact matches are required will cause you to fall over at some point, so be careful. If you are engaged in such an exercise and want some experienced support, please get in touch.
If you have come across any articles that you think really contribute something to the understanding of fuzzy matching technologies in SQL Server / SSIS, please let me know or post a link in the comments so I can improve this article.