What does the Fuzzy Grouping Transformation do?
The Fuzzy Grouping Transformation allows you to identify similar items within a dataset. It doesn’t – as its name perhaps suggests – actually perform any group by operations. However it does provide you with the information you need to group rows by, as what it does is help identify similar rows within a dataset. It uses the same algorithms as the Fuzzy Lookup transformation, and operates a bit like a Fuzzy Lookup except that it effectively creates its own runtime lookup table, based on the incoming data, which it then references to identify similar rows.
If this is all getting a bit confusing, let’s take a look at a example of some output:
This shows three input rows with the field FullName in which the Fuzzy Grouping transformation has been told to look for similar values. These three rows have all been decided to be similar to the FullName “Gregory Alderson”. Note that – as with the Fuzzy Lookup – the match is assigned a similarity score (in the column _Similarity_FullName). Input rows are also assigned a unique key (in the column _key_in) and when a row is matched to another, the key of that row is also stored (in the column _key_out). Note the Fuzzy Lookup component is a synchronous transformation – i.e. you will get one output row for each input row.
So what has actually happened here?
- First, the component has loaded all the data into temporary tables.
- Second, it scanned through using fuzzy algorithms to look for similar items.
- Third, for each input row it output either the best match above the threshold similarity, or otherwise decided it had no matches.
You can then use this output to perform an aggregate group by operation to sum up your data by similar items, using the Aggregate Transformation.
Configuring the Fuzzy Grouping Transformation
The Fuzzy Grouping Transformation is configured over three tabs when you open the component. The first thing you need to configure is the database connection it will use to create the temporary tables it needs to perform the fuzzy matching on the Connection Manager tab. These will be built in TempDB, so ensure the user specified in your connection manager has appropriate permissions on TempDB.
Next, on the Columns tab, select the columns that are going to be used to identify similar rows, and those which will simply pass through the component in the ‘Available Input Columns’ box. If a column is to be grouped, check on the left hand side. If it is simply to pass through just check on the right hand side. Any column selected for grouping will be passed through automatically.
For each column to be analysed for similarity there are a range of settings that influence how the matching occurs and how some output is named. These work as follows:
- Input Column – the column selected with the checkbox for grouping
- Output Alias – the name of the Input column in the Output data flow
- Group Output Alias – the column which will hold the best match value from the grouping
- Match Type – Exact or Fuzzy matching
- Minimum Similarity – The minimum similarity for matches for that column
- Similarity Output Alias – the name of the output column containing the similarity score for that column
- Numerals – tunes how the matches handle numbers in the data
- Comparison Flags – fine tuning the string handling
Full details for how these settings work – especially the last two – can be found here on MSDN. I would expect the most common ones you would play with would be Match Type – when grouping on multiple columns and are confident some of them have good quality data, setting to exact match will improve the overall result for matching a row of data. Similarly with Minimum Similarity, you can tune for the quality of match you will accept on individual columns.
Note you are not limited to a single column for grouping – you can select several which will contribute to the overall similarity score for the row. In the sample package I demonstrate this in data flow 2.
Finally, on the Advanced tab we set the global values that will affect the result you get. The key one is the Similarity Threshold – this is what will set the minimum threshold of match that will be accepted overall for a row to be regarded as similar to another row. The right setting for this value will depend on the data you have and what quality of match you will accept. The Similarity score this evaluates against is a function of the similarity scores of all the columns that are being evaluated for the match.
Where should you use the Fuzzy Grouping Transformation?
The scenario I can see this being most useful in is in an analytical context. From a data cleansing point of view it can be a quick way of assessing how effective a data deduplicating exercise may be, and what columns and parameters to use. From a purely analytical approach it would be useful for doing rough groupings on data where exact results aren’t important and a degree of error can be tolerated.
Be aware that this component is a blocking transformation – i.e. it requires all input data to be consumed and processed before it will output anything, which can result in poor performance.
If you are still struggling, try these additional resources:
If you need specific help or advice, or have suggestions on the post, please leave a comment and I will do my best to help you.