The Fuzzy Grouping Transformation

Fig 1: The Fuzzy Grouping Transformation
Fig 1: The Fuzzy Grouping Transformation

In this post I will be covering the Fuzzy Grouping Transformation. The sample package can be found here for 2005 and guidelines on use are here.

What does the Fuzzy Grouping Transformation do?

The Fuzzy Grouping Transformation allows you to identify similar items within a dataset. It doesn’t – as its name perhaps suggests – actually perform any group by operations. However it does provide you with the information you need to group rows by, as what it does is help identify similar rows within a dataset. It uses the same algorithms as the Fuzzy Lookup transformation, and operates a bit like a Fuzzy Lookup except that it effectively creates its own runtime lookup table, based on the incoming data, which it then references to identify similar rows.

If this is all getting a bit confusing, let’s take a look at a example of some output:

Fuzzy Grouping Output
Fig 2: Fuzzy Grouping Output

This shows three input rows with the field FullName in which the Fuzzy Grouping transformation has been told to look for similar values. These three rows have all been decided to be similar to the FullName “Gregory Alderson”. Note that – as with the Fuzzy Lookup – the match is assigned a similarity score (in the column _Similarity_FullName). Input rows are also assigned a unique key (in the column _key_in) and when a row is matched to another, the key of that row is also stored (in the column _key_out). Note the Fuzzy Lookup component is a synchronous transformation – i.e. you will get one output row for each input row.

So what has actually happened here?

  • First, the component has loaded all the data into temporary tables.
  • Second, it scanned through using fuzzy algorithms to look for similar items.
  • Third, for each input row it output either the best match above the threshold similarity, or otherwise decided it had no matches.

You can then use this output to perform an aggregate group by operation to sum up your data by similar items, using the Aggregate Transformation.

Configuring the Fuzzy Grouping Transformation

The Fuzzy Grouping Transformation is configured over three tabs when you open the component. The first thing you need to configure is the database connection it will use to create the temporary tables it needs to perform the fuzzy matching on the Connection Manager tab. These will be built in TempDB, so ensure the user specified in your connection manager has appropriate permissions on TempDB.

Configuring the Connection Manager for the Fuzzy Grouping Transformation
Fig 3: Configuring the Connection Manager for the Fuzzy Grouping Transformation

Next, on the Columns tab, select the columns that are going to be used to identify similar rows, and those which will simply pass through the component in the ‘Available Input Columns’ box. If a column is to be grouped, check on the left hand side. If it is simply to pass through just check on the right hand side. Any column selected for grouping will be passed through automatically.

Configuring the Columns used in the Fuzzy Grouping Transformation
Fig 4: Configuring the Columns used in the Fuzzy Grouping Transformation

For each column to be analysed for similarity there are a range of settings that influence how the matching occurs and how some output is named. These work as follows:

  • Input Column – the column selected with the checkbox for grouping
  • Output Alias – the name of the Input column in the Output data flow
  • Group Output Alias – the column which will hold the best match value from the grouping
  • Match Type – Exact or Fuzzy matching
  • Minimum Similarity – The minimum similarity for matches for that column
  • Similarity Output Alias – the name of the output column containing the similarity score for that column
  • Numerals – tunes how the matches handle numbers in the data
  • Comparison Flags – fine tuning the string handling

Full details for how these settings work – especially the last two – can be found here on MSDN. I would expect the most common ones you would play with would be Match Type – when grouping on multiple columns and are confident some of them have good quality data, setting to exact match will improve the overall result for matching a row of data. Similarly with Minimum Similarity, you can tune for the quality of match you will accept on individual columns.

Note you are not limited to a single column for grouping – you can select several which will contribute to the overall similarity score for the row. In the sample package I demonstrate this in data flow 2.

Configuring the Advanced options for the Fuzzy Grouping Transformation
Fig 5: Configuring the Advanced options for the Fuzzy Grouping Transformation

Finally, on the Advanced tab we set the global values that will affect the result you get. The key one is the Similarity Threshold – this is what will set the minimum threshold of match that will be accepted overall for a row to be regarded as similar to another row. The right setting for this value will depend on the data you have and what quality of match you will accept. The Similarity score this evaluates against is a function of the similarity scores of all the columns that are being evaluated for the match.

Where should you use the Fuzzy Grouping Transformation?

The scenario I can see this being most useful in is in an analytical context. From a data cleansing point of view it can be a quick way of assessing how effective a data deduplicating exercise may be, and what columns and parameters to use. From a purely analytical approach it would be useful for doing rough groupings on data where exact results aren’t important and a degree of error can be tolerated.

Be aware that this component is a blocking transformation – i.e. it requires all input data to be consumed and processed before it will output anything, which can result in poor performance.

MSDN Documentation for the Fuzzy Grouping Transformation can be found here for 2008 and here for 2005.

If you are still struggling, try these additional resources:

If you need specific help or advice, or have suggestions on the post, please leave a comment and I will do my best to help you.

10 thoughts on “The Fuzzy Grouping Transformation

  1. the i/p rows with score 1 that are being used to compare other rows, how r they being selected?? On what basis is a row being given a score of 1.

  2. Hey.. I have a huge Table say having 5 crores of records in the table. Applying fuzzy grouping on this table would be a bit costly… Please can you suggest an alternative logic that I can use to cleanup my data.
    Also, ways in which I can use my fuzzy grouping on such a huge data.

  3. You could use the Fuzzy Lookup and store the results. Or if SQL2012 is an option, use DQS to de-dup, or T-SQL Fuzzy functions in 2008R2. Depends what you are trying to achieve…

  4. In this example, you have used “Gregory Alderson” as the matching criteria for other values in FullName. I tried this, but in my project each record was compared with it’s own value. Like, in Fig 2, 2nd row, “Gregory Anderson” is compared with “Gregory Alderson”, but for me it is getting compared with “Gregory Anderson” itself, and thus for each row ” _Similarity_FullName”, value is 1. So, can you guide me, where I have to update “FullName_clean” column with Gregory Anderson for all the rows.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>