A common requirement in Data Warehousing is to apply a Data Quality “score” to records as they come in. The score is then used to identify and filter or fix bad data coming in depending on its assigned quality.
A practical example of this might be that in a Customer Address record, a missing Postcode might attract a high score as it’s a very important field. However a badly formatted work, home or mobile telephone number may attract a lower score as it may not be as important to the business. Though, cumulatively, if all three numbers are badly formatted that may be necessary to give a combined high score so the record gets examined.
An example of this is below. A failed Postcode gets a score of 3, and a failed telephone number gets a score of 1. Thus, anything with a score of 3 or above either has a failed Postcode, or 3 failed telephone numbers, and can thus be subject to special handling.
From the example above we can see this is a fairly arbitrary process in terms of how scores are calculated and used. SSDQS itself doesn’t natively support assigning a score or weight to a failed data item, but what it does do is provide us with a flexible engine to help us decide what is a failed data item. SSIS can then react to this pass / fail behaviour and apply a scoring.
Setting up an SSDQS Knowledge Base for Scoring
Given that the basis for scoring is pretty binary in nature, I set up a simple KB that had domains that would either pass or fail a piece of data. I first created a data set with three data fields:
- Year – Values ranging from 1970 to 2025
- Value – Values ranging from 0 to 100
- Code – Values A,B,C,D,E
I then set up a KB to evaluate the fields as follows:
- Year – Valid from 1975 to 2020
- Value – Valid from 10 to 95
- Code – Valid values A,C,E
Note that I did not set up any Domain Values or do any training – I just set up the KB, Domains and Domain Rules. All I want to use DQS for is to identify records that are invalid for SSIS to use in scoring.
Using SSIS to Score SSDQS output
Next I hookup up my SSIS Data Quality Cleansing Component to push the source data through the Knowledge Base, and get the status of each of the columns after they pass through. As there are no preloaded valid values in the Domains, the status comes back as either “Invalid” (it failed the Domain rule) or “Unknown” (in this configuration, this translates to a correct value).
The Data Quality Cleansing Component doesn’t support scoring in itself. This has to be added using a Derived Column on an item by item basis. Using a simple IF / THEN / ELSE expression, I assign a score of 1 to each failed column based on the status of the record, as below:
Because of the Pipeline nature of SSIS, I then need to add a second Derived Column transform downstream to weight and add the scores together to create a final, record level score:
This results in a final Data Quality “Score” assigned to each record:
What you then do with these scores is up to you. In my example package, I used a Conditional Split to send records with a score over a certain threshold to a different destination:
Improving the Scoring process
The example I’ve created is quite simplistic – it has hard coded weightings and redirection thresholds, and can only react to two (of a possible three) record statuses. The process could be made more flexible using metadata driven weightings and thresholds (provided as package inputs).
Beyond that you have the option to handle the clean and dirty data more appropriately – by pushing dirty data into a cleanup process, halting ETL processes etc, etc.
The key takeaway here is that DQS enables you to create a scoring process that is independent of the actual Data Quality rules that pass or fail a piece of data. The DQS Knowledge Base is your flexible input of what qualifies as a good or bad record, instead of having to hard code using SQL or Derived Columns, which could get messy very quickly.