In the previous post we looked at creating a Knowledge Base through the Knowledge Discovery process. This gave us a first glimpse of what can be done in terms of managing incoming values and providing correction.In this post we will look at the more advanced capabilities of managing incoming data quality issues such as format rules, reference data, etc.
In a quick update to yesterdays post, the helpful team @ the MSDN DQS Forum have answered my query about the difference between Invalid and Error values:
An Invalid value is a value that does not belong to the domain and doesn’t have a correction. For example, the value 12345 in a City domain. An Error value is a value that belongs to the domain but is a syntax error. For example Shicago instead of Chicago in a City domain.
The functional difference: when the system sets statuses for values, it does so according to the above semantics. If a value failed a domain rule, its status will be Invalid. If the system detects a syntax error and an associated correction, the erroneous value status will be Error.
However DQS does do not enforce this semantics on manual operations. So you can enter a correction for an Invalid value without changing its status, and you can remove a correction to an Error value without changing the status as well.
So the upshot is Invalid = Unusable, Error = Correctable.
Domain Management: Domain Rules
If we open our previously created Knowledge Base in Domain management mode, we get a list of Domains, as below:
And, to the right hand side we see our options for setting the Data Quality rules that apply to the domain in question:
The full list of tabs is:
- Domain Properties
- Reference Data
- Domain Rules
- Domain Values
- Term-Based Relations
Domain properties are the same as those available when creating a Domain – with the restriction that you cannot change the data type of the domain. The values available are:
- Domain Name
- Domain Description
- Data Type (fixed after creation)
- Use Leading values – a checkbox
- Format output to – which gives a dropdown of formats suitable to the data type
Nothing terribly exciting here, so moving on…
This was covered in my previous post, but as a lightning recap here is where you can:
- Flag data values as Correct, Invalid or Error
- Provide corrected values for Errors
- Add / Delete Values manually
So, lets move on to new features.
Now we are getting into the syntax based validation that DQS can apply.
In the picture above I have generated a simple rule with a couple of elements – that Country must be at least 3 characters and does not contain any full stop, in an attempt to filter out any abbreviations. Now its worth pointing out at this stage that anything that fails a Domain Rule is considered Invalid – i.e. it is an unusable value that cannot be corrected. So by applying this rule I will render any abbreviations in my Domain Values list Invalid – and the DQS client is kind enough to warn me of this if I apply the rule:
The warning tells me I will increase my Invalid value count and consequently decrease my Valid value count.
It’s important to note that the rules will NOT apply to any cases where there is already a correction defined in the Domain Values. For example, the value “UK” which I corrected to “United Kingdom” is unaffected by this rule, but “HK” is rendered invalid as I have not corrected HK to “Hong Kong”.
So what rule types are available? Broadly it covers:
- Begins with
- Ends with
- Pattern Matching
- Regular Expression matching
As I showed in my example you can combine rule types with AND / OR operators so they become quite flexible. You can also have multiple rules per domain if you want to simplify managing multiple non dependent conditions.
Handily you can also test a rule during development – just above the rules list is a couple of buttons fro test runs – one allows you to test the rule against exiting Domain Values, and the other allows you to test it against a sample data set.
Term Based Relations
This is not the most obviously named component, but is effectively a Find and Replace engine. So you could repair common typos – e.g. change “teh” the “the” or expand common abbreviations – e.g change “Inc.” to “Incorporated”.
This is only available for String Domain types, and doesn’t appear as an option for other types. MSDN Documentation here.
Data Quality Services also allows you to use external sources to validate data against – so far this only covers Azure Datamarket – but the team have hinted that other options will become available. As this feature seems to be half baked at this point I’m not going to dig much further.
This post has skimmed over some of the features of DQS once you have an active Knowledge Base and Domains set up. The headlines of what we can see is:
- Domain Values: Validation and correction of values against a known list
- Domain Rules: Validation against formatting, patterns and values using a set of rules
- Term Based Relations: Effectively “Find and Replace”
- Reference Data: Validation against external data sources
There’s some things I’ve skipped over for future posts, such as composite domains and reference data in depth. But so far, still looking good!