SQL Server Data Quality Services – Composite Domains

One of the things I skimmed over in previous posts was the concept of Composite Domains. This is a combination of domains that are assessed with interdependencies on content.

At a very simple level, a Composite Domain addresses these kind of problems:

  • If City = “London”, Country must equal “England”
  • If Wealth Category = “Millionaire”, Bank Balance must be greater than 1 million

They allow us to validate separate data items in combination, thus allowing the writing of more complex rules beyond the already capable single field ones.

Implementing Composite Domains

Below is a couple of screenshots around setting up composite domains. First we have the definition of what fields need to be included:

DQS Client - Composite Domain Properties

DQS Client - Composite Domain Properties

This is fairly straightforward – just pick at least two fields that are interrelated from the available Domain list. The other screen of interest is under the Rules tab:

DQS Client - Composite Domain Rules

DQS Client - Composite Domain Rules

Here it can be seen that a composite domain rule has the capability to evaluate two components at a time. This is a deliberate limitation, so if you wanted to validate 3 fields in combination, you would have to do it via a set of rules that cross over. If you were looking at validating the rule:

If City = “London”, Country = “UK” & Region = “Europe”

You would have to do it through the following rules

If City = “London”, Country = “UK”
If City = “London”, Region = “Europe”

The rules allow for AND / OR at the field level, so you could have rules that read

If City = “London” OR “Birmingham”, Country = “UK”

If Bank Balance >= 2 million and < 1 billion, Wealth Category = “Multi Millionaire”

There seems to be scope for improvement here – the rule capability is a little simplistic but I imagine will meet most scenarios and does make cross field validation possible.

Other features and summary

There are two features I skipped over – Reference Data (for a bigger future post on the whole concept) and the Value Relations tab which, at this point in time seems not to be working and is just a statistical summary of the values found in the data.

There’s not much to close out on otherwise – Composite Domains allow fields to influence other fields from a data quality perspective. The documentation on this feature is sparse at this point so hopefully we’ll get more information soon.

SQL Server Data Quality Services – Domain Management

In the previous post we looked at creating a Knowledge Base through the Knowledge Discovery process. This gave us a first glimpse of what can be done in terms of managing incoming values and providing correction.In this post we will look at the more advanced capabilities of managing incoming data quality issues such as format rules, reference data, etc.

In a quick update to yesterdays post, the helpful team @ the MSDN DQS Forum have answered my query about the difference between Invalid and Error values:

An Invalid value is a value that does not belong to the domain and doesn’t have a correction. For example, the value 12345 in a City domain. An Error value is a value that belongs to the domain  but is a syntax error. For example Shicago instead of Chicago in a City domain.

The functional difference: when the system sets statuses for values, it does so according to the above semantics. If a value failed a domain rule, its status will be Invalid. If the system detects a syntax error and an associated correction, the erroneous value status will be Error.

However DQS does do not enforce this semantics on manual operations. So you can enter a correction for an Invalid value without changing its status, and you can remove a correction to an Error value without changing the status as well.

So the upshot is Invalid = Unusable, Error = Correctable.

Domain Management: Domain Rules

If we open our previously created Knowledge Base in Domain management mode, we get a list of Domains, as below:

Domain Management - Domain List

Domain Management - Domain List

And, to the right hand side we see our options for setting the Data Quality rules that apply to the domain in question:

Domain Management - Domain Rules

Domain Management - Domain Rules

The full list of tabs is:

  • Domain Properties
  • Reference Data
  • Domain Rules
  • Domain Values
  • Term-Based Relations

Domain Properties

Domain properties are the same as those available when creating a Domain – with the restriction that you cannot change the data type of the domain. The values available are:

  • Domain Name
  • Domain Description
  • Data Type (fixed after creation)
  • Use Leading values – a checkbox
  • Format output to – which gives a dropdown of formats suitable to the data type

Nothing terribly exciting here, so moving on…

Domain Values

This was covered in my previous post, but as a lightning recap here is where you can:

  • Flag data values as Correct, Invalid or Error
  • Provide corrected values for Errors
  • Add / Delete Values manually

So, lets move on to new features.

Domain Rules

Now we are getting into the syntax based validation that DQS can apply.

Data Quality Client - Domain Rules

Data Quality Client - Domain Rules

In the picture above I have generated a simple rule with a couple of elements – that Country must be at least 3 characters and does not contain any full stop, in an attempt to filter out any abbreviations. Now its worth pointing out at this stage that anything that fails a Domain Rule is considered Invalid – i.e. it is an unusable value that cannot be corrected. So by applying this rule I will render any abbreviations in my Domain Values list Invalid – and the DQS client is kind enough to warn me of this if I apply the rule:

Data Quality Client - Domain Rule Warning

Data Quality Client - Domain Rule Warning

The warning tells me I will increase my Invalid value count and consequently decrease my Valid value count.

It’s important to note that the rules will NOT apply to any cases where there is already a correction defined in the Domain Values. For example, the value “UK” which I corrected to “United Kingdom” is unaffected by this rule, but “HK” is rendered invalid as I have not corrected HK to “Hong Kong”.

So what rule types are available? Broadly it covers:

  • Length
  • Value
  • Contains
  • Begins with
  • Ends with
  • Pattern Matching
  • Regular Expression matching

As I showed in my example you can combine rule types with AND / OR operators so they become quite flexible. You can also have multiple rules per domain if you want to simplify managing multiple non dependent conditions.

Handily you can also test a rule during development – just above the rules list is a couple of buttons fro test runs – one allows you to test the rule against exiting Domain Values, and the other allows you to test it against a sample data set.

Term Based Relations

This is not the most obviously named component, but is effectively a Find and Replace engine. So you could repair common typos – e.g. change “teh” the “the” or expand common abbreviations – e.g change “Inc.” to “Incorporated”.

Domain Management - Term Based Relations

Domain Management - Term Based Relations

This is only available for String Domain types, and doesn’t appear as an option for other types. MSDN Documentation here.

Reference Data

Data Quality Services also allows you to use external sources to validate data against – so far this only covers Azure Datamarket – but the team have hinted that other options will become available. As this feature seems to be half baked at this point I’m not going to dig much further.

Summary

This post has skimmed over some of the features of DQS once you have an active Knowledge Base and Domains set up. The headlines of what we can see is:

  • Domain Values: Validation and correction of values against a known list
  • Domain Rules: Validation against formatting, patterns and values using a set of rules
  • Term Based Relations: Effectively “Find and Replace”
  • Reference Data: Validation against external data sources

There’s some things I’ve skipped over for future posts, such as composite domains and reference data in depth. But so far, still looking good!

SQL Server Data Quality Services – Creating a Knowledge Base

So far we’ve opened up the client and taken a look at the areas for working with. In this post I’ll look at setting up a Knowledge Base from scratch using some sample data I’ve mocked up based on some Netflix catalogue data.

Note all of the images are of near full screen so I’ve shrunk them in the post, just click on the image to see the full screen.

Creating a Knowledge Base

To create a new KB just select “New Knowledge Base” from the client front screen, and give it a name and description. You can either clone an existing one on already the server or import, but in this case I’m just going to start from scratch. There’s three possible modes in which you can create a new KB:

  • Domain Management – Creating from scratch with no guidance
  • Knowledge Discovery – Using a sample data set to guide building your KB with a view to using it for data cleansing rules
  • Matching Policy – Using a sample data set to guide building your KB with a view to using its record matching capabilities
SSDQS Client: Create new Knowledge Base

SSDQS Client: Create new Knowledge Base

As I’m not attempting to do any matching, I’m using the Knowledge Discovery approach. (Note: because I’m using an Excel source I need Excel installed on the machine).

Mapping Columns to Domains

Once I’ve picked my data source, I need to map columns:

Data Quality Client - Field Mapping

Data Quality Client - Field Mapping

At this stage, there is no facility to automatically create domains, so before I map the fields to a Domain, I need to create each of them:

Data Quality Client - New Domain

Data Quality Client - New Domain

Now a (known) defect appears to be fields from Excel are automatically treated as text – so if they contain numbers you cannot map them to a Numeric domain. So I’ll skip over evaluating Year for now.

Analysing and Managing the results

Once all the mapping is done, click next and you can upload the file for analysis. Now my sample file is 5,000 records, and it took DQS a few minutes just to upload this – so big sets may take a while. Anyway, the output of this is an analysis of the data, displayed in a profiler screen:

Data Quality Client - Discovery Output

Data Quality Client - Discovery Output

This breaks down the records by the following criteria:

  • New records – in this case, all of them as it’s a first pass
  • Unique - how many unique records
  • Valid - how many valid records – again, as it’s a first pass, everything is valid

So, we skip on to the next stage – to manage the results of the findings. This is the first time we start to see what DQS can offer us in terms of cleansing:

Data Quality Client - Managing Domain Values

Data Quality Client - Managing Domain Values

Here we are reviewing the values found for Country, and can manage the values that come through, flagging them as Correct, Error or Invalid – and assign a corrected value to incorrect ones.

I’m trying to see if there is a functional difference between Error and Invalid. As per documentation:

The status of the value, as determined by the discovery process. You can change the type by clicking the down arrow and selecting a different type. A green check indicates that the value is correct or corrected; a red cross indicates that the value is in error; and an orange triangle with an exclamation point indicates that the value is not valid.
A value that is not valid does not conform to the data requirements for the domain. A value that is in error can be valid, but is not the correct value for data reasons.

Update: the difference has been clarified here in the DQS Forum.

Skipping over the semantics issues, what we see here is a list of the values that the DQS Client has found in the Knowledge Discovery analysis of the data. We can then flag these values to Invalid or Error as we see fit – or leave them at their default value of Correct. Once we have flagged them as not correct, it is then possible to enter the Correct value in the “Correct To” column. Handily, the client then groups your corrected values under the correct value in the list.

The final thing is to click Next and Publish the Knowledge Base (i.e. store the results back on the DQS Server)

Summary

So in this post we have quickly reviewed the creation of  Knowledge Base through the Knowledge Discovery Mode. This has allowed us to create a set of values in our Knowledge Base using some sample data and then apply some corrections to those values, using a simple GUI to manage the results.

In the next post I will look at working in more depth with this created Knowledge Base using the “Domain Management” mode.

SQL Server Data Quality Services – First Look

So I have managed to get SSDQS up and running, and here’s what I get when I open the SSDQS Client:

SQL Server Data Quality Services Client front page

SQL Server Data Quality Services Client front page

For those without a microscope (you can click on the image to see the full screen), the client has 3 areas of activity:

  • Knowledge Base Management
  • Data Quality Projects
  • Administration

I’ll dive into the 3 areas briefly below, then head into more depth on each one in subsequent posts

Knowledge Base Management

Knowledge Bases are at the heart of SSDQS functionality. Here you define the rules that SSDQS will apply when validating data, and the action taken when those rules are breached. The idea behind calling them “Knowledge Bases” is that they capture knowledge about the data in your organisation – effectively they are catalogues of rules. (I rather hope this name changes as Microsoft already heavily uses Knowledge Base as a term in its support activities, so it’s a bit of a confusing choice).

Anyway, nomenclature moans aside, the Knowledge Base is the most important thing to grasp here. It’s the container for what the SSDQS engine, either through the client or SSIS will use when validating data. The Knowledge Base contains Domains, which represents an area of Data Quality – so for example in an Address style Knowledge Base this could be a single field – such as City – or a compound selection of fields such as Full Address, made up of Address, City and Postal Code. Hence the term Domain – its important to note SSDQS is not a single field only validation engine, but can deal with collections of fields, and also create different Domains relating to the same fields, allowing the application of Data Quality with context. Again, sticking with the Address model, this means a Knowledge Base can subject Addresses from different countries to different rules by creating two Domains to address the same set of fields and calling the Domain as appropriate. There’s also what appears to be fuzzy matching capabilities as well.

Data Quality Projects

Projects are actual applications of Knowledge Bases across sets of Data. From what I can see so far, there are two modes of operation: Cleansing & Matching.

Cleansing is the application of the Knowledge Base against a set of data to test and refine the rules you have in place. Matching is applying the Knowledge Base against a set of data and applying what appears to be Fuzzy Matching algorithms over it to match it to existing records.

I’ll be able to flesh this out more once I have some data to play with (I’m going to have to create some as there don’t appear to be any samples to work with) – but the thing I like about this is that these projects are not an isolated process of an end result, but feed back into the Knowledge Base so they become part of an ongoing process – which is what dealing with Data Quality is.

Administration

Well, there’s always some Admin – it this case the console covers Activity Monitoring and Configuration. Activity monitoring covers usage and activities against Knowledge Bases. Configuration manages some matching default parameters and connections to 3rd party Reference databases. Thrilling stuff.

Summary & Links

So, so far SSDQS looks like a powerful and user-friendly solution to Data Quality problems. In subsequent posts I’ll dig into the capabilities and start picking it apart.

In the meantime, here are some useful resources to get you started:

Installing SQL Server Data Quality Services (Denali CTP3)

I’ve been getting Denali up and running today, as I have a specific interest in Data Quality Services. It’s not quite next/click/next/click so here’s a few useful pointers:

The Denali CTP3 link

Once you have it installed, to get SSDQS up and running you also need to follow this: Installing Data Quality Services (SQL Server “Denali”)

Main pointers are, post SQL Installation you need to:

  • Run DQSInstaller.exe as admin (I got a heap of error messages when registering assemblies but they can be ignored)
  • Grant your user the dqs_administrator role on the DQS_Main database (if it’s not sa already)

Over the next few days I’ll start digging in to capabilities and provide updates on this interesting new feature…

« Previous Page