SQL Server Data Quality Services – First Look
So I have managed to get SSDQS up and running, and here’s what I get when I open the SSDQS Client:
For those without a microscope (you can click on the image to see the full screen), the client has 3 areas of activity:
- Knowledge Base Management
- Data Quality Projects
I’ll dive into the 3 areas briefly below, then head into more depth on each one in subsequent posts
Knowledge Base Management
Knowledge Bases are at the heart of SSDQS functionality. Here you define the rules that SSDQS will apply when validating data, and the action taken when those rules are breached. The idea behind calling them “Knowledge Bases” is that they capture knowledge about the data in your organisation – effectively they are catalogues of rules. (I rather hope this name changes as Microsoft already heavily uses Knowledge Base as a term in its support activities, so it’s a bit of a confusing choice).
Anyway, nomenclature moans aside, the Knowledge Base is the most important thing to grasp here. It’s the container for what the SSDQS engine, either through the client or SSIS will use when validating data. The Knowledge Base contains Domains, which represents an area of Data Quality – so for example in an Address style Knowledge Base this could be a single field – such as City – or a compound selection of fields such as Full Address, made up of Address, City and Postal Code. Hence the term Domain – its important to note SSDQS is not a single field only validation engine, but can deal with collections of fields, and also create different Domains relating to the same fields, allowing the application of Data Quality with context. Again, sticking with the Address model, this means a Knowledge Base can subject Addresses from different countries to different rules by creating two Domains to address the same set of fields and calling the Domain as appropriate. There’s also what appears to be fuzzy matching capabilities as well.
Data Quality Projects
Projects are actual applications of Knowledge Bases across sets of Data. From what I can see so far, there are two modes of operation: Cleansing & Matching.
Cleansing is the application of the Knowledge Base against a set of data to test and refine the rules you have in place. Matching is applying the Knowledge Base against a set of data and applying what appears to be Fuzzy Matching algorithms over it to match it to existing records.
I’ll be able to flesh this out more once I have some data to play with (I’m going to have to create some as there don’t appear to be any samples to work with) – but the thing I like about this is that these projects are not an isolated process of an end result, but feed back into the Knowledge Base so they become part of an ongoing process – which is what dealing with Data Quality is.
Well, there’s always some Admin – it this case the console covers Activity Monitoring and Configuration. Activity monitoring covers usage and activities against Knowledge Bases. Configuration manages some matching default parameters and connections to 3rd party Reference databases. Thrilling stuff.
Summary & Links
So, so far SSDQS looks like a powerful and user-friendly solution to Data Quality problems. In subsequent posts I’ll dig into the capabilities and start picking it apart.
In the meantime, here are some useful resources to get you started: