I’m presenting at SQL Server User Group on Feb 14th
If you’re in Sydney on Valentine’s day and don’t like your wife / partner all that much, you can come and spend time with something you really love – SQL Server! I’m presenting at the SQL Server User Group on Feb 14th – the session title is “Introducing SQL Server Data Quality Services, plus what’s new in SQL2012 SSIS”. Please register at: http://www.sqlserver.org.au/Events/ViewEvent.aspx?EventId=577
SQL Server Data Quality Services in SQL2012 RC0 – Part 2
Since my last post on SSDQS I’ve been in touch with the development team who have raised some suggestions and workarounds to improve performance. This post will focus on that feedback and how effective it is in reducing execution times.
SSIS vs DQS Client Cleansing
The first bit of feedback was that interactive cleansing through the DQS Client was known to be faster than SSIS interaction – so my first instinct was to test just how much faster it was – and I was surprised – the speedup was around fivefold. The below chart shows my results for processing 10,000 rows with 1-5 Domains:
If this scaled, then my 3.9 hour estimate for a 1m row / 10 Domain process would shrink to under an hour. Still not ideal, but getting closer to a production viable speed.
Now, the reason behind this – as explained by the DQS team – is that the component sends discrete chunks through to validate (1000 rows at a time as far as I can tell) which the DQS Server then passes back – which adds overhead and is inefficient for the DQS Server. This is done so that the DQS Cleansing component is not a blocking component. However at this point it’s not possible as far as I can tell – to have any control over the size of these chunks.
Speeding up SSIS processing
The next bit of feedback was to suggest breaking up the work to improve throughput. There’s two ways of doing this – first is to split up the domain processing and second is to break up the data into chunks and process in parallel. So I tried this by splitting it up the following ways:
- 5 Domains through a single DQS Cleansing Task – 10,000 rows
- Each domain though a dedicated DQS Cleansing Task – 10,000 rows
- 5 Domains through 5 dedicated DQS Cleansing Tasks – 2,000 rows each – 10,000 total
To be honest, the results weren’t overwhelming:
- Untuned: 94s
- Separated Domains: 86s
- Separated Data: 78s
Given that the Separating of Domains means the data would in a real situation have to be split up and recombined, there’s probably not enough saving there to make that approach worthwhile. Splitting up the data yielded a 20% processing time saving – nice, not enough to be really useful given how long it takes normally.
Practical suggestions for the DQS Team
A direct quote from the DQS team’s mail to me was “DQS is designed to best perform on large chunks.”. Looking at the SSIS logs, it’s only sending 1,000 rows at a time – which is clearly sub optimal for DQS + SSIS to interact effectively. So there are two options available for a fix based on my understanding:
- Make the component configurable to send larger chunks – with a more SSIS like 10,000 rows default
- Make the component optionally blocking
The first just makes sense and I doubt would be a massive job to make “Rows To DQS Server” a configurable property. The second may be harder – and can probably be duplicated just by setting the new “Rows to DQS Server” property to zero or a very high number.
In practice it’s still a bit slow for very heavy DW workloads, but hopefully the above suggestion would give it a real boost in performance and make it viable for mid sized ones.
SQL Server Data Quality Services in SQL2012 RC0 – Part 1
So the key news – in case you missed it – is that SQL2012 RC0 has been made available for download. After a few battles with the Installer – first the known issue with the Distributed Replay users – then some things requiring manual installs of KB’s to get the installer to run through – I have a VM set up with it.
The DQS team have posted about the improvements made in the DQS blog – and the one I really wanted to focus on was performance via SSIS as the CTP3 offering was not viable for large data sets. So this Part 1 post is all about the performance of DQS via SSIS in RC0.
So, I set up a Knowledge Base in the same way as I did for testing CTP3, with 5 duplicate domains – just evaluating an Integer with a single rule saying that integer had to be greater than a value to be valid. Then I ran two sets of values (5k & 10k rows) through the KB via SSIS, evaluating 1,2,3,4 and 5 fields.
So how does DQS Perform?
Here’s the results- the value in the grid is Seconds taken to process.
So – have we moved on from CTP3? A bit. But not much, and enough to be accounted for by a different VM setup (as a reminder CTP3 processing 5k rows took from 20 to 45 seconds for 1-5 columns). I accept a VM may be slower than a properly configured server, but even if it was twice as quick it would still not be a viable option for industrial use.
Looking at execution time changes by number of columns / rows processed, the time taken seems to be pretty linear as rows and columns increase, so it appears DQS performance can be evaluated pretty much as:
DQS Execution Time = Spin Up Time + (Columns * (Rows * Row Process Time))
Where:
Spin Up Time = time taken for DQS engine to start (Constant)
Columns = number of columns being evaluated (Variable)
Rows = number of rows being processed (Variable)
Row Process Time = time taken to process a single row (Constant)
On my VM, Spin Up Time seems to be 7 seconds, and Rows Process Time = 0.0014 seconds.
So, if we had to validate 10 columns on 1,000,000 rows of data (not too crazy) -
DQS Execution Time = Spin Up Time + (Columns * (Rows * Row Process Time))
DQS Execution Time = 7 + (10 * (1,000,000 * 0.0014))
DQS Execution Time = 7 + (10 * (1,000,000 * 0.0014))
DQS Execution Time = 14007 seconds = 233 minutes = 3.9 hours
Which effectively rules it out as a viable production process. Note of course that my formula doesn’t make any allowance for rule complexity.
Is DQS Production ready?
As per anything, the answer is – It depends. For validating small data sets it’s in the realms of slow, but probably acceptable. For big data sets, I’d have to say no – I couldn’t use it in a production environment to validate large sets of data. I’ve added a Connect suggestion to get this on the teams radar.
Simple Data Quality Scoring with SSDQS & SSIS
A common requirement in Data Warehousing is to apply a Data Quality “score” to records as they come in. The score is then used to identify and filter or fix bad data coming in depending on its assigned quality.
A practical example of this might be that in a Customer Address record, a missing Postcode might attract a high score as it’s a very important field. However a badly formatted work, home or mobile telephone number may attract a lower score as it may not be as important to the business. Though, cumulatively, if all three numbers are badly formatted that may be necessary to give a combined high score so the record gets examined.
An example of this is below. A failed Postcode gets a score of 3, and a failed telephone number gets a score of 1. Thus, anything with a score of 3 or above either has a failed Postcode, or 3 failed telephone numbers, and can thus be subject to special handling.
From the example above we can see this is a fairly arbitrary process in terms of how scores are calculated and used. SSDQS itself doesn’t natively support assigning a score or weight to a failed data item, but what it does do is provide us with a flexible engine to help us decide what is a failed data item. SSIS can then react to this pass / fail behaviour and apply a scoring.
Setting up an SSDQS Knowledge Base for Scoring
Given that the basis for scoring is pretty binary in nature, I set up a simple KB that had domains that would either pass or fail a piece of data. I first created a data set with three data fields:
- Year – Values ranging from 1970 to 2025
- Value – Values ranging from 0 to 100
- Code – Values A,B,C,D,E
I then set up a KB to evaluate the fields as follows:
- Year – Valid from 1975 to 2020
- Value – Valid from 10 to 95
- Code – Valid values A,C,E
Note that I did not set up any Domain Values or do any training – I just set up the KB, Domains and Domain Rules. All I want to use DQS for is to identify records that are invalid for SSIS to use in scoring.
Using SSIS to Score SSDQS output
Next I hookup up my SSIS Data Quality Cleansing Component to push the source data through the Knowledge Base, and get the status of each of the columns after they pass through. As there are no preloaded valid values in the Domains, the status comes back as either “Invalid” (it failed the Domain rule) or “Unknown” (in this configuration, this translates to a correct value).
The Data Quality Cleansing Component doesn’t support scoring in itself. This has to be added using a Derived Column on an item by item basis. Using a simple IF / THEN / ELSE expression, I assign a score of 1 to each failed column based on the status of the record, as below:
Because of the Pipeline nature of SSIS, I then need to add a second Derived Column transform downstream to weight and add the scores together to create a final, record level score:
This results in a final Data Quality “Score” assigned to each record:
What you then do with these scores is up to you. In my example package, I used a Conditional Split to send records with a score over a certain threshold to a different destination:
Improving the Scoring process
The example I’ve created is quite simplistic – it has hard coded weightings and redirection thresholds, and can only react to two (of a possible three) record statuses. The process could be made more flexible using metadata driven weightings and thresholds (provided as package inputs).
Beyond that you have the option to handle the clean and dirty data more appropriately – by pushing dirty data into a cleanup process, halting ETL processes etc, etc.
The key takeaway here is that DQS enables you to create a scoring process that is independent of the actual Data Quality rules that pass or fail a piece of data. The DQS Knowledge Base is your flexible input of what qualifies as a good or bad record, instead of having to hard code using SQL or Derived Columns, which could get messy very quickly.
SQL Server Data Quality Services & SSIS – Performance
This is a snippet of a post on the performance of the DQS engine when called from SSIS. I’ve created a simple number based Domain rule and replicated it 5 times in my knowledge base. My package then feeds copies of the same set of data into the DQS component (5000 rows) and runs it through 1 – 5 domains.
The performance profile is as below:
There seems to be a fairly linear relationship between the number of domains being processed and execution time. Note that I’ve created a dummy value for “0″ to indicate what the start-up time of the DQS component might be, as it’s impossible to have a DQS Cleansing Component in the flow with no columns mapped.
I’d ignore the actual numbers – this is on a development VM which is definitely not configured for performance – and I’m aware the DQS Team are working on performance issues (though by the looks of it, better be working hard).
SQL Server Data Quality Services & SSIS
So far in my posts on SSDQS we’ve looked at the Data Quality Services Client and building SSDQS Knowledge Bases. Now in practice when handling bulk data a need to reference this in routine loads is needed, and to nobody’s surprise, SSIS is the tool for the job.
The DQS Cleansing Component
So, in our (shiny, new) SSIS Toolbox we have a new component to connect to DQS – the DQS Cleansing Component:
The DQS cleansing component pushes a data flow to the DQS Engine for validation. This requires a special Connection Manager, the DQS Cleansing Connection Manager, which as we can see below is a simple creature:
The sole option at this point is to choose which DQS Server to point at. So, lets look at what we get in the SSIS Component once we use the Connection Manager:
Once again – still nice and simple – choosing your Connection Manager allows you to then pick from a list of Published Knowledge Bases. Once a KB is selected, a list of the available Domains is populated, though there is nothing you can do with this list other than review it. So next we move to the Mapping tab:
The usual suspects are there – pick your input columns in the top half of the tab and they become available for mapping in the lower half. Each input column can be mapped to a single Domain (I can’t quite see how Composite Domains work in this context). You then get three output streams – the Output, Corrected Output and Status Output. The Output is just the column passed through, Corrected is the column value corrected by the DQS Engine and the Status is the record status (which comes out as Correct, Corrected or Unknown which corresponds to the DQS Data Quality Project statuses. In the Advanced Editor you can also switch on Confidence and and Reason Outputs, which relate to matching projects.
Note that there is only a single output for the DQS Cleansing Component – if you want to send OK, Error and Invalid records to different locations, you will need to do so with a downstream Conditional Split component.
Summary
So we’ve had a quick look at the basics of automating DQS activities using SSIS, and how SSIS plugs in to the DQS Server. Subsequent posts will start digging into some practical implementation including performance.
Some further reading can be found here:
- Using the SSIS DQS Cleansing Component – from the DQS Blog
SQL Server Data Quality Services – Composite Domains
One of the things I skimmed over in previous posts was the concept of Composite Domains. This is a combination of domains that are assessed with interdependencies on content.
At a very simple level, a Composite Domain addresses these kind of problems:
- If City = “London”, Country must equal “England”
- If Wealth Category = “Millionaire”, Bank Balance must be greater than 1 million
They allow us to validate separate data items in combination, thus allowing the writing of more complex rules beyond the already capable single field ones.
Implementing Composite Domains
Below is a couple of screenshots around setting up composite domains. First we have the definition of what fields need to be included:
This is fairly straightforward – just pick at least two fields that are interrelated from the available Domain list. The other screen of interest is under the Rules tab:
Here it can be seen that a composite domain rule has the capability to evaluate two components at a time. This is a deliberate limitation, so if you wanted to validate 3 fields in combination, you would have to do it via a set of rules that cross over. If you were looking at validating the rule:
If City = “London”, Country = “UK” & Region = “Europe”
You would have to do it through the following rules
If City = “London”, Country = “UK”
If City = “London”, Region = “Europe”
The rules allow for AND / OR at the field level, so you could have rules that read
If City = “London” OR “Birmingham”, Country = “UK”
If Bank Balance >= 2 million and < 1 billion, Wealth Category = “Multi Millionaire”
There seems to be scope for improvement here – the rule capability is a little simplistic but I imagine will meet most scenarios and does make cross field validation possible.
Other features and summary
There are two features I skipped over – Reference Data (for a bigger future post on the whole concept) and the Value Relations tab which, at this point in time seems not to be working and is just a statistical summary of the values found in the data.
There’s not much to close out on otherwise – Composite Domains allow fields to influence other fields from a data quality perspective. The documentation on this feature is sparse at this point so hopefully we’ll get more information soon.
SQL Server Data Quality Services – Domain Management
In the previous post we looked at creating a Knowledge Base through the Knowledge Discovery process. This gave us a first glimpse of what can be done in terms of managing incoming values and providing correction.In this post we will look at the more advanced capabilities of managing incoming data quality issues such as format rules, reference data, etc.
In a quick update to yesterdays post, the helpful team @ the MSDN DQS Forum have answered my query about the difference between Invalid and Error values:
An Invalid value is a value that does not belong to the domain and doesn’t have a correction. For example, the value 12345 in a City domain. An Error value is a value that belongs to the domain but is a syntax error. For example Shicago instead of Chicago in a City domain.
The functional difference: when the system sets statuses for values, it does so according to the above semantics. If a value failed a domain rule, its status will be Invalid. If the system detects a syntax error and an associated correction, the erroneous value status will be Error.
However DQS does do not enforce this semantics on manual operations. So you can enter a correction for an Invalid value without changing its status, and you can remove a correction to an Error value without changing the status as well.
So the upshot is Invalid = Unusable, Error = Correctable.
Domain Management: Domain Rules
If we open our previously created Knowledge Base in Domain management mode, we get a list of Domains, as below:
And, to the right hand side we see our options for setting the Data Quality rules that apply to the domain in question:
The full list of tabs is:
- Domain Properties
- Reference Data
- Domain Rules
- Domain Values
- Term-Based Relations
Domain Properties
Domain properties are the same as those available when creating a Domain – with the restriction that you cannot change the data type of the domain. The values available are:
- Domain Name
- Domain Description
- Data Type (fixed after creation)
- Use Leading values – a checkbox
- Format output to – which gives a dropdown of formats suitable to the data type
Nothing terribly exciting here, so moving on…
Domain Values
This was covered in my previous post, but as a lightning recap here is where you can:
- Flag data values as Correct, Invalid or Error
- Provide corrected values for Errors
- Add / Delete Values manually
So, lets move on to new features.
Domain Rules
Now we are getting into the syntax based validation that DQS can apply.
In the picture above I have generated a simple rule with a couple of elements – that Country must be at least 3 characters and does not contain any full stop, in an attempt to filter out any abbreviations. Now its worth pointing out at this stage that anything that fails a Domain Rule is considered Invalid – i.e. it is an unusable value that cannot be corrected. So by applying this rule I will render any abbreviations in my Domain Values list Invalid – and the DQS client is kind enough to warn me of this if I apply the rule:
The warning tells me I will increase my Invalid value count and consequently decrease my Valid value count.
It’s important to note that the rules will NOT apply to any cases where there is already a correction defined in the Domain Values. For example, the value “UK” which I corrected to “United Kingdom” is unaffected by this rule, but “HK” is rendered invalid as I have not corrected HK to “Hong Kong”.
So what rule types are available? Broadly it covers:
- Length
- Value
- Contains
- Begins with
- Ends with
- Pattern Matching
- Regular Expression matching
As I showed in my example you can combine rule types with AND / OR operators so they become quite flexible. You can also have multiple rules per domain if you want to simplify managing multiple non dependent conditions.
Handily you can also test a rule during development – just above the rules list is a couple of buttons fro test runs – one allows you to test the rule against exiting Domain Values, and the other allows you to test it against a sample data set.
Term Based Relations
This is not the most obviously named component, but is effectively a Find and Replace engine. So you could repair common typos – e.g. change “teh” the “the” or expand common abbreviations – e.g change “Inc.” to “Incorporated”.
This is only available for String Domain types, and doesn’t appear as an option for other types. MSDN Documentation here.
Reference Data
Data Quality Services also allows you to use external sources to validate data against – so far this only covers Azure Datamarket – but the team have hinted that other options will become available. As this feature seems to be half baked at this point I’m not going to dig much further.
Summary
This post has skimmed over some of the features of DQS once you have an active Knowledge Base and Domains set up. The headlines of what we can see is:
- Domain Values: Validation and correction of values against a known list
- Domain Rules: Validation against formatting, patterns and values using a set of rules
- Term Based Relations: Effectively “Find and Replace”
- Reference Data: Validation against external data sources
There’s some things I’ve skipped over for future posts, such as composite domains and reference data in depth. But so far, still looking good!
SQL Server Data Quality Services – Creating a Knowledge Base
So far we’ve opened up the client and taken a look at the areas for working with. In this post I’ll look at setting up a Knowledge Base from scratch using some sample data I’ve mocked up based on some Netflix catalogue data.
Note all of the images are of near full screen so I’ve shrunk them in the post, just click on the image to see the full screen.
Creating a Knowledge Base
To create a new KB just select “New Knowledge Base” from the client front screen, and give it a name and description. You can either clone an existing one on already the server or import, but in this case I’m just going to start from scratch. There’s three possible modes in which you can create a new KB:
- Domain Management – Creating from scratch with no guidance
- Knowledge Discovery – Using a sample data set to guide building your KB with a view to using it for data cleansing rules
- Matching Policy – Using a sample data set to guide building your KB with a view to using its record matching capabilities
As I’m not attempting to do any matching, I’m using the Knowledge Discovery approach. (Note: because I’m using an Excel source I need Excel installed on the machine).
Mapping Columns to Domains
Once I’ve picked my data source, I need to map columns:
At this stage, there is no facility to automatically create domains, so before I map the fields to a Domain, I need to create each of them:
Now a (known) defect appears to be fields from Excel are automatically treated as text – so if they contain numbers you cannot map them to a Numeric domain. So I’ll skip over evaluating Year for now.
Analysing and Managing the results
Once all the mapping is done, click next and you can upload the file for analysis. Now my sample file is 5,000 records, and it took DQS a few minutes just to upload this – so big sets may take a while. Anyway, the output of this is an analysis of the data, displayed in a profiler screen:
This breaks down the records by the following criteria:
- New records – in this case, all of them as it’s a first pass
- Unique - how many unique records
- Valid - how many valid records – again, as it’s a first pass, everything is valid
So, we skip on to the next stage – to manage the results of the findings. This is the first time we start to see what DQS can offer us in terms of cleansing:
Here we are reviewing the values found for Country, and can manage the values that come through, flagging them as Correct, Error or Invalid – and assign a corrected value to incorrect ones.
I’m trying to see if there is a functional difference between Error and Invalid. As per documentation:
The status of the value, as determined by the discovery process. You can change the type by clicking the down arrow and selecting a different type. A green check indicates that the value is correct or corrected; a red cross indicates that the value is in error; and an orange triangle with an exclamation point indicates that the value is not valid.
A value that is not valid does not conform to the data requirements for the domain. A value that is in error can be valid, but is not the correct value for data reasons.
Update: the difference has been clarified here in the DQS Forum.
Skipping over the semantics issues, what we see here is a list of the values that the DQS Client has found in the Knowledge Discovery analysis of the data. We can then flag these values to Invalid or Error as we see fit – or leave them at their default value of Correct. Once we have flagged them as not correct, it is then possible to enter the Correct value in the “Correct To” column. Handily, the client then groups your corrected values under the correct value in the list.
The final thing is to click Next and Publish the Knowledge Base (i.e. store the results back on the DQS Server)
Summary
So in this post we have quickly reviewed the creation of Knowledge Base through the Knowledge Discovery Mode. This has allowed us to create a set of values in our Knowledge Base using some sample data and then apply some corrections to those values, using a simple GUI to manage the results.
In the next post I will look at working in more depth with this created Knowledge Base using the “Domain Management” mode.
SQL Server Data Quality Services – First Look
So I have managed to get SSDQS up and running, and here’s what I get when I open the SSDQS Client:
For those without a microscope (you can click on the image to see the full screen), the client has 3 areas of activity:
- Knowledge Base Management
- Data Quality Projects
- Administration
I’ll dive into the 3 areas briefly below, then head into more depth on each one in subsequent posts
Knowledge Base Management
Knowledge Bases are at the heart of SSDQS functionality. Here you define the rules that SSDQS will apply when validating data, and the action taken when those rules are breached. The idea behind calling them “Knowledge Bases” is that they capture knowledge about the data in your organisation – effectively they are catalogues of rules. (I rather hope this name changes as Microsoft already heavily uses Knowledge Base as a term in its support activities, so it’s a bit of a confusing choice).
Anyway, nomenclature moans aside, the Knowledge Base is the most important thing to grasp here. It’s the container for what the SSDQS engine, either through the client or SSIS will use when validating data. The Knowledge Base contains Domains, which represents an area of Data Quality – so for example in an Address style Knowledge Base this could be a single field – such as City – or a compound selection of fields such as Full Address, made up of Address, City and Postal Code. Hence the term Domain – its important to note SSDQS is not a single field only validation engine, but can deal with collections of fields, and also create different Domains relating to the same fields, allowing the application of Data Quality with context. Again, sticking with the Address model, this means a Knowledge Base can subject Addresses from different countries to different rules by creating two Domains to address the same set of fields and calling the Domain as appropriate. There’s also what appears to be fuzzy matching capabilities as well.
Data Quality Projects
Projects are actual applications of Knowledge Bases across sets of Data. From what I can see so far, there are two modes of operation: Cleansing & Matching.
Cleansing is the application of the Knowledge Base against a set of data to test and refine the rules you have in place. Matching is applying the Knowledge Base against a set of data and applying what appears to be Fuzzy Matching algorithms over it to match it to existing records.
I’ll be able to flesh this out more once I have some data to play with (I’m going to have to create some as there don’t appear to be any samples to work with) – but the thing I like about this is that these projects are not an isolated process of an end result, but feed back into the Knowledge Base so they become part of an ongoing process – which is what dealing with Data Quality is.
Administration
Well, there’s always some Admin – it this case the console covers Activity Monitoring and Configuration. Activity monitoring covers usage and activities against Knowledge Bases. Configuration manages some matching default parameters and connections to 3rd party Reference databases. Thrilling stuff.
Summary & Links
So, so far SSDQS looks like a powerful and user-friendly solution to Data Quality problems. In subsequent posts I’ll dig into the capabilities and start picking it apart.
In the meantime, here are some useful resources to get you started:

























