SQL Server Data Quality Services in SQL2012 RC0 – Part 1

So the key news – in case you missed it – is that SQL2012 RC0 has been made available for download. After a few battles with the Installer – first the known issue with the Distributed Replay users – then some things requiring manual installs of KB’s to get the installer to run through – I have a VM set up with it.

The DQS team have posted about the improvements made in the DQS blog – and the one I really wanted to focus on was performance via SSIS as the CTP3 offering was not viable for large data sets. So this Part 1 post is all about the performance of DQS via SSIS in RC0.

So, I set up a Knowledge Base in the same way as I did for testing CTP3, with 5 duplicate domains – just evaluating an Integer with a single rule saying that integer had to be greater than a value to be valid. Then I ran two sets of values (5k & 10k rows) through the KB via SSIS, evaluating 1,2,3,4 and 5 fields.

So how does DQS Perform?

Here’s the results- the value in the grid is Seconds taken to process.

DQS Performance in SSIS

DQS Performance in SSIS

 

So – have we moved on from CTP3? A bit. But not much, and enough to be accounted for by a different VM setup (as a reminder CTP3 processing 5k rows took from 20 to 45 seconds for 1-5 columns). I accept a VM may be slower than a properly configured server, but even if it was twice as quick it would still not be a viable option for industrial use.

Looking at execution time changes by number of columns / rows processed, the time taken seems to be pretty linear as rows and columns increase, so it appears DQS performance can be evaluated pretty much as:

DQS Execution Time = Spin Up Time + (Columns * (Rows * Row Process Time))

Where:

Spin Up Time = time taken for DQS engine to start (Constant)

Columns = number of columns being evaluated (Variable)

Rows = number of rows being processed (Variable)

Row Process Time = time taken to process a single row (Constant)

On my VM, Spin Up Time seems to be 7 seconds, and Rows Process Time = 0.0014 seconds.

So, if we had to validate 10 columns on 1,000,000 rows of data (not too crazy) -

DQS Execution Time = Spin Up Time + (Columns * (Rows * Row Process Time))

DQS Execution Time = 7 + (10 * (1,000,000 * 0.0014))

DQS Execution Time = 7 + (10 * (1,000,000 * 0.0014))

DQS Execution Time = 14007 seconds = 233 minutes = 3.9 hours

Which effectively rules it out as a viable production process. Note of course that my formula doesn’t make any allowance for rule complexity.

Is DQS Production ready?

As per anything, the answer is – It depends. For validating small data sets it’s in the realms of slow, but probably acceptable. For big data sets, I’d have to say no – I couldn’t use it in a production environment to validate large sets of data. I’ve added a Connect suggestion to get this on the teams radar.

About BI Monkey

Comments

2 Responses to “SQL Server Data Quality Services in SQL2012 RC0 – Part 1”
  1. abdul khadir says:

    how to install the DQS component. where it is available and how to download all the details needed for me.

  2. BI Monkey says:

    Just download SQL2012 RC0 (first link in the post) and it’s all contained in there.

Speak Your Mind

Tell us what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!