SQL Server Data Quality Services in SQL2012 RC0 – Part 2

Since my last post on SSDQS I’ve been in touch with the development team who have raised some suggestions and workarounds to improve performance. This post will focus on that feedback and how effective it is in reducing execution times.

SSIS vs DQS Client Cleansing

The first bit of feedback was that interactive cleansing through the DQS Client was known to be faster than SSIS interaction – so my first instinct was to test just how much faster it was – and I was surprised – the speedup was around fivefold. The below chart shows my results for processing 10,000 rows with 1-5 Domains:

SSDQS SSIS vs Client performance
SSDQS SSIS vs Client performance

If this scaled, then my 3.9 hour estimate for a 1m row / 10 Domain process would shrink to under an hour. Still not ideal, but getting closer to a production viable speed.

Now, the reason behind this – as explained by the DQS team – is that the component sends discrete chunks through to validate (1000 rows at a time as far as I can tell) which the DQS Server then passes back – which adds overhead and is inefficient for the DQS Server. This is done so that the DQS Cleansing component is not a blocking component. However at this point it’s not possible as far as I can tell – to have any control over the size of these chunks.

Speeding up SSIS processing

The next bit of feedback was to suggest breaking up the work to improve throughput. There’s two ways of doing this – first is to split up the domain processing and second is to break up the data into  chunks and process in parallel. So I tried this by splitting it up the following ways:

  1. 5 Domains through a single DQS Cleansing Task – 10,000 rows
  2. Each domain though a dedicated DQS Cleansing Task – 10,000 rows
  3. 5 Domains through 5 dedicated DQS Cleansing Tasks – 2,000 rows each – 10,000 total

To be honest, the results weren’t overwhelming:

  1. Untuned: 94s
  2. Separated Domains: 86s
  3. Separated Data: 78s

Given that the Separating of Domains means the data would in a real situation have to be split up and recombined, there’s probably not enough saving there to make that approach worthwhile. Splitting up the data yielded a 20% processing time saving – nice, not enough to be really useful given how long it takes normally.

Practical suggestions for the DQS Team

A direct quote from the DQS team’s mail to me was “DQS is designed to best perform on large chunks.”. Looking at the SSIS logs, it’s only sending 1,000 rows at a time – which is clearly sub optimal for DQS + SSIS to interact effectively. So there are two options available for a fix based on my understanding:

  1. Make the component configurable to send larger chunks – with a more SSIS like 10,000 rows default
  2. Make the component optionally blocking

The first just makes sense and I doubt would be a massive job to make “Rows To DQS Server” a configurable property. The second may be harder – and can probably be duplicated just by setting the new “Rows to DQS Server”  property to zero or a very high number.

In practice it’s still a bit slow for very heavy DW workloads, but hopefully the above suggestion would give it a real boost in performance and make it viable for mid sized ones.




3 thoughts on “SQL Server Data Quality Services in SQL2012 RC0 – Part 2

  1. I sent the details from the above to the project team, who replied saying they will take my suggestions into account. So the answer is “possibly”. I think it’s critical to adoption that DQS / SSIS interact effectively, but they may have a different view (or priorities for meeting launch date).

  2. Hi,
    Yes, we are working on improving the performance of the DQS SSIS component, and also other areas of the DQS product.
    We will provide additional information soon.

    Thanks for your comments and support to making DQS an even greater product :)

    Elad Ziklik
    Group Program Manager
    The DQS Team

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>