Since my last post on SSDQS I’ve been in touch with the development team who have raised some suggestions and workarounds to improve performance. This post will focus on that feedback and how effective it is in reducing execution times.
SSIS vs DQS Client Cleansing
The first bit of feedback was that interactive cleansing through the DQS Client was known to be faster than SSIS interaction – so my first instinct was to test just how much faster it was – and I was surprised – the speedup was around fivefold. The below chart shows my results for processing 10,000 rows with 1-5 Domains:
If this scaled, then my 3.9 hour estimate for a 1m row / 10 Domain process would shrink to under an hour. Still not ideal, but getting closer to a production viable speed.
Now, the reason behind this – as explained by the DQS team – is that the component sends discrete chunks through to validate (1000 rows at a time as far as I can tell) which the DQS Server then passes back – which adds overhead and is inefficient for the DQS Server. This is done so that the DQS Cleansing component is not a blocking component. However at this point it’s not possible as far as I can tell – to have any control over the size of these chunks.
Speeding up SSIS processing
The next bit of feedback was to suggest breaking up the work to improve throughput. There’s two ways of doing this – first is to split up the domain processing and second is to break up the data into chunks and process in parallel. So I tried this by splitting it up the following ways:
- 5 Domains through a single DQS Cleansing Task – 10,000 rows
- Each domain though a dedicated DQS Cleansing Task – 10,000 rows
- 5 Domains through 5 dedicated DQS Cleansing Tasks – 2,000 rows each – 10,000 total
To be honest, the results weren’t overwhelming:
- Untuned: 94s
- Separated Domains: 86s
- Separated Data: 78s
Given that the Separating of Domains means the data would in a real situation have to be split up and recombined, there’s probably not enough saving there to make that approach worthwhile. Splitting up the data yielded a 20% processing time saving – nice, not enough to be really useful given how long it takes normally.
Practical suggestions for the DQS Team
A direct quote from the DQS team’s mail to me was “DQS is designed to best perform on large chunks.”. Looking at the SSIS logs, it’s only sending 1,000 rows at a time – which is clearly sub optimal for DQS + SSIS to interact effectively. So there are two options available for a fix based on my understanding:
- Make the component configurable to send larger chunks – with a more SSIS like 10,000 rows default
- Make the component optionally blocking
The first just makes sense and I doubt would be a massive job to make “Rows To DQS Server” a configurable property. The second may be harder – and can probably be duplicated just by setting the new “Rows to DQS Server” property to zero or a very high number.
In practice it’s still a bit slow for very heavy DW workloads, but hopefully the above suggestion would give it a real boost in performance and make it viable for mid sized ones.