Some of you may have heard of the SSIS Balanced Data Distributor data flow component. It’s a neat gizmo that acts (sort of) like a mix of a Multicast and a Conditional Split, sending different buffers to different destinations. The idea of the component is to increase parallelism in data flows and thus speed up processing. I suggest you read this post by Matt Masson on the SSIS Team Blog to get an idea of how it works.
I thought I’d give it a road test in conjunction with the DQS Cleansing component to see if it offered any performance benefits. This was using DQS CU1 to get the delivered performance benefits, and also changing the “DQS CHUNK SIZE” parameter to 10,000 (see here) so it didn’t run like a total dog from the outset.
I pumped 150,000 rows through 5 simple domains and here’s the results:
Round 1: the DQS Baseline
Pushing all the rows through a single DQS Cleansing component took a whopping 21 minutes. Ouch.
Round 2: DQS with the BDD (Balanced Data Distributor)
Splitting the flow using the BDD to four identical targets (to match the number of cores I have available) took 10 minutes. Still not great, but it is a 50% performance improvement. My VM was pretty much maxed out by this so there may have been some hardware limitations kicking in there.
So, in conclusion the Balanced Data Distributor is a good tool to help with those DQS performance issues.
The handling of Composite Domains in Data Quality Services is not intuitive. You can map a column to a composite domain in the DQS Cleansing Component, as below:
However there is no under the hood cleverness to automap the Sides column into the Domain as well. This makes some sort of sense as there is no reliable way for SSIS to guess what columns should be able to be mapped into the other parts of the composite domain.
What would make sense is that you should be able to map multiple columns into a Composite Domain along with a column sequence so you can manage it through the UI. Something like this:
The effect of the mapping as in Fig 1 is that a naive user would expect the DQS Cleansing component is being clever, when actually it is doing what it is told – passing the single column’s value to the Composite Domain in DQS. This results in records being treated as if only the first part of the CD is being passed in – in this case Shape. The result being that records are by default Invalid (or, as in my KB, sometimes Corrected).
How to correctly manage Composite Domains
The correct way to pass in data to a Composite Domain in SSIS for evaluation is to build it into a concatenated string using a Derived Column, and map that string into the Composite Domain – thanks to Matt Masson of the SSIS team for explaining this – see the comments at the bottom of this post.
My example string would be built as below, space delimited:
Which would then be mapped into the Composite Domain in the DQS Cleansing Component as:
This will then start churning out the expected results.
We’ll have to run through this three times to install the different instances required, but two warm up ones are required before we get to SharePoint and PowerPivot for SharePoint. First of all we need to install everything for the default instance.
1. Run setup, choose installation and select New SQL Server stand alone installation.
2. This will run through Setup Support Rules, after which you should see this:
3. Click on OK. Next you get the Product Key screen:
4. Enter your Product Key (or choose Evaluation if appropriate) and click Next. The License Terms dialog comes up:
5. Accept the license terms, and Feature usage if you want to share. Then click Next. Some chugging will occur and you will then get the Setup Support Rules dialog pop up with a couple of warnings:
These warnings are
It doesn’t like the fact you are installing SQL on a Domain Controller – nothing we can do about that
It doesn’t like the fact Windows Firewall is on – as we don’t plan remote access, we don’t care
6. Click Next. Here we choose what we are going to install.
7. Choose a SQL Server Feature Installation and click Next. This brings up the Feature Selection.
Click on select all, leaving the feature directory paths untouched.
8. Click Next. This will lead to the Installation Rules dialog
Everything should have passed.
9. Click Next. This brings us to Instance Configuration:
We are going to install the Default instance, and leave all settings as is.
10. Click Next to bring up the Disk Space requirements dialog.
Everything should be OK.
11. Click Next. The Server Configuration dialog comes up:
This will advise (if you try and move on) that the SSAS Account is invalid.
12. Click the dropdown in Account Name to browse for a new user.
Type “admin” and click the “Check Names” button. This should validate the name and change it to “Administrator”. Then click OK. Enter the password in the dialog from Fig 5.11.
13. Click Next – we won’t be changing the default collation.
We will only be using Windows Authentication mode, so leave that as per the default. Click the “Add Current User” button to make the Administrator the SQL Administrator. Change the Data Directories if you want to customise them – though there isn’t much need for a demo box – and leave Filestream turned off.
14. Click Next to bring up the Analysis Services configuration dialog.
We are going to install Multidimensional SSAS (i.e. old school OLAP). As before, click Add Current user to make the Administrator the SSAS Administrator. Configure the Data Directories if you want.
15. Click Next to get to Reporting Services configuration.
Here we will make a change to Reporting Services Native Mode, and switch it to Install Only.
16. Click Next for Error Reporting.
17. Click Next for the installation configuration rules. Expand by clicking on “Show Details”.
Everything should have passed.
18. Click Next to get to the Installation Confirmation screen
And we’re good to go!
19. Click Install and watch the Installation Progress Bar. At this point I’d advise doing something else – this can take an hour or two.
And, all going to plan, everything is done.
20. Click OK, Close the Setup dialog, then reboot the machine. If necessary install updates and restart again.
21. Now, for our 2nd Install of SQL Server – this time to get in the Tabular Mode Analysis Server. Run Setup again, and repeat all of the steps above (on the way you will go through the Product Updates check, just go through this and click Next) until you get to the Installation Type Screen:
Here we are performing a new installation.
22. Click Next, Enter your product key again, accept licence terms, and then perform a SQL Server Feature installation (as in Fig 5.6).
23. Click Next to get to Feature Selection.
Check Analysis Services only.
24. Click Next. You’ll pass on to Instance Configuration:
Now we Install a Named Instance. I’ve opted to call it TABULAR. Call it anything you like except POWERPIVOT – that Instance will get created later for SharePoint integration. Then, click Next to get to Disk Space Requirements.
25. Click Next again to Server Configuration:
As in Figures 5.11 & 5.12 we need to set the Service Account to Administrator and enter the password.
26. Click Next to get to Analysis Services configuration:
This time we are installing the Server in Tabular Mode, so change the default Server Configuration. As before, make the Administrator the SSAS Administrator by clicking the “Add Current User” button.
27. Click Next. Run through all subsequent screens until you get to the Ready to Install dialog, then click Install, and go make another cup of tea. When Installation is complete, restart and get ready for the SharePoint install.
Configuring Data Quality Services (This section was added on 01 May)
28: Next we need to configure Data Quality Services by doing the Post Installation tasks. As per the MSDN Article, locate DQSInstaller.exe under Start > All Programs > Microsoft SQL Server 2012 > Data Quality Services > Data Quality Server Installer:
29: Click it to start execution. You will be presented with a command Prompt window asking for a Database Master Key.
30. Enter a suitable password (note it down!) and hit enter. Re-enter it to confirm and hit enter again. The process will run for a while. Until you get the completion message in the command prompt:
31. Hit enter. The command window will close.
32. Launch SQL Server Management Studio and connect to the default Instance of SQL Server. Locate your Administrator Role under Security:
33. Double click to launch the Login Properties.
34. As indicated above, select the “User Mapping” tab.
35. Map the DQS_MAIN database
36: Check the dqs_administrator role.
37: Click OK, then close SQL Server Management Studio. Data Quality Services is now configured.
Configuring Master Data Services (This section was added on 22 May)
38: Run the Master Data Services Configuration Manager, located at Start > All Programs > Microsoft SQL Server 2012 > Master Data Services:
This will launch the Master Data Services Configuration Manager:
If, as shown above, you get the IIS error shown here about .svc handler mappings, follow steps 39-42 (as per this TechNet article):
39: Open up a command prompt and navigate to %windir%\Microsoft.NET\Framework64\v4.0.30319
40: Run the command aspnet_regiis -i to install ASP.Net
You should get the result above.
41. Click Exit on the Master Data Services Configuration Manager
42. Relaunch the Master Data Services Configuration Manager:
There should now be no warning for IIS.
43. In the Master Data Services Configuration Manager, click on Database Configuration to get the Database Configuration screen:
44. Click on “Create Database” to launch the Create database Wizard
45: Click Next to select the Database Server
Accept the default settings.
46: Click Next to create the Database
Name your data base something suitable, such as MDS_Demo. Leave the Database collation as default.
47: Click Next to set the Administrator account
Accept the default (i.e. your Administrator account).
48: Click Next for the Summary screen
Review the summary and go back if you need to make changes.
49: Click Next to progress the Configuration
Everything should complete successfully.
50: Click Finish. The System Settings will now be editable. We won’t be making any changes.
51. Click on the Web Configuration option on the Master Data Services Configuration Manager.
52. In the drop down select “Default Web Site”
53. Click “Create Application…”. This will launch the Create Web Application dialig.
Accept all the default settings, and enter the Administrator username and passwords.
54. Click OK. This will close the dialog.
55. In the “Associate Application with Database” section, click on the “Select…” button to choose a database.
This will launch the dialog to choose your MDS database.
56: Click Connect (otherwise the Master Data Services database dropdown doesn’t populate)
57: Choose your newly created MDS database in the dropdown.
58. Click OK.
59. Click Apply in the Master Data Services Configuration Manager – Web Configuration dialog.
The Configuration complete dialog will appear.
60: Uncheck the “Launch web application in browser” checkbox.
61: Click OK to close the popup.
62. Next, click “Enable integration with Data Quality Services” in the Master Data Services Configuration Manager – Web Configuration dialog.
This will probably fail with the error: “Error while trying to enable integration with Data Quality Services. SQL Server returned the following error: Windows NT User or group ‘[Localmachine]\MDS_ServiceAccounts’ not found. Check the name again.”
Configure the DQS Server to accept bigger chunks of data from SSIS (see the connect for details of how to)
Increase the SSIS properties DefaultBufferMaxRows and DefaultBufferSize so bigger chunks can be sent by SSIS to the DQS server
I revised my original tests slightly, to change from 5 domains to 4 domains so that parallel processing would be counted properly (i.e. when splitting things up it would process the 4 domains in parallel to compare fairly to the single process – as opposed to doing 4+1, as the maximum parallel tasks my processing will allow is 4).
The tweaks I made was setting the DQS Server Chunk size to 100,000, DefaultBufferMaxRows to 104857600 (10 times the default) and DefaultBufferSize to 100,000 (also 10 times the default). This meant in my sample data set of 10,000 rows everything was passed through in one big chunk.
So to remind of my test approach, I did it 3 ways (to allow for previous performance tweaks mooted by the DQS team):
All in one pass (4 domains in a single component)
Separate domains (process each domain in a separate path)
Split data (process the domains in 4 x 2,500k chunks)
The results of testing
Well… it helps. But there’s still a big gap between what can be achieved in the DQS client and what you can get via SSIS. The results are below, comparing the Tuned approach to the Untuned approach, and also the DQS client directly:
I’ve deliberately skipped out the actual numbers because I’m running on a clunky VM, and since I’ve read that Stephen Few book, I’ve learned a few things about conveying messages via data visualisation.
So, here’s the takeaways:
The new tuning approaches makes a difference – it’s about 3 times faster
The old tuning approaches still make negligible difference
Even tuned, SSIS is 5 times slower than the client
I’ll be feeding this all into the DQS team… and see what they come up with next. It would be nice if we can reach a point where performance is comparable with the client.
So, a little late – it’s been a busy couple of weeks – but the deck from my presentation “Introducing SQL Server Data Quality Services, plus what’s new in SQL2012 SSIS” is available here: Download
It was a great and well attended session – especially given it was Valentine’s day – so I owe some thanks to attendees’ forgiving spouses as well as mine! Remember to keep an eye on http://sqlserver.org.au/ for upcoming events.
If you’re in Sydney on Valentine’s day and don’t like your wife / partner all that much, you can come and spend time with something you really love – SQL Server! I’m presenting at the SQL Server User Group on Feb 14th – the session title is “Introducing SQL Server Data Quality Services, plus what’s new in SQL2012 SSIS”. Please register at: http://www.sqlserver.org.au/Events/ViewEvent.aspx?EventId=577
Since my last post on SSDQS I’ve been in touch with the development team who have raised some suggestions and workarounds to improve performance. This post will focus on that feedback and how effective it is in reducing execution times.
SSIS vs DQS Client Cleansing
The first bit of feedback was that interactive cleansing through the DQS Client was known to be faster than SSIS interaction – so my first instinct was to test just how much faster it was – and I was surprised – the speedup was around fivefold. The below chart shows my results for processing 10,000 rows with 1-5 Domains:
If this scaled, then my 3.9 hour estimate for a 1m row / 10 Domain process would shrink to under an hour. Still not ideal, but getting closer to a production viable speed.
Now, the reason behind this – as explained by the DQS team – is that the component sends discrete chunks through to validate (1000 rows at a time as far as I can tell) which the DQS Server then passes back – which adds overhead and is inefficient for the DQS Server. This is done so that the DQS Cleansing component is not a blocking component. However at this point it’s not possible as far as I can tell – to have any control over the size of these chunks.
Speeding up SSIS processing
The next bit of feedback was to suggest breaking up the work to improve throughput. There’s two ways of doing this – first is to split up the domain processing and second is to break up the data into chunks and process in parallel. So I tried this by splitting it up the following ways:
5 Domains through a single DQS Cleansing Task – 10,000 rows
Each domain though a dedicated DQS Cleansing Task – 10,000 rows
5 Domains through 5 dedicated DQS Cleansing Tasks – 2,000 rows each – 10,000 total
To be honest, the results weren’t overwhelming:
Separated Domains: 86s
Separated Data: 78s
Given that the Separating of Domains means the data would in a real situation have to be split up and recombined, there’s probably not enough saving there to make that approach worthwhile. Splitting up the data yielded a 20% processing time saving – nice, not enough to be really useful given how long it takes normally.
Practical suggestions for the DQS Team
A direct quote from the DQS team’s mail to me was “DQS is designed to best perform on large chunks.”. Looking at the SSIS logs, it’s only sending 1,000 rows at a time – which is clearly sub optimal for DQS + SSIS to interact effectively. So there are two options available for a fix based on my understanding:
Make the component configurable to send larger chunks – with a more SSIS like 10,000 rows default
Make the component optionally blocking
The first just makes sense and I doubt would be a massive job to make “Rows To DQS Server” a configurable property. The second may be harder – and can probably be duplicated just by setting the new “Rows to DQS Server” property to zero or a very high number.
In practice it’s still a bit slow for very heavy DW workloads, but hopefully the above suggestion would give it a real boost in performance and make it viable for mid sized ones.
The DQS team have posted about the improvements made in the DQS blog – and the one I really wanted to focus on was performance via SSIS as the CTP3 offering was not viable for large data sets. So this Part 1 post is all about the performance of DQS via SSIS in RC0.
So, I set up a Knowledge Base in the same way as I did for testing CTP3, with 5 duplicate domains – just evaluating an Integer with a single rule saying that integer had to be greater than a value to be valid. Then I ran two sets of values (5k & 10k rows) through the KB via SSIS, evaluating 1,2,3,4 and 5 fields.
So how does DQS Perform?
Here’s the results- the value in the grid is Seconds taken to process.
So – have we moved on from CTP3? A bit. But not much, and enough to be accounted for by a different VM setup (as a reminder CTP3 processing 5k rows took from 20 to 45 seconds for 1-5 columns). I accept a VM may be slower than a properly configured server, but even if it was twice as quick it would still not be a viable option for industrial use.
Looking at execution time changes by number of columns / rows processed, the time taken seems to be pretty linear as rows and columns increase, so it appears DQS performance can be evaluated pretty much as:
DQS Execution Time = Spin Up Time + (Columns * (Rows * Row Process Time))
Spin Up Time = time taken for DQS engine to start (Constant)
Columns = number of columns being evaluated (Variable)
Rows = number of rows being processed (Variable)
Row Process Time = time taken to process a single row (Constant)
On my VM, Spin Up Time seems to be 7 seconds, and Rows Process Time = 0.0014 seconds.
So, if we had to validate 10 columns on 1,000,000 rows of data (not too crazy) –
DQS Execution Time = Spin Up Time + (Columns * (Rows * Row Process Time))
Which effectively rules it out as a viable production process. Note of course that my formula doesn’t make any allowance for rule complexity.
Is DQS Production ready?
As per anything, the answer is – It depends. For validating small data sets it’s in the realms of slow, but probably acceptable. For big data sets, I’d have to say no – I couldn’t use it in a production environment to validate large sets of data. I’ve added a Connect suggestion to get this on the teams radar.