New DQS features in SQL 2016 CTP2

In this quick post I’m going to share the new features introduced in Data Quality Services as part of SQL Server 2016 CTP2

Well, that didn’t take long. As far as I can tell, nothing has changed. So still hanging out for:

  • A decent UI
  • Ability to reference data services other than Azure Marketplace (e.g. MDS!)

Noises from MSFT are not exactly encouraging, so I’m not expecting much in future CTP’s at this point.

Read More

SSIS, DQS Cleansing & the Balanced Data Distributor

Some of you may have heard of the SSIS Balanced Data Distributor data flow component. It’s a neat gizmo that acts (sort of) like a mix of a Multicast and a Conditional Split, sending different buffers to different destinations. The idea of the component is to increase parallelism in data flows and thus speed up processing. I suggest you read this post by Matt Masson on the SSIS Team Blog to get an idea of how it works.

I thought I’d give it a road test in conjunction with the DQS Cleansing component to see if it offered any performance benefits. This was using DQS CU1 to get the delivered performance benefits, and also changing the “DQS CHUNK SIZE” parameter to 10,000 (see here) so it didn’t run like a total dog from the outset.

I pumped 150,000 rows through 5 simple domains and here’s the results:

Round 1: the DQS Baseline

Fig 1: DQS Cleansing Baseline
Fig 1: DQS Cleansing Baseline

 

 

 

 

 

 

 

 

 

 

 

 

Pushing all the rows through a single DQS Cleansing component took a whopping 21 minutes. Ouch.

Round 2: DQS with the BDD (Balanced Data Distributor)

Fig 2: DQS with BDD
Fig 2: DQS with BDD

 

 

 

 

 

 

 

Splitting the flow using the BDD to four identical targets (to match the number of cores I have available) took 10 minutes. Still not great, but it is a 50% performance improvement. My VM was pretty much maxed out by this so there may have been some hardware limitations kicking in there.

So, in conclusion the Balanced Data Distributor is a good tool to help with those DQS performance issues.

Read More

I’ll be speaking at Tech-Ed Australia 2012

I'll be speaking at Tech Ed Australia 2012
I’ll be speaking at Tech Ed Australia 2012

 

 

 

 

 

 

What it says on the tin: I’ll be presenting a session entitled “Introducing Data Quality Services and its role in an Enterprise Information Management (EIM) Process” at Tech Ed Australia 2012.

Look forward to seeing you there!

Read More

Data Quality Services Composite Domains and SSIS

The handling of Composite Domains in Data Quality Services is not intuitive. You can map a column to a composite domain in the DQS Cleansing Component, as below:

Fig 1: Mapping Composite Domains in SSIS DQS Cleansing Component
Fig 1: Mapping Composite Domains in SSIS DQS Cleansing Component

 

 

 

 

 

However there is no under the hood cleverness to automap the Sides column into the Domain as well. This makes some sort of sense as there is no reliable way for SSIS to guess what columns should be able to be mapped into the other parts of the composite domain.

What would make sense is that you should be able to map multiple columns into a Composite Domain along with a column sequence so you can manage it through the UI. Something like this:

Fig 2: Possible Composite Domain Mapping UI
Fig 2: Possible Composite Domain Mapping UI

 

 

 

 

 

The effect of the mapping as in Fig 1 is that a naive user would expect the DQS Cleansing component is being clever, when actually it is doing what it is told – passing the single column’s value to the Composite Domain in DQS. This results in records being treated as if only the first part of the CD is being passed in – in this case Shape. The result being that records are by default Invalid (or, as in my KB, sometimes Corrected).

How to correctly manage Composite Domains

The correct way to pass in data to a Composite Domain in SSIS for evaluation is to build it into a concatenated string using a Derived Column, and map that string into the Composite Domain – thanks to Matt Masson of the SSIS team for explaining this – see the comments at the bottom of this post.

My example string would be built as below, space delimited:

Fig 3: Creating the Concatenated String for the Composite Domain in a Derived Column
Fig 3: Creating the Concatenated String for the Composite Domain in a Derived Column

 

 

 

 

Which would then be mapped into the Composite Domain in the DQS Cleansing Component as:

Fig 4: Mapping the Concatenated String into the Composite Domain
Fig 4: Mapping the Concatenated String into the Composite Domain

 

 

 

 

 

This will then start churning out the expected results.

The database, DQS Knowledge Base & SSIS Package can be found here.

I’ve also put up a Connect to cover this issue, please vote it up if you think it important.

Read More

Build your own SQL 2012 Demo Machine – Part 4 – Installing SQL Server twice

Step 5: Install SQL Server 2012 twice

We’ll have to run through this three times to install the different instances required, but two warm up ones are required before we get to SharePoint and PowerPivot for SharePoint. First of all we need to install everything for the default instance.

1. Run setup, choose installation and select New SQL Server stand alone installation.

Fig 5.1: New SQL Server Installation
Fig 5.1: New SQL Server Installation

2. This will run through Setup Support Rules, after which you should see this:

Fig 5.2: Setup Support Rules
Fig 5.2: Setup Support Rules

3. Click on OK. Next you get the Product Key screen:

Fig 5.3: Product Key Screen
Fig 5.3: Product Key Screen

4. Enter your Product Key (or choose Evaluation if appropriate) and click Next. The License Terms dialog comes up:

Fig 5.4: License Terms
Fig 5.4: License Terms

5. Accept the license terms, and Feature usage if you want to share. Then click Next. Some chugging will occur and you will then get the Setup Support Rules dialog pop up with a couple of warnings:

Fig 5.5: Setup Support Rules Warnings
Fig 5.5: Setup Support Rules Warnings

These warnings are

  • It doesn’t like the fact you are installing SQL on a Domain Controller – nothing we can do about that
  • It doesn’t like the fact Windows Firewall is on – as we don’t plan remote access, we don’t care

6. Click Next. Here we choose what we are going to install.

Fig 5.6: Setup Role
Fig 5.6: Setup Role

7. Choose a SQL Server Feature Installation and click Next. This brings up the Feature Selection.

Fig 5.7: Feature Selection
Fig 5.7: Feature Selection

Click on select all, leaving the feature directory paths untouched.

8. Click Next. This will lead to the Installation Rules dialog

Fig 5.8: Installation Rules
Fig 5.8: Installation Rules

Everything should have passed.

9. Click Next. This brings us to Instance Configuration:

Fig 5.9: Instance Configuration
Fig 5.9: Instance Configuration

We are going to install the Default instance, and leave all settings as is.

10. Click Next to bring up the Disk Space requirements dialog.

Fig 5.10: Disk Space Requirements
Fig 5.10: Disk Space Requirements

Everything should be OK.

11. Click Next. The Server Configuration dialog comes up:

Fig 5.11: Server Configuration
Fig 5.11: Server Configuration

This will advise (if you try and move on) that the SSAS Account is invalid.

12. Click the dropdown in Account Name to browse for a new user.

Fig 5.12: Select User
Fig 5.12: Select User

Type “admin” and click the “Check Names” button. This should validate the name and change it to “Administrator”. Then click OK. Enter the password in the dialog from Fig 5.11.

13. Click Next – we won’t be changing the default collation.

Fig 5.13: Database Engine Configuration
Fig 5.13: Database Engine Configuration

We will only be using Windows Authentication mode, so leave that as per the default. Click the “Add Current User” button to make the Administrator the SQL Administrator. Change the Data Directories if you want to customise them – though there isn’t much need for a demo box – and leave Filestream turned off.

14. Click Next to bring up the Analysis Services configuration dialog.

Fig 5.14: Analysis Services Configuration
Fig 5.14: Analysis Services Configuration

We are going to install Multidimensional SSAS (i.e. old school OLAP). As before, click Add Current user to make the Administrator the SSAS Administrator. Configure the Data Directories if you want.

15. Click Next to get to Reporting Services configuration.

Fig 5.15: Reporting Services Configuration
Fig 5.15: Reporting Services Configuration

Here we will make a change to Reporting Services Native Mode, and switch it to Install Only.

16. Click Next for Error Reporting.

Fig 5.16: Error Reporting
Fig 5.16: Error Reporting

17. Click Next for the installation configuration rules. Expand by clicking on “Show Details”.

Fig 5.17: Instance Configuration Rules
Fig 5.17: Instance Configuration Rules

Everything should have passed.

18. Click Next to get to the Installation Confirmation screen

Fig 5.18: Ready to Install
Fig 5.18: Ready to Install

And we’re good to go!

19. Click Install and watch the Installation Progress Bar. At this point I’d advise doing something else – this can take an hour or two.

Fig 5.19: Installation Complete
Fig 5.19: Installation Complete

And, all going to plan, everything is done.

20. Click OK, Close the Setup dialog, then reboot the machine. If necessary install updates and restart again.

21. Now, for our 2nd Install of SQL Server – this time to get in the Tabular Mode Analysis Server. Run Setup again, and repeat all of the steps above (on the way you will go through the Product Updates check, just go through this and click Next) until you get to the Installation Type Screen:

Fig 5.20: Installation Type
Fig 5.20: Installation Type

Here we are performing a new installation.

22. Click Next, Enter your product key again, accept licence terms, and then perform a SQL Server Feature installation (as in Fig 5.6).

23. Click Next to get to Feature Selection.

Fig 5.21: Feature Selection
Fig 5.21: Feature Selection

Check Analysis Services only.

24. Click Next. You’ll pass on to Instance Configuration:

Fig 5.22: Instance Configuration
Fig 5.22: Instance Configuration

Now we Install a Named Instance. I’ve opted to call it TABULAR. Call it anything you like except POWERPIVOT – that Instance will get created later for SharePoint integration. Then, click Next to get to Disk Space Requirements.

25. Click Next again to Server Configuration:

Fig 5.23: Server Configuration
Fig 5.23: Server Configuration

As in Figures 5.11 & 5.12 we need to set the Service Account to Administrator and enter the password.

26. Click Next to get to Analysis Services configuration:

Fig 5.24: Analysis Services Configuration
Fig 5.24: Analysis Services Configuration

This time we are installing the Server in Tabular Mode, so change the default Server Configuration.  As before, make the Administrator the SSAS Administrator by clicking the “Add Current User” button.

27. Click Next. Run through all subsequent screens until you get to the Ready to Install dialog, then click Install, and go make another cup of tea. When Installation is complete, restart and get ready for the SharePoint install.

Configuring Data Quality Services (This section was added on 01 May)

28: Next we need to configure Data Quality Services by doing the Post Installation tasks. As per the MSDN Article, locate DQSInstaller.exe under Start > All Programs > Microsoft SQL Server 2012 > Data Quality Services > Data Quality Server Installer:

Fig 5.25: Data Quality Server Installer
Fig 5.25: Data Quality Server Installer

29: Click it to start execution. You will be presented with a command Prompt window asking for a Database Master Key.

Fig 5.26: Data Quality Server Installer - Database Master Key
Fig 5.26: Data Quality Server Installer - Database Master Key

30. Enter a suitable password (note it down!) and hit enter. Re-enter it to confirm and hit enter again. The process will run for a while. Until you get the completion message in the command prompt:

Fig 5.27: Data Quality Server Installer - Completion
Fig 5.27: Data Quality Server Installer - Completion

31. Hit enter. The command window will close.

32. Launch SQL Server Management Studio and connect to the default Instance of SQL Server. Locate your Administrator Role under Security:

Fig 5.28: Server Security for Default Instance
Fig 5.28: Server Security for Default Instance

33. Double click to launch  the Login Properties.

Fig 5.29: Login Properties
Fig 5.29: Login Properties

34. As indicated above, select the “User Mapping” tab.

35. Map the DQS_MAIN database

36: Check the dqs_administrator role.

37: Click OK, then close SQL Server Management Studio. Data Quality Services is now configured.

Configuring Master Data Services (This section was added on 22 May)

This is as per this MSDN Arcticle

38: Run the Master Data Services Configuration Manager, located at Start > All Programs > Microsoft SQL Server 2012 > Master Data Services:

Fig 5.30: Master Data Services Configuration Manager
Fig 5.30: Master Data Services Configuration Manager

This will launch the Master Data Services Configuration Manager:

Fig 5.31: Master Data Services Configuration Manager
Fig 5.31: Master Data Services Configuration Manager

If, as shown above, you get the IIS error shown here about .svc handler mappings, follow steps 39-42 (as per this TechNet article):

39: Open up a command prompt and navigate to %windir%\Microsoft.NET\Framework64\v4.0.30319

40: Run the command aspnet_regiis -i to install ASP.Net

Fig 5.32: Install ASP.NET
Fig 5.32: Install ASP.NET

You should get the result above.

41. Click Exit on the Master Data Services Configuration Manager

42. Relaunch the Master Data Services Configuration Manager:

Fig 5.33: Master Data Services Configuration Manager
Fig 5.33: Master Data Services Configuration Manager

There should now be no warning for IIS.

43. In the Master Data Services Configuration Manager, click on Database Configuration to get the Database Configuration screen:

Fig 5.34: Master Data Services Configuration Manager - Database Configuration
Fig 5.34: Master Data Services Configuration Manager - Database Configuration

44. Click on “Create Database” to launch the Create database Wizard

Fig 5.35: Master Data Services Configuration Manager - Create Database Wizard
Fig 5.35: Master Data Services Configuration Manager - Create Database Wizard

45: Click  Next to select the Database Server

Fig 5.36: Master Data Services Configuration Manager - Create Database Wizard - Select Server
Fig 5.36: Master Data Services Configuration Manager - Create Database Wizard - Select Server

Accept the default settings.

46: Click Next to create the Database

Fig 5.37: Master Data Services Configuration Manager - Create Database Wizard - Create Database
Fig 5.37: Master Data Services Configuration Manager - Create Database Wizard - Create Database

Name your data base something suitable, such as MDS_Demo. Leave the Database collation as default.

47: Click Next to set the Administrator account

Fig 5.38: Master Data Services Configuration Manager - Create Database Wizard - Set Administrator Account
Fig 5.38: Master Data Services Configuration Manager - Create Database Wizard - Set Administrator Account

Accept the default (i.e. your Administrator account).

48: Click Next for the Summary screen

Fig 5.39: Master Data Services Configuration Manager - Create Database Wizard - Summary
Fig 5.39: Master Data Services Configuration Manager - Create Database Wizard - Summary

Review the summary and go back if you need to make changes.

49: Click Next to progress the Configuration

Fig 5.40: Master Data Services Configuration Manager - Create Database Wizard - Progress
Fig 5.40: Master Data Services Configuration Manager - Create Database Wizard - Progress

Everything should complete successfully.

50: Click Finish. The System Settings will now be editable. We won’t be making any changes.

Fig 5.41: Master Data Services Configuration Manager - Database Configuration
Fig 5.41: Master Data Services Configuration Manager - Database Configuration

51. Click on the Web Configuration option on the Master Data Services Configuration Manager.

Fig 5.42: Master Data Services Configuration Manager - Web Configuration
Fig 5.42: Master Data Services Configuration Manager - Web Configuration

52. In the drop down select “Default Web Site”

53. Click “Create Application…”. This will launch the Create Web Application dialig.

Fig 5.43: Master Data Services Configuration Manager - Create Web Application
Fig 5.43: Master Data Services Configuration Manager - Create Web Application

Accept all the default settings, and enter the Administrator username and passwords.

54. Click OK. This will close the dialog.

55. In the “Associate Application with Database” section, click on the “Select…” button to choose a database.

Fig 5.44: Master Data Services Configuration Manager - Associate Application with Database
Fig 5.44: Master Data Services Configuration Manager - Associate Application with Database

This will launch the dialog to choose your MDS database.

Fig 5.45: Master Data Services Configuration Manager - Connect Application to Database
Fig 5.45: Master Data Services Configuration Manager - Connect Application to Database

56: Click Connect (otherwise the Master Data Services database dropdown doesn’t populate)

57: Choose your newly created MDS database in the dropdown.

58. Click OK.

59. Click Apply in the Master Data Services Configuration Manager – Web Configuration dialog.

The Configuration complete dialog will appear.

Fig 5.46: Master Data Services Configuration Manager - Configuration Complete
Fig 5.46: Master Data Services Configuration Manager - Configuration Complete

60: Uncheck the “Launch web application in browser” checkbox.

61: Click OK to close the popup.

62. Next, click “Enable integration with Data Quality Services” in the Master Data Services Configuration Manager – Web Configuration dialog.

Fig 5.47: Master Data Services Configuration Manager - Enable Integration with Data Quality Services
Fig 5.47: Master Data Services Configuration Manager - Enable Integration with Data Quality Services

This will probably fail with the error: “Error while trying to enable integration with Data Quality Services. SQL Server returned the following error: Windows NT User or group ‘[Localmachine]\MDS_ServiceAccounts’ not found. Check the name again.”

Fig 5.48: Master Data Services Configuration Manager - Enable Integration with Data Quality Services Error
Fig 5.48: Master Data Services Configuration Manager - Enable Integration with Data Quality Services Error

This is a known problem with installing MDS on a Domain Controller, and the solution is as described here.

63. Click OK to close the error popup.

64. Launch SQL Server Management studio

65. Connect to the SQL Instance hosting DQS (should be your default instance).

66. Navigate to Security > Logins, right click and choose “New Login…”

Fig 5.49: Create a new Login for MDS_ServiceAccounts
Fig 5.49: Create a new Login for MDS_ServiceAccounts

67. In the dialog that launches, create a new login for [YourDomain]\MDS_ServiceAccounts. Leave everything else as default.

Fig 5.50: Create a new Login for MDS_ServiceAccounts
Fig 5.50: Create a new Login for MDS_ServiceAccounts

68: Next,navigate to Databases > DQS_MAIN > Security > Users, right click and select “New User…”

Fig 5.51: Create a new User for MDS_ServiceAccounts
Fig 5.51: Create a new User for MDS_ServiceAccounts

69: Create a new user for MDS_ServiceAccounts

Fig 5.52: Create a new User for MDS_ServiceAccounts
Fig 5.52: Create a new User for MDS_ServiceAccounts

70. Click OK.

71. In Master Data Services Configuration Manager – Web Configuration, click on Enable Integration with Data Quality Services as in Fig 5.47 again.

Fig 5.53: Master Data Services Configuration Manager - Enable Integration with Data Quality Services
Fig 5.53: Master Data Services Configuration Manager - Enable Integration with Data Quality Services

Master Data Services is now successfully configured!

72. Move on to Build your own SQL 2012 Demo Machine – Part 5 – SharePoint & PowerPivot

Read More

Data Quality Services Performance Best Practices Guide released

Via @mattmasson – the DQS team have released a Data Quality Services Performance Best Practices Guide. It’s pretty thorough, and covers most aspects of performance that I’ve been talking about so far, plus heaps about matching and underlying architecture. A really good resource.

However there’s one table which I think I’ll let speak for itself, which is expected performance:

DQS Performance
DQS Performance

You won’t be using this in a big DW anytime soon…

Read More

DQS / SSIS Performance in SQL2012 RC0 – following some tweaks

As per the connect I raised around DQS / SSIS performance, the DQS / SSIS performance was pretty sluggish. The DQS team have come back with some more possible tweaks, and this is my test of those.

The key suggestions were:

  • Configure the DQS Server to accept bigger chunks of data from SSIS (see the connect for details of how to)
  • Increase the SSIS properties DefaultBufferMaxRows and DefaultBufferSize so bigger chunks can be sent by SSIS to the DQS server

I revised my original tests slightly, to change from 5 domains to 4 domains so that parallel processing would be counted properly (i.e. when splitting things up it would process the 4 domains in parallel to compare fairly to the single process – as opposed to doing 4+1, as the maximum parallel tasks my processing will allow is 4).

The tweaks I made was setting the DQS Server Chunk size to 100,000, DefaultBufferMaxRows to 104857600 (10 times the default) and DefaultBufferSize to 100,000 (also 10 times the default). This meant in my sample data set of 10,000 rows everything was passed through in one big chunk.

So to remind of my test approach, I did it 3 ways (to allow for previous performance tweaks mooted by the DQS team):

  • All in one pass (4 domains in a single component)
  • Separate domains (process each domain in a separate path)
  • Split data (process the domains in 4 x 2,500k chunks)

The results of testing

Well… it helps. But there’s still a big gap between what can be achieved in the DQS client and what you can get via SSIS. The results are below, comparing the Tuned approach to the Untuned approach, and also the DQS client directly:

DQS Performance with SSIS
DQS Performance with SSIS

I’ve deliberately skipped out the actual numbers because I’m running on a clunky VM, and since I’ve read that Stephen Few book, I’ve learned a few things about conveying messages via data visualisation.

So, here’s the takeaways:

  • The new tuning approaches makes a difference – it’s about 3 times faster
  • The old tuning approaches still make negligible difference
  • Even tuned, SSIS is 5 times slower than the client

I’ll be feeding this all into the DQS team… and see what they come up with next. It would be nice if we can reach a point where performance is comparable with the client.

Read More

SQL Server user group presentation deck – DQS & SSIS in SQL2012

So, a little late – it’s been a busy couple of weeks – but the deck from my presentation “Introducing SQL Server Data Quality Services, plus what’s new in SQL2012 SSIS” is available here: Download

It was a great and well attended session – especially given it was Valentine’s day – so I owe some thanks to attendees’ forgiving spouses as well as mine! Remember to keep an eye on http://sqlserver.org.au/ for upcoming events.

Read More

I’m presenting at SQL Server User Group on Feb 14th

If you’re in Sydney on Valentine’s day and don’t like your wife / partner all that much, you can come and spend time with something you really love – SQL Server! I’m presenting at the SQL Server User Group on Feb 14th – the session title is “Introducing SQL Server Data Quality Services, plus what’s new in SQL2012 SSIS”. Please register at: http://www.sqlserver.org.au/Events/ViewEvent.aspx?EventId=577

 

Read More

SQL Server Data Quality Services in SQL2012 RC0 – Part 2

Since my last post on SSDQS I’ve been in touch with the development team who have raised some suggestions and workarounds to improve performance. This post will focus on that feedback and how effective it is in reducing execution times.

SSIS vs DQS Client Cleansing

The first bit of feedback was that interactive cleansing through the DQS Client was known to be faster than SSIS interaction – so my first instinct was to test just how much faster it was – and I was surprised – the speedup was around fivefold. The below chart shows my results for processing 10,000 rows with 1-5 Domains:

SSDQS SSIS vs Client performance
SSDQS SSIS vs Client performance

If this scaled, then my 3.9 hour estimate for a 1m row / 10 Domain process would shrink to under an hour. Still not ideal, but getting closer to a production viable speed.

Now, the reason behind this – as explained by the DQS team – is that the component sends discrete chunks through to validate (1000 rows at a time as far as I can tell) which the DQS Server then passes back – which adds overhead and is inefficient for the DQS Server. This is done so that the DQS Cleansing component is not a blocking component. However at this point it’s not possible as far as I can tell – to have any control over the size of these chunks.

Speeding up SSIS processing

The next bit of feedback was to suggest breaking up the work to improve throughput. There’s two ways of doing this – first is to split up the domain processing and second is to break up the data into  chunks and process in parallel. So I tried this by splitting it up the following ways:

  1. 5 Domains through a single DQS Cleansing Task – 10,000 rows
  2. Each domain though a dedicated DQS Cleansing Task – 10,000 rows
  3. 5 Domains through 5 dedicated DQS Cleansing Tasks – 2,000 rows each – 10,000 total

To be honest, the results weren’t overwhelming:

  1. Untuned: 94s
  2. Separated Domains: 86s
  3. Separated Data: 78s

Given that the Separating of Domains means the data would in a real situation have to be split up and recombined, there’s probably not enough saving there to make that approach worthwhile. Splitting up the data yielded a 20% processing time saving – nice, not enough to be really useful given how long it takes normally.

Practical suggestions for the DQS Team

A direct quote from the DQS team’s mail to me was “DQS is designed to best perform on large chunks.”. Looking at the SSIS logs, it’s only sending 1,000 rows at a time – which is clearly sub optimal for DQS + SSIS to interact effectively. So there are two options available for a fix based on my understanding:

  1. Make the component configurable to send larger chunks – with a more SSIS like 10,000 rows default
  2. Make the component optionally blocking

The first just makes sense and I doubt would be a massive job to make “Rows To DQS Server” a configurable property. The second may be harder – and can probably be duplicated just by setting the new “Rows to DQS Server”  property to zero or a very high number.

In practice it’s still a bit slow for very heavy DW workloads, but hopefully the above suggestion would give it a real boost in performance and make it viable for mid sized ones.

 

 

 

Read More