Migrating Package Connections to Project Connections in SSIS

One of the best changes in SSIS 2012 was to create the concept of a Project Connection – a connection manager that can be used across the whole project instead of being limited to the package scope, meaning you had to recreate and configure effectively the same connection in every single package you have. This feature is great…   when  you are starting a new project.

However a recent task I got handed was to migrate a 2008 project to the 2012 project model. All very sensible, to ease maintenance, eliminate  XML configurations and generally bolster security. Then I got to work….

Converting a Package Connection to a Project Connection

Ah, the easy part. Pick your connection, right click, convert to  project connection and  … ta daa! You have a project connection!

SSIS Project Connection
SSIS Project Connection

Now… what about all the other packages?

Pointing every other Package connection to the Project connection

This is a little harder. The good bit is your project connection will appear in your available connection managers. The bad bit is there is no way to tell the SSIS designer to use this one instead of your old one. You can either manually repoint every data flow, Execute SQL, script and whatever other task happens to be using the package connection to the project connection – easy if your package is small and simple – or get X(ML)treme! Fortunately thanks to this post by Jeff Garretson I was reminded that SSIS packages are just XML, and XML can be edited much faster than a package using the designer. Jeff’s post only resolved how to fix up the Data Flow – I had a pile of Control Flow tasks to fix up too – so here’s how to get it done without hours of coding.

Step 1: In designer mode Get the Name & GUID of all Connections to be replaced and what to replace them with.

You can get this from the properties window when you have a connection manager selected in the SSIS designer:

Connection Manager Properties
Connection Manager Properties

Step 2: Switch to code view and replace all Data Flow connections

You can find where a package connection is being used in a data flow by looking for the following in the XML:

connectionManagerID=”Package.ConnectionManagers[{Package Connection Name}]”

connectionManagerRefId=”Package.ConnectionManagers[{Package Connection Name}]”

Replace this text with:

connectionManagerID=”{Project Connection GUID}:external”

connectionManagerRefId=” Project.ConnectionManagers [{Project Connection Name)]”

Do this for all connections, then return to design view and delete the package connections

Step 3: Refresh the connections

This shouldn’t strictly be necessary but it seems to refresh all the meta data

  1. Close the package
  2. Open the package in code view

Step 4: Replace all Control Flow connections

  1. Find and replace {Package Connection GUID} with {Project Connection GUID}

Step 5: Refresh the connections again

  1. Close the package, open in design view
  2. Check for errors
  3. Done!

This almost certainly could be coded up by some one with the right chops, time and motivation.

A useful side note – if you have a Project Connection and a Package Connection with the same name (get link) then your project will start throwing errors.

Read More

October Sydney training roundup – MS BI, Cloud, Analytics

The end of the year is closing in fast but there’s still plenty of chances to learn from specialist providers Agile BI, Presciient and of course, me!

Topics cover the full spread of DW, BI and Analytics so there’s something for every role in the data focused organisation.

Build your Data Warehouse in SQL Server & SSIS with the BI Monkey

Nov 24/25 – Are you about to build your Data Warehouse with Microsoft tools and want to do it right first time?

This course is designed to help a novice understand what is involved in building a Data Warehouse both from a technical architecture and project delivery perspective. It also delivers you basic skills in the tools the Microsoft Business Intelligence suite offers you to do that with.

Get more detail here

Agile BI workshops

Power BI specialist Agile BI brings your product updates on this key new self service BI technology:

Oct 15 – Power BI workshop – Excel new features for reporting and data analysis – more detail here

Oct 30 – What Every Manager Should Know About Microsoft Cloud, Power BI for Office 365 and SQL Server 2014 – more detail here

Presciient Training

Dr Eugene Dubossarsky shares his deep business and technical exercise across a range of advanced and basic analytics. Full details here but the key list is:

Dec 9/10 – Predictive analytics and data science for big data

Dec 11/12 –Introduction to R and data visualisation

Dec 16/17 –Data analytics for fraud and anomaly detection, security and forensics

Dec 18/19 – Business analytics and data for beginners

 

Read More

SQL Server Data Warehouse Training in Sydney – July 28

The eagle eyed among you may have noticed I’ve added a big countdown banner to the right of the page announcing a course “Build your Data Warehouse in SQL Server & SSIS with the BI Monkey“, running for 2 days from July 28.

The course is aimed at giving an overview of how to build and manage a Data Warehouse using the SQL Server platform as a base. It’s aimed to be a solid mix of Data Warehouse theory and hands on development to get the concepts locked in. For full details see my new “Training” page, but the key outcomes are:

  • Learn how to get the Architecture right and keep your solution flexible
  • Understand the whole lifecycle of a DW project
  • Get hands on with the tools you need to move data

 

The high level agenda is:

  • The Three Main Data Warehouse architectures – Kimball, Inmon & Data Vault
  • SSIS Basics – Projects, Control Flows and Data Flows
  • Moving Data with SSIS
  • ETL Control Frameworks
  • Database considerations
  • How HDInsight (Hadoop) fits in
  • Code Management with Team Foundation Server
  • Testing
  • Build Automation
  • Deployment
  • Management
  • MDS, DQS and other useful Microsoft Toys

 
Click the button to register now – look forward to seeing you!

Eventbrite - Build your Data Warehouse in SQL Server & SSIS with the BI Monkey

Read More

Is ETL Development doomed?

A slightly dramatic title, but over the last few months I’ve been exposed to a number of tools that will provide a strong layer of automation to ETL development, eliminating a lot of the basic need for ETL developers to shift data from System A to Database B and beyond.

I’ve been hands on with a couple:

And also heard about a couple more:

… and I’m certain this list is not comprehensive. The significant takeaway is that software build automation in the BI world is starting to catch up with where other software languages have already been (Coded a website lately? many IDE’s do most of the hard work for you now). Much as IDE driven tools such as DTS, SSIS and so on moved us away from hand coding SQL and wrapping up those scripts, the latest round of tools are moving us away from the IDE’s where we drag and drop our flows.

How will ETL careers be killed?

There seems to be a couple of tracks for this. First is the pure development automation tools, such as Varigence MIST. If you are technically minded, take a look at this product demo video – though I suggest skipping to about 25 minutes in to see the real meat as it does go on a bit. It looks mindbogglingly powerful but is clearly shooting at the ETL pro who wants to churn stuff out faster, more consistently and with less fiddling about. MIST is limited to SSIS/AS (for now) and I’m not sure how far it will go as it’s clearly aimed at the developer pro market, which is not always the big buyers. I expect to be playing with it more over the next few weeks on a live project so should be able to get a better view.

The second path appears to be more targeted at eliminating ETL developers in their entirety. AnalytixDS wraps up metadata import (i.e. you suck in your source and target metadata from the systems or ERWIN), do the mapping of fields and apply rules, then “push button make code”. Obviously there’s a bit more to it than that, but the less you care about your back end and the quality of your ETL code (cough Wherescape cough) the more likely this product will appeal to you. Say hello, business users, who are the big buyers (though I look forward to troubleshooting your non-scalable disasters in the near future).

What’s the diagnosis, doctor?

Long term, the demand for ETL skills will decline on the back of these tools. Simple ETL work will simply go away, but the hard stuff will remain and it will become an even more niche skill that will pay handsomely – though you may spend more time configuring and troubleshooting products than working with raw code. Which class of tool dominates is uncertain, but I’m leaning towards the business oriented mapping tools that completely abstract away from ETL development altogether.

If you’ve had any experience with these tools, I’d love to hear about them in the comments.

Read More

Extract data from Hive using SSIS

So now the Hive ODBC driver exists, the next thing to do is use SSIS to extract data from Hive into a SQL instance for… well, I’m sure we’ll find a reason for it.

Setting up the DSN

The first thing to do is set up a System DSN (Data Source Name) to reference in the ODBC connection. For SSIS, that means we need a 32 bit driver to reference, which means we need to find the 32 Bit ODBC Data Source Administrator. If you’re on a 32 Bit OS, just go to the Control Panel and search for it. If you are on a 64 Bit OS like me, you need to hunt it out. On Windows 7, it can be found at “C:\Windows\SysWOW64\odbcad32.exe”. Note you need to run as Administrator to make changes.

Go to the System DSN:

Fig 1: ODBC Data Source Administrator
Fig 1: ODBC Data Source Administrator

 

 

 

 

 

 

Click “Add…”

Fig 2: ODBC Data Source Administrator
Fig 2: ODBC Data Source Administrator

 

 

 

 

 

 

 

 

 

 

Scroll down the list until you find the “HIVE” driver, then click “Finish”, which brings up the ODBC Hive Setup dialog:

Fig 3: ODBC Data Source Administrator
Fig 3: ODBC Data Source Administrator

 

 

 

 

 

 

 

 

 

 

 

 

 

Give your DSN a sensible name and description. For your host enter the cluster URL (without http://) – i.e. “[your cluster name].cloudapp.net”. Leave the port as 10000. Under Authentication select “Username/Password” and enter your username. Then click “OK” and we are ready to move on.

Connect in SSIS

To hook this into SSIS we need to create a Data Flow and add an ADO.NET Connection Manager. Not – as I initially thought – an ODBC Connection Manager.

Under the Provider, select under “.Net Providers” the “Odbc Data Provider” option.

Fig 4: ADO.NET Connection Manager
Fig 4: ADO.NET Connection Manager

 

 

 

 

 

 

Once that’s done you can choose your just created Data Source Name using the dropdown under “Data source specification”. Add your username and password to complete setup, then click “OK”.

Fig 5: ADO.NET Connection Manager
Fig 5: ADO.NET Connection Manager

 

 

 

 

 

 

 

 

Now the Connection Manager is set up, you can use it in a Data Flow. Add a ADO.NET Data Source, and select your Connection Manager. Then you can – as per a normal database connection – select tables or write queries. In this example I’ve just chosen the HiveSampleTable that comes with every cluster.

Fig 6: ADO.NET Source
Fig 6: ADO.NET Source

 

 

 

 

 

 

 

Then we route the data somewhere, in this case just pumping it to a Row Count.

Fig 7: ADO.NET Data Flow
Fig 7: ADO.NET Data Flow

 

 

 

 

 

 

 

 

I’ve put on a Data Viewer just to show it works.

Fig 8: ADO.NET Data Flow
Fig 8: ADO.NET Data Flow

 

 

 

 

 

 

 

And there we have it. Data from Hive on a Hadoop on Azure cluster via SSIS.

 

Read More

SSIS, DQS Cleansing & the Balanced Data Distributor

Some of you may have heard of the SSIS Balanced Data Distributor data flow component. It’s a neat gizmo that acts (sort of) like a mix of a Multicast and a Conditional Split, sending different buffers to different destinations. The idea of the component is to increase parallelism in data flows and thus speed up processing. I suggest you read this post by Matt Masson on the SSIS Team Blog to get an idea of how it works.

I thought I’d give it a road test in conjunction with the DQS Cleansing component to see if it offered any performance benefits. This was using DQS CU1 to get the delivered performance benefits, and also changing the “DQS CHUNK SIZE” parameter to 10,000 (see here) so it didn’t run like a total dog from the outset.

I pumped 150,000 rows through 5 simple domains and here’s the results:

Round 1: the DQS Baseline

Fig 1: DQS Cleansing Baseline
Fig 1: DQS Cleansing Baseline

 

 

 

 

 

 

 

 

 

 

 

 

Pushing all the rows through a single DQS Cleansing component took a whopping 21 minutes. Ouch.

Round 2: DQS with the BDD (Balanced Data Distributor)

Fig 2: DQS with BDD
Fig 2: DQS with BDD

 

 

 

 

 

 

 

Splitting the flow using the BDD to four identical targets (to match the number of cores I have available) took 10 minutes. Still not great, but it is a 50% performance improvement. My VM was pretty much maxed out by this so there may have been some hardware limitations kicking in there.

So, in conclusion the Balanced Data Distributor is a good tool to help with those DQS performance issues.

Read More

Data Quality Services Composite Domains and SSIS

The handling of Composite Domains in Data Quality Services is not intuitive. You can map a column to a composite domain in the DQS Cleansing Component, as below:

Fig 1: Mapping Composite Domains in SSIS DQS Cleansing Component
Fig 1: Mapping Composite Domains in SSIS DQS Cleansing Component

 

 

 

 

 

However there is no under the hood cleverness to automap the Sides column into the Domain as well. This makes some sort of sense as there is no reliable way for SSIS to guess what columns should be able to be mapped into the other parts of the composite domain.

What would make sense is that you should be able to map multiple columns into a Composite Domain along with a column sequence so you can manage it through the UI. Something like this:

Fig 2: Possible Composite Domain Mapping UI
Fig 2: Possible Composite Domain Mapping UI

 

 

 

 

 

The effect of the mapping as in Fig 1 is that a naive user would expect the DQS Cleansing component is being clever, when actually it is doing what it is told – passing the single column’s value to the Composite Domain in DQS. This results in records being treated as if only the first part of the CD is being passed in – in this case Shape. The result being that records are by default Invalid (or, as in my KB, sometimes Corrected).

How to correctly manage Composite Domains

The correct way to pass in data to a Composite Domain in SSIS for evaluation is to build it into a concatenated string using a Derived Column, and map that string into the Composite Domain – thanks to Matt Masson of the SSIS team for explaining this – see the comments at the bottom of this post.

My example string would be built as below, space delimited:

Fig 3: Creating the Concatenated String for the Composite Domain in a Derived Column
Fig 3: Creating the Concatenated String for the Composite Domain in a Derived Column

 

 

 

 

Which would then be mapped into the Composite Domain in the DQS Cleansing Component as:

Fig 4: Mapping the Concatenated String into the Composite Domain
Fig 4: Mapping the Concatenated String into the Composite Domain

 

 

 

 

 

This will then start churning out the expected results.

The database, DQS Knowledge Base & SSIS Package can be found here.

I’ve also put up a Connect to cover this issue, please vote it up if you think it important.

Read More

ETL vs SQL

I’ve recently been asked to explain why I would use an ETL tool such as SSIS over good old fashioned SQL code. Here’s a quick summary, plus some links to others discussing the same topic:

  • Easier to maintain the code – through atomicity and simpler interfaces
  • Many reusable components – a lot of code (e.g. Lookups) has already been constructed
  • More flexible than SPs – they have more functionality (e.g. Cached Lookups, FTP)
  • Can deliver better performance than SPs – use of In Memory techniques can boost performance
  • Features such as logging/audit and metadata are built in
  • Support is more broadly available – it’s easier to find out why a prebuilt, widely used component is throwing errors

I appreciate not everyone will agree, and here’s some people who do, and don’t:

 

Read More

Data Quality Services Performance Best Practices Guide released

Via @mattmasson – the DQS team have released a Data Quality Services Performance Best Practices Guide. It’s pretty thorough, and covers most aspects of performance that I’ve been talking about so far, plus heaps about matching and underlying architecture. A really good resource.

However there’s one table which I think I’ll let speak for itself, which is expected performance:

DQS Performance
DQS Performance

You won’t be using this in a big DW anytime soon…

Read More

DQS / SSIS Performance in SQL2012 RC0 – following some tweaks

As per the connect I raised around DQS / SSIS performance, the DQS / SSIS performance was pretty sluggish. The DQS team have come back with some more possible tweaks, and this is my test of those.

The key suggestions were:

  • Configure the DQS Server to accept bigger chunks of data from SSIS (see the connect for details of how to)
  • Increase the SSIS properties DefaultBufferMaxRows and DefaultBufferSize so bigger chunks can be sent by SSIS to the DQS server

I revised my original tests slightly, to change from 5 domains to 4 domains so that parallel processing would be counted properly (i.e. when splitting things up it would process the 4 domains in parallel to compare fairly to the single process – as opposed to doing 4+1, as the maximum parallel tasks my processing will allow is 4).

The tweaks I made was setting the DQS Server Chunk size to 100,000, DefaultBufferMaxRows to 104857600 (10 times the default) and DefaultBufferSize to 100,000 (also 10 times the default). This meant in my sample data set of 10,000 rows everything was passed through in one big chunk.

So to remind of my test approach, I did it 3 ways (to allow for previous performance tweaks mooted by the DQS team):

  • All in one pass (4 domains in a single component)
  • Separate domains (process each domain in a separate path)
  • Split data (process the domains in 4 x 2,500k chunks)

The results of testing

Well… it helps. But there’s still a big gap between what can be achieved in the DQS client and what you can get via SSIS. The results are below, comparing the Tuned approach to the Untuned approach, and also the DQS client directly:

DQS Performance with SSIS
DQS Performance with SSIS

I’ve deliberately skipped out the actual numbers because I’m running on a clunky VM, and since I’ve read that Stephen Few book, I’ve learned a few things about conveying messages via data visualisation.

So, here’s the takeaways:

  • The new tuning approaches makes a difference – it’s about 3 times faster
  • The old tuning approaches still make negligible difference
  • Even tuned, SSIS is 5 times slower than the client

I’ll be feeding this all into the DQS team… and see what they come up with next. It would be nice if we can reach a point where performance is comparable with the client.

Read More