An SQL alternative to the SCD
In SQL 2008 a new T-SQL construct was added - the MERGE operation. (Ok, pedants will know this wasn’t new to Oracle, but it was new to SQL Server).
This operation allows for the merging of a dataset into a reference dataset – which can be remarkably similar to Insert / Update operations effected by the Slowly Changing Dimension transformation. However the way it operates is very different. Instead of the SCD’s row by row evaluation approach, the MERGE operation is a set based operation. What this means is it compares the whole of the source dataset to the reference dataset in a single pass. This has significant implications for performance – on a site where I implemented this the operation which took 1,200 seconds in the SCD cut down to 51 seconds using a Merge.
There are limitations and differences to be aware of:
- You cannot directly return row counts for Insert / Update / Ignore operations in the Merge
- As it is a bulk operation a single row will cause failure of the whole batch
- There’s no GUI – just hand crafted SQL
- Less error trapping / logging options
- More flexibility in terms of actions when matches / non matches are found
The main reason why you would consider the SQL Merge – it handles Type 1, and with a little cunning, Type 2 dimensions – in a fraction of the time it takes the SCD to plod through. It’s still not as fast as a proper in memory comparison using something such as TableDifference – but it’s always good to know you have something else available in your toolkit.
Further information:
- Using the SQL MERGE Statement for Slowly Changing Dimension Processing – from the Kimball Group
- How to create type 1 & 2 SCD’s using standard SSIS components (other than the SCD) (at the bottom of the post) – Benny Austin
SQL Server Aliases
As part of a meeting on setting up DR systems, one of the Server Techs mentioned using SQL Server Aliases to allow the cubes to be identical in structure – right down to connections – yet live on separate environments and point at different SQL Servers. This was news to me, but a quick google turned this up: How to setup and use a SQL Server alias.
An Alias is an Operating System level setting that allows you give a friendly name to your Server – and you can then connect to the server using that friendly name. What this means is that as you migrate your code from environment to environment, if you use the same friendly name for your SQL Server in each tier of the deployment, you don’t have to change your connection strings.
It’s a neat trick (though i’m not sure how the names resolve if you use the same name in environments that can see each other) – it certainly requires some proper planning from an infrastructure point of view.
Multiple LIKE clauses in a single WHERE statement
I recently came up against a scenario where, in amongst a number of other filters, I had to deal with a couple of wildcard criteria. As there is no option to have multiple LIKE clauses in native SQL, such as WHERE Field LIKE IN (’%Option1%’,'%Option2%’), I feared I was stuck with having to duplicate my WHERE clause in its entirety for each LIKE operation – until I found this cunning bit of code (from the thread here):
/* Apply Multiple LIKE clauses in a single WHERE clause */
SELECT *
FROM [AdventureWorks].[Person].[Contact]
WHERE CASE
WHEN FirstName LIKE ‘Gusta%’ THEN 1
WHEN FirstName LIKE ‘Cath%’ THEN 1
END = 1
An elegant solution!
SQL Server 2008 Auditing
I recently got presented with this challenge: How do you monitor people disabling your SQL Agent Jobs? Not prevent, warn or notify… just be able to monitor and find out who did what, when?
Enter SQL Server Auditing, introduced in 2008.
What is SQL 2008 Auditing?
Well, you could read the official documentation on MSDN, but it’s a little overwhelming, so here’s the view from 500 ft. SQL Server Audit provides a secure means of tracking access and changes made to the database schema or its data. What this means is you can trace who tried to access what and when. If any command was issued to change the data, you can track this as well. You cannot track the changes made to the data. So for example if someone made an insert into a table, you could track that someone made an insert, who did it, when that insert was made and what the SQL of the insert statement looked like – except for what data was inserted. That task falls to Change Data capture (CDC) which isn’t directly tied to Auditing (currently).
How do I implement Auditing?
A SQL Server Audit is made up of two components, the first of which is the SQL Server Audit. This is a server level object (created in the master database) that defines where logs will be written. Logs can be written to either a file, the Application Log or Security Log. These must be created before you create any actual Audits.
The second component is either a Server Audit or Database Audit. Server Audits again exist at the server level, and track activities that occur at the server level, such as login attempts or permission changes. Database Audits exist within an individual database and track activities at a database level such as schema changes or data operations.
You can have multiple SQL Server Audits defining multiple log target locations. Multiple Server and Database audits can be created to use a given SQL Server Audit. This probably makes more sense when explained in pictures:
So how do I audit my SQL Agent Jobs?
Setting up Audits can either be done through SQL Server Management Studio (SSMS) or through SQL scripts. As usual any action carried out through the Wizard, so I will do the wizard first, then show the generated scripts.
Step 1 is setting up the SQL Server Audit. To create this in SSMS, under the Security Node of the Server is a folder called “Audits”. To create a new Audit, simply right click on the folder and choose “New Audit..”, as set out below:

Fig 2: Creating a SQL Server Audit
Then in the Wizard just enter the Audit name and in the dropdown select “Application Log”, as shown below. The alternatives are to the Security Log or to a File, but I won’t be covering those in this simple example. Click on OK and you’re done.

Fig 3: Creating a SQL Server Audit
The equivalent SQL script is below:
USE [master]
GO
CREATE SERVER AUDIT [My Sample Audit]
TO APPLICATION_LOG
WITH
( QUEUE_DELAY = 1000
,ON_FAILURE = CONTINUE
)GO
Now you have somewhere to write your log. Note that by default SQL Server Audits are disabled when they are created, which means nothing will be written to your log until until it is enabled. This can be done just by right clicking on it in SSMS and choosing “Enable”.
The next bit is to set up the Database Audit on the SQL Agent Jobs table, which is in the system table [msdb].[dbo].[sysjobs]. Under the msdb database, open up the “Security” folder and within there is the “Database Audit Specifications” folder. Right click here and choose “New Database Audit Specification…” as shown below.

Fig 4: Creating a Database Audit Specification
This will open the Create Database Audit Specification dialog. Because we are operating on msdb, unless you are logged in as sa, you probably won’t be able to create objects and the process will fail. So ensure the user you are logged in as has appropriate permissions on msdb. Because I’m playing on a Dev environment, I just allowed my own user access to msdb and gave them db_owner rights – this probably won’t fly in a production environment, of course!
The dialog is shown below. First task is to choose which SQL Server Audit to write to – so we use the one we created above called ”My Sample Audit”. You choose the Audit Action Type – i.e. what you want to log. In our case we want to see when changes are made to our jobs in SQL Agent, so we choose the “UPDATE” Action Type. Next we have the Object Class, which for this case is a Database Object, so “Object” is selected, and then the Object Name. Finally the Principal defines who is Audited – in the example I have selected “public” because every user is in the public role, so I will capture any users UPDATE commands issued on my selected table.

Fig 5: Creating a Database Audit Specification
Now here we hit a minor hiccup – you will note in the example above I have selected the object sysdtslog90 – it’s the only non-system view in the database, and the only one I can get to come up in the browser. The solution here is just to hit the “Script” button, and modify the generated code to point at the right table, as shown below:
USE[msdb]
GO
CREATE DATABASE AUDIT SPECIFICATION [SQL Agent Audit]
FOR SERVER AUDIT [My Sample Audit]
ADD (UPDATE ON OBJECT::[dbo].[sysjobs] BY [public])GO
This code runs just fine, and as you will see has the desired effect. As with SQL Server Audits, Database Audit Specifications are created in a disabled state. As before, just right click to enable and the Auditing will begin.
Finally to test, just change the Enabled state of any SQL Agent job on your server, then check the Application Log, where you will find an Event detailing who just changed its status!
Fuzzy Thinking
I’ve covered off the Fuzzy Lookup and Fuzzy Grouping transformations in SSIS and noticed in my research that these capabilities aren’t particularly coherently talked about on the web. So below I thought i’d collect some of the better articles for your late night reading. There isn’t all that much out there, unfortunately.
So, how does it all work?
Here are a few articles covering theory, mostly from Microsoft:
- Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005 – a detailed article on the use of the components, with some coverage of under the hood behaviour and performance considerations
- Fuzzy Lookups and Groupings Provide Powerful Data Cleansing Capabilities – explains at a conceptual level using the Fuzzy Lookup components and considerations in understanding the results
- Robust and Efficient Fuzzy Match for Online Data Cleaning – a research paper by the creators of the Fuzzy Lookup algorithms which goes into the guts of how fuzzy matching works, with details on Q-Grams, Error tolerant indexes and so forth
It is probably worth reiterating that because of the way the algorithms and their Q-Grams work, when longer strings are being analysed for fuzzy matches, the better the chances of a good match. When I first started using the algorithms I was doing some client matching and matched first and last names separately. Once I had a deeper understanding of the components, I started matching on a full name and the quality and reliability of matches improved significantly.
Ok, so how do I make it work?
Now, some articles covering practical implementation of the tasks:
- Fuzzy Lookup and Fuzzy Grouping transformations – guides to the practical use of the transformations from me, the BI Monkey
- Using Fuzzy Lookup Transformations in SQL Server Integration Services – a practical example of using a fuzzy lookup
- Adventures with Fuzzy Matching – from Jamie Thompson, which provides some cautions on the reliability of the results
- Fuzzy Lookup and Fuzzy Grouping transformations – MSDN documentation
The best thing you can do is get some sample data and play with the components to understand what it is they do. The results are impressive – if not bulletproof – and can make a great contribution to de-duplicating client data, etc.
And what does the BI Monkey have to say about it?
Fuzzy Matching is a powerful and easy to use tool which is great for approximate grouping of data for analysis where a margin of error is tolerable. It is also a great helper in data cleansing exercises. Having too much faith in the results where exact matches are required will cause you to fall over at some point, so be careful. If you are engaged in such an exercise and want some experienced support, please get in touch.
And in other news, fresh from Jamie Thompson – Fuzzy Lookup and Regex are going to become available in SQL2008R2.
If you have come across any articles that you think really contribute something to the understanding of fuzzy matching technologies in SQL Server / SSIS, please let me know or post a link in the comments so I can improve this article.
Handling Recursive Hierarchies in SQL Server
I was recently posed a question on handling recursive hierarchies which left me completely stumped, so I had to find a good solution to it. This post will cover all that, but first -
…what is a Recursive Hierarchy?
A recursive hierarchy is a one where children of parent members can be parents themselves, such as an Organisation chart, or chart of accounts. A simple example is shown below, where the node A2 is both a child of A1 and a parent of A4 & A5:

Fig 1: A Simple Recursive Hierarchy
Handling these within a database environment can be difficult because the number of parent / child relationships (aka “Depth”) can vary, so it is impossible to create a fixed width table to accomodate them for their future growth. Consequently, most operational systems store these in a simple two level tabel which records the parent / child relationships only, as shown below.

Fig 2: The Parent Child Table
From the operation systems point of view this is usually all it needs to function as they usually only need to know the relationship one step in either direction, which this table satisfies. Determining the parent of A2 or the children of A2 can be determined with a simple SELECT query.
How do they cause difficulties for reporting?
The difficulty faced in a SQL based reporting situation is that you cannot easily determine relationships between parents and grandchildren, great-grandchildren, etc. without nesting queries. The specific problem with this is that you cannot know from a parent exactly how many levels of children lie below it. Vice versa, you cannot know how many levels of parents a child record has. Consequently you cannot know in advance how deep to nest your queries. From a practical standpoint as well, these relationships could be hundreds or thousands deep, and writing that query can pose problems in itself, even if you know there are exactly 1,567 levels in it!
The problem I was posed was superficially simple – if every level of such a hierarchy can have values, such as laid out below, how do you determine the aggregate value for a given parent and all its children?

Fig 3: The Values Table
For a tiny example such as this, nested queries is an option – your sub selects would only have to go two deep at most. But what if the depth changed, or ran into the tens or hundreds? Also, how do you create the generic query for any node in the tree? And for added complexity, what if there are multiple trees in the hierarchy? The problems are not trivial to resolve, but I have located two good solutions.
Solution 1: The LR Method
The first, and in my view most elegant, comes from Michael J. Kamfonas in his article “Recursive Hierarchies: The Relational Taboo!“, which I strongly recommend reading to fully grasp the solution. His solution is to pre-number each node in the hierarchy with a value that on the Left forms a lower bound and on the Right an upper bound that allows you to select all nodes under it using a BETWEEN clause. A picture explains this concept clearly, with the L Value in Pink and the R Value in Green:

Fig 4: The LR Method applied to a Recursive Hierachy
So as you can see for node A2, the range between its L Value of 2 and R Value of 7 encompasses the L Values of all its children (A4 with 3, and A5 with 5). So to select all the nodes below A2 there is no need to resolve any relationships, you can simply select all children nodes based on their L values.
In database terms, the output looks like this:

Fig 5: The L R Table
So, if I wanted to know the sum of all the values below any given node, I would use this pseudo SQL:
SELECT SUM(Value)
FROM ValuesTable vLEFT JOIN LR_Table lr
ON v.Node = lr.NodeWHERE lr.L_Value
BETWEEN (SELECT L_Value FROM LRTable WHERE Node = ‘ParentNode’)
AND (SELECT R_Value FROM LRTable WHERE Node = ‘ParentNode’)
To demonstrate this in practice, I have created some SQL which creates a recursive hierarchy table as in Fig 2, a values table as in Fig 3 and an LR table as in Fig 5. The LR table is then populated with L/R values by a simple cursor which logs its activity to the message window so you can see what it is doing. I don’t think it will win any prizes for efficiency, but it works! The code sample then closes out with a T-SQL sample which calculates the sum of all it and its childrens values. Download the sample code here.
Solution 2: Kimball Helper Table
Unsurprisingly, Ralph Kimball also has an approach. His involves the use of a “Helper Table” which he describes in his article “Helper tables handle dimensions with complex hierarchies“. As for solution 1 I advise reading the article as I will only go into the practical implications of his approach.
The Kimball approach requires creating a table that stores every path from each node in the tree to itself and to every node below it. Looking at the example below, this means creating one row per node, plus one row for each path down the tree for each parent node to all of its children.

Fig 6: Paths to capture in a Helper Table
The helper table also includes a few extra flags – the Level Depth from the top parent, and flags for the top level parent nodes (Topmost) and lowest level children nodes (Lowest). The output table ends up looking like this:

Fig 7: The Kimball Helper Table
This makes navigating relative positions in the hierarchy simpler, and allows the answering of the original question easy as all that is required is to join on the Values table to the Helper table for the given parent node, as below:
SELECT SUM(Value)
FROM HelperTable rLEFT JOIN ValuesTable v
ON r.ChildNode = v.NodeWHERE ParentNode = ‘ParentNode’
However – as you will see from my sample code – populating these tables is much more complex. The sample code creates and populates the recursive hierarchy table as in Fig 2, a values table as in Fig 3 and an Helper table as in Fig 7. The process of populating the table is a mix of cursors, inserts and updates and is much trickier to get right than the LR method, because of the higher demands for information, such as Depth from Parent.
Recursive Hierarchies – no problem!
Above are two solid approaches to the Recursive Hierarchy problem. The code samples provided should help you get started on understanding how to handle these scenarios when trying to report against them in Database based reporting scenarios such as in SSRS. SSAS and other OLAP based reporting handles all of this in its stride however, as described here, for example.
If you find issues with my code samples or see improvements, i’d love to know about them, so please keep me posted about your experiences with them.
Finally, I will close out with the old joke on the subject:
To truly understand recursion,
you must first understand recursion
10 Essential SQL Tips for Developers
10 Essential SQL Tips for Developers courtesy of Eric Shafer – covers some important basics – if you scan through that list and don’t understand all 10 points, make an effort to ensure you do – they aren’t mind boggling obscurities but important basics.
ti7mfkj6sr
Bulk loading XML with SQLXMLBulkload
I was recently asked to help out with loading some awkward XML into SQL Server using the SQLXMLBulkLoad feature of SQL Server (it’s been there since 2000 but is one of SQL Server’s lesser known features). It’s a more graceful approach than using the XML Source Component in SSIS but has its own limitations and frustrations.
The specific problem I was trying to resolve was loading a file that had an Element that only had sub-Elements and contained no data, which resulted in me getting the error “Schema: the parent/child table of the relationship does not match”
What is SQLXMLBulkload?
Apart from a mouthful with a vowel shortage, it is a standard component of SQL Server that provides a means to bulk load XML into standard relational tables. All you need is the SQL Server target tables, a script to invoke the SQLXMLBulkLoad, the XML file itself and an appropriately formed XSD schema to describe the XML and how to load it into the relational schema. The biggest limitation is that you cannot manipulate the data as it is imported – it is after all a bulk loader. The frustration part comes in forming the XSD schema properly, which can be a bit of a black art.
How to Invoke SQLXMLBulkload
The script to call the bulk loader is pretty simple – my example below was simply saved in notepad with a .vbs extension, which is then executed with a double click (or could be called from SSIS using an Execute Process task).
Set objBL = CreateObject(”SQLXMLBulkLoad.SQLXMLBulkLoad”)
objBL.ConnectionString = “Data Source=SERVER\INSTANCE_NAME;Initial Catalog=TARGET_DATABASE;Provider=SQLNCLI10.1;Integrated Security=SSPI;”
objBL.ErrorLogFile = “C:\error.log”
objBL.Execute “C:\sample.xsd”, “C:\sample.xml”
Set objBL = Nothing
Handling Elements with sub-Elements only with sql:is-constant
Now, to the problem. My XML source contained a nested Element which had no content itself, but had further multiple occurences of a nested Element within it, as below:
<Nuts>
<Nut Id = “1″ Name = “Peanut”>
<Photos>
<Photo>nutty1.jpg</Photo>
<Photo>nutty2.jpg</Photo>
</Photos>
</Nut>
<Nut Id = “2″ Name = “Cashew”>
<Photos>
<Photo>nutty3.jpg</Photo>
<Photo>nutty4.jpg</Photo>
</Photos>
</Nut>
</Nuts>
For ages I was going in circles trying to get the relationship set between the Nut and Photo elements (using the sql:relationship annotation) but had endless trouble because each nesting demanded a relationship – so it looked like it needed a Nut > Photos > Photo relationship, but because the Element <Photos> had no data items there was nothing on which I could establish a relationship. Eventually I stumbled upon the sql:is-constant annotation which tells the XSD schema that the element doesn’t map to a database table or column – and one of its documented uses is specifically to create a container element, as my situation required. So my final XSD looked like this:
<?xml version=”1.0″?>
<xsd:schema xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xmlns:xs=”http://www.w3.org/2001/XMLSchema” xmlns:xsd=”http://www.w3.org/2001/XMLSchema” xmlns:sql=”urn:schemas-microsoft-com:mapping-schema” attributeFormDefault=”unqualified” elementFormDefault=”qualified”>
<xsd:annotation>
<xsd:appinfo>
<sql:relationship name = “Link” parent = “Nuts” parent-key = “Id” child = “NutPhotos” child-key = “Id” />
</xsd:appinfo>
</xsd:annotation>
<xsd:element name=”Nut” sql:relation=”Nuts”>
<xsd:complexType>
<xsd:sequence>
<xsd:element name=”Photos” sql:is-constant = “1″>
<xsd:complexType>
<xsd:sequence minOccurs=”0″>
<xsd:element name=”Photo” sql:field = “Photo” sql:relation=”NutPhotos” sql:relationship = “Link” />
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
<xsd:attribute name=”Name” type=”xsd:string” sql:field=”Name” />
</xsd:complexType>
</xsd:element>
</xsd:schema>
One tiny bit of annotation solved my headaches – understanding the flow of XML and how XSDs interpret it is definitely tricky, but it can be very powerful. Of course, the fact that this took me four hours to solve using SQLXMLBulkload, but 5 minutes in SSIS took the colour off it a bit, but SSIS can’t always be used, as was the case for the person I helped out.
The SQLXMLBulkLoad documentation on MSDN is actually a good reference and well worth spending a little time going over so you can better understand this feature.
Count the number of rows in every Table in a Database in no time!
Here is a piece of T-SQL code that uses DMV’s (Dynamic Management Views) to give an approximate row count of every table in your database in virtually no time at all. Bear in mind it is running off collected statistics so won’t always be 100% accurate – but it’s far quicker than doing a Count(*) on a table by table basis when you just need a rough idea when doing sizing. In case you don’t know what DMV’s there are, go poke around in SSMS – [Database] > Views > System Views and see what’s there. Anyway, the code:
SELECT s.[Name] as [Schema] , t.[name] as [Table] , SUM(p.rows) as [RowCount] FROM sys.schemas s LEFT JOIN sys.tables t ON s.schema_id = t.schema_id LEFT JOIN sys.partitions p ON t.object_id = p.object_id LEFT JOIN sys.allocation_units a ON p.partition_id = a.container_id WHERE p.index_id in(0,1) -- 0 heap table , 1 table with clustered index AND p.rows is not null AND a.type = 1 -- row-data only , not LOB GROUP BY s.[Name], t.[name] ORDER BY 1,2
For a very swift explanation – sys.schemas and sys.tables list the schemas and tables in the database, so joining these together on schema_id gives a list of all tables by schema in the database. Adding on sys.partitions then pulls in the partitions associated with each table, and finally sys.allocation_units pulls in the allocation units, which i’m not quite sure what they are – the guts of this query were pulled from another blog which I embarrasingly can’t trace back to now.
I’m no expert on DMV’s so if you have any views on the quality of this query – please leave a comment with your thoughts.
Microsoft Master Data Mangement Release announced
Microsoft’s Master Data Management tool, previously codenamed ‘Bulldog’ and slated to be included in the Office 14 release is now set to be included with Kilimanjaro – which seems to now be officially called SQL Server 2008 R2.
More details can be found at the SQL 2008 Official site. The rumour mill seems to imply that SQL 2008 R2 will drop at roughy the same time as Windows 7.
Just in case you have no idea what Master Data Management entails, it covers managing and centralising metadata within an organisation that may span many systems. The most common example is central management of customer records. For more insight and a more detailed explanation, I suggest you read this white paper on the subject by Roger Wolter and Kirk Haselden.
