The Merge Join Transformation

Fig 1: The Merge Join Transformation
Fig 1: The Merge Join Transformation

In this post I will be covering the Merge Join Transformation. The sample package can be found here for 2005 and guidelines on use are here.

What does the Merge Join Transformation do?

The Merge Join Transformation joins two data sets on a common key (or keys). It’s the same as performing an SQL Join – the options you have are Inner Join, Full Outer Join and Left Outer Join. However the difference is that SSIS rather than SQL is doing the join, so it’s an in-memory activity. The Merge Join is a Partially Blocking transformation – it needs to receive all data per join key value pair from all inputs before it can proceed, so this can have a significant negative effect on performance.

Configuring the Merge Join Transformation

A key thing to note for the Merge Join Transformation is that the inputs must be sorted on the keys on which you are joining the datasets, for both sets. This means you either have to set the IsSorted property on your source if the data is coming in sorted, or push it through a Sort transformation before you can attempt to do the join. In the example package, I have used 2 OLE DB sources. One of these is unsorted and I sort the output using a Sort column – note this is another blocking transform and another chance to drag performance down. The other one is sorted, and I tell SSIS by setting the IsSorted property of the OLE DB Source Output to true using the advanced Editor.

In terms of configuration of the Merge Join, all you need to do is map the join keys using a drag and drop and select the columns you want on the output, as shown below:

Configuring the Merge Join Transformation
Fig 2: Configuring the Merge Join Transformation

The checkbox to the left of the column name in the join section indicates whether that column is output from the transformation. In the box below the join section, you can rename the output columns if you want. The Order column is the Sort Order, and is not configurable. The Join Key checkbox is ticked when you create a relationship.

There is no advanced editor, but if you access the properties from the design surface there are a few extra properties to access which are worth knowing about:

  • MaxBuffersPerInput – which sets how many buffers to be active for one input at a time – see here for more detail
  • TreatNullsAsEqual – which decides whether to join Null values or not – by default this is true.

Where should you use the Merge Join Transformation?

The answer to this is simply that it should be used wherever it is not possible to perform the join using native SQL functionality – e.g. joining data from text file sources or other non database sources. If your sources are coming from the same database, use a query and get the database to do the work. It’ll be faster and more efficient in the vast majority of cases, and the fact that it’s a blocking transformation will slow down your package overall.

MSDN Documentation for the Merge Join Transformation can be found here for 2008 and here for 2005.

If you are still struggling, try these additional resources:

Also on advanced use of the component:

If you need specific help or advice, or have suggestions on the post, please leave a comment and I will do my best to help you.

4 thoughts on “The Merge Join Transformation

  1. In the past, I have had some problems with using a combination of SSIS sort and SQL sorted inputs for the merge join. I get some strange output where the data does not actually join correctly for some rows. I sorted the issue by putting in a SSIS Sort after each input, perhaps SSIS sort != SQL sort?

  2. If it’s not joining correctly, it probably means the source data isn’t sorted correctly or on the outside chance, you need to add more buffers into the join (using MaxBuffersPerInput ). But it’s most likely a sorting issue.

  3. SSIS SQL Sort? that is correct. SSIS uses windows collations and SQL uses “SQL Server” collation. I think the documentation says to change any varchar to nvarchar which suppose to fix this problem which means cast in order by; not sure if that won’t have impact on performance if there are joins. I’m having issues with sort myself so I will try to publish my comment with my final findings.

    BDW I don’t understand why merge join is a blocking transformation; if input is sorted than that probably should be partially blocking transformation?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>