This post is going to cover the XML source component, which allows you to read data from XML sources into SSIS data flows. It is largely a rehash of a post by Matt Mason on the SSIS Team Blog, though this one has some samples to play with. A zip file containing the samples can be found here, read guidelines on use here.
Referencing the XML file, and its associated XSD
The XML Source is unusual in that it does not require a Connection Manager, but instead references a XML data source either as a direct reference to a file,a variable containing a file reference, or a variable containing XML data. These options can be found in the Data Access Mode of the Connection Manager Tab, shown below.
Apart from the XML file location, the XML Source also requires you to specify the location of the the XML Schema Definition (shortened to XSD) which specifies the locations, names and formats (text, numeric etc.) of the data items found in the XML file. This schema definition can exist as part the source XML file, in which case you can check the “Use inline schema” box, and the “XSD location:” box will grey out. See example 4 in the sample package for a demonstration of this. However beware that the schema must be within the data rows tag – see the connect article here for a full explanation, but the inline schema the XML source works with is not compliant to standards.
Otherwise you need to specify the location of the XSD file. If you don’t have one already available, SSIS can attempt to generate one for you by clicking the “Generate XSD…” button, which will prompt you for a file location to generate the XSD in. If successful the XSD generation process will create a .xsd file for you and reference the location in the Connection manager tab.
How XML data is handled by the XML Source
XML Data items can be formed in two ways, as Elements within XML tags, or Attributes of XML tags, expressed simply below:
<datarow dataitemname = “dataitemvalue” />
In each case, the XML Source will treat each identifiable row, as indicated by the <datarow> tag as a incoming row of data, and each data item as a column. Examples of both scenarios are included in the sample package in case 2 and 3.
A more complex scenario is where there are nested XML elements, as in the sample below:
In this case there will be multiple outputs from the XML source. In the sample above there would be one output for datarow, and one for nestedelement. These can then be tied together by a row id column generated by the component. The Columns tab from the package is shown below, detailing the two outputs due to the presence of the nested tags. You can also see the autogenerated row_id column:
The sample in the package then uses a Merge Join to then pull these back together as a flattened line of data, joining on the row id.
The XML source is a flexible component used to access XML files. It is capapable of generating XSD schema files if required. If you have nested elements, the component generates multiple outputs that can be recombined on an id generated by the components. It is worth noting that the SSIS team blog says it handles XML “to various degrees of success“, which implies that as the XML gets more complicated, you may experience undesirable results.