Reference Environment Variables in C# Mappers for HDInsight

Within your Mappers and Reducers there may be a need to reference the environment variables being fed to the task, such as the file name. Understanding how to do so took a little digging on my part, with a little help from Matt Winkler in the HDInsight MDSN forum.

Using this snippet of code:

// Adding this reference at the start of the code

using System.Collections;

foreach (DictionaryEntry var in Environment.GetEnvironmentVariables())

Console.WriteLine(“{0}”, var.Key + “|” + var.Value);

// Some junk code so the mapper doesn’t fail

string line; // Variable to hold current line

while ((line = Console.ReadLine()) != null)

{             // do nothing            }

 

It was possible to output all the Environment Variables as the Mapper output and work out their format from the resultant text file it created.

Then, to reference individual Environment Variables in the Mapper, you can simply use variations on:

 

string FileName = System.Environment.GetEnvironmentVariable(“map_input_file”);

string FileChunk = System.Environment.GetEnvironmentVariable(“map_input_start”);

Read More

MapReduce in C# for Hadoop on Azure

There are a bewildering array of language options available to write Mappers and Reducers (aka MapReduce) – Java and Python feature heavily, and for the non programmer the entire exercise is borderline incomprehensible.

However, a kind soul by the name of Sreedhar Pelluru has posted a simple walkthrough for building a Mapper and Reducer using C# & VS2010 for us Microsoft oriented souls, with an intended Hadoop on Azure target. The walkthrough is here: Walkthrough: Creating and Using C# Mapper and Reducer (Hadoop Streaming)

There are a few holes in the script so here’s the things to look out for:

  • In the section “Create and run a Map/Reduce job on HadoopOnAzure portal”, the first item suggests you run a Javascript command to get the IP address, but doesn’t provide it until a few lines later – the command is: “#cat file:///apps/dist/conf/core-site.xml “. You can also find out the IP by remoting into the cluster and running IPConfig at the command line.
  • Step 7 in the same section asks you to open hadoop-streaming.jar, and it took me a while to realise this mean on the HadoopOnAzure portal, not on your local machine (so I spent quite a bit of time in misadventures trying to manipulate the file on my local machine)
  • Error messages for job failure aren’t terribly helpful, and there’s no validation on job parameter input, so really really make sure that your command really does look exactly like the one in step 11. Miss a double quote or mistype a path and you will get no hint as to that being the source of the error.

Eventually I beat the above and achieved victory – a successful job run on HadoopOnAzure with the expected results. Next challenge – build my own data, mapper & reducer and repeat. Then get it into Hive….

 

Read More

Using .NET Framework 3.5 in SSIS Scripts

Thanks to Valentino Vranken who provides this useful post: Using A .Net 3.5 Assembly In SSIS 2008

By default SSIS Script Tasks / Components reference .NET Framework 2.0, which was confusing me as I tried to implement a solution around managing times in different Time Zones which required the TimeZoneInfo Class which only exists in 3.5 and higher. (This was what I was trying to implement: Daylight Savings in Script Task). Because the script referenced .NET Framework 2.0 the TimeZoneInfo class wouldn’t work. Switch the target framework to 3.5 and bingo – all good.

The tl/dr version: In your script component – while editing the script – open the Properties and change the Target Framework to 3.5 to use functions available in 3.5.

Read More

Understanding the SSIS Object Model – part 1 – Executables

I have recently been buried deep in the process of creating SSIS packages through code – a deep and interesting challenge, not least because I have had to teach myself C# and get to grips with Object Oriented Programming – a long way from my grubby VBA I was writing many years ago.

Key to the success of understanding how to build packages through code is to understand the SSIS Object Model. Over a series of posts I will try and explain how the Object Model looks and works. So, on with Part 1 – Executables. As I walk through this section, I will demonstrate adding an Execute SQL task to a package.

Executables

The top level object in the Object Model is a Package. However a Package is also an Executable object – along with all the SSIS Containers. These Containers are the Sequence Container, ForEach & For Loops and also TaskHosts (which contain Tasks such as the ExecuteSQL Task and Data Flow – more about TaskHosts in a bit). So at the top of the tree we have the Package, but below that we have Control Flow tasks.

In broad (and possibly not quite right) terms, an Executable is a part of the Object Model that can issue a command to start an activity. It is important to note that Data Flow sources, transformations and destination are not Executables. The Data Flow Task is – because it can start the Data Flow – but you cannot in isolation call a Derived Column Transformation, for example. In this context it’s a bit like an engine – the Data Flow Task is the ignition that starts things happening, but on its own any Transformation is like a piston – vitally important but useless until the engine starts.

So, to begin the code snippets, here I will add an Executable to a Package.

            // Declare and create a package object
            PackageTestPkg = new Package();

            // Create an executable object
            ExecutableNewExe = TestPkg.Executables.Add(“STOCK:SQLTask”);

What happened above is firstly I created a new Package Object. Then, to that object I used the Executables.Add method to add an Executable to the Packages’ Executables collection, with the moniker “STOCK:SQLTask” to define what type of Executable I was adding.

TaskHosts

TaskHostsare the Containers for the tasks you are familiar with from the Control Flow such as the ExecuteSQL Task and Data Flow. You will never see a TaskHost through BIDS – it is a programming concept rather than anything visual. A TaskHost has an InnerObject (and more on these in the next section) which describes the details for type of task it is (i.e. ExecuteSQL Task, Data Flow, etc.).

A good way to think of a TaskHost is like an empty glass jam jar. In itself it doesn’t do much, but if you fill it with some ExecuteSQLTask Jam, you can see it’s an Execute SQL Task.

Below we cast this added Executable as a TaskHost so we can start configuring it.

            // Cast as TaskHost
            TaskHostNewTaskHost = (TaskHost)NewExe;

            // Give it a name
            NewTaskHost.Name = “I am a new Execute SQL Task”;

There’s a limited range of properties you can set at the TaskHost level which are detailed here, such as Name – these are the properties that are common to all TaskHosts. To set properties that are specific to a type of Task, you need to access its InnerObject.

InnerObjects

The InnerObject is the jam in the TaskHost jar from my analogy above – it contains the specific stuff for a given task. So, in order to configure a task fully, you need to access it’s InnerObject to set the properties that are specific to that type of task.

So, below we cast this TaskHost as an ExecuteSQLTask, thus filling the TaskHost jar with our tasty ExecuteSQLTask jam and set its SQLStatementSourceProperty:

            // Cast as ExecuteSQLTask
            ExecuteSQLTaskNewSQLTask = (ExecuteSQLTask)NewTaskHost.InnerObject;

            // Set a ExecuteSQLTask specific property
            NewSQLTask.SqlStatementSource = “And here is my SQL Statement”;

Wrap-Up

So, now you should be a little wiser as to what is required to add and configure a Task on the Control Flow – understanding how Executables, TaskHosts and InnerObjects interact.

Read More

Modifying an SSIS Package through code

Part of any SSIS development experience inevitably results in you discovering a minor mistake or something that was missed a long way into the development cycle – or even after, in testing (you do test your code, right?). Then you are faced with the tedious job of opening every single package, making a change in every one… and getting some serious mouse finger. Much like I once did when I learned about BufferTempStoragePath.

Fortunately, there is a way to automate these fixes. The SSIS Object model is (relatively) easily manipulated through .NET languages – so it’s not too difficult to write a small program that will change your package. Below is a sample I knocked up that will add a variable to an existing package and save the change:

using System;
usingMicrosoft.SqlServer.Server;
usingMicrosoft.SqlServer.Dts.Runtime;

namespace Package_Modifier
{
    class Program
    {
        static void Main(string[] args)
        {
            // Initialize an Application and Package object
            Application app = new Application();
            Package package = null;

            // Set a package path
            StringpkgPath = “C:\\BI Monkey\\SamplePackage.dtsx”;

            // Load the package in package object
            package = app.LoadPackage(pkgPath, null);

            // Add the new variable
            package.Variables.Add(“NewVar”, false, “User”, 0);

            // Save the package
            app.SaveToXml(pkgPath, package, null);
          }
    }
}

You can essentially make any change you like to a package – I’ve chosen adding a variable because it’s an easy manipulation of the package object and I’ve got a long way to go before I work out how to do anything much harder :)

Read More