One of the best changes in SSIS 2012 was to create the concept of a Project Connection – a connection manager that can be used across the whole project instead of being limited to the package scope, meaning you had to recreate and configure effectively the same connection in every single package you have. This feature is great… when you are starting a new project.
However a recent task I got handed was to migrate a 2008 project to the 2012 project model. All very sensible, to ease maintenance, eliminate XML configurations and generally bolster security. Then I got to work….
Converting a Package Connection to a Project Connection
Ah, the easy part. Pick your connection, right click, convert to project connection and … ta daa! You have a project connection!
Now… what about all the other packages?
Pointing every other Package connection to the Project connection
This is a little harder. The good bit is your project connection will appear in your available connection managers. The bad bit is there is no way to tell the SSIS designer to use this one instead of your old one. You can either manually repoint every data flow, Execute SQL, script and whatever other task happens to be using the package connection to the project connection – easy if your package is small and simple – or get X(ML)treme! Fortunately thanks to this post by Jeff Garretson I was reminded that SSIS packages are just XML, and XML can be edited much faster than a package using the designer. Jeff’s post only resolved how to fix up the Data Flow – I had a pile of Control Flow tasks to fix up too – so here’s how to get it done without hours of coding.
Step 1: In designer mode Get the Name & GUID of all Connections to be replaced and what to replace them with.
You can get this from the properties window when you have a connection manager selected in the SSIS designer:
Step 2: Switch to code view and replace all Data Flow connections
You can find where a package connection is being used in a data flow by looking for the following in the XML:
Agile methodologies have a patchy track record in BI/DW projects. A lot of this is to do with adopting the methodologies themselves – as I’ve alluded to in prior posts there are a heap of obstacles in the way that are cultural, process and ability based.
I was discussing agile adoption with a client who readily admitted that their last attempt had failed completely. The conversation turned to the concept of the Zero sprint and he admitted part of the reasons for failure was that they had allowed Zero time for their Zero sprint.
What is this Zero sprint anyway?
The reality of any technical project is that there are always certain fundamental decisions and planning processes that need to be gone through before any meaningful work can be done. Data Warehouses are particularly vulnerable to this – you need servers, an agreed design approach, a set of ETL standards – before any valuable work can be done – or at least without incurring so much technical debt that your project gets sunk after the first iteration cleaning up after itself.
So the Zero Sprint is all that groundwork that needs to be done before you get started. It feels counter agile as you can easily spend a couple of months producing nothing of any direct value to the business/customer. The business will of course wonder where the productivity nirvana is – and particularly galling is you need your brightest and best on it to make sure you get a solid foundation put in place so it’s not a particularly cheap phase either.
How to structure and sell the Zero sprint
The structure part is actually pretty easy. There’s a set of things you need to establish which will form a fairly stable product backlog. Working out how long they will take isn’t that hard either as experienced team members will be able to tell you how long it takes to do pieces like the conceptual architecture. It just needs to be run like a long sprint.
Selling it as part of an Agile project is a bit harder. Because you end up not delivering any business consumable value you need to be very clear about what you will deliver, when you will deliver it and what value it adds to the project. It starts smelling a lot like Waterfall at this point, so if the business is skeptical that anything has changed, you have to manage their expectations well. Be clear that once the initial hump is passed, the value will flow – but if you don’t do it the value will flow earlier to their expectations, but then quickly after the pipes will clog with technical debt (though you may want to use a different terminology!).
A couple of times recently I have come up against requirements which have required some fairly complex logic to apply security. One involved some fairly gnarly relationships coming from multiple directions, the other involved grinding through Hierarchies from parent nodes down to permitted viewable children.
The problem with both cases is that though the logic can sometimes be written (albeit usually in an ugly as hell manner) – the functions needed to do so perform atrociously. For complex relationships you are obligated to take in context after context, changing filters and doing all sorts of DAX voodoo. As we know by now, avoiding relationships is good for performance. Hierarchies can be managed through the PATH function, but it’s a text operation that is far from speedy.
Let’s give a quick example of some complex security – consider the below data model:
Here the security controls who can see what has been spent on a Task in the FactTable object. How can see what depends on their Role and/or the Unit they are in. There is also a 1:many relationship between a person and the login they can use.
So for dynamic security you need to navigate from the User Id to the Person and assess what Unit they are in for the Unit based permissions. You also need to assess what Role they are in to get the Role based permissions.
I took one look at this and shuddered at the messy DAX I was going to have to write, plus how terribly it would perform.
Do it in the Cube? Yeah Nah.
So I thought “Yeah nah” and decided the cube was the wrong place to be doing this. Ultimately all I was trying to get towards was to pair a given login with a set of tasks that login would have permissions against. This is something that could easily be pushed back into the ETL layer. The logic to work it out would still be complex, but at the point of data consumption – the bit that really matters – there would be only minimal thinking by the cube engine.
So my solution enforces security through a role scanning a two column table which contains all valid pairings of login and permitted tasks to view. Very fast to execute when browsing data and a lot easier to code for. The hard work is done in loading that table, but the cube application of security is fast and easy to follow. The hierarchy equivalent is a pairing of User Id with all the nodes in the Hierarchy that are permitted to be seen.
As a final note, for those non-Aussie readers the expression “Yeah nah” is a colloquialism that implies that the speaker can’t be bothered with the option in front of them. For example: “Do you want a pie from the Servo, Dave?” “Yeah nah.”
A common requirement in any set of calculations is to create a range of time variants on any measure – Prior Period, Year to Date, Prior Year to Date, Prior Quarter… you think of a time slice and someone will find it useful.
However the downside to this is that in the model you end up maintaining lots of calculations that are all largely doing the same thing. Any good coder likes to parameterise and make code reusable. So how could we do this in Tabular? There is a way that is a very specific variant of the idea of Parameter Tables
Disconnect your Dimensions!
Step one is to unhook your Date Dimension from your fact table. This may seem counter-intuitive, but what it frees you to do is to use the Date dimension as a source of reference data that doesn’t filter your data when you select a date – this simplifies the subsequent calculations significantly. You also need to add to the date dimension all the dates you will need to perform your calculations – Year starts, Prior Year starts, Period starts etc. – this isn’t compulsory but you’ll be grateful later on when you need these dates and don’t have to calculate them on the fly, trust me. Your Date table (I’m going to cease calling it a Dimension, it isn’t any more) will end up looking something like this:
In practice you would hide all the columns apart from the Date as this is the only one that actually gets used by users.
Time for the Variants
Next, we need to create a simple filter table to apply the Time Variant calculations. All it needs is a numeric identifier per variant and a variant name, like so:
This – quite clearly – isn’t the clever bit. The thing to observe with all of these variants is that they create a date range. So what we need to do is calculate the applicable Start and End dates of that range. This is the bit where we are grateful we pre-calculated all those in our Date table. We add two Measures to the table, StartDate and EndDate, which detect which Time Variant is being calculated and then work out the appropriate date, based on the currently selected date. The DAX for StartDate looks like this:
We use a SWITCH statement against the VariantID to detect which Variant we are trying to get the date range start for, then pick the right date from the Date Table. Pre-calculating these in the Date table keeps this part simple.
Add it all up
The final part is to pull these dates into the measure:
I’ve recently wrapped up writing the draft of a PowerPivot book (news on that once it’s published) and as part of having to make sure I “knew my onions” I spent a bit of time working my way around understanding the compression engine. I came across this post – Optimizing High Cardinality Columns in VertiPaq – by Marco Russo, and it sparked my interest in seeing how it could be applied to a couple of common data types – financial amounts and date / times. This first lead to me getting distracted building a tabular model to see how much memory columns (and other objects) used. Now i’m getting back to what took me down that path in the first place: seeing how different data type constructions affect memory usage.
How PowerPivot compresses Data
As an introduction, it really helps to understand how PowerPivot compresses data in the first place*. The key tool it uses is a Dictionary which assigns an integer key to a data value. Then when the data is stored it actually stores the key, rather than the data. When presenting the data, it retrieves the keys and shows the user the values in the dictionary.
To illustrate, in this list of Names and Values:
We have several repetitions of Name. These get stored in the dictionary as follows:
Then, internally PowerPivot stores the data of Names/Values like this:
This results in high compression because a text value takes up much more space than an integer value in the database. This effect multiples the more repetitive (i.e. lower cardinality) the data is. High cardinality data, typically numeric values and timestamps – do not compress as well as the number of dictionary entries is often not much less than the number of actual values.
* Quick caveat: this is the theory, not necessarily the practice. The actual compression algorithms used are proprietary to Microsoft so they may not always follow this pattern.
Splitting Data – the theory
The key to Marco’s approach is to split data down into forms with lower cardinality. So what does that mean?
For a financial amount, the data will be in the form nnnnnnn.dd – i.e. integer and fraction, dollars and cents, pounds and pence, etc. But the key thing is that the cents / pence / “dd’ portion is very low cardinality – there are only one hundred variations. Also, stripping out the “dd” potion will probably end up reducing the cardinality of the number overall. For example, consider these unique 4 numbers:
That is four distinct numbers… but two integer parts and two fraction parts. At this small scale it makes no difference, but for thousands of values it can make a big impact on cardinality.
For a DateTime the data will be in the form dd/mm/yy : hh:mm:ss.sss. You can separate out the time component or round it down to reduce cardinality. Your use case will determine what makes sense, and we will look at both below.
Splitting Data – the practice
Any good theory needs a test, so I created a one million row data set with the following fields:
TranCode: A 3 character Alpha transaction code
TranAmount: A random number roughly between 0.00 and 20,000.00
TranAmountInteger: The Integer part of TranAmount
TranAmountFraction: The Fraction part of TranAmount
TranDateTime: A random date in 2014 down to the millisecond
TranDate: The date part of TranDateTime
TranTime_s: The time part of TranDateTime rounded to the second expressed as a time datatype
TranTime_ms: The time part of TranDateTime rounded to the millisecond expressed as a time datatype
TranTime_num_s: The time part of TranDateTime rounded to the second expressed as an integer datatype
TranTime_num_ms: The time part of TranDateTime rounded to the millisecond expressed as an integer datatype
TranTime_s_DateBaseLined: The time part of TranDateTime rounded to the second expressed as a datetime datatype, baselined to the date 01/10/1900
TranTime_ms_DateBaseLined: The time part of TranDateTime rounded to the millisecond expressed as a datetime datatype, baselined to the date 01/10/1900
The generating code is available here. I’ve used some T-SQL Non Uniform Random Number functions to get more “realistic” data as early drafts of this test were delivering odd results because the data was too uniformly distributed so VertiPaq couldn’t compress it effectively.
You may be wondering why I’ve produced TranTime as time and datetime datatypes – the short answer is Tabular Models treat sql server time datatypes as text datatypes in the tabular model, so I wanted to check if that made a difference as I was getting some strange results for split time.
I then imported the table into a tabular model and processed it, then used the discover_memory_object_usage to work out space consumed by column. The results were this:
There was a clear saving for splitting the financial amounts into integer and fractions – the split column saved around 50% of the space.
DateTime behaved very oddly. Rounding down the precision from milliseconds to seconds brought big savings – which makes sense as the cardinality of the column went from 1,000,000 to 60,000. However splitting it out to just the time component actually increased space used.
I tried fixing this by baselining the time component to a specific date – so all millisecond/second components were added to the same date (01/01/1900) – this basically made no difference.
A more effective variation was to just capture the number of milliseconds / seconds since the start of the date as an integer, which saved about 89% and 92% of space respectively.
Splitting Data – the advice
Though there are certain costs associated with doing so, such as the loss of the ability to do DISTINCTCOUNT on values, but if your model is pushing memory limits then splitting decimal numbers into their integer and fraction (especially currency fields) can make a big difference – my experiments showed 50% and that was using fairly random data – real life tends to be a bit more ordered so you can hope for more savings.
Fundamentally it looks like DateTime values compress poorly, and Time values even more so. A better solution – at least from a compression standpoint – is to store the date value as a Date datatype in the model, and have any time component stored as integers. How this impacts performance when bringing these together at runtime using the DATEADD function is a matter for you to test!
Agile in a BI/DW environment faces a unique set of challenges that make becoming productive more difficult. These issues fall into a couple of categories. First are the difficulties in getting the team to the productivity nirvana promised, which I covered in this post. Second are the difficulties posed by technology and process, which I’ll talk about today.
Some obstructions cannot be moved by thought alone.
Agility in traditional coding environments runs at a very high level like this: User states requirements, Coder develops an application that meets those requirements, test, showcase, done.
In BI/DW environments there process is less contained and has a lot of external dependencies. A user requesting a metric on a report is not a matter of coding to meet that requirement – we need to find the data source, find the data owner, get access to the data, process it, clean it, conform it and then finally put it on the report. Depending on the size and complexity of the organisation this can take anywhere between days and months to resolve.
Agile development as it is traditionally understood, with short sprints and close user engagement works well for reporting and BI when the data has already been loaded into the Warehouse. If you are starting from scratch, your user will often have become bored and wandered off long before you give them any reporting.
(Yes, once again, nobody cares about the back end because it’s boring and complicated)
Rather than move the mountain to Mohammed…
There are some steps you can take to mitigate this. The product backlog is your friend here. Often with some relatively light work on the backlog you can identify which systems you are going to hit and broadly what data you will need from those systems.
On a large scale project you may find that you have multiple systems to target, all of which will vary in terms of time from discovery to availability in the DW. Here I generally advocate switching to a Kanban type approach (i.e. task by task rather than sprint based) where you try and move your tasks forward as best you can, and once you are blocked getting at one system, while you wait for it to unblock move on to another.
As systems get delivered into the EDW you can start moving to delivering BI in a more interactive, sprint based fashion. I generally advocate decoupling the BI team from the DW team for this reason. The DW team work on a different dynamic and timescale to the BI team (though note I count building Data Marts as a BI function, not a DW function). You do run the risk of building Data Warehouse components that are not needed, but knowing you will discarding some effort is part of Agile thinking so shouldn’t be a big concern.
Once again its about people
You may notice that none of the issues I’ve raised here are set in stone technical issues. It’s still about people – the ability of external people to react to or accommodate your needs – the capacity of users to be engaged in protracted development processes – the flexibility of project sponsors not to have a rigid scope.
Good people who can be flexible and accommodate change are the keystone to agile success. No tool or process with ever trump these factors.
Agile in a BI/DW environment faces a unique set of challenges that make becoming productive more difficult. These issues fall into a couple of categories. First are the difficulties in getting the team to the productivity nirvana promised. Second are the difficulties in simply being productive. Today I’ll focus on the first case.
Productivity nirvana is hard to find.
A core principle of Agile is the cross functionality of teams – so if there is slack in demand for one type of resource in a sprint, that resource can help out where there is stress on another. So a coder may pick up some test work, a web developer may help with some database design or a tester may help with some documentation and so on. The end result being the team can pretty much jump in each others shoes for basic tasks and only lean on the specialists for the tricky bits.
In BI/DW this cross-skilling is harder to pull off. The technical specialisation is more extreme – people tend to sit in the ETL, Cube or Report developer buckets and its taken them quite a while to get there. There is occasional crossover between a couple of technologies (usually at the BI end between Cube & report) but true polymaths are very rare. Plus the skills required to be good at any of these technologies tends to need very different mindsets – ETL developers tend to need to be methodical, logical thinkers with a strong eye for details and a love of databases – whereas report developers are often more creative and engage more with people (the business). This makes hopping into other team members shoes quite hard.
Meditations on the path
These things can be overcome to an extent by limiting the domains where cross-skilling is expected. This can be done in smaller teams by focusing the areas where the team can support each other away from the technical – for example testing or documentation can be pretty process driven and an ETL developer can easily test a report. Expectations around cross-skilling need to be reined in and the sprint planned with that in mind. This isn’t to say that cross-skilling can’t arise – but the time to get there is going to be a lot longer.
In larger teams you can look at dividing up the teams into areas where cross-skilling is more practical. Typically I like to Partition the DW and BI teams, though I take the perspective that your data mart ETL developer is part of the BI team which means you do need a bit of a flexible player in that BI ETL role though.
Once again its about people
A topic I like to hammer home is that most of your project concerns are not technical or process driven – it’s all about people, specifically people’s ability and willingness to adapt and learn. Picking team members who can adapt, are willing to adapt and can see the value to themselves in doing so are going to get you to the productivity nirvana that much faster.
As always, thoughts, comments and war stories welcome!