Productivity issues for Agile in BI/DW – Part 2: Technology

Agile in a BI/DW environment faces a unique set of challenges that make becoming productive more difficult. These issues fall into a couple of categories. First are the difficulties in  getting the team to the productivity nirvana promised, which I covered in this post. Second are the difficulties posed by technology and process, which I’ll talk about today.

Some obstructions cannot be moved by thought alone.

Solving problems by thought alone

Solving problems by thought alone

Agility in traditional coding environments runs at a very high level like this: User states requirements, Coder develops an application that meets those requirements, test, showcase, done.

In BI/DW environments there process is less contained and has a lot of external dependencies. A user requesting a metric on a report is not a matter of coding to meet that requirement – we need to find the data source, find the data owner, get access to the data, process it, clean it, conform it and then finally put it on the report. Depending on the size and complexity of the organisation this can take anywhere between days and months to resolve.

Agile development as it is traditionally understood, with short sprints and close user engagement works well for reporting and BI when the data has already been loaded into the Warehouse. If you are starting from scratch, your user will often have become bored and wandered off long before you give them any reporting.

(Yes, once again, nobody cares about the back end because it’s boring and complicated)

Rather than move the mountain to Mohammed…

There are some steps you can take to mitigate this. The product backlog is your friend here. Often with some relatively light work on the backlog you can identify which systems you are going to hit and broadly what data you will need from those systems.

On a large scale project you may find that you have multiple systems to target, all of which will vary in terms of time from discovery to availability in the DW. Here I generally advocate switching to a Kanban type approach (i.e. task by task rather than sprint based) where you try and move your tasks forward as best you can, and once you are blocked getting at one system, while you wait for it to unblock move on to another.

As systems get delivered into the EDW you can start moving to delivering BI in a more interactive, sprint based fashion. I generally advocate decoupling the BI team from the DW team for this reason. The DW team work on a different dynamic and timescale to the BI team (though note I count building Data Marts as a BI function, not a DW function). You do run the risk of building Data Warehouse components that are not needed, but knowing you will discarding some effort is part of Agile thinking so shouldn’t be a big concern.

Once again its about people

You may notice that none of the issues I’ve raised here are set in stone technical issues. It’s still about people – the ability of external people to react to or accommodate your needs – the capacity of users to be engaged in protracted development processes – the flexibility of project sponsors not to have a rigid scope.

Good people who can be flexible and accommodate change are the keystone to agile success. No tool or process with ever trump these factors.

Productivity issues for Agile in BI/DW – part 1: People

Agile in a BI/DW environment faces a unique set of challenges that make becoming productive more difficult. These issues fall into a couple of categories. First are the difficulties in  getting the team to the productivity nirvana promised. Second are the difficulties in simply being productive. Today I’ll focus on the first case.

Productivity nirvana is hard to find.

Nirvana

Nirvana

A core principle of Agile is the cross functionality of teams – so if there is slack in demand for one type of resource in a sprint, that resource can help out where there is stress on another. So a coder may pick up some test work, a web developer may help with some database design or a tester may help with some documentation and so on. The end result being the team can pretty much jump in each others shoes for basic tasks and only lean on the specialists for the tricky bits.

In BI/DW this cross-skilling is harder to pull off. The technical specialisation is more extreme – people tend to sit in the ETL, Cube or Report developer buckets and its taken them quite a while to get there. There is occasional crossover between a couple of technologies (usually at the BI end between Cube & report) but true polymaths are very rare. Plus the skills required to be good at any of these technologies tends to need very different mindsets – ETL developers tend to need to be methodical, logical thinkers with a strong eye for details and a love of databases – whereas report developers are often more creative and engage more with people (the business). This makes hopping into other team members shoes quite hard.

Meditations on the path

These things can be overcome to an extent by limiting the domains where cross-skilling is expected. This can be done in smaller teams by focusing the areas where the team can support each other away from the technical – for example testing or documentation can be pretty process driven and an ETL developer can easily test a report. Expectations around cross-skilling need to be reined in and the sprint planned with that in mind. This isn’t to say that cross-skilling can’t arise – but the time to get there is going to be a lot longer.

In larger teams you can look at dividing up the teams into areas where cross-skilling is more practical. Typically I like to Partition the DW and BI teams, though I take the perspective that your data mart ETL developer is part of the BI team which means you do need a bit of a flexible player in that BI ETL role though.

Once again its about people

A topic I like to hammer home is that most of your project concerns are not technical or process driven – it’s all about people, specifically people’s ability and willingness to adapt and learn. Picking team members who can adapt, are willing to adapt and can see the value to themselves in doing so are going to get you to the productivity nirvana that much faster.

As always, thoughts, comments and war stories welcome!

IAPA 2014 Salary Survey

The IAPA salary survey came out a couple of months back, and though it is Analytics focused it has some interesting results for those of us in the BI world. My key takeaways follow.

From a purely self interested point of view, Analytics is a well paid profession and it’s getting more so. Further, recruiters are reporting that finding people is getting harder, which indicates the talent pool is not all that deep and has been sucked fairly dry already. Something I experience regularly when trying to find BI talent.

If you want a job in the field, you’re best off being in Sydney or Melbourne. There also appears to be minimum education level of a bachelors degree with most professionals holding a masters or higher. Marketing is one of the biggest employers of analysts.

For those in the field there seems to be a mid career slump in satisfaction (around the ten year mark). Fresh starters are all excited and lifers seem happy too, but somewhere in the middle the enthusiasm fades.

Despite all the market enthusiasm, a significant proportion of respondents said there is an ongoing challenge reported that analysts struggle to get their organisation to value or act on analytics findings – supportive of Eugene Dubossarsky’s claims that business heavily invest in vanity analytics so they can claim “me too” rather than to derive real value.

Technical takeways – for all the noise, Big Data is still a Small Concern and regular sized analytical problems are prevalent. Excel is the #1 tool used to work with data, and if you are more of an integrator good SQL skills are king.

Last of all, There still seems to be a heavy focus on social media analytics – despite it’s dubious value – but it pays better. Something which underscores the vanity analytics claims further.

Agility in BI

Over the last couple of years I have been increasingly exposed to Agile Methodologies for delivering BI projects. I’ve even once accidentally got myself trained as a Scrum Master. The topic itself isn’t exactly new (I blogged about it at the end of 2009) but adoption is starting to pick up as recognition that it can significantly improve the results of BI projects. What follows are some early stage ramblings on my experience so far as I try to pull together a more coherent story as I think its one that needs to be told and sold.

Agile Principles

For those of you unfamiliar with Agile, it stems from the Agile Manifesto drawn up by some experienced developers who figured there was a better way than waterfall. The aim is to deliver faster, earlier and closer to user expectations. This is achieved by focusing on high value deliverables, delivering them piecemeal in descending order of value and engaging very closely with the business. All sounds pretty sensible so far.

Agile People

There are a large variety of Agile Methodologies in play – Scrum, Extreme Programming, Lean Startup….   and none of them matter. Or at least, which one you pick doesn’t really matter. What matters more is whether your team does – or doesn’t – have the ability to work in an agile fashion. What I mean by this is that team members are self starting, self managing individuals with a low level of ego – for example they are happy to pitch in for the team and test some other developers work, rather than go “No, I code C++ and that’s all I will do.”. Good Agile teams are made up of members who understand they are in a team and that a team is not a fit of discrete lego bricks of skills but a fuzzy stew of ingredients where contributions blend into each other.

Now “the team” may just be your little pod of developers going out on a limb and trying soemthing new. Or, in a case I worked with last year, it may be a collosal financial services organisation where “the team” also encompassed very senior executives. Agility is a fragile thing and can just as easily be broken by a bottom rung developer not pulling their weight as it can be by a senior manager not understanding what the team is doing and trying to influence them to behave in non-agile ways.

Adoption Issues

The take up of Agile is often fraught with frustration and disappointment that a Nirvana of productivity doesn’t arise as soon as the Agile team forms. The reasons behind this are manifold – not least because it takes a while for the team to start working together with the new approach. Most methodologies acknowledge that initially productivity will be worse, rather than improved, until the team has stormed formed and normed its way to a higher productivity level. Many businesses struggle with the idea that they must let the patient get worse before they can get better.

Something else I have seen frequently is that the business engagement is poor – and not for want of trying. Before Agile is attempted, the business will often complain IT doesn’t involve them enough. Once it is attempted, they then grumble that involves them too much. This is incredibly frustrating to see as it underscores the business percepetion that IT is the servant and they are the master, rather than a valued partner trying to help them achieve their goals.

The Road to Success

I’m still gathering my thoughts on this but I suspect part of the road to success in doing BI in an Agile way is going to involve raising some red flags in the same way that you do for BI projects for the classic modes of failure (e.g. no business sponsor). Too often i’ve seen Agile fail because it is driven out of the development team wanting to do better but being unable to get the support needed to make it happen. The other killer is the “WAgile” trap where the team tries to operate as an Agile unit under a Waterfall project management approach.

I’m also keen to hear any readers war stories in the comments.

Top Six Worst Practices in BI – Vendor Nonsense

A while ago I was pointed at this TDWI white paper – “Top Six Worst Practices in Business Intelligence” – which turned out to be a classic TDWI vendor driven pile of steaming misinformation. Verbatim, here are their claims:

  1. Buying what analysts want without considering the needs of other users
  2. Deploying new BI tools without changing the Excel mindset into a BI platform mindset
  3. Making a BI purchasing decision based on one hot feature, or buying a feature list rather than laying the foundation for a comprehensive BI strategy
  4. Lack of a concrete data quality strategy or treating data quality as an afterthought
  5. Not taking a “mobile-first” approach, or not considering the needs of mobile BI users
  6. Ignoring new data, new sources, and new compliance requirements

Just in case anyone is in danger of believing this, I thought I’d give a rebuttal as I have a headache and am feeling grumpy.

1. Buying what analysts want without considering the needs of other users

The paper makes the somewhat bizarre (and unsupported) claim that:

Most companies make business intelligence (BI) purchasing decisions based on input from
just one group of users – the business analysts. This single perspective, however, creates
many problems.

I can safely say in any purchasing decision I’ve ever come across the BA’s input has been somewhere between nada and zip. The reality is that it’s driven by corporate policies, what legacy systems are already in place, licensed and supported and one in a blue moon some idiot in management sees Tableau or Qlikview and buys that without considering any of the underlying problems that will cause.

There is a grain of truth in this point – that any purchasing decision that doesn’t consider the end users preferred way to to use information is doomed to failure. The back end is irrelevant – end users do not care about the Data Warehouse platform, the ETL tool or even the Cube platform you use. Just the front end. And most of the time, that front end is Excel.

2. Deploying new BI tools without changing the Excel mindset into a BI platform mindset

This is a vendor problem, not a user problem in that users prefer Excel to their tool. Sorry about that to anyone who isn’t Microsoft. It’s kind of odd because this point contradicts their first point.

3. Making a BI purchasing decision based on one hot feature

Yes. This explains Tableau and Qlikview’s popularity. However the solution is not – as the erstwhile vendor claims – their product. In fact, I’m not even sure why this is on the list. Technical issues are rarely the source of BI project failure, so it doesn’t really matter what product you choose – and I’m sure every vendor in the world will recoil in horror at this uncomfortable truth. What matters are people and data. The tool connecting the two is often inconsequential.

4. Lack of a concrete data quality strategy or treating data quality as an afterthought

This I agree with. Data Quality is a huge pain point.

5. Not taking a “mobile-first” approach

If it’s relevant to your organisation. In my current organisation it is utterly irrelevant. In many projects I’ve worked on it’s been a nice to have that got discarded quickly due to its poor value. If it affects adoption, then it’s relevant and must be considered. If it won’t, it doesn’t need to be thought about.

6. Ignoring new data, new sources, and new compliance requirements

I’m ambivalent about this one. The implication of this is less about ignoring and more about being unable to adapt. Rigid, locked down BI systems rapidly become irrelevant because they must change as business changes. However this is as much a function of people and process as it is technology.

How about ONE worst practice in Business Intelligence?

Try this instead: “Believing your BI implementations success or failure will be impacted by technology more that by the people using it”.

…as a cheeky second practice – “Believing anything vendors tell you”.

Merry Christmas and a Happy New Year!

Exploring Memory Usage in Tabular Models

Trying to understand what is going on under the hood with a Tabular model is possible using the SSAS DISCOVER_OBJECT_MEMORY_USAGE Rowset – but the results aren’t going to win any prizes for accessibility. The mighty Kasper De Jonge had a crack at making it accessible back in 2012 using a PowerPivot model. However it didn’t help me filter down the way I wanted so I decided to up it a notch with a Tabular model on my Tabular model’s memory usage.

A Tabular Model on DISCOVER_OBJECT_MEMORY_USAGE

The main features of the Tabular model are:

  • A Measure for Memory usage (kind of important)
  • A Hierarchy for exploring the structure of the memory use
  • An Attribute for the Model (so you can filter on just the model you want)
  • An Attribute for the Model Object (e.g. Hierarchy, Column Storage, Data Sources, etc.)
  • An Attribute to identify Server objects (such as Server Assemblies) vs Model objects

Before we get into the gnarly details, here’s a look at what comes out the other side:

DISCOVER_OBJECT_MEMORY_USAGE

DISCOVER_OBJECT_MEMORY_USAGE model output

What you get is the capacity to browse down the hierarchy and apply a few useful filters:

  • Filter to the Model(s) you are interested in
  • Filter for the type of Model Object (e.g. Column, Hierarchy) you want to focus on
  • Filter for Server / Model level objects (largely useful for just getting rid of server level noise)

Things that work well, and not so well.

Actually, it mostly works pretty well. It cleans up most of the GUIDs that make navigation tricky, categorises objects usefully (for me, anyway) and the logic baked into the view that does most of the work is not too hard to follow.

The biggest problem is that the hierarchy of objects doesn’t always make sense – there seem to be Model level objects at the Server level with no attached model. This is probably more to do with my understanding of how the server handles certain objects.

However, I’m always happy to get some feedback on this and any suggestions – especially on how to categorise things properly – will be greatly appreciated.

How to get this in your environment

The solution comes in a few parts:

  • SQL Table to hold the contents of DISCOVER_OBJECT_MEMORY_USAGE
  • SSIS Package to extract the results from DISCOVER_OBJECT_MEMORY_USAGE into the table
  • SQL View to translate, clean and categorise the output from DISCOVER_OBJECT_MEMORY_USAGE
  • A Tabular model to help structure exploring the output
  • An Excel spreadsheet to show the results

If you want to get this up and running, the pack here has everything you need. In order to install it do the following:

  1. Run the SQL script dmv_SSAS_Tabular_DISCOVER_OBJECT_MEMORY_USAGE.sql to create the destination table
  2. Run the SQL script vw_dmv_SSAS_Tabular_DISCOVER_OBJECT_MEMORY_USAGE.sql to create the translating view
  3. Run the SSIS package in the the SSAS_DMV project to load the table
  4. Deploy the SSAS project TabularObjectMemoryUsage to create the tabular model
  5. Open the spreadsheet ObjectMemoryUsage.xlsx to explore your results

Along the way in steps 1-5 you’ll have to set connections/configurations to ones that work for your environment.

Have fun playing!

SSAS Tabular at Scale

The cube on my project has been hitting some apparent concurrency issues, so I’ve been hunting for advice on how to tune the hardware (model tuning has already gone a long way). Unfortunately Microsoft don’t have any reference architectures – and their only other advice was to try and use an appliance in Direct Query mode – which was not practical in our circumstances any way.

As usual, the gents at SQLBI had something useful to say on the subject based on a customer case study, which is detailed in this white paper. While well worth a read, I’ll summarise the key findings:

  • Standard Server CPU’s don’t perform well enough, and you will need to look at faster CPU’s with a large cache
  • Faster CPU’s are better than more CPU’s in terms of return on investment for perfromance
  • Fast RAM is a must
  • For NUMA aware servers you need to set the Node Affinity to a single node, preferably using a Hyper-V host for your tabular server

Setting aside the last point, which is a bit deep in server config and requires more explanation, the key thing is to look for fast CPU. They found that Workstation Blades were generally better than Server Blades, and some of the best performance they got was out of one of their Dev’s gaming rigs!

We’ll be trying some of this out and hopefully I can keep you posted with results. I have more stuff on monitoring tabular in the pipeline now I’ve finished my PowerPivot book (to be published soon).

Also, don’t forget my upcoming DW Starter Training on Nov 24/25 in Sydney

October Sydney training roundup – MS BI, Cloud, Analytics

The end of the year is closing in fast but there’s still plenty of chances to learn from specialist providers Agile BI, Presciient and of course, me!

Topics cover the full spread of DW, BI and Analytics so there’s something for every role in the data focused organisation.

Build your Data Warehouse in SQL Server & SSIS with the BI Monkey

Nov 24/25 – Are you about to build your Data Warehouse with Microsoft tools and want to do it right first time?

This course is designed to help a novice understand what is involved in building a Data Warehouse both from a technical architecture and project delivery perspective. It also delivers you basic skills in the tools the Microsoft Business Intelligence suite offers you to do that with.

Get more detail here

Agile BI workshops

Power BI specialist Agile BI brings your product updates on this key new self service BI technology:

Oct 15 – Power BI workshop – Excel new features for reporting and data analysis – more detail here

Oct 30 – What Every Manager Should Know About Microsoft Cloud, Power BI for Office 365 and SQL Server 2014 – more detail here

Presciient Training

Dr Eugene Dubossarsky shares his deep business and technical exercise across a range of advanced and basic analytics. Full details here but the key list is:

Dec 9/10 – Predictive analytics and data science for big data

Dec 11/12 -Introduction to R and data visualisation

Dec 16/17 -Data analytics for fraud and anomaly detection, security and forensics

Dec 18/19 – Business analytics and data for beginners

 

PowerPivot calculated columns are not dynamic

A quick and dirty one – in attempting some clever dynamic security modelling in DAX I was warned about a gotcha in my design – that calculated columns were only evaluated when the model processed, so any approach based on calculated columns was doomed to failure. I didn’t quite believe it so I decided to do a road test in PowerPivot. Using a simple data set of one Dimension and one Fact, like below:

Simple Data Model

Simple Data Model

I hooked them up in PowerPivot with a relationship between “Column” on both tables. Then, using the ISFILTERED() function I created a couple of calculations. One, at Row Level, that would return a value of 1 if I filtered on the Attribute column:

=IF(ISFILTERED(Dim[Attribute]),1,0)

Which I then put a SUM on top of. I also added one at measure level, perfoming a similar function:

FilterCheckAsMeasure:=IF(ISFILTERED(Dim[Attribute]),1,0)

Then I created a Pivot Table checking the results, and got this:

Results

Results

The takeaway being that filtering on the Attribute was picked up by the table level measure, but the calculated column did not change value.

You can have a check of this yourself in my sample workbook: DAX Calculated Columns Evaluation Order.

What does the documentation say?

Well, nothing terribly clear, but in this page there is this small paragraph:

When a column contains a formula, the value is computed for each row. The results are calculated for the column as soon as you create the formula. Column values are then recalculated as necessary, such as when the underlying data is refreshed.

It doesn’t say exactly when “recalculated as necessary” is necessary, but the implication is that it’s a model level change, rather than the switching of context, or in my case the changing of the user viewing the model.

So in summary, we have to assume that our calculated column values are fixed upon initial calculation, formula changes or data load (or processing in a Tabular model) and there isn’t a way to make the value in a given cell change.

Creating effective date ranges from multiple sources using Window Functions

Sometimes dimension data is managed across multiple tables. Just to complicate things sometimes this data has independent effective date ranges on these sources. So when we try to tie our data together, trying to pick which item of data is effective when is a bit of a challenge.

A picture speaks a thousand lines of blog post, so the picture below spells it out:

Date Ranges from multiple sources

Date Ranges from multiple sources

Table A has a set of data with different effective periods. Table B has a set of data for the same attribute with a completely independent set of effective periods. In data terms, it looks like this:

Date ranges from multiple sources sample data

Date ranges from multiple sources sample data

The challenge is to join them together so we get the right combination of attributes effective at the right time, as per the first picture. Now there is a way to do it through a join with careful selection of start / end  dates in a CASE statement and filtering out of records using  WHERE clause. However that has the downfall that it cannot cope with records where there is no cross over of data – so records  “1-“,”4-” and  “5-” have to be added in later through a separate process.

The alternative is to get the window functions voodoo doll out, and stretch the brain a little so you can do it all in one query.

Step one in this exercise is realising that each tables start dates could also be end dates in the other table, and each tables end dates could also be start dates (less a day) in the other table. So we need to UNION End Dates from Table A with Start Dates from Table B, like so:

SELECT    ID,    [A Start Date] AS [Start Date]
FROM    Table_A
UNION
SELECT    ID,    [B Start Date]
FROM    Table_B
UNION
– All end dates are start dates + 1 too
SELECT    ID,    ISNULL(DATEADD(d,1,[A End Date]),’31 Dec 2099′)
FROM    Table_A
UNION
SELECT    ID,    ISNULL(DATEADD(d,1,[B End Date]),’31 Dec 2099′)
FROM    Table_B

Now, this gives us a full set of every possible start date – which is a starting point.The end result looks like this:

Union Results

Union Results

We can repeat the same trick for end dates and then do a cartesian join on the two sets and then we get a combination of every possible start and end date. No we need some criteria by which to select the right date pair. If we add a DATEDIFF to the resultset it becomes obvious we want to pick the smallest date range:

Crossjoin results with DATEDIFF

Crossjoin results with DATEDIFF

A WINDOW function gives us the intelligence to pick the right row. So if we apply a ROW_NUMBER() over a PARTITION of Start Date, ordering by End Date, then we just have to select the first row of each partition:

The final result

The final result

Now we have a complete set of effective date ranges on which to join our attribute tables!

Grab a copy of the query, including sample data scripts here: DateRanges

Don’t forget to check out my upcoming training “Build your Data Warehouse in SQL Server & SSIS with the BI Monkey” – a  primer course for anyone wanting to build a first DW using Microsoft tools.

 

Next Page »