Using SUMMARIZE to get a Maximum Per Day in a Period

A quick snippet on the following scenario. I had daily data of users sessions in a system and wanted to know in a given period what the maximum number of unique users on a single day was. My data looked like this:

Data sample
Data sample

A user could connect multiple times per day, so I needed a DISTINCTCOUNT to get unique users. However for a given period I needed to know this per day. So for the period I needed to calculate the number of unique users per day in that period – which meant I needed to create an interim table using SUMMARIZE.

This resulted in this construction:

SUMMARIZE(Logs,[Session Date],”UsersPerDay”,DISTINCTCOUNT([User]))

“Logs” is my source data table. “Session Date” is what I am grouping my table by to get the results per day. The context of the period I am looking at (be it year, month, quarter, whatever) is managed by the date filters I apply to the table. “UsersPerDay” is just the name I assign to my measure, which is the DISTINCTCOUNT of the User field.

What I will end up with is an interim table which has – per day – the number of distinct users. Though it will not be materialised, in memory it would look like this:


Then, to get the Maximum in a day for a period, we just need the MAX of the the UsersPerDay in this table. As it’s an expression, we lean on MAXX:

MaxUsersPerDay:=MAXX(SUMMARIZE(Logs,[Session Date],”UsersPerDay”,DISTINCTCOUNT([User])),[UsersPerDay])

And there we have it! Note in MAXX the Expression we use to get the MAXX of is our custom “UsersPerDay” we created in the SUMMARIZE function. Intellisense won’t pick this up as it’s not part of the model but the formula works just fine.

Read More

Option Syntax for OData.Feed in PowerQuery

Technical post – for those who need to add options / parameters to their OData feeds in PowerQuery, syntax help is a little thin in the official documentation.

So for those moving beyond the basic connection of


If you want to add headers, then do this:


Noting that the escape for a double quote is another double quote in your headers string.

If you want to use a boolean option, such as EnableBatch, then do this:


Note the use of the plain null with no surrounding quotes if there’s no 2nd parameter value being provided.

Similarly for numeric parameters:


However, Timeout needs to be expressed as a Duration type – so you need to do this manipulation:


I have done more fun things that try to work out this syntax – I hope this saves you some time!

Read More

Datazen first client test!

I got to use Datazen for the first time in anger with a client this week, and my experience was a bit of a mixed bag. There are elements of it which are neat but a fair few niggles.

Things to love

It’s pretty. The simple visualisations mean it is easy to create a nice looking, uncluttered dashboard with minimal fuss and tweaking. The responsive design means cross device functionality is pretty good and looks nice. It’s also quick to build and change content.

Quick learning tips

First up, let’s get some practical learnings shared:

  1. Use views or stored procs behind your data if hitting a SQL Source. Datazen has no query editor (just a big text box) and doesn’t always handle parameters gracefully. Plus debugging is a pain as error massages are often less than helpful (e.g. “Data Preview Failed”)
  2. Set report friendly field names in you query as you can’t always manage them in designer – sometimes you can, sometimes you can’t.
  3. Selecting the ‘All’ option on a Selection List Navigator sends back ” (empty string) as the parameter value to the query, so handle that rather than NULL.

Now, some major drawbacks:

  1. Consistency of cross platform behaviour is not great. I found some drillthoughs didn’t work on iOS or Android. Windows 8 seems to be the go to client. It’s not fatal but for fussy clients it’s a hard sell that this cross platform tool doesn’t work as expected.
  2. The Win 7 Publisher app is unstable, buggy and seems to have features missing – such as proper configuration for report drillthrough. It’s only been around a few weeks so it’s forgivable but if you have to use it seriously, make sure you have a Win 8 client somewhere to do your development work on.
  3. The charting is actually quite limited. There’s no stacked bar, for example. Line charts can only be by time. Labeling and colours are pretty hard to control, often rendering components useless. A great example is the category chart (bar chart) – the renderer will sometimes decide not to display category labels – which then means you just have a nice picture of some bars with no context as to what each one is, like this:
Could be cat population by state for all I know
Could be cat population by state for all I know

Finally some irritations:

These are some of the things that got annoying as I used the product – not an exhaustive list – and small enough I’d expect them to be fixed relatively soon.

  1. You cannot update columns on a grid component if the underlying column names change – you have to rebuild component (a small task but annoying during development)
  2. You cannot set the Low/Neutral/High ranges for gauge columns on indicator grids so they match settings for other gauges
  3. You cannot align numbers – and they are aligned left which is not good dataviz practice
  4. There is no handling for outliers on heatmaps so one extreme value will screw your shading
  5. You can’t cascade drillthrough to a level below
  6. The data preview after creating a data view has no Scroll bar so if there’s a lot of fields you can’t see them
  7. There are maps provided but you have to work out how they are keyed yourself so you can connect your data (to be addressed in a future post)
  8. You can’t “oversize” the canvas so phone users can scroll down.
  9. Nobody’s using it – or at least talking about it – so help is hard to find.

A lot of irritation boils down to “I want to set that but I can’t”. This I’m sure is partly design, partly product maturity.

My takeaway.

After a week with the product I get a real sense that it’s not even a v1 product yet. Maybe v0.9. There’s lots of niggles in the product – and not just at the back end where users can’t see them. I could tolerate the back end weaknesses if the end user experience was great, but there’s holes there. Still, there’s a lot of good that can be done. It’ll be interesting to see how it fares given PowerBI has such a huge feature overlap.

Read More

New DQS features in SQL 2016 CTP2

In this quick post I’m going to share the new features introduced in Data Quality Services as part of SQL Server 2016 CTP2

Well, that didn’t take long. As far as I can tell, nothing has changed. So still hanging out for:

  • A decent UI
  • Ability to reference data services other than Azure Marketplace (e.g. MDS!)

Noises from MSFT are not exactly encouraging, so I’m not expecting much in future CTP’s at this point.

Read More

Migrating Package Connections to Project Connections in SSIS

One of the best changes in SSIS 2012 was to create the concept of a Project Connection – a connection manager that can be used across the whole project instead of being limited to the package scope, meaning you had to recreate and configure effectively the same connection in every single package you have. This feature is great…   when  you are starting a new project.

However a recent task I got handed was to migrate a 2008 project to the 2012 project model. All very sensible, to ease maintenance, eliminate  XML configurations and generally bolster security. Then I got to work….

Converting a Package Connection to a Project Connection

Ah, the easy part. Pick your connection, right click, convert to  project connection and  … ta daa! You have a project connection!

SSIS Project Connection
SSIS Project Connection

Now… what about all the other packages?

Pointing every other Package connection to the Project connection

This is a little harder. The good bit is your project connection will appear in your available connection managers. The bad bit is there is no way to tell the SSIS designer to use this one instead of your old one. You can either manually repoint every data flow, Execute SQL, script and whatever other task happens to be using the package connection to the project connection – easy if your package is small and simple – or get X(ML)treme! Fortunately thanks to this post by Jeff Garretson I was reminded that SSIS packages are just XML, and XML can be edited much faster than a package using the designer. Jeff’s post only resolved how to fix up the Data Flow – I had a pile of Control Flow tasks to fix up too – so here’s how to get it done without hours of coding.

Step 1: In designer mode Get the Name & GUID of all Connections to be replaced and what to replace them with.

You can get this from the properties window when you have a connection manager selected in the SSIS designer:

Connection Manager Properties
Connection Manager Properties

Step 2: Switch to code view and replace all Data Flow connections

You can find where a package connection is being used in a data flow by looking for the following in the XML:

connectionManagerID=”Package.ConnectionManagers[{Package Connection Name}]”

connectionManagerRefId=”Package.ConnectionManagers[{Package Connection Name}]”

Replace this text with:

connectionManagerID=”{Project Connection GUID}:external”

connectionManagerRefId=” Project.ConnectionManagers [{Project Connection Name)]”

Do this for all connections, then return to design view and delete the package connections

Step 3: Refresh the connections

This shouldn’t strictly be necessary but it seems to refresh all the meta data

  1. Close the package
  2. Open the package in code view

Step 4: Replace all Control Flow connections

  1. Find and replace {Package Connection GUID} with {Project Connection GUID}

Step 5: Refresh the connections again

  1. Close the package, open in design view
  2. Check for errors
  3. Done!

This almost certainly could be coded up by some one with the right chops, time and motivation.

A useful side note – if you have a Project Connection and a Package Connection with the same name (get link) then your project will start throwing errors.

Read More

PowerPivot Succinctly – new eBook by me!

Want to learn more about PowerPivot? Syncfusion have published a new eBook “PowerPivot Succinctly” written by my good self.

A quick overview of the contents:

  1. PowerPivot Model Basics
  2. Using your PowerPivot Model
  3. Sharing your PowerPivot Model
  4. A Note on Instability
  5. Deep Dive: The xVelocity Engine

It’s a free download – so go help yourself!

Read More

Managing complex security in SSAS Tabular – Yeah Nah edition

A couple of times recently I have come up against requirements which have required some fairly complex logic to apply security. One involved some fairly gnarly relationships coming from multiple directions, the other involved grinding through Hierarchies from parent nodes down to permitted viewable children.

The problem with both cases is that though the logic can sometimes be written (albeit usually in an ugly as hell manner) – the functions needed to do so perform atrociously. For complex relationships you are obligated to take in context after context, changing filters and doing all sorts of DAX voodoo. As we know by now, avoiding relationships is good for performance. Hierarchies can be managed through the PATH function, but it’s a text operation that is far from speedy.

Let’s give a quick example of some complex security – consider the below data model:

Complex Model
Complex Model

Here the security controls who can see what has been spent on a Task in the FactTable object. How can see what depends on their Role and/or the Unit they are in. There is also a 1:many relationship between a person and the login they can use.

So for dynamic security you need to navigate from the User Id to the Person and assess what Unit they are in for the Unit based permissions. You also need to assess what Role they are in to get the Role based permissions.

I took one look at this and shuddered at the messy DAX I was going to have to write, plus how terribly it would perform.

Do it in the Cube? Yeah Nah.

So I thought “Yeah nah” and decided the cube was the wrong place to be doing this. Ultimately all I was trying to get towards was to pair a given login with a set of tasks that login would have permissions against. This is something that could easily be pushed back into the ETL layer. The logic to work it out would still be complex, but at the point of data consumption – the bit that really matters – there would be only minimal thinking by the cube engine.

So my solution enforces security through a role scanning a two column table which contains all valid pairings of login and permitted tasks to view. Very fast to execute when browsing data and a lot easier to code for. The hard work is done in loading that table, but the cube application of security is fast and easy to follow. The hierarchy equivalent is a pairing of User Id with all the nodes in the Hierarchy that are permitted to be seen.

As a final note, for those non-Aussie readers the expression “Yeah nah” is a colloquialism that implies that the speaker can’t be bothered with the option in front of them. For example: “Do you want a pie from the Servo, Dave?” “Yeah nah.”

Read More

Dynamic Time Variant calculations in DAX

A common requirement in any set of calculations is to create a range of time variants on any measure – Prior Period, Year to Date, Prior Year to Date, Prior Quarter…   you think of a time slice and someone will find it useful.

However the downside to this is that in the model you end up maintaining lots  of calculations that are all largely doing the same thing. Any good coder likes to parameterise and make code reusable. So how could we do this in Tabular? There is a way that is a very specific variant of the idea of Parameter Tables

Disconnect your Dimensions!

Step one is to unhook your Date Dimension from your fact table. This may seem counter-intuitive, but what it frees you to do is to use the Date dimension as a source of reference data that doesn’t filter your data when you select a date – this simplifies the subsequent calculations significantly. You also need to add to the date dimension all the dates you will need to perform your calculations – Year starts, Prior Year starts, Period starts etc. – this isn’t compulsory but you’ll be grateful later on when you need these dates and don’t have to calculate them on the fly, trust me. Your Date table (I’m going to cease calling it a Dimension, it isn’t any more) will end up looking something like this:

Date Table
Date Table

In practice you would hide all the columns apart from the Date as this is the only one that actually gets used by users.

Time for the Variants

Next, we need to create a simple filter table to apply the Time Variant calculations. All it needs is a numeric identifier per variant and a variant name, like so:

Variants Table
Variants Table

This – quite clearly – isn’t the clever bit. The thing to observe with all of these variants is that they create a date range. So what we need to do is calculate the applicable Start and End dates of that range. This is the bit where we are grateful we pre-calculated all those in our Date table. We add two Measures to the table, StartDate and EndDate, which detect which Time Variant is being calculated and then work out the appropriate date, based on the currently selected date. The DAX for StartDate looks like this:


We use a SWITCH statement against the VariantID to detect which Variant we are trying to get the date range start for, then pick the right date from the Date Table. Pre-calculating these in the Date table keeps this part simple.

Add it all up

The final part is to pull these dates into the measure:


This works by using the DATEBETWEEN function to apply a custom date range filter to the Transactions table – which we create dynamically through our StartDate and EndDate calculations.

Our end result:

Time Variant Results
Time Variant Results

We can see above that we can for a single selected date, generate a range of Start and End dates and apply those to our single summarising function to create multiple Time Variations.

The sample workbook is here: DAX Time Variants

Read More

Great PASS BIDW VC Video on how Vertipaq Compression works in SSAS Tabular / PowerPivot

Embedded below is a great video from Marco Russo on how the compression engine works in SSAS Tabular / PowerPivot:

This is from the SQL PASS BI Virtual Chapter Youtube channel – well worth nosing around now they post all their sessions on there (handy for us in Oz where the timings are usually not great)

Read More

Shrink Tabular column space used by over 50% using a simple trick

I’ve recently wrapped up writing the draft of a PowerPivot book (news on that once it’s published) and as part of having to make sure I “knew my onions” I spent a bit of time working my way around understanding the compression engine. I came across this post – Optimizing High Cardinality Columns in VertiPaq – by Marco Russo, and it sparked my interest in seeing how it could be applied to a couple of common data types – financial amounts and date / times. This first lead to me getting distracted building a tabular model to see how much memory columns (and other objects) used. Now i’m getting back to what took me down that path in the first place: seeing how different data type constructions affect memory usage.

How PowerPivot compresses Data

As an introduction, it really helps to understand how PowerPivot compresses data in the first place*. The key tool it uses is a Dictionary which assigns an integer key to a data value. Then when the data is stored it actually stores the key,  rather than the data. When presenting the data, it retrieves the keys and shows the user the values in the dictionary.

To illustrate, in this list of Names and Values:

Names and Values
Names and Values

We have several repetitions of Name. These get stored in the dictionary as follows:

Names Dictionary
Names Dictionary

Then, internally PowerPivot stores the data of Names/Values like this:

PowerPivot Stored Data
PowerPivot Stored Data

This results in high compression because a text value takes up much more space than an integer value in the database. This effect multiples the more repetitive (i.e. lower cardinality) the data is. High cardinality data,  typically numeric values and timestamps – do not compress as well as the number of dictionary entries is often not much less than the number of actual values.

* Quick caveat: this is the theory, not necessarily the practice. The actual compression algorithms used are proprietary to Microsoft so they may not always follow this pattern.

Splitting Data – the theory

The key to Marco’s approach is to split data down into forms with lower cardinality. So what does that mean?

For a financial amount, the data will be in the form nnnnnnn.dd – i.e. integer and fraction, dollars and cents, pounds and pence, etc. But the key thing is that the cents / pence / “dd’ portion is very low cardinality – there are only one hundred variations. Also, stripping out the “dd” potion will probably end up reducing the cardinality of the number overall. For example, consider these unique 4 numbers:

  • 4.95
  • 4.50
  • 7.95
  • 7.50

That is four distinct numbers… but two integer parts and two fraction parts. At this small scale it makes no difference, but for thousands of values it can make a big impact on cardinality.

For a DateTime the data will be in the form dd/mm/yy : hh:mm:ss.sss. You can separate out the time component or round it down to reduce cardinality. Your use case will determine what makes sense, and we will look at both below.

Splitting Data – the practice

Any good theory needs a test, so I created a one million row data set with the following fields:

  • TranCode: A 3 character Alpha transaction code
  • TranAmount: A random number roughly between 0.00 and 20,000.00
  • TranAmountInteger: The Integer part of TranAmount
  • TranAmountFraction: The Fraction part of TranAmount
  • TranDateTime: A random date in 2014 down to the millisecond
  • TranDate: The date part of TranDateTime
  • TranTime_s: The time part of TranDateTime rounded to the second expressed as a time datatype
  • TranTime_ms: The time part of TranDateTime rounded to the millisecond expressed as a time datatype
  • TranTime_num_s: The time part of TranDateTime rounded to the second expressed as an integer datatype
  • TranTime_num_ms: The time part of TranDateTime rounded to the millisecond expressed as an integer datatype
  • TranTime_s_DateBaseLined: The time part of TranDateTime rounded to the second expressed as a datetime datatype, baselined to the date 01/10/1900
  • TranTime_ms_DateBaseLined: The time part of TranDateTime rounded to the millisecond expressed as a datetime datatype, baselined to the date 01/10/1900

The generating code is available here. I’ve used some T-SQL Non Uniform Random Number functions to get more “realistic” data as early drafts of this test were delivering odd results because the data was too uniformly distributed so VertiPaq couldn’t compress it effectively.

You may be wondering why I’ve produced TranTime as time and datetime datatypes – the short answer is Tabular Models treat sql server time datatypes as text datatypes in the tabular model, so I wanted to check if that made a difference as I was getting some strange results for split time.

I then imported the table into a tabular model and processed it, then used the discover_memory_object_usage to work out space consumed by column. The results were this:

Split Column Memory Usage
Split Column Memory Usage

There was a clear saving for splitting the financial amounts into integer and fractions – the split column saved around 50% of the space.

DateTime behaved very oddly. Rounding down the precision from milliseconds to seconds brought big savings – which makes sense as the cardinality of the column went from 1,000,000 to 60,000. However splitting it out to just the time component actually increased space used.

I tried fixing this by baselining the time component to a specific date – so all millisecond/second components were added to the same date (01/01/1900) – this basically made no difference.

A more effective variation was to just capture the number of milliseconds / seconds since the start of the date as an integer, which saved about 89% and 92% of space respectively.

Splitting Data – the advice

Though there are certain costs associated with doing so, such as the loss of the ability to do DISTINCTCOUNT on values, but if your model is pushing memory limits then splitting decimal numbers into their integer and fraction (especially currency fields) can make a big difference – my experiments showed 50% and that was using fairly random data – real life tends to be a bit more ordered so you can hope for more savings.

Fundamentally it looks like DateTime values compress poorly, and Time values even more so. A better solution – at least from a compression standpoint – is to store the date value as a Date datatype in the model, and have any time component stored as integers. How this impacts performance when bringing these together at runtime using the DATEADD function is a matter for you to test!

Read More