Managing your history data

This post is to an extent a small rant about some design decisions I have been constrained by on my current project. These decisions were made predominantly for one fairly bad reason: it made the architect’s life easier (apologies to the architects if they are reading – but these were bad choices!)

The design choices in question are around the managing of history data. In one component of the system it relates to Database storage design, the other relates to Cube storage design. In both cases the history data is stored in a separate location to the “current” data.

Databases: Why separate history tables are a bad idea

The first – and most compelling – reason for not storing your history data in separate tables to your current tables is that it increases complexity for users. Instead of having one location to look for data, your users now have to use two.

The second compelling reason is that there is no point to doing this from a storage point of view. SQL 2005 & 2008 (Enterprise editions only, admittedly) provide partitioning. This enables the contents of an individual table to be stored in separate locations on different filegroups. This means that you can store your current days data in one location and your history in a different one. The reason for doing this is the same as splitting it into separate tables – that querying the current section will be faster than the historic section.  In theory queries against partitioned tables should in fact be faster as the current data is now no longer in the same filegroup as the history data.

Now, there is an overhead associated with designing and maintaining partitions but I don’t see that it is significantly larger than that required to deal with the process required to archive data into separate tables on a daily basis. Additionally when maintaining separate history tables, you need to separate out every single table, whether it gets 10 rows a day or 10 million. With partitioning you can just target the large tables that need that focus.

There are other downsides to maintaining separate tables. If you make a change to a table design, you need to do it in 2 places.  You also need to remember to update your history processes. If your history process fails, you can end up with users getting unexpected query results or ETL process failures when the system loads the next day’s data into the current tables –  and untangling it becomes a real mess. If your partition processes fail to run, you just have too much data in one filegroup for a while – unlikely to be fatal.

So if you have large tables you need to split out for performance purposes – do it at the back end, using the power of the database – which is designed to store data efficiently. Keep it away from the users – they neither need to know or care about your need to keep the data separate. If you want to give them a single object to query with the current day’s data, just use views.

Cubes: Why a separate history Cube is a bad idea

Much of the above applies here – SSAS also has partitions – so you can again store your historic and current data in separate physical locations with the users being totally unaware of this. Again there is overhead in maintenance, but this will also balance out with the maintenance and risks associated with maintaining two identical cubes that only differ in terms of data source.

Use your storage options!

So without banging on about the same things any further, please consider the following two points whenever considering managing your history data:

  1. How does what i’m planning affect my users?
  2. How does what i’m planning leverage the platforms capabilities?

5 thoughts on “Managing your history data

  1. It is particularly bad when you have your history in a different cube..

    Firstly, partitioning your cube will yield the same performance for MDX queries.
    Then it is terrible from maintenance point of view – we have to replicate all caclulations in both cubes and when something changes, we would have to replicate all the changes in both places.
    Also, how about changing cube structures? How about security, etc…

    You are abolutely right to be unhappy about being forced to implement bad designs. I hate it as well. Wouldn’ it be possible to consolidate the two fact tables before you build a cube on top of them? I would probably attempt to push a change like that through..plenty of reasons to do so. I guess sometimes we are unlucky enough to get asked to do something as “experts” and then not to get heard when expressing concerns..

  2. I don’t completely agree on the history for the database.
    (from an OLTP vision, not DW filled with ETL process)

    When you have history in 1 table you have an overhead to get the current version of the record. You have to search for the latest record (so a select max(dateref) or id all the time) or you have to search for the “current=Y” record (but in this case you have to insert the new record and update the previous one so multiple write action)

    and constructing the history can required complicated queries: if a table (like a dimension like table) has been updated and you create a new record here, you do not create or update the related tables, so when you have to create your history, you have to search which row of the parent tables is good for the date of the current child row.

    so having a unique history table can be a good solution when there is 1 read access for multiple write. Else this method did not help querying the current record or helping getting the right history records.

    creating a real history database will result in an explosion of the size of the database. each time a parent record is updated (so an insert is done) you have to also duplicate all the child tables to link them to the newly record (and its a cascading insert)

    Separating the current and the history table at least solve the “current” access (which is the most common access) and insure excellent performance here, while the history table will provide the capability of tracking what as changed over the time.

    Where I’m working now they just suffer the history issue because they choose what you describe, and in fact this add complexity to the access instead-of providing simplification. Because all the time they undo what the history does!!! (because the current vision is more important than history vision)

    The mistake: don’t focus on the storage, but focus on the usage!
    If you have 1 write, 1000 read of the current record and 10 of the history records; the unique table model slow down 1010 accesses. separating into 2 tables will slow down only 10 accesses (history)
    If you have 1000 write, 1 read of the current records and 1 of the history, then the 1 table model is good.

    For a DW, the problem is different, because the ETL process will insure the history integrity.

    There is other factors to consider and which can impact the solutions, each system is different. So insure that you always validate the usage and needs before going to a solution which can finally produce problems.

  3. Jerome, you make a lot of valid points and I agree usage is the key thing – after all data is being provided as a service and you want to make that service as fast as possible.

    The key point of my post is that storage technologies such as partitioning mean that some of the access concerns you raise go away – so you don’t actually slow the access to current records by maintaining historic ones in the same data structure.

    It’s perhaps not as black and white as I indicated once you get into the scope of updating history and the read / write relationship changes.

    Thanks for you input!

  4. This is wrong on so many levels i do not know where to start.

    I gather you are not a storage nor database expert.

    If i have time later i will illustrate why.

    Dan.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>