Data Quality – the Business Problem

Steve Bennett over at Oz Analytics has just done a couple of good posts on data quality from the perspective of how to “sell” the issue of poor data quality to the business and make them realise it’s not just a technical problem, but can also cost them money. The relevant posts are here and here.

A flipside to his approach of quantifying the cost to the business is of that for us data monkeys, we should focus our thinking when faced with a data quality problem to consider if it’s actually worth solving. It may grate for us to have crappy data in our lovely warehouse, but if the cost of solving it exceeds the benefit realised – we may sometimes just have to let it be there.*

* Of course this thought makes me feel a little dirty, and I think I need to take a shower in some nice strongly typed data with enforced referential integrity :)

4 thoughts on “Data Quality – the Business Problem

  1. BI Monkey, I know how you feel. Have been working on a solution which brings in survey data from spreadsheets recently, a whole lot of free type text with much of it very similar answers but not quite the same (management or mgt). As much as I would love to have the fields boxed up into nice little sets of data, it is a costly exercise to fix. Solution, if it bothers them then fix it later and bring it in then. What I found was that the free type fields were rarely looked at, they were purely what I would class as filler data…data that is completed because it could be useful someday. I guess when that day comes then they will clean all 7 years worth of survey data up, until then I can only shudder at the messy data.

  2. Interesting problem. From a business perspective I would be asking them why they are going to the trouble of collecting data they have no intention of looking at – is it because a) they think it isn’t useful information or b) because they have no idea how to analyse it?

    If a) then they shouldn’t expend effort collecting this data!

    If b) there are text mining options available – the Term Extraction tool in SSIS may help them get started on that path by breaking the free text data into a manageable, analysable from. It may be worth you spending some time doing that just to demonstrate free text data need not be an inscrutable form of data.

    Cheers, James

  3. Haven’t looked at the Term Extraction component before, I will head to your example and explanation after this comment, but I have to say I am not quite sure I understand how best to use it … it appears to almost be a full text index component. In my case I take a survey of what students do after they leave school, the free text data is the course they undertake if they go to University … so data could be shown as Business, Commerce, Mgt & HRM, Human Resource Management, Business Management, and that is all for the one course at the one University. Can’t quite get my head around how I would approach this from both a data model perspective and a dimensional usability perspective.

    All I can say is that it is very apparent that most 18-19 year olds are not exactly sure what their course is actually called!!

  4. Well, what you would have to do after extracting the terms is then manually categorise them to form a dimension – professional text miners may have better ideas, of course. Either that or just use the most common terms to paint a picture – for example in 1,000 students the stats could be: 400 mentioned Business, 300 mentioned Science, 400 mentioned computers – it’s not neat dimensional categorisation but provides a picture that analysts can work with.

    Of course with teenagers i’m surprised its not all in “txt spk” :)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>