Why to avoid DISTINCT and GROUP BY to get unique records

This is a quick and dirty post on the use of DISTINCT or GROUP BY to get unique records, based on something I helped a developer with over the last couple of weeks.

Their thought process was that because they were getting duplicate records, the easiest way to get rid of them was to slap a DISTINCT at the start of the query to get unique results. Which, in a sense, is OK – because it worked (sort of).However there’s two very good reasons why this is not always a good approach.

#1: Your query is wrong

If you are getting back duplicate records, what it probably means is that you are really doing your query wrong. The below example is an admittedly imperfect example of this – as the first query returns far more than intended – but was close to what I was dealing with:

USE AdventureWorks

/* Query 1: Using DISTINCT to try to eliminate duplicates */

select    DISTINCT
WHEN sc.ContactTypeID = 11 THEN ‘Y’
END    AS    ‘OwnerContact’
from    Sales.Store s
left join Sales.StoreContact sc
ON s.CustomerID = sc.CustomerID

/* Query 2: Using a properly formed WHERE clause */

select    s.Name,
‘Y’ AS    ‘OwnerContact’
from    Sales.Store s
left join Sales.StoreContact sc
ON s.CustomerID = sc.CustomerID
WHERE sc.ContactTypeID = 11

What I’m trying to illustrate with the example above is that if you consider more carefully what records you are bringing back in your joins, you are less likely to end up with duplicates. By making sure you are only joining to tables in such a way as to bring back the data you need is going to reduce the risk of other errors creeping in.

#2: Performance

If you are getting duplicate records, you are bringing back more data than you need. On top of this the DISTINCT or GROUP BY operations are having to go over the whole returned data set to identify unique records. This can get pretty expensive pretty quicky in terms of database operations.

From my perspective badly performing queries are a lesser sin than incorrect ones. I doubt many business users will be making decisions based on query length, but they will on the data you serve up to them.

Summing up

All I want to do in this post is make you pause and think before doing a DISTINCT or GROUP BY purely to eliminate duplicates – especially if you don’t really understand why you are getting them. A better designed and more accurate query can often get rid of the dupes and cut off the risk of bad data in the future.

Update 25 Jul 2011: Mark Caldwell articulates it better than me @ SQL Team Blog: Why I Hate DISTINCT

3 thoughts on “Why to avoid DISTINCT and GROUP BY to get unique records

  1. Not a very good example at all. You’re comparing apples to oranges in your example. In your first select assuming Joe Schmoe is the only one with a ContactID = 11 you will get in your result set…

    Name OwnerContact
    Joe Schmoe Y
    John Schmoe N
    Peter Schmoe N

    In your second query you will only get in your result set…

    Name OwnerContact
    Joe Schmoe Y

    Two totally different results depending on what you need. You may need John and Peter Schmoe in your result set and you have just filter them out with your WHERE clause.

  2. I will continue to use GROUP BY where I need to. But yes, I avoid DISTINCT like the plague

    On point #1 Query 2, you have a left join with a condition on the right table in the where clause; this results an implicit inner join because only ContactTypeID = 11 can match as the SQL engine eliminates all NULLs. So I guess I have to agree with Carl. It doesn’t highlight the problem

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>