Data Warehousing and Business Intelligence

20 April 2011

T-SQL: Exists versus In

Filed under: SQL Server — Vincent Rainardi @ 6:31 am
Tags:

I keep forgetting the syntax for “Not Exists” so I thought I’d write a post for it.

select * from Table1 A where not exists
( select * from Table2 B where A.Col1 = B.Col1 ) 

The above query is known as “correlation” query.

You need to have the “A” and “B” aliases in the code above. Otherwise you will have to type in the whole table name:

select * from Table1 where not exists
( select * from Table2 where Table2.col1 = Table1.col1 ) 

Well, not really. You can get away with not specifying the Table2 alias. But you have to specify Table1’s:

select * from Table1 where not exists
( select * from Table2 where col1 = Table1.col1 ) 

If you take away the Table1, it will return nothing, because there is no correlation between the two tables:

select * from Table1 where not exists
( select * from Table2 where col1 = col1 ) 

Exists versus In

And while I’m at it I might as well explain the benefit of “Exists” compared to “In”. And the disadvantage.

Here’s “In”: (it’s always easier to explain by example)

select * from Table1 where col1 not in
(select col1 from Table2) 

If we want to compare 2 columns:

select * from Table1
where col1 not in (select col1 from Table2)
  and col2 not in (select col2 from table2) 

Whereas using “Exists” to compare 2 columns:

select * from Table1 A where not exists
( select * from Table2 B where A.Col1 = B.Col1 and A.Col2 = B.Col2 ) 

So comparing 2 columns using “In” is not the same as using “Exists”. The “In” is not really comparing 2 columns.

To make the “In” really comparing 2 columns we need to do this:

select * from Table1
where convert(varchar,col1) + '|' + convert(varchar,col2) not in
(select convert(varchar,col1) + '|' + convert(varchar,col2) from Table2)

Above, we concatenate the two columns and compare the concatenated string.

Query Performance

Due to time limitation I thought I’d write the performance in another post. But considering that performance is the most important factor for some people, I will write the principles here, and leave the proofing of it for another time.

It is my believe that if the result set to compare to is small, then In is faster. But if the result set to compare to is large, then join is faster.

For example, compare these 2 SQL Statements, one for Exists, one for In:

 select * from Table1 A where not exists
( select * from Table2 B where A.Col1 = B.Col1 ) -- Query1

select * from Table1 where col1 not in
( select col1 from Table2 ) -- Query2 

If Table2 is small (say 10 rows), then the subquery in Query 2 will run fast, SQL will pin the result in memory and the lookup happens in memory, hence fast. But if Table2 is large (say 4 million rows) then SQL will put the result on disk, and the lookup will be searching the disk, hence slow.

On the other hand, the correlation is implemented as join and is very efficient dealing with 2 large set. Say both Table1 and Table2 contains 4 million rows. If (and that’s a big if) Col1 is indexed on both tables, Query 1 will run fast because Col1 is the join criteria.

The question I had for years is the efficiency of Query 2 to deal with small data set. Say Table 1 is 4 million rows and Table 2 is 10 rows. Can Query 2 beat Query 1? In SQL 7 and 2000 I believe this was the case. But in 2008 I don’t believe that is the case. It depends on the index. And the logic for hash join in SQL Server 2008 has improved a lot. We have 3 types of hash join in SQL Server: In-Memory hash join, Grace hash join and Recursive hash join (see here). Quote: “SQL Server starts by using an in-memory hash join and gradually transitions to grace hash join, and recursive hash join, depending on the size of the build input.” Note that SQL Server doesn’t understand the size of the table if the statistics are out of date, hence it is important to keep the statistics up-to-date.

Advertisements

3 Comments »

  1. Vincent,
    only as a small addition: in SQL Server and T-SQL “NOT IN” and “NOT EXISTS” return the same result, but in Oracle “NOT IN” will show an empty result if the subquery contains NULL values (because Oracle thinks NULL is not equal to NULL):


    -- Oracle:
    SQL> select * from t1;

    A
    ----------
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10

    10 Zeilen ausgewählt.

    SQL> select * from t2;

    A
    ----------
    1
    2
    3
    4
    5

    SQL> select * from t1 where t1.a not in (select t2.a from t2);

    A
    ----------
    8
    10
    6
    7
    9

    SQL> insert into t2 values (null);

    1 Zeile wurde erstellt.
    --> one row created with a NULL-value

    SQL> select * from t1 where t1.a not in (select t2.a from t2);

    Es wurden keine Zeilen ausgewählt
    --> no rows selected

    Comment by Martin Preiss — 20 April 2011 @ 9:23 am | Reply

  2. and to complete the example: NOT EXISTS shows the same result in Oracle as in SQL Server

    select *
    from t1
    where not exists (select 1
    from t2
    where t2.a = t1.a)

    A
    -----
    8
    10
    6
    7
    9

    Comment by Martin Preiss — 20 April 2011 @ 9:25 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: