I keep forgetting the syntax for “Not Exists” so I thought I’d write a post for it.
select * from Table1 A where not exists ( select * from Table2 B where A.Col1 = B.Col1 )
The above query is known as “correlation” query.
You need to have the “A” and “B” aliases in the code above. Otherwise you will have to type in the whole table name:
select * from Table1 where not exists ( select * from Table2 where Table2.col1 = Table1.col1 )
Well, not really. You can get away with not specifying the Table2 alias. But you have to specify Table1’s:
select * from Table1 where not exists ( select * from Table2 where col1 = Table1.col1 )
If you take away the Table1, it will return nothing, because there is no correlation between the two tables:
select * from Table1 where not exists ( select * from Table2 where col1 = col1 )
Exists versus In
And while I’m at it I might as well explain the benefit of “Exists” compared to “In”. And the disadvantage.
Here’s “In”: (it’s always easier to explain by example)
select * from Table1 where col1 not in (select col1 from Table2)
If we want to compare 2 columns:
select * from Table1 where col1 not in (select col1 from Table2) and col2 not in (select col2 from table2)
Whereas using “Exists” to compare 2 columns:
select * from Table1 A where not exists ( select * from Table2 B where A.Col1 = B.Col1 and A.Col2 = B.Col2 )
So comparing 2 columns using “In” is not the same as using “Exists”. The “In” is not really comparing 2 columns.
To make the “In” really comparing 2 columns we need to do this:
select * from Table1 where convert(varchar,col1) + '|' + convert(varchar,col2) not in (select convert(varchar,col1) + '|' + convert(varchar,col2) from Table2)
Above, we concatenate the two columns and compare the concatenated string.
Due to time limitation I thought I’d write the performance in another post. But considering that performance is the most important factor for some people, I will write the principles here, and leave the proofing of it for another time.
It is my believe that if the result set to compare to is small, then In is faster. But if the result set to compare to is large, then join is faster.
For example, compare these 2 SQL Statements, one for Exists, one for In:
select * from Table1 A where not exists ( select * from Table2 B where A.Col1 = B.Col1 ) -- Query1 select * from Table1 where col1 not in ( select col1 from Table2 ) -- Query2
If Table2 is small (say 10 rows), then the subquery in Query 2 will run fast, SQL will pin the result in memory and the lookup happens in memory, hence fast. But if Table2 is large (say 4 million rows) then SQL will put the result on disk, and the lookup will be searching the disk, hence slow.
On the other hand, the correlation is implemented as join and is very efficient dealing with 2 large set. Say both Table1 and Table2 contains 4 million rows. If (and that’s a big if) Col1 is indexed on both tables, Query 1 will run fast because Col1 is the join criteria.
The question I had for years is the efficiency of Query 2 to deal with small data set. Say Table 1 is 4 million rows and Table 2 is 10 rows. Can Query 2 beat Query 1? In SQL 7 and 2000 I believe this was the case. But in 2008 I don’t believe that is the case. It depends on the index. And the logic for hash join in SQL Server 2008 has improved a lot. We have 3 types of hash join in SQL Server: In-Memory hash join, Grace hash join and Recursive hash join (see here). Quote: “SQL Server starts by using an in-memory hash join and gradually transitions to grace hash join, and recursive hash join, depending on the size of the build input.” Note that SQL Server doesn’t understand the size of the table if the statistics are out of date, hence it is important to keep the statistics up-to-date.