Data Warehousing and Business Intelligence

22 April 2011

T-SQL: Exists versus In – Performance Comparison

Filed under: SQL Server — Vincent Rainardi @ 4:25 pm
Tags:

In this test I’m going to query which numbers exist in Table1 but not Table2, using “Exist” and “In”.

Exist Query:

select * from Table1 A where not exists
( select * from Table2 B where B.col1 = A.col1 )

In Query:

select * from Table1 where col1 not in
( select col1 from Table2 )

There will be 2 tests:

  • Test 1: Table2 is large
  • Test 2: Table2 is small

I believe that in Test 1 “exist” is a lot faster, but in Test 2 “in” is slightly faster than “exist”

Test 1: Table 2 is large

Data Setup:
Table1 contains numbers from 1 to 100,000.
Table2 contains numbers from 10 to 100,000.

Expected results: Exist is faster than In.

Script to populate both tables with data:

truncate table table1;

declare @i int
set @i = 1
while @i <= 100000
begin
insert into Table1 values (@i, @i)
set @i = @i + 1
end;

truncate table table2;

declare @i int
set @i = 10
while @i <= 100000
begin
insert into Table2 values (@i, @i)
set @i = @i + 1
end;

Results:

  • Exist Query: average of 167 ms.
  • In Query: average of 1892 ms.

This is as I predicted.

Test 2: Table 2 is small

Data Setup:
Table1 contains 100,000 rows. Col 1 contains numbers from 0 to 50.
Table2 contains 49 rows. Col 1 contains numbers from 2 to 50.

Expected results: In is slightly faster than Exist.

Script to populate both tables with data:

truncate table table1;

declare @i int
set @i = 1
while @i <= 100000
begin
insert into Table1 values (@i%50, @i)
set @i = @i + 1
end;

select * from Table1

truncate table table2;

declare @i int
set @i = 2
while @i <= 50
begin
insert into Table2 values (@i, @i)
set @i = @i + 1
end;

select * from Table2

Results:

  • Exist Query: average of 219 ms.
  • In Query: average of 253 ms.

So I was wrong. Even if Table 2 is small, “Exist” is still faster than “in” (slightly). But if Table 2 is large, “Exist” is a lot faster than “in”. So it is better to use Exist. Always.

2 Comments »

  1. Vincent,
    that’s very interesting – when I repeat your example “Test1” on SQL Server 2008 I see in the execution plan that NOT EXISTS uses a simple Hash Match (Right Anti Semi Join) Operation with two child nodes and with table scans on both tables. For the NOT IN case I see a strange combination of Nested Loops (Left Anti Semi Join) with a Top- and a Row Count Spool- element to gather the data for Table1, which is then again joined to Table1 by Hash Match (Right Anti Semi Join).

    When I repeat your example in Oracle 11.2.0.1 I see for both queries a plan similar to the NOT EXISTS case in MS SQL (there is a small difference in the plans because the NOT IN case has to be NULL aware in Oracle; but it’s always an HASH JOIN RIGHT ANTI with full scans on the tables) – and there is not difference in performance.

    Obviously it’s still very hard for query optimization software to identify queries with identical semantic – and to find the best plan: http://iggyfernandez.wordpress.com/2010/12/07/which-query-is-better%E2%80%94part-iii-no-explain-plan-to-rule-them-all/.

    Comment by Martin Preiss — 22 April 2011 @ 6:33 pm | Reply

  2. The issue is NOT IN vs NOT EXISTS. IN and EXISTS are equivalent. Check http://sqlinthewild.co.za/index.php/2010/02/18/not-exists-vs-not-in/

    Comment by Jon — 20 August 2013 @ 10:27 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: