In this test I’m going to query which numbers exist in Table1 but not Table2, using “Exist” and “In”.
Exist Query:
select * from Table1 A where not exists
( select * from Table2 B where B.col1 = A.col1 )
In Query:
select * from Table1 where col1 not in
( select col1 from Table2 )
There will be 2 tests:
- Test 1: Table2 is large
- Test 2: Table2 is small
I believe that in Test 1 “exist” is a lot faster, but in Test 2 “in” is slightly faster than “exist”
Test 1: Table 2 is large
Data Setup:
Table1 contains numbers from 1 to 100,000.
Table2 contains numbers from 10 to 100,000.
Expected results: Exist is faster than In.
Script to populate both tables with data:
truncate table table1;
declare @i int
set @i = 1
while @i <= 100000
begin
insert into Table1 values (@i, @i)
set @i = @i + 1
end;
truncate table table2;
declare @i int
set @i = 10
while @i <= 100000
begin
insert into Table2 values (@i, @i)
set @i = @i + 1
end;
Results:
- Exist Query: average of 167 ms.
- In Query: average of 1892 ms.
This is as I predicted.
Test 2: Table 2 is small
Data Setup:
Table1 contains 100,000 rows. Col 1 contains numbers from 0 to 50.
Table2 contains 49 rows. Col 1 contains numbers from 2 to 50.
Expected results: In is slightly faster than Exist.
Script to populate both tables with data:
truncate table table1;
declare @i int
set @i = 1
while @i <= 100000
begin
insert into Table1 values (@i%50, @i)
set @i = @i + 1
end;
select * from Table1
truncate table table2;
declare @i int
set @i = 2
while @i <= 50
begin
insert into Table2 values (@i, @i)
set @i = @i + 1
end;
select * from Table2
Results:
- Exist Query: average of 219 ms.
- In Query: average of 253 ms.
So I was wrong. Even if Table 2 is small, “Exist” is still faster than “in” (slightly). But if Table 2 is large, “Exist” is a lot faster than “in”. So it is better to use Exist. Always.
Vincent,
that’s very interesting – when I repeat your example “Test1” on SQL Server 2008 I see in the execution plan that NOT EXISTS uses a simple Hash Match (Right Anti Semi Join) Operation with two child nodes and with table scans on both tables. For the NOT IN case I see a strange combination of Nested Loops (Left Anti Semi Join) with a Top- and a Row Count Spool- element to gather the data for Table1, which is then again joined to Table1 by Hash Match (Right Anti Semi Join).
When I repeat your example in Oracle 11.2.0.1 I see for both queries a plan similar to the NOT EXISTS case in MS SQL (there is a small difference in the plans because the NOT IN case has to be NULL aware in Oracle; but it’s always an HASH JOIN RIGHT ANTI with full scans on the tables) – and there is not difference in performance.
Obviously it’s still very hard for query optimization software to identify queries with identical semantic – and to find the best plan: http://iggyfernandez.wordpress.com/2010/12/07/which-query-is-better%E2%80%94part-iii-no-explain-plan-to-rule-them-all/.
Comment by Martin Preiss — 22 April 2011 @ 6:33 pm |
The issue is NOT IN vs NOT EXISTS. IN and EXISTS are equivalent. Check http://sqlinthewild.co.za/index.php/2010/02/18/not-exists-vs-not-in/
Comment by Jon — 20 August 2013 @ 10:27 am |