Data Warehousing and Business Intelligence

28 February 2010

Impact of Clustering the Fact Key to Query Performance

Filed under: Data Warehousing,SQL Server — Vincent Rainardi @ 5:23 pm
Tags: ,

If your fact table has a single column, integer, identity primary key (a.k.a. Fact Key), does it make a difference to the query performance whether the Fact Key is clustered or not? The sole purpose of this article is to answer that question, by testing it.

In previous article I have established that the insert performance is better if the fact key is clustered. But I have not touched the query performance yet, hence this article. Here I will test the query performance by using a typical star join query. Star join is a query where the fact table is joined to several dimensions, and the ‘where clause’ contains some dimension attributes. The measures are usually summed up, with the dimension attributes being on the ‘group by’ clause.

The following is an example of a star join query commonly used in data warehousing:

select D1.Attr1, D2.Attr1, D3.Attr1, sum(measure1), sum(measure2)
from FactPKClus F
inner join dim1 D1 on D1.dimkey1 = F.dimkey1
inner join dim2 D2 on D2.dimkey2 = F.dimkey2
inner join dim3 D3 on D3.dimkey3 = F.dimkey3
where D1.Attr2 between ‘Value1’ and ‘Value2’
and D2.Attr2 between ‘Value1’ and ‘Value2’
and D3.Attr2 between ‘Value1’ and ‘Value2’
group by D1.Attr1, D2.Attr1, D3.Attr1

For this test I will create 3 Fact Tables:

  1. With Fact Key Clustered
  2. With Fact Key Not Clustered
  3. Without Fact Key

I will use 3 dimensions, each ‘connected’ to the fact table using surrogate keys. Each dimension contains 2 attributes and 1000 rows (members). Then I will query the star schema using a star join query. I am using SQL Server 2008 R2, so the star join query will utilise the enhancement in the bitmap filter (which doesn’t require FK to work).

— Create the 3 fact tables
create table FactPKClus
( factkey int identity(1,1),
dimkey1 int, dimkey2 int, dimkey3 int,
measure1 decimal(18,5), measure2 decimal(18,5),
primary key clustered (factkey)
)

create table FactPKNonClus
( factkey int identity(1,1),
dimkey1 int, dimkey2 int, dimkey3 int,
measure1 decimal(18,5), measure2 decimal(18,5),
primary key nonclustered (factkey)
)

create table FactNoPK
( factkey int,
dimkey1 int, dimkey2 int, dimkey3 int,
measure1 decimal(18,5), measure2 decimal(18,5)
)
— Create initial data for fact table
create table data ( col1 int , col2 decimal(20,5))
declare @a int
set @a = 1
while @a <= 1000000
begin
insert into data values (convert(int,rand()*1000), 10.1*@a)
set @a = @a + 1
end –10mins
— Populate the fact tables with initial data
insert into FactPKClus (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from data

insert into FactPKNonClus (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from data

insert into FactNoPK (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from data
— Create 3 dims, each with 1000 members and 2 attributes
create table dim1 (dimkey1 int identity(1,1) primary key clustered, attr1 varchar(20), attr2 varchar(20))
create table dim2 (dimkey2 int identity(1,1) primary key clustered, attr1 varchar(20), attr2 varchar(20))
create table dim3 (dimkey3 int identity(1,1) primary key clustered, attr1 varchar(20), attr2 varchar(20))

declare @a int
set @a = 1
while @a <= 1000
begin
insert into dim1 (attr1, attr2) values (‘Value’+convert(varchar,@a), ‘Value’+convert(varchar,@a))
insert into dim2 (attr1, attr2) values (‘Value’+convert(varchar,@a), ‘Value’+convert(varchar,@a))
insert into dim3 (attr1, attr2) values (‘Value’+convert(varchar,@a), ‘Value’+convert(varchar,@a))
set @a = @a + 1
end –1s
— create non clustered indexes on the dim tables
create nonclustered index dim1attr1 on dim1(attr1)
create nonclustered index dim1attr2 on dim1(attr2)
create nonclustered index dim2attr1 on dim1(attr1)
create nonclustered index dim2attr2 on dim1(attr2)
create nonclustered index dim3attr1 on dim1(attr1)
create nonclustered index dim3attr2 on dim1(attr2)
— query the star schema
select D1.Attr1, D2.Attr1, D3.Attr1, sum(measure1), sum(measure2)
from FactPKClus F
inner join dim1 D1 on D1.dimkey1 = F.dimkey1
inner join dim2 D2 on D2.dimkey2 = F.dimkey2
inner join dim3 D3 on D3.dimkey3 = F.dimkey3
where D1.Attr2 between ‘Value1’ and ‘Value2’
and D2.Attr2 between ‘Value1’ and ‘Value2’
and D3.Attr2 between ‘Value1’ and ‘Value2’
group by D1.Attr1, D2.Attr1, D3.Attr1 –5.87s
select D1.Attr1, D2.Attr1, D3.Attr1, sum(measure1), sum(measure2)
from FactPKNonClus F
inner join dim1 D1 on D1.dimkey1 = F.dimkey1
inner join dim2 D2 on D2.dimkey2 = F.dimkey2
inner join dim3 D3 on D3.dimkey3 = F.dimkey3
where D1.Attr2 between ‘Value1’ and ‘Value2’
and D2.Attr2 between ‘Value1’ and ‘Value2’
and D3.Attr2 between ‘Value1’ and ‘Value2’
group by D1.Attr1, D2.Attr1, D3.Attr1 –5.87s
select D1.Attr1, D2.Attr1, D3.Attr1, sum(measure1), sum(measure2)
from FactNoPK F
inner join dim1 D1 on D1.dimkey1 = F.dimkey1
inner join dim2 D2 on D2.dimkey2 = F.dimkey2
inner join dim3 D3 on D3.dimkey3 = F.dimkey3
where D1.Attr2 between ‘Value1’ and ‘Value2’
and D2.Attr2 between ‘Value1’ and ‘Value2’
and D3.Attr2 between ‘Value1’ and ‘Value2’
group by D1.Attr1, D2.Attr1, D3.Attr1 –5.14s
 
/* System Spec: Acer Aspire 5051 AWXMi, AMD Turion 64 Mobile MK-36 799 MHz,
512 MB RAM, Win XP SP3, SQL Server 2008 R2 Nov CTP */
 
Conclusion: whether the identity PK is clustered or not, the star join query performance is the same. The query performance is better if the fact table has no fact key.

Measurement method: see here.

Advertisements

2 Comments »

  1. Why “The query performance is better if the fact table has no fact key”?
    As I know, cluster index on fact key should improve performance.

    Anyone can explain it for me?
    thanks.

    Comment by ab c — 6 September 2013 @ 11:05 am | Reply

    • I am not sure why the query on the fact table with no fact key completes in 5.14s whereas on the one with clustered index on the fact key the query took 5.87s. The rows in these two fact tables are in the same physical order because they were loaded from the same source table. Both fact tables are not indexed on the dim key columns.
      One possible explanation is: the clustered index on fact key causes additional bytes on the row length, which prolong the sorting on the dim keys for the join operations.

      Comment by Vincent Rainardi — 8 September 2013 @ 7:02 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: