Data Warehousing and Data Science

28 February 2010

Impact of Clustering the Fact Key to Query Performance

Filed under: Data Warehousing,SQL Server — Vincent Rainardi @ 5:23 pm
Tags: ,

If your fact table has a single column, integer, identity primary key (a.k.a. Fact Key), does it make a difference to the query performance whether the Fact Key is clustered or not? The sole purpose of this article is to answer that question, by testing it.

In previous article I have established that the insert performance is better if the fact key is clustered. But I have not touched the query performance yet, hence this article. Here I will test the query performance by using a typical star join query. Star join is a query where the fact table is joined to several dimensions, and the ‘where clause’ contains some dimension attributes. The measures are usually summed up, with the dimension attributes being on the ‘group by’ clause.

The following is an example of a star join query commonly used in data warehousing:

select D1.Attr1, D2.Attr1, D3.Attr1, sum(measure1), sum(measure2)
from FactPKClus F
inner join dim1 D1 on D1.dimkey1 = F.dimkey1
inner join dim2 D2 on D2.dimkey2 = F.dimkey2
inner join dim3 D3 on D3.dimkey3 = F.dimkey3
where D1.Attr2 between ‘Value1’ and ‘Value2’
and D2.Attr2 between ‘Value1’ and ‘Value2’
and D3.Attr2 between ‘Value1’ and ‘Value2’
group by D1.Attr1, D2.Attr1, D3.Attr1

For this test I will create 3 Fact Tables:

  1. With Fact Key Clustered
  2. With Fact Key Not Clustered
  3. Without Fact Key

I will use 3 dimensions, each ‘connected’ to the fact table using surrogate keys. Each dimension contains 2 attributes and 1000 rows (members). Then I will query the star schema using a star join query. I am using SQL Server 2008 R2, so the star join query will utilise the enhancement in the bitmap filter (which doesn’t require FK to work).

— Create the 3 fact tables
create table FactPKClus
( factkey int identity(1,1),
dimkey1 int, dimkey2 int, dimkey3 int,
measure1 decimal(18,5), measure2 decimal(18,5),
primary key clustered (factkey)
)

create table FactPKNonClus
( factkey int identity(1,1),
dimkey1 int, dimkey2 int, dimkey3 int,
measure1 decimal(18,5), measure2 decimal(18,5),
primary key nonclustered (factkey)
)

create table FactNoPK
( factkey int,
dimkey1 int, dimkey2 int, dimkey3 int,
measure1 decimal(18,5), measure2 decimal(18,5)
)
— Create initial data for fact table
create table data ( col1 int , col2 decimal(20,5))
declare @a int
set @a = 1
while @a <= 1000000
begin
insert into data values (convert(int,rand()*1000), 10.1*@a)
set @a = @a + 1
end –10mins
— Populate the fact tables with initial data
insert into FactPKClus (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from data

insert into FactPKNonClus (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from data

insert into FactNoPK (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from data
— Create 3 dims, each with 1000 members and 2 attributes
create table dim1 (dimkey1 int identity(1,1) primary key clustered, attr1 varchar(20), attr2 varchar(20))
create table dim2 (dimkey2 int identity(1,1) primary key clustered, attr1 varchar(20), attr2 varchar(20))
create table dim3 (dimkey3 int identity(1,1) primary key clustered, attr1 varchar(20), attr2 varchar(20))

declare @a int
set @a = 1
while @a <= 1000
begin
insert into dim1 (attr1, attr2) values (‘Value’+convert(varchar,@a), ‘Value’+convert(varchar,@a))
insert into dim2 (attr1, attr2) values (‘Value’+convert(varchar,@a), ‘Value’+convert(varchar,@a))
insert into dim3 (attr1, attr2) values (‘Value’+convert(varchar,@a), ‘Value’+convert(varchar,@a))
set @a = @a + 1
end –1s
— create non clustered indexes on the dim tables
create nonclustered index dim1attr1 on dim1(attr1)
create nonclustered index dim1attr2 on dim1(attr2)
create nonclustered index dim2attr1 on dim1(attr1)
create nonclustered index dim2attr2 on dim1(attr2)
create nonclustered index dim3attr1 on dim1(attr1)
create nonclustered index dim3attr2 on dim1(attr2)
— query the star schema
select D1.Attr1, D2.Attr1, D3.Attr1, sum(measure1), sum(measure2)
from FactPKClus F
inner join dim1 D1 on D1.dimkey1 = F.dimkey1
inner join dim2 D2 on D2.dimkey2 = F.dimkey2
inner join dim3 D3 on D3.dimkey3 = F.dimkey3
where D1.Attr2 between ‘Value1’ and ‘Value2’
and D2.Attr2 between ‘Value1’ and ‘Value2’
and D3.Attr2 between ‘Value1’ and ‘Value2’
group by D1.Attr1, D2.Attr1, D3.Attr1 –5.87s
select D1.Attr1, D2.Attr1, D3.Attr1, sum(measure1), sum(measure2)
from FactPKNonClus F
inner join dim1 D1 on D1.dimkey1 = F.dimkey1
inner join dim2 D2 on D2.dimkey2 = F.dimkey2
inner join dim3 D3 on D3.dimkey3 = F.dimkey3
where D1.Attr2 between ‘Value1’ and ‘Value2’
and D2.Attr2 between ‘Value1’ and ‘Value2’
and D3.Attr2 between ‘Value1’ and ‘Value2’
group by D1.Attr1, D2.Attr1, D3.Attr1 –5.87s
select D1.Attr1, D2.Attr1, D3.Attr1, sum(measure1), sum(measure2)
from FactNoPK F
inner join dim1 D1 on D1.dimkey1 = F.dimkey1
inner join dim2 D2 on D2.dimkey2 = F.dimkey2
inner join dim3 D3 on D3.dimkey3 = F.dimkey3
where D1.Attr2 between ‘Value1’ and ‘Value2’
and D2.Attr2 between ‘Value1’ and ‘Value2’
and D3.Attr2 between ‘Value1’ and ‘Value2’
group by D1.Attr1, D2.Attr1, D3.Attr1 –5.14s
 
/* System Spec: Acer Aspire 5051 AWXMi, AMD Turion 64 Mobile MK-36 799 MHz,
512 MB RAM, Win XP SP3, SQL Server 2008 R2 Nov CTP */
 
Conclusion: whether the identity PK is clustered or not, the star join query performance is the same. The query performance is better if the fact table has no fact key.

Measurement method: see here.

24 February 2010

Primary Key and Clustered Index on the Fact Table

Filed under: Data Warehousing,SQL Server — Vincent Rainardi @ 6:42 pm
Tags: ,

Some SQL Server data warehousing practitioners question whether we should have a single column primary key on the fact table. And, if we do, whether we should index it. And, if we index it, whether the index should be clustered or not. This article is to answer those 3 questions.

Before we begin, I would like to point out that this article is SQL Server specific. Ideally 2008 but at least 2005. This article is not applicable if your data warehouse is in other RDBMS like Oracle, DB/2 and Informix. And it is not applicable to parallel database platform such as Teradata, Exadata and PDW. I would like to write about other platforms, but that’s for another post.

OK, let’s answer the 3 questions one by one:

  1. Should we have a single column PK in the fact table?
  2. Should we index that PK?
  3. Should we cluster that index?

Should we have a single column primary key on the fact table?

This question actually consists of 2 questions:
a)  Should we have a Primary Key?
b)  Should the PK be a single column?

I’ll begin with b) first, then onto a).

a) Single Column PK or Multi Column PK?

A fact table typically consists of several dimension key columns and several measures columns. Probably plus a few system columns like load timestamp and data quality or audit key. A few years ago, my view was that ideally we need to create an identity column on the fact table and make it a primary key. I usually call this column fact_key. This column enables us to uniquely identify each fact row so we can a) link between fact rows, and b) update or delete a single row in the fact table in the case of duplicate fact rows.

The alternative to a single column PK is to have multi-column PK. We create the PK from a combination of dimension key columns. Usually not all dimension key columns, but only several of them. Only those whose combination enables us to uniquely identify each fact row. We do not include the dim key column that is dependent on another dim key column.

The single column PK is better than multi-column PK because of two reasons:

  1. The multi-column PK may not be unique. Even if the combination of those columns is unique now, it does not guarantee that in the future we won’t have duplicate primary key. The single-column approach guarantees that the PK is unique, because it is an identity column.
  2. The single-column PK is slimmer than multi-column PK, hence better query performance. To do a self join in the fact table (to link the current fact row to the previous fact row for example, in the case of chained events), we join on a single integer column. This gives a better performance than joining on a combination of multiple columns.

So if we need to create a PK on the fact table, it is better to be a single column PK. But do we need to have a primary key on the fact table?

b) Should We Have a Primary Key on the Fact Table?

Fundamental database design theory states that we need to have a primary key on every table. When we design a transaction system, this is indeed the principle. Forget the 3rd normal form; even the first normal form requires that:

  1. There must not be any repeating columns in every table
  2. Each table must have a primary key that uniquely identify each row

And to achieve the second and third normal form we need to break some tables into smaller ones and create foreign keys between them. This PK and FK thing have been in our mind for years so when we design a fact table, almost automatically we feel the need to create a primary key, to be able to identify to each row in the table.

But that is the database design principles for transaction systems. In dimensional data warehousing, we do not use normalisation. Instead, we de-normalise the tables. In a star schema, the tables are de-normalised.

We need to remember that data warehousing is not just dimensional data warehousing. Dimensional data warehousing is when we have fact and dimensional tables, i.e. Kimball style. There are other styles of data warehousing, where the data warehouse is not in the form of fact and dim tables. Instead, the DW is normalised, just like a transaction system database. For example in Inmon’s concept, the enterprise data warehouse is normalised. In many Teradata DWs, the data warehouses are also normalised.

So, should we have a primary key in the fact table? The answer is: it depends on whether we need to be able to identify each fact row or not. There are 3 reasons why we need to be able to identify each fact row:
1)  If we need to refer to a fact row from another fact row, for example in the case of chain of events
2)  In the case if there are several identical fact rows and we need to update (or delete) only some of them
3)  To link the fact table to another fact table

There another reason why we want to have a PK in the fact table: to prevent duplicate rows. I’ll explain the 3 reasons for identifying each fact row first, then the prevention of duplicate rows after that.

1. Referring to a fact row from another fact row

In data warehouses, we need to refer to a fact row from another fact row to show the chain of events/transactions. So on the ‘from’ fact row, we put the fact_key value of the ‘target’ fact row. This requirement is rare, but it does happen.

The alternative designs to show the chain of events are:

a)      Create a transaction dimension and show the chain of events in this dimension.  The grain of a transaction dimension is 1 row per business event or business transaction. There are arguments both for and against the idea of transaction dimension but that’s for another article. The main argument from the ‘against’ side is that the transaction dimension attributes should be put as degenerate dimension columns in the fact table.

b)      In many cases it is not the chain of events/transaction, but only the chain of master data, for example: chain of customers (related customers), related policies or related accounts. In this case we should do the linking on the customer dimension, policy dimension or account dimension; not on the fact table or transaction dimension.

2. To deal with identical fact rows

If we have several identical fact rows and we need to update or delete only some of them, we can use rowcount. In SQL Server, using “rowcount” we can define how many rows we want our SQL statement to be applied to. For example:
create table t1 ( col1 int, col2 varchar(30) )
insert into t1 values (1,’A’)
insert into t1 values (1,’A’)
set rowcount 1
update t1 set col2 = ‘B’ where col1 = 1

In the above example we only update 1 row because we specify “set rowcount 1”.

If you don’t reset the rowcount, when we do “select * from t1”, SQL Server will only return 1 row. To display both rows we need to reset the row count by setting it to 0:
set rowcount 0
select * from t1

So we don’t need to have a primary key on a table to be able to deal with duplicate rows. We can use rowcount.

3. To linking to other fact table (drill across)

Some people put a primary key on the fact table (usually in the form of a single column, e.g. fact key) so that they can refer to that row from other fact table. Consider a classic scenario of retail data warehouse where we have 2 fact tables: shipment and order. To enable us to go from the order fact table to the shipment fact table, we create a fact key column on the order fact table, and then put this fact key on the shipment fact row. This method is a quite costly, particularly when we have to update the fact key on the shipment fact table.

The very basic premise of Kimball’s style dimensional modelling is to have conformed dimensions. One of the main purposes of having conformed dimension is to enable us to drill across from one fact table to another. In other words, we should use the dim keys to navigate from one fact table to another fact table. Not making your dimension conformed, then use fact key to navigate between fact tables, would require additional processing time on the warehouse load. I wouldn’t recommend this approach because a) non conformed dimensions create an issue of data trustworthiness; b) ETL time becomes longer because we have more dimension tables to maintain and to stamp the “related fact key” column.

Some people use DD (degenerate dimension) to navigate between fact tables. A DD is a column in the source transaction table that is left over after we move all the attributes to the dimension tables. Order ID is an example of DD. We can use Order ID to go from the order fact table to the shipment fact table. Note that we don’t need to uniquely identify a row; an order could be delivered in many shipments and a shipment could contain many orders. That is another reason why fact key (or fact table PK in general) is not ideal for drilling across between fact tables: because we don’t need it to be unique. The best way to drill across is to have conformed dimension and use the dim keys.

To Prevent Duplicate Rows in the Fact Table

Sometimes people put a primary key on the fact table to prevent duplicate rows. For example, to prevent a transaction that has been loaded into the fact table on the previous day load to be loaded again the next day. But when designing the fact table grain, we need to consider whether there is a valid business scenario for duplicate rows in the fact table. Consider a classic scenario of retail data warehouse: a fact table with 4 dim keys: customer, product, store and date, and 2 additive measures: quantity and sales amount. Can we put a primary key using the combination of the 4 dim keys to prevent duplicate rows? Is it a valid business scenario to have duplicate rows with the same 4 dim keys?

The first thing we need to do when creating a fact table is declaring the grain. This is a major step that some people forget, or deliberately skip. In the above scenario, the fact table grain could be:
a)  1 row for every transaction in the EPOS system (electronic point of sale), or
b)  1 row for every combination of customer-product-store-date.

When a customer buys 2 packets of crisps and the cashier at the checkout scans it twice, we have 2 rows in the EPOS system with different transaction/line IDs. In the case of a) above, we create 2 rows in the fact table, with the same 4 dim keys. In the case of b) we group the 2 source columns and we create only 1 row with quantity = 2.

In the case of a) above, having duplicate rows in the fact table is a valid business scenario so we should not put a primary key in the fact table to prevent it. In the case of b) above, having duplicate rows is not a valid business scenario. Another commonly used example is if the customer comes back one hour later and buy another packet of crisps. That’s the same customer, going to the same store, buying the same product, on the same day. Because the grain is date (not time), there will be duplicate rows in the fact table.

A common practice in retail data warehousing is to have a degenerate dimension (DD) on the EPOS transaction ID. A DD is a column in the source transaction table that is left over after we move all the attributes to the dimension tables. In a transaction fact table, the grain of a DD column is usually (not always) the same with the grain of the fact table. In the case of a) above, some people create the fact table PK on this transaction ID column. True that in most system, the EPOS transaction ID is unique, but in some cases, it is possible to have duplicate transaction ID. For example, in the case of merger/acquistion: two international retailers who use the same EPOS system merged. Although within each retailer, the EPOC transaction IDs are unique globally, they are not unique between companies. I’d like to point out that if the DD column is Order ID we can’t use it as PK because it’s not unique (the grain of the fact table is at line level, not order level).

So, before putting a PK in the fact table we need to check whether there is a business scenario that makes duplicate rows valid. Now the latest trend of fact table primary key is to create the fact key column. It’s an identity column, so it can’t be duplicated. It is unique. But, in this case, we don’t really prevent duplication, do we?

The disadvantage of having a primary key on the fact table

A primary key is a constraint. We create it to prevent duplication on the contrained columns. To reinforce the contraint, SQL Server needs to check the constraint everytime a row is inserted into the table or the columns are updated.

In data warehousing, we insert lots of rows into fact tables every day, particularly snapshot fact tables. If the fact table only contains 100,000 rows, it does not make a difference whether we have a PK or not. But if the fact table contains 100 million rows, it does make a big difference. As an example, the following test inserts 100,000 rows into 2 fact tables (1 with PK and 1 without), both containing 1 million rows.

— Create 2 fact tables, 1 with PK, 1 without PK
create table FactPK
( dimkey1 int, dimkey2 int, dimkey3 int,
measure1 decimal(18,5), measure2 decimal(18,5)
primary key (dimkey1, dimkey2, dimkey3)
)

create table FactNoPK
( dimkey1 int, dimkey2 int, dimkey3 int,
measure1 decimal(18,5), measure2 decimal(18,5)
)

— Prepare data to be loaded initially (1 million each)
create table data ( col1 int , col2 decimal(20,5))
declare @a int
set @a = 1
while @a <= 1000000
begin
insert into data values (@a, 10.1*@a)
set @a = @a + 1
end
–about 5-10mins depending on the autogrow of DB files

–Do initial load on both fact tables
insert into FactPK (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from data
insert into factNoPK (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from data

— Prepare new data to be inserted into both fact tables
create table NewData ( col1 int , col2 decimal(20,5))
declare @a int
set @a = 1000001
while @a <= 1100000
begin
insert into NewData values (@a, 10.1*@a)
set @a = @a + 1
end

— Insert new data into both fact tables and measure the duration
insert into factPK (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from NewData –100k rows in 3.013s, see here for measurement methods

insert into factNoPK (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from NewData –100k rows in 2.027s

In the above test the fact tables are quite slim. They only have 3 dim keys and 2 measures. And they only contain 1 million rows. In practice it could be a lot bigger than this. But, even on that size, we can see that the primary key make a big difference. The one with PK is about 50% slower than the one without.

So should we have a primary key in the fact table? Only if we have a genuine reason to identify each and every fact row. We shouldn’t put a PK in the fact table just for the sake of it. It must have a purpose. It has some disadvantages so we need to consider whether the benefits outweigh the disadvantages.

Should we index the primary key?

Well this one is easy. In SQL Server we must have an index on the primary key column. The way SQL Server implement primary key is by using a unique index. We can’t have a primary key without having an index.

So, the short answer is yes. We must index the primary key.

Should we cluster that index?

A clustered index dictates the physical location of each row. If we make a column a clustered index, the locations of the table rows on the disk are physically ordered based on that column. This slows down inserts. If we insert a row into a table with clustered index, SQL Server needs to identify first where to insert that row, based on the clustered column(s). No, that row does not automatically go to the end of the table; it goes in the middle of the table depending on the value of the clustered column(s). In extreme cases, on a 40-million-rows fact table could take 1 hour to insert 200,000 rows. This is because SQL Server spends a lot of time to identify where to insert each row. This the main argument why if we decided to have a PK on the fact table, the PK index must be clustered: to improve insert performance.

In data warehousing, the primary consideration is the query performance. Secondary consideration is the load performance. This is because the purpose of a warehouse is to satisfy user queries. When a user queries the warehouse, we only have a few seconds to serve the requests (a few minutes in some cases); where as when loading we have hours.

That is also true for a clustered index in the fact table: the primary consideration for the clustered index in the fact table is the query performance. If the table rows are physically ordered by column1, then a query with column1 in the where clause would enjoy a good performance. In SQL Server 2000, clustered index was the only tool that we have with regards to the physical order of the table rows. But, since SQL Server 2005 we have another tool at our disposal: partitioning.

The physical order of a fact table should support the where clause of data warehouse queries. If it is a periodic snapshot fact table, the physical order should be the snapshot date (or snapshot week/month). This is because the majority of the data warehouse queries specify the snapshot date/week/month in the where clause.

If it is a transaction fact table, generally it is better if we don’t get the fact table rows physically ordered using clustered index. This way, our non-clustered indexes on dim keys will be based on the Row ID (combination of file ID, page ID and per-page row ID). This way, the order of the fact table does not favour a certain dimension, so whatever attributes on the where clause, the query will have the equal performance.

We should only have a transaction fact table physically ordered if the majority of the warehouse queries specify one certain column. For example, in a CRM warehouse, there are many cases where it is beneficial to have the transaction fact table physically ordered on the customer key. In this scenario, queries with customer attributes on the where clause will be faster. Yes, queries without customer attributes on the where clause will be a bit slower, but because it’s a CRM warehouse, majority of the warehouse queries have customer attributes on them. Hence overall it is beneficial (for query performance).

If for any reason, you decided to put a clustered index on the fact table, be it for a PK or on other column to support query performance, you should really consider 1 thing: are you prepared if your fact table load takes twice as long? A clustered index on the fact table really slows the insert down, except if the clustered index is on an identity column.

We need to remember that we have partitioning at our disposal to physically order the fact rows. Traditionally, in SQL Server warehousing (2005 and above) we put the fact table in several spindles. We do this by partitioning the table and put different partitions on different file groups, which in turns are located on different spindles. This way the query is served by several physical disks, hence the IO rate is higher, hence the query is faster.

But, if you don’t have different spindles, you can still benefit from partitioning by arranging the partition function so the majority of the queries only hit 1 partition. This way the table looks a lot smaller to SQL Server. For example, in the case of monthly periodic snapshot fact table, partition the fact table by snapshot month. Majority of the queries are targetting the current or last snapshot month. If you have 10 years (120 snapshot months) of data, those queries will be a lot faster because it only hit 1 partition, which 1/120th portion of the whole table.

Even after using partitioning, we still have clustered index at our disposal to further arrange the physical order of the fact row, within a partition. For example, in a CRM warehouse, we can partition the monthly snapshot fact table by snapshot month, and then create a clustered index on the customer key. In a retail warehouse, a potentially beneficial column to cluster on is the transaction date key.

Sorry I’m too carried away with physical ordering and partitioning, back to the question: should we cluster the PK index? The answer is it depends.

  1. If your fact table is heavily queried by the front end reports, then you should use the clustered index to support query performance, as I explained above (use partitioning first, clustred index second). Because we can only have 1 clustered index per table, we need to specify any other indexes (including the PK index) as non clustered.
  2. If your PK is a composite of the dim keys, you should not use that PK for clustered index. In SQL Server, all non-clustered indexes will be very wide/large (and therefore slower) if the clustered index is wide. This is because non-clustered indexes contain the clustered index columns.
  3. If your PK is a single column integer identity key, and loading performance is more important for you than query performance*, then you should cluster that PK index. Insert is faster if the identity PK is clustered, compared to if it’s not clustered (but fact table with no PK is fastest), as shown in the test below. *for example because the fact table is only used to load SSAS cubes, which read the fact table in its entirity, rather than selective.

— Create 3 fact tables, 2 with identity PK (1clustered 1 non clustered) and 1 without PK

create table FactPKClus
( factkey int identity(1,1),
dimkey1 int, dimkey2 int, dimkey3 int,
measure1 decimal(18,5), measure2 decimal(18,5),
primary key clustered (factkey)
)
create table FactPKNonClus
( factkey int identity(1,1),
dimkey1 int, dimkey2 int, dimkey3 int,
measure1 decimal(18,5), measure2 decimal(18,5),
primary key nonclustered (factkey)
)
create table FactNoPK
( factkey int,
dimkey1 int, dimkey2 int, dimkey3 int,
measure1 decimal(18,5), measure2 decimal(18,5)
)
— Create initial data for fact table
create table data ( col1 int , col2 decimal(20,5))
declare @a int
set @a = 1
while @a <= 1000000
begin
insert into data values (convert(int,rand()*1000), 10.1*@a)
set @a = @a + 1
end –10mins

— Populate the fact tables with initial data
insert into FactPKClus (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from data

insert into FactPKNonClus (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from data

insert into FactNoPK (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from data

— Prepare new data to be inserted into both fact tables
create table NewData ( col1 int , col2 decimal(20,5))
declare @a int
set @a = 1000001
while @a <= 1100000
begin
insert into NewData values (convert(int,rand()*1000), 10.1*@a)
set @a = @a + 1
end

— Insert new data into fact tables and measure the performance
insert into FactPKClus (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from NewData –3.96s (25,252 rows/s)

insert into FactPKNonClus (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from NewData –-4.71s (21,231 rows/s)

insert into FactNoPK (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from NewData –3.31s (30,221 rows/s)
/*Measurements are taken 10 times each and then averaged. See here for measurement methods.
System Spec: Acer Aspire 5051 AWXMi, AMD Turion 64 Mobile MK-36 799 MHz,
512 MB RAM, Win XP SP3, SQL Server 2008 R2 Nov CTP*/

In the test above, there are 2 fact tables with PK. The one which the PK is clustered is 19% faster on insert than the one with nonclustered PK (25,252 rows/s compared to 21,231 rows/s). But, the one without a PK is still faster (30,221 rows/s), that’s 20% faster than the clustered PK.

Note that the concept of ‘clustered index’ is SQL Server specific. It does not apply on other platforms. Teradata for example, have a different data distribution mechanism which utilises “primary index”.

Clustered index is the default for PK

In practice I found there are many fact tables created with a clustered index on the primary key. Not only identity PK, but also non-identity PK. This is because, by default, SQL Server will create the PK index as clustered. If we don’t specify whether the primary key should be clustered or non clustered, by default it’s clustered. For example:
create table Fact1
( dimkey1 int,
measure1 decimal(18,5)
primary key (dimkey1)
)
create table Fact1
( dimkey1 int primary key,
measure1 decimal(18,5)
)

In both examples above, SQL Server creates the primary key on dimkey1 as clustered index. So the fact table is physically ordered on dimkey1 column.

To create the primary key as non clustered index, the command is:
create table Fact1
( dimkey1 int,
measure1 decimal(18,5)
primary key nonclustered (dimkey1)
)

Conclusion

When you create a fact table,

a)  Consider whether you need to be able to identify each row on the fact table. If you don’t need to, don’t create a primary key.
b)  If you do need to create a primary key, create it as single identity column.
c)  Cluster the PK if: the PK is an identity column and loading performance is more important than query performance
Don’t cluster the PK if: the PK is a composite, or when you need the cluster index to support query performance (together with partitioning)

PS. If the PK is a single identity column, the clustered index won’t give much overhead at insert because new rows are created at the end of the table. Thanks to Marco Russo for pointing this out.

1 February 2010

How long did it run?

Filed under: SQL Server — Vincent Rainardi @ 8:15 am
Tags:

A common way to measure how long a SQL statement ran is using SQL Profiler. Alternatively we can also use the getdate() function. My preference is to use Set Statistics. In this post I will describe these 3 methods.

Using SQL Profiler

To use SQL Profiler, start a new trace and go to the event selection tab. Select ‘Show all events’ and ‘Show all columns’. Scroll from top to buttom and clear the check boxes on every event. On the ‘SQL Statement completed’ event, select these columns: Duration, EndTime, StartTime, TextData, and clear all other columns as shown below.

To display on the SQL statement(s) that you are running, click on the Column Filters and edit the TextData mentioning a keyword on your statements, like this:

Click Run to run the trace. Then run you SQL Statement(s). The output is like this:

In the above trace I run the same SQL statement 3 times. As you can see they took 17.558, 15.933 and 14.921 seconds respectively.

Using getdate()

Sometimes we can’t use SQL Profiler, for example if we want to programmatically run the measurements. When measuring, I run it several times and take the average. I usually trow away the first 3 measurements to make sure that the system stabilises first, for example to eliminate the effect of caching. Also to eliminate ‘overshoot’ and ‘ringing’, as illustrated by a time response diagram below (from Wikipedia):

To record how long a SQL statement runs I just use getdate(), then datediff in milliseconds, before converting them to seconds, as follows:

declare @start time, @finish time, @counter int, @duration float

— get the start time
set @start = getdate()

— do something
set @counter = 0
while @counter <= 1000
begin
set @counter = @counter + 1
end

— get the finish time
set @finish = getdate()

— calculate the duration in millisecond, then translate to seconds
set @duration = datediff(ms, @start, @finish)/1000.0

— display the start and finish times, and the duration
print @start
print @finish
print @duration — can also write to a table

The measurements from this getdate method are about 0.12% (on average) of the SQL Profiler outputs, as shown in following: (first number: getdate, second: profiler) (17580, 17588), (15940, 15933), (14933, 14921), (16397, 16329), (29377, 29360), (33326, 33309).

Using Set Statistics

The command is quite simple: “set statistics time on“. Then we execute the SQL Statement we want to measure. For example:

set statistics time on
select
D1.Attr1, D2.Attr1, D3.Attr1, sum(measure1), sum(measure2)
from FactPKNonClus F
inner join dim1 D1 on D1.dimkey1 = F.dimkey1
inner join dim2 D2 on D2.dimkey2 = F.dimkey2
inner join dim3 D3 on D3.dimkey3 = F.dimkey3
where D1.Attr2 between ‘Value1’ and ‘Value2’
and D2.Attr2 between ‘Value1’ and ‘Value2’
and D3.Attr2 between ‘Value1’ and ‘Value2’
group by D1.Attr1, D2.Attr1, D3.Attr1

The result is not on the Results tab, but on the Messages tab:

SQL Server parse and compile time:
CPU
time = 250 ms, elapsed time = 649 ms.
(106 row(s) affected)
SQL Server Execution Times:
CPU time = 3906 ms, elapsed time = 8180 ms.

It’s the elapsed time on the SQL Server execution times that we are looking for (8180 ms). That’s the duration when the SQL statement ran.

The parse and compile time will reduce when you rerun the SQL statement. And eventually they will become zero. This is because the statement would have already been parsed and compiled. For example, when we execute the above SQL statement (just the select statement, no need to re-execute the set statistics) 2 more times, we will get:

SQL Server parse and compile time:
CPU
time = 0 ms, elapsed time = 0 ms.
(
106 row(s) affected)
SQL Server Execution Times:
CPU time = 3766 ms, elapsed time = 7468 ms.

Notice that the elapsed time also decreses (from 8180 to 7468 ms). When measuring, ideally we need to run it several times until the execution time is more or less stable. Stable means it does not go down any more, but goes up and down around a certain point. At this point we then need to measure several times (say 3 times) then take an average.

To switch off the set statistics, the command is “set statistics time off“.

Blog at WordPress.com.