Some SQL Server data warehousing practitioners question whether we should have a single column primary key on the fact table. And, if we do, whether we should index it. And, if we index it, whether the index should be clustered or not. This article is to answer those 3 questions.
Before we begin, I would like to point out that this article is SQL Server specific. Ideally 2008 but at least 2005. This article is not applicable if your data warehouse is in other RDBMS like Oracle, DB/2 and Informix. And it is not applicable to parallel database platform such as Teradata, Exadata and PDW. I would like to write about other platforms, but that’s for another post.
OK, let’s answer the 3 questions one by one:
- Should we have a single column PK in the fact table?
- Should we index that PK?
- Should we cluster that index?
Should we have a single column primary key on the fact table?
This question actually consists of 2 questions:
a) Should we have a Primary Key?
b) Should the PK be a single column?
I’ll begin with b) first, then onto a).
a) Single Column PK or Multi Column PK?
A fact table typically consists of several dimension key columns and several measures columns. Probably plus a few system columns like load timestamp and data quality or audit key. A few years ago, my view was that ideally we need to create an identity column on the fact table and make it a primary key. I usually call this column fact_key. This column enables us to uniquely identify each fact row so we can a) link between fact rows, and b) update or delete a single row in the fact table in the case of duplicate fact rows.
The alternative to a single column PK is to have multi-column PK. We create the PK from a combination of dimension key columns. Usually not all dimension key columns, but only several of them. Only those whose combination enables us to uniquely identify each fact row. We do not include the dim key column that is dependent on another dim key column.
The single column PK is better than multi-column PK because of two reasons:
- The multi-column PK may not be unique. Even if the combination of those columns is unique now, it does not guarantee that in the future we won’t have duplicate primary key. The single-column approach guarantees that the PK is unique, because it is an identity column.
- The single-column PK is slimmer than multi-column PK, hence better query performance. To do a self join in the fact table (to link the current fact row to the previous fact row for example, in the case of chained events), we join on a single integer column. This gives a better performance than joining on a combination of multiple columns.
So if we need to create a PK on the fact table, it is better to be a single column PK. But do we need to have a primary key on the fact table?
b) Should We Have a Primary Key on the Fact Table?
Fundamental database design theory states that we need to have a primary key on every table. When we design a transaction system, this is indeed the principle. Forget the 3rd normal form; even the first normal form requires that:
- There must not be any repeating columns in every table
- Each table must have a primary key that uniquely identify each row
And to achieve the second and third normal form we need to break some tables into smaller ones and create foreign keys between them. This PK and FK thing have been in our mind for years so when we design a fact table, almost automatically we feel the need to create a primary key, to be able to identify to each row in the table.
But that is the database design principles for transaction systems. In dimensional data warehousing, we do not use normalisation. Instead, we de-normalise the tables. In a star schema, the tables are de-normalised.
We need to remember that data warehousing is not just dimensional data warehousing. Dimensional data warehousing is when we have fact and dimensional tables, i.e. Kimball style. There are other styles of data warehousing, where the data warehouse is not in the form of fact and dim tables. Instead, the DW is normalised, just like a transaction system database. For example in Inmon’s concept, the enterprise data warehouse is normalised. In many Teradata DWs, the data warehouses are also normalised.
So, should we have a primary key in the fact table? The answer is: it depends on whether we need to be able to identify each fact row or not. There are 3 reasons why we need to be able to identify each fact row:
1) If we need to refer to a fact row from another fact row, for example in the case of chain of events
2) In the case if there are several identical fact rows and we need to update (or delete) only some of them
3) To link the fact table to another fact table
There another reason why we want to have a PK in the fact table: to prevent duplicate rows. I’ll explain the 3 reasons for identifying each fact row first, then the prevention of duplicate rows after that.
1. Referring to a fact row from another fact row
In data warehouses, we need to refer to a fact row from another fact row to show the chain of events/transactions. So on the ‘from’ fact row, we put the fact_key value of the ‘target’ fact row. This requirement is rare, but it does happen.
The alternative designs to show the chain of events are:
a) Create a transaction dimension and show the chain of events in this dimension. The grain of a transaction dimension is 1 row per business event or business transaction. There are arguments both for and against the idea of transaction dimension but that’s for another article. The main argument from the ‘against’ side is that the transaction dimension attributes should be put as degenerate dimension columns in the fact table.
b) In many cases it is not the chain of events/transaction, but only the chain of master data, for example: chain of customers (related customers), related policies or related accounts. In this case we should do the linking on the customer dimension, policy dimension or account dimension; not on the fact table or transaction dimension.
2. To deal with identical fact rows
If we have several identical fact rows and we need to update or delete only some of them, we can use rowcount. In SQL Server, using “rowcount” we can define how many rows we want our SQL statement to be applied to. For example:
create table t1 ( col1 int, col2 varchar(30) )
insert into t1 values (1,’A’)
insert into t1 values (1,’A’)
set rowcount 1
update t1 set col2 = ‘B’ where col1 = 1
In the above example we only update 1 row because we specify “set rowcount 1”.
If you don’t reset the rowcount, when we do “select * from t1”, SQL Server will only return 1 row. To display both rows we need to reset the row count by setting it to 0:
set rowcount 0
select * from t1
So we don’t need to have a primary key on a table to be able to deal with duplicate rows. We can use rowcount.
3. To linking to other fact table (drill across)
Some people put a primary key on the fact table (usually in the form of a single column, e.g. fact key) so that they can refer to that row from other fact table. Consider a classic scenario of retail data warehouse where we have 2 fact tables: shipment and order. To enable us to go from the order fact table to the shipment fact table, we create a fact key column on the order fact table, and then put this fact key on the shipment fact row. This method is a quite costly, particularly when we have to update the fact key on the shipment fact table.
The very basic premise of Kimball’s style dimensional modelling is to have conformed dimensions. One of the main purposes of having conformed dimension is to enable us to drill across from one fact table to another. In other words, we should use the dim keys to navigate from one fact table to another fact table. Not making your dimension conformed, then use fact key to navigate between fact tables, would require additional processing time on the warehouse load. I wouldn’t recommend this approach because a) non conformed dimensions create an issue of data trustworthiness; b) ETL time becomes longer because we have more dimension tables to maintain and to stamp the “related fact key” column.
Some people use DD (degenerate dimension) to navigate between fact tables. A DD is a column in the source transaction table that is left over after we move all the attributes to the dimension tables. Order ID is an example of DD. We can use Order ID to go from the order fact table to the shipment fact table. Note that we don’t need to uniquely identify a row; an order could be delivered in many shipments and a shipment could contain many orders. That is another reason why fact key (or fact table PK in general) is not ideal for drilling across between fact tables: because we don’t need it to be unique. The best way to drill across is to have conformed dimension and use the dim keys.
To Prevent Duplicate Rows in the Fact Table
Sometimes people put a primary key on the fact table to prevent duplicate rows. For example, to prevent a transaction that has been loaded into the fact table on the previous day load to be loaded again the next day. But when designing the fact table grain, we need to consider whether there is a valid business scenario for duplicate rows in the fact table. Consider a classic scenario of retail data warehouse: a fact table with 4 dim keys: customer, product, store and date, and 2 additive measures: quantity and sales amount. Can we put a primary key using the combination of the 4 dim keys to prevent duplicate rows? Is it a valid business scenario to have duplicate rows with the same 4 dim keys?
The first thing we need to do when creating a fact table is declaring the grain. This is a major step that some people forget, or deliberately skip. In the above scenario, the fact table grain could be:
a) 1 row for every transaction in the EPOS system (electronic point of sale), or
b) 1 row for every combination of customer-product-store-date.
When a customer buys 2 packets of crisps and the cashier at the checkout scans it twice, we have 2 rows in the EPOS system with different transaction/line IDs. In the case of a) above, we create 2 rows in the fact table, with the same 4 dim keys. In the case of b) we group the 2 source columns and we create only 1 row with quantity = 2.
In the case of a) above, having duplicate rows in the fact table is a valid business scenario so we should not put a primary key in the fact table to prevent it. In the case of b) above, having duplicate rows is not a valid business scenario. Another commonly used example is if the customer comes back one hour later and buy another packet of crisps. That’s the same customer, going to the same store, buying the same product, on the same day. Because the grain is date (not time), there will be duplicate rows in the fact table.
A common practice in retail data warehousing is to have a degenerate dimension (DD) on the EPOS transaction ID. A DD is a column in the source transaction table that is left over after we move all the attributes to the dimension tables. In a transaction fact table, the grain of a DD column is usually (not always) the same with the grain of the fact table. In the case of a) above, some people create the fact table PK on this transaction ID column. True that in most system, the EPOS transaction ID is unique, but in some cases, it is possible to have duplicate transaction ID. For example, in the case of merger/acquistion: two international retailers who use the same EPOS system merged. Although within each retailer, the EPOC transaction IDs are unique globally, they are not unique between companies. I’d like to point out that if the DD column is Order ID we can’t use it as PK because it’s not unique (the grain of the fact table is at line level, not order level).
So, before putting a PK in the fact table we need to check whether there is a business scenario that makes duplicate rows valid. Now the latest trend of fact table primary key is to create the fact key column. It’s an identity column, so it can’t be duplicated. It is unique. But, in this case, we don’t really prevent duplication, do we?
The disadvantage of having a primary key on the fact table
A primary key is a constraint. We create it to prevent duplication on the contrained columns. To reinforce the contraint, SQL Server needs to check the constraint everytime a row is inserted into the table or the columns are updated.
In data warehousing, we insert lots of rows into fact tables every day, particularly snapshot fact tables. If the fact table only contains 100,000 rows, it does not make a difference whether we have a PK or not. But if the fact table contains 100 million rows, it does make a big difference. As an example, the following test inserts 100,000 rows into 2 fact tables (1 with PK and 1 without), both containing 1 million rows.
— Create 2 fact tables, 1 with PK, 1 without PK
create table FactPK
( dimkey1 int, dimkey2 int, dimkey3 int,
measure1 decimal(18,5), measure2 decimal(18,5)
primary key (dimkey1, dimkey2, dimkey3)
)
create table FactNoPK
( dimkey1 int, dimkey2 int, dimkey3 int,
measure1 decimal(18,5), measure2 decimal(18,5)
)
— Prepare data to be loaded initially (1 million each)
create table data ( col1 int , col2 decimal(20,5))
declare @a int
set @a = 1
while @a <= 1000000
begin
insert into data values (@a, 10.1*@a)
set @a = @a + 1
end–about 5-10mins depending on the autogrow of DB files
–Do initial load on both fact tables
insert into FactPK (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from data
insert into factNoPK (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from data
— Prepare new data to be inserted into both fact tables
create table NewData ( col1 int , col2 decimal(20,5))
declare @a int
set @a = 1000001
while @a <= 1100000
begin
insert into NewData values (@a, 10.1*@a)
set @a = @a + 1
end
— Insert new data into both fact tables and measure the duration
insert into factPK (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from NewData –100k rows in 3.013s, see here for measurement methods
insert into factNoPK (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from NewData –100k rows in 2.027s
In the above test the fact tables are quite slim. They only have 3 dim keys and 2 measures. And they only contain 1 million rows. In practice it could be a lot bigger than this. But, even on that size, we can see that the primary key make a big difference. The one with PK is about 50% slower than the one without.
So should we have a primary key in the fact table? Only if we have a genuine reason to identify each and every fact row. We shouldn’t put a PK in the fact table just for the sake of it. It must have a purpose. It has some disadvantages so we need to consider whether the benefits outweigh the disadvantages.
Should we index the primary key?
Well this one is easy. In SQL Server we must have an index on the primary key column. The way SQL Server implement primary key is by using a unique index. We can’t have a primary key without having an index.
So, the short answer is yes. We must index the primary key.
Should we cluster that index?
A clustered index dictates the physical location of each row. If we make a column a clustered index, the locations of the table rows on the disk are physically ordered based on that column. This slows down inserts. If we insert a row into a table with clustered index, SQL Server needs to identify first where to insert that row, based on the clustered column(s). No, that row does not automatically go to the end of the table; it goes in the middle of the table depending on the value of the clustered column(s). In extreme cases, on a 40-million-rows fact table could take 1 hour to insert 200,000 rows. This is because SQL Server spends a lot of time to identify where to insert each row. This the main argument why if we decided to have a PK on the fact table, the PK index must be clustered: to improve insert performance.
In data warehousing, the primary consideration is the query performance. Secondary consideration is the load performance. This is because the purpose of a warehouse is to satisfy user queries. When a user queries the warehouse, we only have a few seconds to serve the requests (a few minutes in some cases); where as when loading we have hours.
That is also true for a clustered index in the fact table: the primary consideration for the clustered index in the fact table is the query performance. If the table rows are physically ordered by column1, then a query with column1 in the where clause would enjoy a good performance. In SQL Server 2000, clustered index was the only tool that we have with regards to the physical order of the table rows. But, since SQL Server 2005 we have another tool at our disposal: partitioning.
The physical order of a fact table should support the where clause of data warehouse queries. If it is a periodic snapshot fact table, the physical order should be the snapshot date (or snapshot week/month). This is because the majority of the data warehouse queries specify the snapshot date/week/month in the where clause.
If it is a transaction fact table, generally it is better if we don’t get the fact table rows physically ordered using clustered index. This way, our non-clustered indexes on dim keys will be based on the Row ID (combination of file ID, page ID and per-page row ID). This way, the order of the fact table does not favour a certain dimension, so whatever attributes on the where clause, the query will have the equal performance.
We should only have a transaction fact table physically ordered if the majority of the warehouse queries specify one certain column. For example, in a CRM warehouse, there are many cases where it is beneficial to have the transaction fact table physically ordered on the customer key. In this scenario, queries with customer attributes on the where clause will be faster. Yes, queries without customer attributes on the where clause will be a bit slower, but because it’s a CRM warehouse, majority of the warehouse queries have customer attributes on them. Hence overall it is beneficial (for query performance).
If for any reason, you decided to put a clustered index on the fact table, be it for a PK or on other column to support query performance, you should really consider 1 thing: are you prepared if your fact table load takes twice as long? A clustered index on the fact table really slows the insert down, except if the clustered index is on an identity column.
We need to remember that we have partitioning at our disposal to physically order the fact rows. Traditionally, in SQL Server warehousing (2005 and above) we put the fact table in several spindles. We do this by partitioning the table and put different partitions on different file groups, which in turns are located on different spindles. This way the query is served by several physical disks, hence the IO rate is higher, hence the query is faster.
But, if you don’t have different spindles, you can still benefit from partitioning by arranging the partition function so the majority of the queries only hit 1 partition. This way the table looks a lot smaller to SQL Server. For example, in the case of monthly periodic snapshot fact table, partition the fact table by snapshot month. Majority of the queries are targetting the current or last snapshot month. If you have 10 years (120 snapshot months) of data, those queries will be a lot faster because it only hit 1 partition, which 1/120th portion of the whole table.
Even after using partitioning, we still have clustered index at our disposal to further arrange the physical order of the fact row, within a partition. For example, in a CRM warehouse, we can partition the monthly snapshot fact table by snapshot month, and then create a clustered index on the customer key. In a retail warehouse, a potentially beneficial column to cluster on is the transaction date key.
Sorry I’m too carried away with physical ordering and partitioning, back to the question: should we cluster the PK index? The answer is it depends.
- If your fact table is heavily queried by the front end reports, then you should use the clustered index to support query performance, as I explained above (use partitioning first, clustred index second). Because we can only have 1 clustered index per table, we need to specify any other indexes (including the PK index) as non clustered.
- If your PK is a composite of the dim keys, you should not use that PK for clustered index. In SQL Server, all non-clustered indexes will be very wide/large (and therefore slower) if the clustered index is wide. This is because non-clustered indexes contain the clustered index columns.
- If your PK is a single column integer identity key, and loading performance is more important for you than query performance*, then you should cluster that PK index. Insert is faster if the identity PK is clustered, compared to if it’s not clustered (but fact table with no PK is fastest), as shown in the test below. *for example because the fact table is only used to load SSAS cubes, which read the fact table in its entirity, rather than selective.
— Create 3 fact tables, 2 with identity PK (1clustered 1 non clustered) and 1 without PK
create table FactPKClus
( factkey int identity(1,1),
dimkey1 int, dimkey2 int, dimkey3 int,
measure1 decimal(18,5), measure2 decimal(18,5),
primary key clustered (factkey)
)
create table FactPKNonClus
( factkey int identity(1,1),
dimkey1 int, dimkey2 int, dimkey3 int,
measure1 decimal(18,5), measure2 decimal(18,5),
primary key nonclustered (factkey)
)
create table FactNoPK
( factkey int,
dimkey1 int, dimkey2 int, dimkey3 int,
measure1 decimal(18,5), measure2 decimal(18,5)
)
— Create initial data for fact table
create table data ( col1 int , col2 decimal(20,5))
declare @a int
set @a = 1
while @a <= 1000000
begin
insert into data values (convert(int,rand()*1000), 10.1*@a)
set @a = @a + 1
end –10mins
— Populate the fact tables with initial data
insert into FactPKClus (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from data
insert into FactPKNonClus (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from data
insert into FactNoPK (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from data
— Prepare new data to be inserted into both fact tables
create table NewData ( col1 int , col2 decimal(20,5))
declare @a int
set @a = 1000001
while @a <= 1100000
begin
insert into NewData values (convert(int,rand()*1000), 10.1*@a)
set @a = @a + 1
end
— Insert new data into fact tables and measure the performance
insert into FactPKClus (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from NewData –3.96s (25,252 rows/s)
insert into FactPKNonClus (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from NewData –-4.71s (21,231 rows/s)
insert into FactNoPK (dimkey1, dimkey2, dimkey3, measure1, measure2)
select col1, col1+1, col1+2, col2, col2+7 from NewData –3.31s (30,221 rows/s)
/*Measurements are taken 10 times each and then averaged. See here for measurement methods.
System Spec: Acer Aspire 5051 AWXMi, AMD Turion 64 Mobile MK-36 799 MHz,
512 MB RAM, Win XP SP3, SQL Server 2008 R2 Nov CTP*/
In the test above, there are 2 fact tables with PK. The one which the PK is clustered is 19% faster on insert than the one with nonclustered PK (25,252 rows/s compared to 21,231 rows/s). But, the one without a PK is still faster (30,221 rows/s), that’s 20% faster than the clustered PK.
Note that the concept of ‘clustered index’ is SQL Server specific. It does not apply on other platforms. Teradata for example, have a different data distribution mechanism which utilises “primary index”.
Clustered index is the default for PK
In practice I found there are many fact tables created with a clustered index on the primary key. Not only identity PK, but also non-identity PK. This is because, by default, SQL Server will create the PK index as clustered. If we don’t specify whether the primary key should be clustered or non clustered, by default it’s clustered. For example:
create table Fact1
( dimkey1 int,
measure1 decimal(18,5)
primary key (dimkey1)
)
create table Fact1
( dimkey1 int primary key,
measure1 decimal(18,5)
)
In both examples above, SQL Server creates the primary key on dimkey1 as clustered index. So the fact table is physically ordered on dimkey1 column.
To create the primary key as non clustered index, the command is:
create table Fact1
( dimkey1 int,
measure1 decimal(18,5)
primary key nonclustered (dimkey1)
)
Conclusion
When you create a fact table,
a) Consider whether you need to be able to identify each row on the fact table. If you don’t need to, don’t create a primary key.
b) If you do need to create a primary key, create it as single identity column.
c) Cluster the PK if: the PK is an identity column and loading performance is more important than query performance
Don’t cluster the PK if: the PK is a composite, or when you need the cluster index to support query performance (together with partitioning)
PS. If the PK is a single identity column, the clustered index won’t give much overhead at insert because new rows are created at the end of the table. Thanks to Marco Russo for pointing this out.