A bit on Primary Key
Yesterday a friend asked me why there was no primary key on a fact table. I explained that we did have a fact table surrogate key (FSK) on that fact table, but I made it as a unique non-clustered index because we needed the physical ordering (the clustered index) to be on the snapshot date as it was a daily periodic snapshot fact table, queried mostly by using the snapshot date.
The purpose of having a primary key (PK) is to enforce uniqueness in one column of the table. We can achieve the same thing, by creating a unique key, hence we do not need a PK in that fact table.
We need to pause and think, if in the fact table we need a unique identifier of the rows. If we need to refer a fact row from another fact row (self referencing), then we do need a PK, which is usually a single column bigint FSK. But this unique identifier single column bigint FSK does not have to be an FK, it can be a non-clustered unique index.
The second purpose of having a PK is to enforce not null. This is really not the function of the PK, but more of a requirement of a PK. A PK requires that the column must be not-nullable. So not-nullable is a property of the column itself, not a property of the PK. And we implement this non-nullability when declaring the column on the table DDL.
We need to bear in mind that a PK has nothing to do with clustered or non-clustered indexes. SQL Server will automatically implement a PK as either a unique clustered index (UCI) or a unique non-clustered index (UNCI), depending on whether a clustered index already exists.
The Clustered Index
A clustered index does two things:
- Sorting the table physically
- Act as the row locator in non-clustered indexes
Point a) is for the performance of the queries. If we don’t partition a periodic snapshot fact table on the snapshot date, the next best thing is to cluster the table on the snapshot date.
But point a) is also for the performance of the update and delete. It is rare, but in some cases we need to update a periodic snapshot fact table (PSFT). So far I only found 1 case where I need to update a PSFT, out of about 150 PFSTs that I have created over the last 10 years. When updating fact table, it is absolutely crucial that the partitioning key, or the clustered fact table if you don’t have it partitioned, to be on the business date, plus the columns used as the joining criteria between the fact staging table and the PSFT. The clustered index should not be on the query criteria columns. It is the job of the non-clustered index to support the query.
Point b) means that the narrower the clustered key, the smaller the non-clustered indexes. Some people think that the narrow clustered key means that the non-clustered index will also have better performance but in my opinion and observation this performance increase is negligible.
So that’s the clustered index. For an insert-only PSFT we put the clustered index on the snapshot date plus the query criteria column to support the query performance. For an insert-and-update PSFT we put the clustered index on the joining criteria of the update statement.
For example, if the joining criteria of the update is snapshot date key + order ID (a degenerate dimension, the PK in the source table), whereas the query criteria is snapshot date key + account key, then for insert-only PSFT the clustered index should be snapshot date key + account key whereas for update PSFT the clustered index should be on snapshot date key + order ID.
The join SQL engine takes when updating the fact table depends on not only the clustered index of the PSFT, but also on the clustered index of the fact staging table (the source of the merge). If we do use the Merge command, we should convert it to update & insert commands. See my article here (link) about the Merge command’s inefficiency.
We can replace the physical ordering functionality above with partitioning. It is common and it make sense to partition a PSFT, I agree. But I would recommend not to partition the fact table when we create it, but later on. This is because of these reasons:
- We should not spend the time unless it is required, and when we create the fact table we don’t yet know if the partitioning is required.
- When the table is populated and queried we can learn about its condition and behaviour. Much, much better than imagining. For example, is the partitioning required to support query performance, or loading performance?
- We may have purging on the fact table, limiting the volume so that we don’t need to partition it because the performance is OK.
- We may need to create a new column for the partitioning key.
- When we create the fact table, we don’t yet know how the data will be distributed and queried. Purging can change the data distribution. We don’t know the query pattern, for example, 90% of the queries might be on today’s data so we should put it into a separate partition.
- Point e above dictates the partition maintenance strategy, i.e. whether we have partitioning functions splitting the table into 10,000 partitions or to 100 partitions with a “sliding window” maintenance. At the creation time, we have very limited knowledge of this.
Non Clustered Indexes
Each surrogate key is ideally indexed. Not combined as 1 index, but as separate indexes. All as non-clustered indexes (NCI). Say we have order date key, customer key, product key, store key, ship date key. Then we create 1 NCI on order date key, 1 NCI on customer key, 1 NCI on product key, 1 NCI on store key and 1 NCI on ship date key.
We should not combine these NCIs into 1 NCI because the second, third, and forth column of the combined NCI will not be used. Because their ordering is not sequential.
See also two of my articles which are related: