Data Warehousing, BI and Data Science

5 June 2015

Indexing Fact Tables in SQL Server

Filed under: Analysis Services — Vincent Rainardi @ 5:12 pm

This article applies only to SQL Server as it is using clustered and nonclustered indices. It doesn’t apply to Oracle or Teradata or DB2.

Apologies I’m going straight to the matter, without explaining what is a clustered index key, what is a dimension key column, etc. For an introduction to indexing, here are Microsoft’s articles: Clustered and NonClustered Indexes Described, Using Clustered Indexes, Designing An Index.

A common way of indexing a fact table is by putting a clustered index on the fact key column, then put nonclustered index on each of the dimension key column. This will allow the each of the nonclustered index to be efficiently used, whilst ensuring that they are slim.

They are slim because in SQL Server, a nonclustered index uses the clustered index key as the row locator. The clustered index key, as I mentioned above, is the fact key which is a bigint.

They are efficiently used because they are built on each of the dimension key. Each of them as the first index key, not as a second or third index key. This ensures that they will be hit by the queries using that dimension key, because they are the first column in the index key.

But there is a major weekness. Most fact tables are periodic snapshot. Meaning that each query will hit the fact table on the snapshot date key column. Either directly hitting the fact table, or via the date dimension. On the latter case it specifies the snapshot date (2015-03-31), so it will hit the date dimension, and then hit the fact table on the snapshot date key column (20150331).

Because periodic snapshot fact tables are primarily queried on the snapshot date key, it would be a lot better to use the clustered index to hold the snapshot date key column, not the fact key column. The primary key of the fact table should still be the fact key column, but the clustered index is on the snapshot date key column.

A Primary Key doesn’t have to be a clustered index

It is a common misperception in SQL Server world, that a PK must be a clustered index. That is not so. On the create table statement, if we just do “create table X (column1 int primary key …” then SQL Server will put the PK as a clustered index. But we put the PK on the Constraint clause we can specify nonclustered.

Yes SQL Server need to create an index for a PK. This is to enforce that the PK is unique. But that index does not have to be a clustered index. It can very well be a nonclustered index. See this example below:

create table FactHoldings
( FactKey bigint,
  DateKey int,
  PortfolioKey int,
  SecurityKey int,
  CurrencyKey int,
  GrossValueBase decimal(21,4),
  constraint PK_FactKey primary key nonclustered (FactKey)
);

create clustered index CI_DateKey on Fact1 (DateKey);
create nonclustered index NI_PortfolioKey on Fact1 (PortfolioKey);
create nonclustered index NI_SecurityKey on Fact1 (SecurityKey);
create nonclustered index NI_CurrencyKey on Fact1 (CurrencyKey);

Yes, this will take more storage space for the index, because the row locator of NI_PortfolioKey and NI_CurrencyKey will be the DateKey column. But it will be more performant on almost every fact table query because the table is physically ordered on DateKey. So SQL Server will be able to locate the rows for the specified date quicker.

Which Dimension Key Column To Index

I don’t believe that we need to index every single dimension key column. Some dimensions are rarely used in the queries. We should only index the dimensions which are frequently used when that fact table is queried.

Multi-Column Clustered Index

Would it be a good idea to add additional clustered index key to the clustered index? i.e. in the example above, should we make the DateKey + PortfolioKey as the clustered index key? The argument for this is that almost every query uses PortfolioKey. So it will be quicker for the query to locate the row in the fact table.

No, I don’t believe this is a good idea. Because this will make the clustered index key much wider, and in turn affect the size of the row locator in every single nonclustered index. Which in turn will make all nonclustered index less performant.

Multi-Column NonClustered Index

Should we make the nonclustered index a multi-column? i.e. in the example above, should we add SecurityKey to the PortfolioKey?

Generally speaking this is not a good idea. Queries which only use security dimension without using portfolio dimension, will not be able to use the index because it is implemented as the second nonclustered index key. Generally speaking it is better to keep each dimension key column on a separate nonclustered index.

Indexing Accumulative Snapshot Fact Tables

In an accumulative snapshot fact table, there is one (and only one) dimension key which is the Primary Key of the table. This is usually the main dimension in that industry sector. For retail, this is CustomerKey, for Pharma it is the ProductKey, and for Investment it is the SecurityKey.

It is best to put a clustered index in this main dimension key column.

Indexing Transaction Fact Tables

A transaction fact table doesn’t have a Snapshot Date Key column. Instead it has a Transaction Date Key column. The other columns are more or less the same, i.e. for Retail industry: ProductKey, StoreKey, and optionally CustomerKey. For Retail Banking it’s CustomerAccountKey, BranchKey, TransactionTypeKey and optionally ProductLineKey.

In my opinion, it is best to put a clustered index on the FactKey column, with nonclustered index on each of the major dimension key column (major means: the ones which are heavily used, not the ones after the “optionally” above).

Partitioning the Fact Tables

Generally speaking, from performance point of view, partitioning is better than indexing. From maintenance point of view, it is much worse. The Sliding Window method to maintain the partitions requires a lot of work. To overcome this, define 10,000 static partitions enough 10 years (1 partition for each day). This way we don’t need to do any maintenance on any partition.

Partitioning particularly gives much better performance for Periodic Snapshot fact table (PS). For Accumulative Snapshot (AS) and Transaction fact tables the performance gain is not significant. This is because Transaction fact tables are not queried using a particular column. AS is queries (almost always) using the main dimension key, but the number of rows are not significant.

If a PS is partitioned on Snapshot Date Key, what should be the clustering key? The Fact Key column should be the clustering key. And the other dimension key columns should in nonclustering indexes.

Advertisements

2 Comments »

  1. What are your thoughts when dealing with a multi-tenancy key? For a Multi-column NonClustered Index I would think the Tenant Key should be added since it is used on all queries.

    Comment by Dylan Davis — 14 April 2016 @ 2:10 pm | Reply

    • Hi Dylan, I assume you are talking about shared-database, shared-schema, multi-tenant 3NF data architecture (rather than a property investment data warehouse). Yes for a multi-column nonclustered index we should add the tenant key. In a multi-tenant data warehouse (instead of 3NF/normalised data structure), on the dimension table, the PK and clustered index should still be on the SK column, and on the periodic snapshot the clustered index (or partitioning key) should be on the tenant key and snapshot month/date.

      Comment by Vincent Rainardi — 14 April 2016 @ 6:55 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: