Primary Key In Accumulating Snapshot Fact Tables

9 May 2010

Primary Key In Accumulating Snapshot Fact Tables

Filed under: Data Warehousing — Vincent Rainardi @ 11:29 pm
Tags: SQL Server

In data warehousing it is common to have a fact table with the grain the same as the grain of a dimension. The dimension is usually the same as the main dimension, i.e. in CRM: customer, in manufacturing: product, in health care: patient, in insurance: policy, and in banking: account. For example: in retail, we have Fact Customer Purchasing Behaviour. The grain of this fact table is: 1 row for each customer. The fact table contains the amount of money that each customer spent on each of the 7 major product classes, in the last 12 months.

Another example is a workflow of mortgage approval process in retail banking. Grain: 1 row for each account. Measures: number of working days on each of the 5 stages in mortgage approval. From the date the customer applied, until the date the agreement is signed: application days, questionnaire days, verification days, approval days, and signing days. Or until the customer gets the money.

These fact tables are usually accumulating snapshot. In maintaining this table, we do more SQL update statement than insert statement. Usually these fact tables also have other dimension keys, i.e. in the last example of retail banking: it’s mortgage officer dimension, branch dimension, and mortgage (type) dimension.

Right, that’s the background, now the core of this article: Primary Key. We have 2 questions:

In the fact table, should we make the main dimension key (like account key in the mortgage example above) as the Primary Key?
Should we make the account key in the fact table as a Foreign Key to dim account?

As usual, we should not focus on what’s right and what’s wrong. But to: in which case should I make it a PK, and in which case should I not make it a PK.

Should we make the main dimension key as the primary key of the fact table?

Before we start discussing this, I’d like to mention that if you need a background about “Primary Key on the Fact Table”, you can read this article: Primary Key on the Fact Table.

Whether we should make the main dimension key as the PK of the fact table or not, depends on two things:

Is the grain of the fact table really the same as the grain of the main dimension?
Whether we plan to have a surrogate fact key.

The first factor seems obvious, but we really should examine it carefully. I have found several cases where at the first sight the grain seems the same, but with closer examination I found that the grain of the fact table is more detail than the main dimension.

It’s easier to explain with an example. Take the mortgage status fact table example above. The grain of this fact table is (supposedly) 1 row for each mortgage account. We are not supposed to have an account with 2 rows (or more) in this fact table. But, we also have other dim keys in this fact table: branch key, mortgage type key and mortgage officer key. What if it is possible for a mortgage to have more than 1 officer? What if a mortgage was transferred from one branch to another, e.g. in 2008 it was in branch A, but in 2009 it was in branch B? What if a mortgage changed its type, say from “interest only” to “repayment”, or from “fixed rate” to “tracker”? In these cases, for 1 mortgage account we could have 2 rows in the fact table.

In these cases, we have three options:

Maintain the principle that a mortgage can only have 1 row in this fact table, or
Allow for a mortgage to have several rows in this fact table.
Take the offending dimension key out.

If we stick with the principle that a mortgage can only have 1 row in this fact table, then we restrict the other dimension keys. In the above example, the other dimension keys are branch key, mortgage type key and mortgage officer key. We modify them as follows:

Mortgage officer key: change to “main mortgage officer key”, which specifies the main* credit officer dealing with the customer for this mortgage. *see next paragraph.
Branch key: change to “current branch key”, which is the latest branch for this mortgage. If a mortgage was moved from branch A to branch B, then this column contains the key to branch B.
Mortgage type key: change to “current mortgage type key”, which is the latest mortgage type for this mortgage. If a mortgage was changed from “fixed rate” to “tracker”, then this column contains the key to tracker mortgage type.

I would like to comment further with regard to the word “main” that I mark with * above. To determine which one is current we need to have a business rule. As an example: “if there are two account officers assigned to a particular mortgage, then the main mortgage officer is the officer with the highest rank, at the time the mortgage was approved”. Indeed it is possible for an officer to outrank other officer over time (for example, because of promotion), hence the business rule specifies “at the time the mortgage was approved”. Indeed it is very possible that the 2 officers have the same rank, in this case the rule can further specify for example, “in the case that the officers have the same rank, the main mortgage officer is the first signatory on the approval document”.

I found that, over the years, the main problems with defining business rules are:

There isn’t an existing business practice for it and different part of the business have different view on the proposed rule. And we can’t get them to agree.
The rule is technically “not implementable”. As in, the data that is required to implement that rule is not captured anywhere in the system.

In these situations we need to be practical. If the missing data can be captured that usually the best alternative. If not, we need to create the rule with the data that we have. In the case of disagreeing departments or business units, it is often useful to go higher. At higher level, the business priority is usually more obvious, making it easier to decide.

The other two options are to allow for a mortgage to have several rows in this fact table, or remove the offending dimension key column. Allowing a mortgage to have several rows usually adds more issues with regards to the measures values. The issues are usually either a) how to split the measure across rows, or b) double counting.

Removing the offending dimension key column (and keep the grain of the fact table at mortgage account level) can be a good, practical approach. To get the branch key, mortgage type keys and mortgage officer key we will need to join to another fact table which has these keys. Unless you have an issue with the query performance, this is the “cleanest” approach as you keep those dimension keys where they should belong. And not duplicating them.

Whether we plan to have a surrogate fact key

The second factor affecting the decision about whether we should make the main dimension key as the primary key of the fact table is whether we plan to have a surrogate fact key.

I believe that saying “it’s a standard” is an unwise blanketing approach. To have a policy to create a surrogate fact key on each and every fact table without considering their structure and usage is disadvantageous. True that a surrogate (PK) fact key has its advantages, including better query performance, better indexing (specific to SQL Server platform, not applicable to Oracle), and enable us to uniquely identify a fact row. But there are cases where these advantages don’t occur.

One of those cases is accumulating snapshot fact table, with the grain equals to the main dimension. Consider the mortgage status fact table above. Which enables us to monitor closely the number of working days it took for a mortgage to progress from one stage to another. If the grain of this fact table is 1 row for each mortgage account, it make sense to make the mortgage account key as the primary key of the fact table, not by creating a surrogate fact key instead. Creating a surrogate fact key is not necessary because only one of the dimension key column in this fact table determines the granularity of the fact table. The need for a surrogate fact key arises when there are more than one dimension key columns determine the granularity.

Should we make the main dimension key in the fact table as a Foreign Key to the main dimension

In the above example, this mean: should we make account key in the fact table as a FK to the account dimension.

In my opinion, the answer is yes (identifying), because some accounts may not be in the fact table. Cancelled accounts for example. This also guarantees that all the account keys in the fact table are valid. In the fact table we should not have 0 (unknown) account key. The warehouse load (ETL) should check this, in addition to the FK constraint.

Accumulated snapshot fact table is usually not too big. Having an FK (only on 1 column) doesn’t have a negative impact on the load performance. It on the other hand, have an advantage on the query performance (when we join the fact to the account dimension). When the fact is being loaded, we do more update than insert, so this FK doesn’t have an impact on the load performance.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Data Platform and Data Science

9 May 2010