Data Warehousing and Machine Learning

29 December 2021

Foreign Keys in Fact Tables

Filed under: Data Warehousing — Vincent Rainardi @ 7:47 am

I have written about whether we need to create a physical primary key in the fact table or not (link), including whether that primary key needs to be a single column or a composite. But a couple of weeks ago someone asked me if we would need to physicalise the foreign keys in the fact tables or not. This is an interesting topic because we all know that in Kimball dimensional modelling we have surrogate keys in the fact tables, but it has been a long debate in the data warehousing community whether these surrogate keys should be physicalised or not.

The thinking amongst many data warehousing practitioners is that the surrogate keys are fully controlled by the dimensional business key lookup process, so they don’t need to be physicalised. Besides, it would slow down the insert in the fact tables, whereas “fact table insert” is one of the tasks in data warehouse loading which needs to be kept as efficient as possible because it is resource intensive. It is resource intensive because fact tables can be very large tables. So we employ techniques such as partition switching, disabling indexes, etc.

The point is, if the dimensional key lookup process ensure that all the surrogate keys (SK) in the fact tables are valid and correct, then why do we need to physicalise the SKs? For many DW practitioners, me included, the Referential Integrity in the data warehouse should not be in the form of physicalised foreign keys, but maintained by the dimensional key lookup process (link, link point 5). This lookup process not only ensures that the SKs in the fact tables are valid, but also that they are correct.

Practitioners Discussion

Nicholas Galemmo, author of “Mastering Data Warehouse Design” book, link:
In general, the process of assigning surrogate keys in the ETL process is more than sufficient to ensure FK RI. Enforcing constraints in the database is redundant and not necessary. Doing so slows down the load process. However, some BI tools make use of FK declarations to ease defining relationships between tables. Therefore it is common practice to declare FKs, but not implement constraints to enforce them.

Hang from Brisbane, Australia, link:
Since ETL is a tightly controlled process, any database level RI requirements in data warehouse should be superseded by ETL. In OLTP system, RI constraints are targeted at data entry on record level and hence can protect data integrity. In data warehouse, data are loaded in batches by ETL and RI validation is just a bear minimum part of the whole complex process. RI constraints has performance implication to ETL, and some ETL also has its own peg order about loading the data which could require undoing the RI constraints. In my experience, the only value having database RI is to automatically draw the connection lines between tables by some modelling tools.

Martin Rennhackkamp, link, point 5:
The approach we typically follow is to define the foreign key constraints in the model, and implement and activate them in the development, testing and acceptance environments. However, once you are sure that the foreign key constraints are properly adhered to by the ETL processes; you do not implement and enforce them in the database in the production environment.

What Ralph Kimball Says

One of the old saying in data warehousing is, “when in doubt, look at what Ralph Kimball says”. And in this case, Ralph Kimball and Joe Caserta say this in their book in 2014, The Data Warehousing ETL Toolkit (link), page 212:

——————————————————

There are the three main places in the ETL pipeline where referential integrity can be enforced:

  1. Check the “load tables” before loading them into fact tables:
    Careful bookkeeping and data preparation just before loading the fact table records into the final tables, coupled with careful bookkeeping before deleting any dimension records.
  2. Check while loading into fact table using “DBMS RI Check”:
    Enforcement of referential integrity in the database itself at the moment of every fact table insertion and every dimension table deletion.
  3. Check the integrity of tables at later times:
    Discovery and correction of referential integrity violations after loading has occurred by regularly scanning the fact table, looking forbad foreign keys

Practically speaking, the first option usually makes the most sense. One of the last steps just before the fact table load is looking up the natural keys in the fact table record and replacing them with the correct contemporary values of the dimension surrogate keys.

The second option of having the database enforce referential integrity continuously is elegant but often too slow for major bulk loads of thousands or millions of records. But this is only a matter of software technology.

The third option of checking for referential integrity after database changes have been made is theoretically capable of finding all violations but may be prohibitively slow. But perhaps the query can be restricted only to the data that has been loaded today. But this is a sensible approach that probably should be used as a sanity check even if the first approach is the main processing technique.

——————————————————

Bear in mind that Ralph Kimball and Joe Caserta wrote this 17 years ago in 2004 when database technologies were not as advanced as they are today. But it looks that they are in favour of option 1 i.e. maintain the RI on the ETL.

Steve Flynn’s Advice

One of the best advice I found is from Steve Flynn (link) who says the following (link):

  • At design time foreign key constraints should be used. They guarantee referential integrity when the facts are inserted and will throw an error that the developer/admin can catch and fix before the users report issues with the data.
  • Design the ETL as if the constraints are not there; i.e. follow Kimball’s recommendation to enforce RI during data preparation.
  • If ETL performance is an issue and removing the foreign key constraints will make a difference then remove the constraints. But, test the pre and post ETL performance to see what’s been gained, and only do this if it makes real difference and other performance optimisations haven’t been tried first or have been ruled out for other reasons, such as cost.
  • This gives the assurance that RI cannot be compromised, even if modifications are made to a complex ETL process, while at the same time providing scope to optimise fact table load speed by disabling the foreign key constraints should it be required.

Conclusion

In my opinion, we do not need to implement RI in the fact tables because:

  1. The SK lookup process in the ETL is sufficient to ensure RI.
  2. In my experience disabling and enabling FKs before & after fact load takes a lot of time. And not disabling them takes even longer load time.
  3. We don’t delete from dimension tables.
  4. The argument about “in SQL Server FKs are used to optimise star query” is not true. It is the “star join” which is used, not the constraint (link).

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: