A Data Warehouse with No Surrogate Keys

9 May 2012

A Data Warehouse with No Surrogate Keys

Filed under: Data Warehousing — Vincent Rainardi @ 11:07 pm
Tags: Data Warehousing

Some companies built their data warehouses without surrogate keys. They have a customer table in their business system, with the primary key of CustomerId. In the data warehouse, they also use CustomerId as the primary key. One would thought “oh it’s probably an Inmon style warehouse”, but no, it was fact and dim, Kimball style warehouse. In the DimCustomer, they don’t have a surrogate key. Instead, they use CustomerId as the PK!

Talking to them I tried to understand the reasons why they did it that way. “Because we just want to truncate the DimCustomer and reload it every day. That would make it simple. We don’t need to find out which rows have changed and which rows are new rows. We simply delete everything in the dim and reload the customer table from the OLTP!” Then I realised that was the reason they couldn’t have a surrogate key in their DimClient. Because if they did they could not truncate reload the dim because the surrogate key would have changed. If on day 1 they have 1000 rows with ClientKey of 1 to 1000. On day 2, after truncate reload, the ClientKey will become 1001 to 2000. And that makes their live even more difficult: they would have to change the ClientKey in the fact table.

So for that reason, they use ClientId as well in the fact table, i.e. the PK in the OLTP. This way, in the dim the PK is 1 to 1000, and in the fact table, the PK is also 1 to 1000, and they match!

Figure 1. A Data Warehouse with no Surrogate Keys

Well, if not having surrogate keys make life easier, why do we bother creating surrogate keys in the data warehouse?

To maintain the history of attributes in the dimension (type 2)
To be able to point the old fact table to the old row in the dimension

“But we don’t need the old version of the row. We always want to see the latest version! Why do we need to have SK in the DW?”

To integrate multiple source systems which may have the same natural keys, i.e. the same CustomerId or ProductId.

“Oh that easy, we add SourceId column in both the dimension and the fact and we can join OK!”

True, that is true, we can join multiple source systems by using SourceId column in both the fact and the dim.

So that we can have the Unknown Row. If we don’t know who the Customer is for a particular sales transaction, we point the fact row to this Unknown Row. Same with DimProduct and all other dimensions.

If the CustomerId changes from numeric to alphanumeric in the OLTP, we don’t need to change all the fact tables. (We do if we use CustomerId as the primary key of the dimension). This is actually the core reason why we use surrogate key: because the natural key can change. We want to be able to update the natural key easily, without updating all the fact tables that use that dimension.

Another advantage is a surrogate key is performance: it is 1 integer column. The natural key may consist of several columns, for example: the PK of a bank account table is: branch code + product type code + account number. It is possible that 2 banks merged and their accounts were merged too, causing duplicate hence they prefix it again with the bank code, so it becomes: bank code + branch code + product type code + account number. That is long. Only one integer column is much shorter and more performant. It is quicker to index and quicker to find.

Figure 2. A Data Warehouse with Surrogate Keys

So those are the reasons why we have surrogate keys in the data warehouse! I hope this article helps those who are convinced that they need to use star schema and surrogate keys, but never knew the reasons. I hope they now understand the reasons behind using surrogate keys in a star schema. Personally this case helped me understand the reasons why some people chose not to use surrogate key. Throughout the years I’ve heard many “pro arguments” about using SK, but very few “contra arguments” in using surrogate keys.

In this particular case, they didn’t even know how to maintain a surrogate key. They liked the idea of “maintaining the history of the attributes” after I explained it, but they didn’t know the technicalities to do it in SSIS. It seems very complex and would take a lot of their time. When I introduced SCD transform in SSIS workflow to them, their eyes were bright and cheerful. “Wow that would make things a lot easier,” they said. Over the years, some companies have tried to make this Surrogate Key build business easier. Kalido and Wherescape for example, enable us to build a DW without worrying about the complicated issue of Surrogate Keys.

Once again the main reason of using surrogate key is to detach the data warehouse from the source system. To make the data warehouse more independent from the source system, and therefore gives us a more flexible design as a result. And the main benefit is that SKs enable us to maintain the history of the attributes in the dimension using SCD type 2.

Vincent Rainardi, 9/5/2012

Comments (5)

5 Comments »

[…] A Data Warehouse with No Surrogate Keys Share this:EmailPrintFacebookShareDiggRedditStumbleUpon […]

Pingback by Surrogate Keys | James Serra's Blog — 17 May 2012 @ 12:53 am | Reply
You didn’t convince me. If they use CustomerId key from source system as a key in Data Warehouse that is fine under the condition that CustomerId is accually fake, surrogate key in source system. So it’s like when you use int for your CustomerId and it’s used as primary key, it probably (for 99% of the cases) won’t change to the different type. So I believe in this case it’s much better to use keys from source db.

Comment by Kuba — 2 August 2012 @ 9:39 am | Reply
- Thank you Kuba for your comments
  
  Comment by Vincent Rainardi — 2 August 2012 @ 11:04 pm | Reply
You did not convince me either – your statement about speed is in accurate ( with today’s computing power – the delta between and single int key and a multi – int/txt key is negilable). You did not address the daunting complexity’s surrogation demands,
Like:
DEV – QA -DR sites will all have same complexity,
Drill though where the Natural Keys are what is required to make intellegable sense of the macro numbers your ware housing supports has to be built for everything,
Error handling for the every present *no Surrogate found condition – most all of the edit checking in ETL is just to verify a good Surrogate — Imagine if they weren’t used,
In short – it is an obsolete philosophy born out of using 1978 technology.

Comment by Bill Fleisher — 29 August 2012 @ 10:06 pm | Reply
- Thank you Bill Fleisher for your comments. I learned a lot from your short comment as it contains deep things. You must be a very experienced person in data warehousing and I’m grateful that you spent the time visiting this page and commenting. You are right I was too one sided and should have highlighted the daunting complexity of SKs. Building a DW without SKs gives the sheer advantage of much simpler development. And you are right about computing power, the difference speed is negligable. Perhaps there is a case (cost benefit analysis) of building a DW without SKs.
  
  Comment by Vincent Rainardi — 30 August 2012 @ 6:57 am | Reply