Some companies built their data warehouses without surrogate keys. They have a customer table in their business system, with the primary key of CustomerId. In the data warehouse, they also use CustomerId as the primary key. One would thought “oh it’s probably an Inmon style warehouse”, but no, it was fact and dim, Kimball style warehouse. In the DimCustomer, they don’t have a surrogate key. Instead, they use CustomerId as the PK!
Talking to them I tried to understand the reasons why they did it that way. “Because we just want to truncate the DimCustomer and reload it every day. That would make it simple. We don’t need to find out which rows have changed and which rows are new rows. We simply delete everything in the dim and reload the customer table from the OLTP!” Then I realised that was the reason they couldn’t have a surrogate key in their DimClient. Because if they did they could not truncate reload the dim because the surrogate key would have changed. If on day 1 they have 1000 rows with ClientKey of 1 to 1000. On day 2, after truncate reload, the ClientKey will become 1001 to 2000. And that makes their live even more difficult: they would have to change the ClientKey in the fact table.
So for that reason, they use ClientId as well in the fact table, i.e. the PK in the OLTP. This way, in the dim the PK is 1 to 1000, and in the fact table, the PK is also 1 to 1000, and they match!
Figure 1. A Data Warehouse with no Surrogate Keys
Well, if not having surrogate keys make life easier, why do we bother creating surrogate keys in the data warehouse?
- To maintain the history of attributes in the dimension (type 2)
- To be able to point the old fact table to the old row in the dimension
“But we don’t need the old version of the row. We always want to see the latest version! Why do we need to have SK in the DW?”
To integrate multiple source systems which may have the same natural keys, i.e. the same CustomerId or ProductId.
“Oh that easy, we add SourceId column in both the dimension and the fact and we can join OK!”
True, that is true, we can join multiple source systems by using SourceId column in both the fact and the dim.
So that we can have the Unknown Row. If we don’t know who the Customer is for a particular sales transaction, we point the fact row to this Unknown Row. Same with DimProduct and all other dimensions.
If the CustomerId changes from numeric to alphanumeric in the OLTP, we don’t need to change all the fact tables. (We do if we use CustomerId as the primary key of the dimension). This is actually the core reason why we use surrogate key: because the natural key can change. We want to be able to update the natural key easily, without updating all the fact tables that use that dimension.
Another advantage is a surrogate key is performance: it is 1 integer column. The natural key may consist of several columns, for example: the PK of a bank account table is: branch code + product type code + account number. It is possible that 2 banks merged and their accounts were merged too, causing duplicate hence they prefix it again with the bank code, so it becomes: bank code + branch code + product type code + account number. That is long. Only one integer column is much shorter and more performant. It is quicker to index and quicker to find.
Figure 2. A Data Warehouse with Surrogate Keys
So those are the reasons why we have surrogate keys in the data warehouse! I hope this article helps those who are convinced that they need to use star schema and surrogate keys, but never knew the reasons. I hope they now understand the reasons behind using surrogate keys in a star schema. Personally this case helped me understand the reasons why some people chose not to use surrogate key. Throughout the years I’ve heard many “pro arguments” about using SK, but very few “contra arguments” in using surrogate keys.
In this particular case, they didn’t even know how to maintain a surrogate key. They liked the idea of “maintaining the history of the attributes” after I explained it, but they didn’t know the technicalities to do it in SSIS. It seems very complex and would take a lot of their time. When I introduced SCD transform in SSIS workflow to them, their eyes were bright and cheerful. “Wow that would make things a lot easier,” they said. Over the years, some companies have tried to make this Surrogate Key build business easier. Kalido and Wherescape for example, enable us to build a DW without worrying about the complicated issue of Surrogate Keys.
Once again the main reason of using surrogate key is to detach the data warehouse from the source system. To make the data warehouse more independent from the source system, and therefore gives us a more flexible design as a result. And the main benefit is that SKs enable us to maintain the history of the attributes in the dimension using SCD type 2.
Vincent Rainardi, 9/5/2012