What are the advantages of creating a data warehouse in a normalized format? Some people including myself have been exploring the answer to this question. I’m going to rewrite the answer that I posted in Kimball forum here. But before I do that, let’s clarify the terms: a) normalised, and b) normalised data warehouse.
Normalised: “Normalisation” is the process of removing data redundancy by implementing normalisation rules. I am referring to Codd’s normalisation rules and forms. A “normalised database” is a database which has gone through the process of normalisation.
A “normalised data warehouse” is a database consisting of entities (tables) with parent-child relationship between them, usually normalised to 3rd normal form (3NF). But it could be in other normal forms, such as BCNF, 5NF, etc. Some people call it “relational data warehouse” instead of normalised data warehouse. Some people call it “3NF data warehouse”.
In my book I wrote about 3 data warehouse architectures: Single DDS, NDS+DDS and ODS+DDS. The second one (NDS + DDS) has a normalized database as its master data store. But the presentation layer (DDS) is in dimensional format. That is not what I mean by “normalised data warehouse” in this article.
Advantages for using normalized model are:
1. To minimize disk storage
2. Normalized model is more suitable for some reports
3. To increase flexibility
4. To reduce data integration (ETL)
5. To eliminate data redundancy, making it quicker to update the DW.
My response about the 2 reasons that were posted in the Kimball forum (Teradata and real time):
Even though the terminology and the architecture are slightly different, the “Teradata” reason that you mentioned is generally also applicable to other MPP database system such as Netezza, Greenplum, Neoview or Madison.
In Teradata, the primary index is used to distribute the table rows. Primary index significantly affects the query performance that’s why it’s the main consideration when designing a table. So the main issue when implementing Kimball style fact table (with a composite primary key consisting of dimension surrogate keys) in Teradata is: what is the primary index? If we use a normalized model we don’t have fact tables and we don’t have this problem. In my opinion (Teradata folks, please correct me if I’m wrong) we can solve this problem by creating fact_key, which is a single column surrogate key on the fact table that acts as primary key. fact_key column is also used when implementing a fact table on SQL Server platform, i.e. as the clustered index.
That’s why I think the argument that “MPP database systems require normalized model” is incorrect. I believe that we can implement a DW with dimensional model on an MPP database. If we do that, we will also enjoy the performance benefit of an MPP (i.e. scalable, linear) compared to an SMP database.
Having said that, Tom Russell and Rob Amstrong from Teradata wrote the arguments for using normalized model on MPP, which I believe we can all learn from their expertise: http://www.teradata.com/t/page/115866/index.html
2. Real time DW
What I mean with real time DW here is: when the source system is updated, within 5 seconds that change is reflected in the DW. Generally speaking there are 3 ways to achieve this: 1) using a normalized data model, 2) using a dimensional data model and 3) keep the data in the source system.
A few years ago (we are talking 5-10 years ago), when we need a real time data in DW, we normally use a normalized model. A popular case at that time is class 1 ODS, which Bill Inmon, Claudia Imhoff and Greg Battas explained in chapter 3 of their ODS book (1996). The data structure of an ODS class 1 is almost exactly like the source system. It is normalized.
As Ralph Kimball and Joe Caserta explained in chapter 11 of their ETL toolkit book (originally in 2001, Kimball Design #13), that now we can do real time DW on a dimensional model. This is achieved by creating 2 fact tables, a “static fact table” and a “real time partition”. The “UDM” concept of SSAS 2005 (and 2008) is working on the same principle, the main partitions can be MOLAP where as the real time partition is set as ROLAP.
Over and over this question came up: How do we store history in a normalized DW. History of the master tables that is, not historical transaction records. One popular method is to use audit tables. Now, with change tracking and CDC features on 2008 SQL Server, it is becoming easier to create and populate the audit tables. Using 2008’s CDC, everything is built-in and ready for use out of the box. But the detail explanation as per how etc it’s for a separate article/post.
Update 8/12/2010: I wrote the 2nd version of Normalised Data Warehouse here.