In June I wrote about Normalised Data Warehouse, see here. At that time I was very brief. I’d like to expand it a little bit, particularly with regards to “what is it?”, as this seems to be asked by many people. Hence this 2nd article.
Before I start, I’d like to mention that the idea of a normalised data warehouse originated from William H. Inmon in his book “Building the Data Warehouse”. His architecture is known as the Corporate Information Factory (CIF). This idea was further developed by Claudia Imhoff, Nicholas Gallemo and Jonathan G. Geiger in their book “Mastering Data Warehouse Design” (the book doesn’t just talk about normalised DW, but dimensional DW too). Many other authors also wrote about this concept, and their books can be found here. Two books I’d like to particularly mention are #6, DW 2.0 by Inmon, which further developed the idea and #5 (Barry Devlin’s) which is also relevant. Thanks to Alberto Ferrari for reminding me about this.
The idea of having a centralised data warehouse in normalised format can be confusing. Some people enter the world of data warehousing without having prior experience of database design. They never do normalisation process. They never designed a transactional database. They have designed a dimensional data mart, but they don’t know what the Codd’s rules are. This week I came across a case like this. And that was the 3rd case I encountered. So I write this article to help similar situations, which I’m sure will happen again in the future.
I’ll explain the issues, then address them one by one.
- First, they don’t know what a normalised warehouse look like. They still think about fact and dimension.
- Second, they don’t know why a warehouse should be normalised. They never heard about the concept of transactional data warehouse.
- Third, they don’t know how to normalise a database. They don’t know the first, second and third normal form. They don’t know where to put foreign keys.
A lot of people came into data warehousing straight into Kimball model. Into dimensional modelling. To them, data warehouse is identical to fact and dimension tables. There can be no other form. They read Kimball’s Toolkit book. Probably only the first 3 chapters. Then they feel that they now know what data warehousing is. “It is about creating a fact table in the middle and several dimension tables around it”, they said confidently.
This is understandable, because when you Google (or Bing these days) “data warehousing” or “data warehousing books” the result contains the dimensional ones more than the normalised ones. So much more. People’s opinions are also like that. For some BI/DW architects and managers, Kimball is the only model of data warehousing. They can explain a lot about the benefits of being dimensional. To them, a normalised warehouse is just like a glorified staging area.
OK that’s all very well, but when I asked have they seen a normalised warehouse, they said they haven’t. Have they learned about a normalised warehouse? They haven’t. Do they know why some companies create their DW in normalised fashion? No they don’t. So it seems one sided. We need to know both sides in order to make informed decisions. So in the rest of this article I’ll explain about what a normalised warehouse looks like and why a warehouse should be normalised, i.e. the reasons.
What a Normalised Data Warehouse Looks Like
A normalised data warehouse looks like the database of a transaction system. It does not have fact or dimension tables. It has master and transaction tables. It also has auxiliary tables.
Master tables contain reference data. They are the primary entities (objects) of a business process. They are the what, the who, the where, and the whom of a business process, such as product, customer, account, invoice, region, job, patient, policy, claim, shares, market, and mobile numbers. Transaction tables are entities containing business events, such as payments, orders, subscriptions, gas bills, phone calls, download, sales, purchases, and journal. Auxiliary tables are code-decode tables. The master tables only contain the codes. The descriptions are located in the auxiliary tables. For example, in the policy master table, the status of a policy is “W”. What “W” means is stored in the auxiliary tables (it means “Written”, as in “underwritten”, a term in insurance).
A normalised data warehouse contains no data redundancy. Meaning that the data is not repeated. Each piece of data is defined only in 1 place. This is because the tables have been normalised. Normalisation is a process of eliminating redundant data.
A normalised data warehouse contains many tables, like 80 to 100 tables. In some cases more than 100. It covers many areas in the organisations, not only 1 or 2 areas. For example, if it is a retail bank, it covers customer service, loan, withdrawals and deposits, account servicing, credit worthiness, settlement, marketing and promotions, taxes, contact and communication, cheque clearance, credit cards, finance, etc. No, it does not usually contain IT or HR data.
A normalised warehouse often contains data from many companies or subsidiaries. For example if a group has 29 subsidiary companies in 17 countries, the normalised warehouse probably contains data from 19 out of the 29 companies. No, it does not usually contain all companies, because of practical and political reasons. Practical as in project management, budget, and time constraint. Political as in US, Europe and Asia have their own IT organisation & budget, hence their own DW projects.
A normalised data warehouse is often updated incrementally. In the source system there is usually a transaction identifier which is incremental (as in, going from small number to big). Such as invoice ID, order ID, transaction ID, and call ID. These are usually incremental. So we can update the warehouse transaction table incrementally (as in, the opposite of truncate-reload). Yes sometimes this transaction identifier is recycled (goes back to zero after 999999…) but there are techniques to deal with that, so we can still load the warehouse incrementally. Yes in some cases some old transaction rows are updated but there are techniques to deal with that (please refer to my article on SQL Server Central titled Data Extraction Methods). These days with change tracking and CDC built into SQL 2008, and practically on all other top RDBMSes as well, it’s only a matter of time before all transaction systems implement some kind of mechanism to enable incremental extraction of data.
How about master tables? Are they extracted incrementally? Most often than not, for small master table the approach is “upsert all rows”. This is because it’s quicker, more robust, and more practical. Small as in 10k rows and below. How about deletion? In those master tables I mean, not in the transaction tables. Do we delete the rows in the normalised warehouse? No, they are only marked (soft delete). Why? a) Because they could be referenced by the transaction tables, and b) to keep the information because it is still useful.
A normalised DW is often the amalgamation (or integrated) version of several systems. Yes in a way it is a glorified staging area. I would prefer to call it “integration area” though, because that’s what happening in a normalised warehouse: data integration. In fact, “data integration” is the where biggest effort is spent in building a normalised warehouse. It is often one of the main reasons of creating a normalised warehouse. More of this later, when I explain the “why”. Right now I’m focusing on the “what”.
Sometimes a normalised DW contains snapshoted transaction history. It is basically a copy of a transaction table as it was some time ago. Or, a copy of several transaction tables joined together, as they were some time ago. Some time ago as in, every month end. Snapshoting transaction history is one way of storing the old version of transaction data.
Old version of transaction data
Let’s be clear first about what is “old version of transaction data”. Because a lot of times people confuse it with “transaction history”. Transaction data is, by its very nature, historical. The order table stores orders from many years ago until today. Old years (say 1998) is usually archived into another table, or offline (as in, not in the system, but can be restored back if required). Or they are simply deleted (gone forever, not even stored on offline storage). Only recent years are in the transaction table. Say, last 2 years. So, your transaction table (say order table) contains 2009 and 2010 data. 2000 to 2008 orders are put into a table called “order history”. This is what we call “transaction history”. That is not what I mean with “old version of transaction data”
Say payment1 was created on 1/1/2001 for customer1, amount = £300. 2 months later, 1/3/2001 order1 was modified. The amount is changed to £200. Or the customer is changed to customer2. That’s what I mean with the old version of transaction data, i.e. £300 and customer1. The new version is £200 and customer2.
One way of keeping the old version of transaction data is by using “snapshoting technique”. As in, Kimball’s “snapshot fact table” technique. Another way is using the same technique that SQL Server’s CDC employs: by having a second table, with the same structure, plus 3-4 extra columns at the end. Before the transaction row is updated, they are copied into this second table first. The “audit table technique”, it was previously known, as it was used for auditing purpose. A normalised data warehouse keeps the old version of transaction data using one of the two techniques above.
Normalised data warehouses are often found in parallel database systems (MPP = Massively Parallel Processing) such as Teradata, Netezza, Oracle Exadata and Microsoft PDW. Querying a normalised data warehouse requires more computing power, because we will have to join a lot of tables. If you are using SMP database system, indexing and partitioning will play major part in the performance. If you are using MPP database system, data distribution is the key.
A normalised DW sometimes contains the “old versions” of master data. It keeps the old version using the “audit table technique” described above. No, it does not use the SCD techniques like in the dimensional warehouse.
Reasons for Normalising a Data Warehouse
The main reason of normalising a data warehouse is so that there is no data redundancy in the data warehouse. This has multiple benefits. The obvious one is that there is only 1 place to update when the data changes. This means that the ETL/data load is easier and more efficient (faster).
The second reason is to keep the data warehouse real time. Because there is only 1 place to update (or insert), the update would be quick and efficient.
Third, to manage master data. The idea is that rather than having MDM as a separate system, the master table in the normalised warehouse become the master store. This is because the normalised warehouse is updated by all source systems. The normalised warehouse broker table (in the case of insurance warehouse) becomes the master broker table. The normalised warehouse airport table (in the case of travel data warehouse) becomes the master airport table. The customer table in the normalised warehouse becomes the master customer table. The source of all customer data in the enterprise. It is a popular choice to place the CDI (Customer Data Integration, MDM for customer) for CRM project in a normalised DW instead of a dimensional DW.
Fourth, to enable the enterprise to maintain consistency between multiple dimensional data marts (please refer to chapter 2 of my book). Having a normalised DW you can create a dimensional DM for any business area, for a specified time period, for a particular range of products/customers/geography. For businesses where data is their sellable asset, this capability really has business appeals e.g. you can take different cuts of your data and price each cut differently.
Fifth, to make data integration easier. If you have integrated multiple business system you would appreciate that the core of the work is on the mapping, i.e. where each column from each system should be loaded to the target system. Because in a normalised DW each individual data is located only in one place, it is easier to update the target. Secondly, the source systems are usually normalised, it is easier to map them to a normalised DW because both the source and the target are normalised e.g. more similarities than dimensional DW. In a normalised DW the data are located in different tables, not combined into 1 big dimension table e.g. customer dim or product dim.
Sixth, to save development effort. This is an argument that MPP people usually have. Why having 1 year DW project to create a dimensional model of your data, create the ETL to populate it, and update it every day? Why don’t we just copy the structure of the source business system in the MPP database system and we can query it 1000 times faster? Or, if we are not copying it, at least we can have a very similar structure/data model so the data load would be very simple as the tables in the normalised DW are almost one to one mapping to the source system.
So there you are. Those are the “what” and the “why” of normalised DW. Hope this article is useful for those who has never see a normalised DW; at least giving them an idea. As usual I might make mistakes, your may have different opinions, etc, so I welcome any suggestions, discussion and corrections at firstname.lastname@example.org.
Vincent Rainardi, 8/12/2010
PS. I wrote this about 2 months ago, but didn’t have a chance to complete it. I wrote the last 3 paragraphs today.