Data Warehousing and Business Intelligence

8 December 2010

Normalised Data Warehouse (2)

Filed under: Data Warehousing — Vincent Rainardi @ 11:59 pm
Tags:

In June I wrote about Normalised Data Warehouse, see here. At that time I was very brief. I’d like to expand it a little bit, particularly with regards to “what is it?”, as this seems to be asked by many people. Hence this 2nd article.

Before I start, I’d like to mention that the idea of a normalised data warehouse originated from William H. Inmon in his book “Building the Data Warehouse”. His architecture is known as the Corporate Information Factory (CIF). This idea was further developed by Claudia Imhoff, Nicholas Gallemo and Jonathan G. Geiger in their book “Mastering Data Warehouse Design” (the book doesn’t just talk about normalised DW, but dimensional DW too). Many other authors also wrote about this concept, and their books can be found here. Two books I’d like to particularly mention are #6, DW 2.0 by Inmon, which further developed the idea and #5 (Barry Devlin’s) which is also relevant. Thanks to Alberto Ferrari for reminding me about this.

The idea of having a centralised data warehouse in normalised format can be confusing. Some people enter the world of data warehousing without having prior experience of database design. They never do normalisation process. They never designed a transactional database. They have designed a dimensional data mart, but they don’t know what the Codd’s rules are. This week I came across a case like this. And that was the 3rd case I encountered. So I write this article to help similar situations, which I’m sure will happen again in the future.

I’ll explain the issues, then address them one by one.

  • First, they don’t know what a normalised warehouse look like. They still think about fact and dimension.
  • Second, they don’t know why a warehouse should be normalised. They never heard about the concept of transactional data warehouse.
  • Third, they don’t know how to normalise a database. They don’t know the first, second and third normal form. They don’t know where to put foreign keys.

A lot of people came into data warehousing straight into Kimball model. Into dimensional modelling. To them, data warehouse is identical to fact and dimension tables. There can be no other form. They read Kimball’s Toolkit book. Probably only the first 3 chapters. Then they feel that they now know what data warehousing is. “It is about creating a fact table in the middle and several dimension tables around it”, they said confidently.

This is understandable, because when you Google (or Bing these days) “data warehousing” or “data warehousing books” the result contains the dimensional ones more than the normalised ones. So much more. People’s opinions are also like that. For some BI/DW architects and managers, Kimball is the only model of data warehousing. They can explain a lot about the benefits of being dimensional. To them, a normalised warehouse is just like a glorified staging area.

OK that’s all very well, but when I asked have they seen a normalised warehouse, they said they haven’t. Have they learned about a normalised warehouse? They haven’t. Do they know why some companies create their DW in normalised fashion? No they don’t. So it seems one sided. We need to know both sides in order to make informed decisions. So in the rest of this article I’ll explain about what a normalised warehouse looks like and why a warehouse should be normalised, i.e. the reasons.

What a Normalised Data Warehouse Looks Like

A normalised data warehouse looks like the database of a transaction system. It does not have fact or dimension tables. It has master and transaction tables. It also has auxiliary tables.

Master tables contain reference data. They are the primary entities (objects) of a business process. They are the what, the who, the where, and the whom of a business process, such as product, customer, account, invoice, region, job, patient, policy, claim, shares, market, and mobile numbers. Transaction tables are entities containing business events, such as payments, orders, subscriptions, gas bills, phone calls, download, sales, purchases, and journal. Auxiliary tables are code-decode tables. The master tables only contain the codes. The descriptions are located in the auxiliary tables. For example, in the policy master table, the status of a policy is “W”. What “W” means is stored in the auxiliary tables (it means “Written”, as in “underwritten”, a term in insurance).

A normalised data warehouse contains no data redundancy. Meaning that the data is not repeated. Each piece of data is defined only in 1 place. This is because the tables have been normalised. Normalisation is a process of eliminating redundant data.

A normalised data warehouse contains many tables, like 80 to 100 tables. In some cases more than 100. It covers many areas in the organisations, not only 1 or 2 areas. For example, if it is a retail bank, it covers customer service, loan, withdrawals and deposits, account servicing, credit worthiness, settlement, marketing and promotions, taxes, contact and communication, cheque clearance, credit cards, finance, etc. No, it does not usually contain IT or HR data.

A normalised warehouse often contains data from many companies or subsidiaries. For example if a group has 29 subsidiary companies in 17 countries, the normalised warehouse probably contains data from 19 out of the 29 companies. No, it does not usually contain all companies, because of practical and political reasons. Practical as in project management, budget, and time constraint. Political as in US, Europe and Asia have their own IT organisation & budget, hence their own DW projects.

A normalised data warehouse is often updated incrementally. In the source system there is usually a transaction identifier which is incremental (as in, going from small number to big). Such as invoice ID, order ID, transaction ID, and call ID. These are usually incremental. So we can update the warehouse transaction table incrementally (as in, the opposite of truncate-reload). Yes sometimes this transaction identifier is recycled (goes back to zero after 999999…) but there are techniques to deal with that, so we can still load the warehouse incrementally. Yes in some cases some old transaction rows are updated but there are techniques to deal with that (please refer to my article on SQL Server Central titled Data Extraction Methods). These days with change tracking and CDC built into SQL 2008, and practically on all other top RDBMSes as well, it’s only a matter of time before all transaction systems implement some kind of mechanism to enable incremental extraction of data.

How about master tables? Are they extracted incrementally? Most often than not, for small master table the approach is “upsert all rows”. This is because it’s quicker, more robust, and more practical. Small as in 10k rows and below. How about deletion? In those master tables I mean, not in the transaction tables. Do we delete the rows in the normalised warehouse? No, they are only marked (soft delete). Why? a) Because they could be referenced by the transaction tables, and b) to keep the information because it is still useful.

A normalised DW is often the amalgamation (or integrated) version of several systems. Yes in a way it is a glorified staging area. I would prefer to call it “integration area” though, because that’s what happening in a normalised warehouse: data integration. In fact, “data integration” is the where biggest effort is spent in building a normalised warehouse. It is often one of the main reasons of creating a normalised warehouse. More of this later, when I explain the “why”. Right now I’m focusing on the “what”.

Sometimes a normalised DW contains snapshoted transaction history. It is basically a copy of a transaction table as it was some time ago. Or, a copy of several transaction tables joined together, as they were some time ago. Some time ago as in, every month end. Snapshoting transaction history is one way of storing the old version of transaction data.

Old version of transaction data

Let’s be clear first about what is “old version of transaction data”. Because a lot of times people confuse it with “transaction history”. Transaction data is, by its very nature, historical. The order table stores orders from many years ago until today. Old years (say 1998) is usually archived into another table, or offline (as in, not in the system, but can be restored back if required). Or they are simply deleted (gone forever, not even stored on offline storage). Only recent years are in the transaction table. Say, last 2 years. So, your transaction table (say order table) contains 2009 and 2010 data. 2000 to 2008 orders are put into a table called “order history”. This is what we call “transaction history”. That is not what I mean with “old version of transaction data”

Say payment1 was created on 1/1/2001 for customer1, amount = £300. 2 months later, 1/3/2001 order1 was modified. The amount is changed to £200. Or the customer is changed to customer2. That’s what I mean with the old version of transaction data, i.e. £300 and customer1. The new version is £200 and customer2.

One way of keeping the old version of transaction data is by using “snapshoting technique”. As in, Kimball’s “snapshot fact table” technique. Another way is using the same technique that SQL Server’s CDC employs: by having a second table, with the same structure, plus 3-4 extra columns at the end. Before the transaction row is updated, they are copied into this second table first. The “audit table technique”, it was previously known, as it was used for auditing purpose. A normalised data warehouse keeps the old version of transaction data using one of the two techniques above.

Normalised data warehouses are often found in parallel database systems (MPP = Massively Parallel Processing) such as Teradata, Netezza, Oracle Exadata and Microsoft PDW. Querying a normalised data warehouse requires more computing power, because we will have to join a lot of tables. If you are using SMP database system, indexing and partitioning will play major part in the performance. If you are using MPP database system, data distribution is the key.

A normalised DW sometimes contains the “old versions” of master data. It keeps the old version using the “audit table technique” described above. No, it does not use the SCD techniques like in the dimensional warehouse.

Reasons for Normalising a Data Warehouse

The main reason of normalising a data warehouse is so that there is no data redundancy in the data warehouse. This has multiple benefits. The obvious one is that there is only 1 place to update when the data changes. This means that the ETL/data load is easier and more efficient (faster).

The second reason is to keep the data warehouse real time. Because there is only 1 place to update (or insert), the update would be quick and efficient.

Third, to manage master data. The idea is that rather than having MDM as a separate system, the master table in the normalised warehouse become the master store. This is because the normalised warehouse is updated by all source systems. The normalised warehouse broker table (in the case of insurance warehouse) becomes the master broker table. The normalised warehouse airport table (in the case of travel data warehouse) becomes the master airport table. The customer table in the normalised warehouse becomes the master customer table. The source of all customer data in the enterprise. It is a popular choice to place the CDI (Customer Data Integration, MDM for customer) for CRM project in a normalised DW instead of a dimensional DW.

Fourth, to enable the enterprise to maintain consistency between multiple dimensional data marts (please refer to chapter 2 of my book). Having a normalised DW you can create a dimensional DM for any business area, for a specified time period, for a particular range of products/customers/geography. For businesses where data is their sellable asset, this capability really has business appeals e.g. you can take different cuts of your data and price each cut differently.

Fifth, to make data integration easier. If you have integrated multiple business system you would appreciate that the core of the work is on the mapping, i.e. where each column from each system should be loaded to the target system. Because in a normalised DW each individual data is located only in one place, it is easier to update the target. Secondly, the source systems are usually normalised, it is easier to map them to a normalised DW because both the source and the target are normalised e.g. more similarities than dimensional DW. In a normalised DW the data are located in different tables, not combined into 1 big dimension table e.g. customer dim or product dim.

Sixth, to save development effort. This is an argument that MPP people usually have. Why having 1 year DW project to create a dimensional model of your data, create the ETL to populate it, and update it every day? Why don’t we just copy the structure of the source business system in the MPP database system and we can query it 1000 times faster? Or, if we are not copying it, at least we can have a very similar structure/data model so the data load would be very simple as the tables in the normalised DW are almost one to one mapping to the source system.

So there you are. Those are the “what” and the “why” of normalised DW. Hope this article is useful for those who has never see a normalised DW; at least giving them an idea. As usual I might make mistakes, your may have different opinions, etc, so I welcome any suggestions, discussion and corrections at vrainardi@gmail.com.

Vincent Rainardi, 8/12/2010

PS. I wrote this about 2 months ago, but didn’t have a chance to complete it. I wrote the last 3 paragraphs today.

8 Comments »

  1. […] 8/12/2010: I wrote the 2nd version of Normalised Data Warehouse here. Possibly related posts: (automatically generated)Combining DW and ERP Data in SSAS CubesUsing Data […]

    Pingback by Normalised Data Warehouse « Data Warehousing and Business Intelligence — 9 December 2010 @ 12:00 am | Reply

  2. […] Answer: if you are looking for a junior role e.g. developer, then the expected answer is: in Kimball we do dimension modelling, i.e. fact and dim tables where as in Inmon we do CIF, i.e. EDW in normalised form and we then create a DM/DDS from the EDW. Junior candidates usually prefer Kimball, because of query performance and flexibility, or because that’s the only one they know; which is fine. But if you are interviewing for a senior role e.g. senior data architect then they need to say that the approach depends on the situation. Both Kimball & Inmon’s approaches have advantages and disadvantages. I explained some of the main reasons of having a normalised DW here. […]

    Pingback by Data Warehousing Interview Questions « Data Warehousing and Business Intelligence — 11 December 2010 @ 10:59 pm | Reply

  3. […] Answer: if you are looking for a junior role e.g. developer, then the expected answer is: in Kimball we do dimension modelling, i.e. fact and dim tables where as in Inmon we do CIF, i.e. EDW in normalised form and we then create a DM/DDS from the EDW. Junior candidates usually prefer Kimball, because of query performance and flexibility, or because that’s the only one they know; which is fine. But if you are interviewing for a senior role e.g. senior data architect then they need to say that the approach depends on the situation. Both Kimball & Inmon’s approaches have advantages and disadvantages. Some of the main reasons of having a normalised DW can be found here. […]

    Pingback by Data Warehousing Interview Questions « Data Warehousing and Business Intelligence — 27 December 2010 @ 12:16 pm | Reply

  4. Dear Vincent,
    I found your article very interesting. As it happens, I’m a very experienced (30 years) database and data warehouse architect and I have a preference for the Kimball approach, so as you might expect I disagree with your position.
    Let me comment on some of your points (reasons for normalising a data warehouse)
    First point (no redundancy): in the Kimball approach the redundancy is only on the denormalized attributes in dimensions, which doesn’t create any significant redundancy problems. And in a normalized data warehouse, you still need to load the data marts, which contain redundant data, so the loading process is not faster

    Second (real time): same as above, you still need to update the second level (data marts), which is where most queries are based on. Also, real time means a lot more than that and is complicated to implement even on a normalized dw (i.e., how are you going to guarantee consistency of the data while loading in real time?). Finally, there are well defined techniques to implement a real time dw using the Kimball approach (described in detail in Kimball’s books).

    Third (master data): conformed dimensions in the Kimball model are in fact a very solid source of master data. If they are well implemented, they are probably more complete and robust than a normalized approach

    Fourth: “maintain consistency between multiple dimensional data marts”. Please excuse me, but this really rises some serious doubts if you have ever read (or understood) how Dimensional Bus Data Warehouses work. Consistency between data marts is always guaranteed if you follow the right design guidelines. What you are describing here is a stovepipe implementation, which has nothing to do with the Dimensional Bus Dw. (by the way, I had the same impression reading several comments by B. Inmon to the dimensional approach).

    Fifth: Data Integration. Again, if conformed dimensions are implemented the right way, they are a very powerful tool for data integration. Using a dimensional bus approach I have successfully fixed situations where a normalized approach had failed to deliver data integration.

    Sixth: to save development effort. This is also a good one! There is at least one serious market study that shows how the normalized approach takes significantly more time and effort compared to the dimensional approach. You didn’t mention the fact that a second “data mart” level is ALWAYS needed in order to be able to do some useful queries on the dw, and that a data warehouse usually takes data from multiple sources, so you can’t just “copy the structure of the source business system” as you suggest.

    I hope my comments didn’t upset you, I know there is a lot of passion and strong opinions around about data warehouse architectures. My comments don’t have any ambition to be complete, they were written in less than 15 minutes, but of course I’ll be happy to write a more articulate reply if you want to discuss any particular aspects in detail
    With regards,
    Andrea Vincenzi

    Comment by Andrea Vincenzi — 29 April 2011 @ 6:04 pm | Reply

    • Hi Andrea, totally agree with all your points. No your comments didn’t upset me, instead I’m very grateful for your comments. Your vast experiences enrich the discussions. Like you, I prefer Kimball approach too. When I wrote that article I was trying to see from the other side (at the moment I’m fascinated with the 3rd approach, Data Vault, and will write about it).
      Yes you are right, usually a dim model needs to be built on top of the normalised model (although we don’t always need to build a dim model).
      Yes you are right, multiple data marts in Kimball warehouse are consistent with each other (confirmed dims).
      Yes you are right, we can do real time DW in Kimball’s and it is documented in their ETL Toolkit book.
      Thank you again for your comments.
      Vincent

      Comment by Vincent Rainardi — 6 May 2011 @ 1:32 am | Reply

      • Hi.
        I would like to respond to Andrea’s comments.

        First Point:
        Star schema is deformalized design approach and therefore redundancy must be there and it is there. Sometimes there are needs for redundant star(s) which sometimes called type specific stars.
        I have designed virtual data marts (if there is a need) which are bunch of views or materialized views. It solves data load problem. This is a preferred approach on MMP platforms.

        Second Point:
        There is no such thing as a real time DW. I think main point here is “near” real time.

        Third point:
        Dimensions cannot be reliable source for master data. First of all that’s not a role of star schema in general. (Analysis and query performance only). Secondly, master data is not just a list of attributes. It is hierarchies (which cannot be handled well in dimensional design), relationship (sometime a lot of them which might not be of interest of DW and will complicate design if implemented) and other management artifact around De-duplication, matching effort that requires for MDM (master data)
        MDM is not DW.

        Forth:
        I agree with Andrea on this one.

        Fifth:
        It is lot easier to integrate if data is non redundant and functional dependency is not violated.

        Sixth:
        Normalized DW is not a copy of structure of source system. But, structures that enables you to integrate data across various data sources or subject areas. And this type of schema easily handles changes. It is not easy to handle structural changes with Dimensional design. One example would be frequent hierarchical changes. Try to do that with Type 2 change on Dimensional design.
        I think Dimensional design should be implemented as logical layer because it is easy to understand the data this way and analyze it. But, fiscal implementation of it would be my last choice.

        Comment by Joel Mamedov — 23 March 2012 @ 3:46 am

      • Thanks for your comments Joel. I think you meant “denormalised” and “MPP” on your first point?

        Comment by Vincent Rainardi — 23 March 2012 @ 7:17 am

  5. Hi Vincent.
    Thank you for correction.

    Comment by Joel Mamedov — 23 March 2012 @ 2:45 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: