Data Warehousing and Business Intelligence

26 February 2011

Normalising a Fact Table

Filed under: Data Warehousing — Vincent Rainardi @ 7:20 am
Tags:

To normalise a fact table means we make it slim and tall. We do that by introducing a new dimension called “measure type” into the fact table, and move all measures into just one measure column: amount. The meaning of this amount column depends on the value of the measure type key.

As always, it is easier to ”learn by example”.

Rather than creating a fact table like this: (1)
Dim1_Key, Dim2_Key, Dim3_Key, Measure1, Measure2, Measure3
e.g. Date_Key, Customer_Key, Product_Key, Revenue1, Revenue2, Revenue3

Why don’t we create it like this: (2)
Dim1_Key, Dim2_Key, Dim3_Key, Measure_Type_Key, Measure_Amount
e.g. Date_Key, Customer_Key, Product_Key, Revenue_Type_Key, Revenue_Amount

The usual argument is that “if we have a new measure (e.g. Revenue4), we don’t have to alter the table structure”.

And the usual situation is something like this: “we have 30 different measures/revenues in this company and it is very likely that in the next 3 months we will have another one”.

So, there we go. That’s the problem presented. The question now is, would you go for (1) or (2)? What do you think?

…… (your opinion here. I really like to hear it. Please let me know at vrainardi@gmail.com)

I would go for (1), even if I have to create a new measure every 3 months. Unless it is a GL mart.

Why?

  1. Additive. One of the sacred fundamentals in data warehousing (is there such a thing?) is that a measure is additive.  Because it is additive, we can sum it up. This sounds very simple, but it is very powerful. It is fundamental to the analytics ability of a data warehouse. Most of the BI tools are utilising this “additive characteristic”.
    If we go with (2) we will lose this analytic ability. If one measure is not compatible with the other measures, we can’t add them up. In the case of all of them being revenues, and all of them being daily, and all of them being applicable per product and per customer, and all of them being in the same currency, and all of them being in the same units (say in thousands) then yes the measures are additive to each other. But once we introduce other measure such as costs or tax or non monetary measure into the fact table, we may not be able to add them up together.
  2. One Pass Query.
    “Not really”, you might say. All you have to do is to add “where revenue_type = 1” and you can now sum up the Revenue1. If you want Revenue2, just put “where revenue_type = 2” in your query and voila! you got Revenue2. Yes, true that, but … if you want to get “revenue1 minus revenue2”, we will have to do “2 passes” on the table. Essentially you need to run 2 SQLs. The first one is to sum up Revenue1 and t he second is to sum up Revenue2. Then you substract one from the other. That’s what “2 passes” means. Why is it a problem if it is 2 passes, or 3 passes or 4 passes? Slow, that is the problem. The more number of passes our SQL need to run to produce a report, the slower the report.
  3. On the BI tool, if you don’t put the measure type filter, you are risking the user getting “rubbish information”. From usability point of view, this design is a rubbish. A data warehouse or a data mart is designed primarily to analyse information. We do not store data for the sake of storing it. When we store data we arrange the data in the format that is easiest to be analysed. The user will need to set the measure type dimension every single time. On every single BI query. When you see this in practice you will understand why the users would go mad. Particularly if they are analysing the data using OLAP cubes.
  4. Flexibility. One design principle of data warehousing is flexibility. We want to be able to add things later on. With regards to the fact table, Kimball group clearly say that we need to group the measures based on their grain. When we have a new measure, as long as the measure has the same grain, it can go to the same fact table. If we normalise the table, we may not be able to add another measure (e.g. count), even if it has the same grain. Because if we add a count measure, we lost the additivity. This way, we will end up with many more fact tables. Because the measures are not only grouped by their grain, but also by their data characteristic, e.g. all monetary values goes to fact table 1, all non monetary amount (such as quantity) goes to fact table 2, and all count measures goes to fact table 3. Not a good design. It is not flexible, and we end up with more number of fact tables.

Unless It Is a GL Mart

A General Ledger is a collection of accounts. In a large company there are thousands of accounts. Usually there are 5 main types of accounts: asset accounts, liabilities accounts, capital accounts, revenue accounts and expense accounts. Each account has a balance, at all times. The balances change every day. Transactions (or journal items) change these balances. Imagine that you are viewing this a list of accounts, with their balances. That’s more or less what is known as a “GL mart”. A large company has more than 1 ledgers (more than 1 lists), for instance, 1 for each subsidiary company.

The fact table is something like this:

Period_Key, Account_Type_Key, Account_Key, Ledger_Key, Balance

The period is usually a calendar month. But in some companies 1 period is 4 calendar weeks. The account type is usually the 5 main account types I mentioned earlier, broken down to a very fine grouping of accounts. Usually 3 to 4 levels. Account dimension is usually a parent child dimension. Meaning that for each account, there is a parent account.

Usually there are other fact tables in a GL mart. For example, transaction fact table. The transaction fact table contains the detail of every transaction. The idea is, when you are analysing data using a BI tool and select an account, you can drill down to the transaction level.

OK, enough with the intro. That’s GL mart. Now the key point: in GL mart, there is only 1 measure: Balance. The value of this measure depends on the account key. It is therefore natural, that in the GL fact table we have only 1 measure, and the meaning of that measure depends on the account key column.

That is different from “normalising a fact table” that we discussed in this post. In this post, the fact table has many measure columns. And we tried to make it 1 measure column. Which is unnatural.

As usual I welcome comments and discussion at vrainardi@gmail.com. Vincent 26/2/11.

Advertisements

1 Comment »

  1. I agree , from the reporting point of view, if you are going to use a reporting tool like http://www.widestage.com, you can´t use the second approach, to get all your measures in the same chart or table. Generally speaking the most of the reporting tools works on that way, if you have to perform more than one query to get several measures, then the reporting tool have to join the results in some way, and this is sometimes difficult to do and have an impact in the report performance

    Comment by Hermes Romero — 6 April 2016 @ 5:23 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: