Data Warehousing and Business Intelligence

26 February 2011

Denormalising a Fact Table

Filed under: Data Warehousing — Vincent Rainardi @ 9:00 am
Tags:

To denormalise a fact table means we expand a dimensional key column, to include the attributes in the dimension. This practice usually happens in a small, simple data mart project, where one person is doing everything.

As usual, it is easier to “learn by example”:

The usual practice in dimensional modelling is to design a fact table like this:
Dim1Key, Dim2Key, Measure1, Measure2, Measure3 (1)
e.g. DateKey, EventKey, Revenue, Cost, Visitors (an exhibition organiser company)

Rather than doing that, should the fact table be like this?
Dim1Key, Dim2Attributes, Measure1, Measure2, Measure3 (2)
e.g. Date, EventName, EventType, EventLocation, Revenue, Cost, Visitors

The main consideration for (2) is to save the ETL effort. By not creating the dimension tables and have everything in the fact table, it is simpler and quicker to build.

There are several disadvantages to approach (2):

  1. There might be other event attributes introduced in the future. And those new attributes will need to be put into the fact table as well.
  2. We lost the date attributes such as year and month. They will need to be put into the fact table as well then, making the fact table very wide.
  3. It is difficult to “browse the events”. We would need to do a select distinct on the fact table to provide a list of events. The same with other dimensions.
  4. Some dimensions such as customer and product can have many attributes, making the fact table even wider.
  5. If the fact table is big, say 1 billion rows, aggregating a measure by an attribute is slower than if we put the attributes in a small dimension table.
  6. We would need to index the fact table on these attributes. The indices will be a lot bigger than if the attributes are in a dimension table. They will take more effort to maintain.

Because of these reasons better not to denormalise a fact table.

In practice, at high level, everybody agree with this principle. But not everybody agree with the details of this principle. I call this issue “one or two dimensions”.

Again it is easier to learn by example:
In the above example, should event location be a separate dimension?

Many people I talk to agree that location should be a separate dimension. It should have its own dimension key, and it would have its own attributes such as city, post code and nearest tube station.

Generally, if the attribute has many other attributes, it should be put into a separate dimension. But, there is a disadvantage: we lost the direct link between location and event. We can only link them via the fact table. Which is why some designer “link the 2 dimension via a back door”. Read here for details, case b.

Now a little bit more vague: should the event type be in a separate dimension?

This time the response I get is about half-half. Some people think that event type should be a separate dimension. But some people think it should stay in the event dimension. Opinions will always vary (naturally), but the most important thing for us is to understand the reasons why they think so.

The main reason for keeping event type in the event dimension is because it has no attributes. It only consist of 1 attribute, i.e. itself. It may have code and description, but from business point of view it is only 1 attribute.

Whereas the main reason for separating event type into its own dimension is because some people (especially in the exhibition industry) think that it has its own attributes, i.e. it consist of 3 levels. Level 1: show or exhibition. Level 2: if it’s a show: music or art, and if it’s an exhibition: commercial or art. Level 3: if it’s a commercial exhibition: wedding, home, or books. If it’s an art exhibition: painting, contemporary, or installation.

So the decision whether an attribute should be put into its own dimension or not depends whether it has its own attribute or not. Again, I must emphasise the main disadvantage of separating into its own dimension: we loose the direct link. We can only link via a fact table.

But that could be an advantage. It is possible, that we could link via 2 different fact table. And we end up with different results. If that is the nature of the data (there is no fixed relationship), then the attribute must be put into its own dimension. This is the second reason why we separate an attribute (such as event type) into its own dimension.

As usual I welcome any comments and discussion at vrainardi@gmail.com. Vincent 26/2/11.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: