Data Warehousing and Machine Learning

21 February 2021

One or Two Fact Tables

Filed under: Data Warehousing — Vincent Rainardi @ 7:34 pm

One of the data I receive from my data sources is like this:

In this case, should we make it into one fact table like this: (one row per company per year, with M1, M2, M3 repeated) – let’s call this option 1

Or should we make it into two fact tables like this: (the first one is 1 row per company and the second one is 1 row per company per year) – let’s call it option 2

Kimball theory says we should do option 2, because the grains are different. A fact table is determined by its grain. If the data is at different grain, then it does not belong in this fact table, it belongs to another fact table. In option 1, M1 to M3 do not belong there, because their grains are different. They belong to another fact table, hence option 2 is the correct way.

So that was what I did when I first came into this this situation. Stick to the book, I said to myself, and you’ll be safe.

But then the same situation came up again, on another data source. I should model it the same way I thought. But this one is a bit “rugged”, notice M4 and M5 below which are only applicable to some years:

And M1A is closely related to M1 (they are used together), M2A is closely related to M2, M3A is closely related to M3, so the two fact table almost always have to be joined together.

Also notice that there are 15 measures which are annual, and only 3 which are not.

So I thought rather than the users having to join those two fact tables every time because the first 3 measures are used together, I think the right thing to do is to put them in one fact table like this:

So I did. And there are two more data sources like that afterwards. I still feel it is the right thing to do despite it breaks the normalisation rules (who’s doing normalisation in data warehousing anyway?) and breaks a Kimball principle on fact tables.

It’s not because there are only 3 measures which are repeated (even one is too many) but because otherwise to use it users will have to join the two fact tables.

What’s wrong with joining two fact tables? Nothing. Joining fact tables was done since the dawn of data warehousing, with shipments and orders as the classic example in 1990s.

Oh well, both options have the good side and bad side, the positives and negatives, and I think we could present good arguments on both options. What do you think?

And that is the title of this article: One or Two Fact Tables.

Vincent Rainardi, 21st Feb 2021 (gosh it’s been a looong time since I last wrote about data warehousing. That’s because I’m currently doing a master degree on machine learning, which is a lot of fun! Can’t have enough of it! Doing Python and math every day 🙂 At first I thought I would be converting to ML. But now that I’ve done quite a lot of it, I’m sure I don’t want to let DWBI and Investment Management go. I will bring ML into my DWBI work in financial sector. And keep BA as my main work, not ML engineer 🙂 It’s a very good feeling to be able to know what we want to do in life!

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: