Data Warehousing, BI and Data Science

16 July 2012

The Main Weakness of Snowflake Schemas

Filed under: Data Warehousing — Vincent Rainardi @ 1:48 am
Tags:

Let’s have a look at this dimensional model:

It’s a snowflake schema. It is the dimensional model for a book distribution company. The company buys books from publishers, stores them in their warehouses and sells them in their shops. “Line” means product line, i.e. book category such as academic books, fiction books, sport books, etc.

There are a few interesting points here.

  1. DimLine, DimAuthor and DimPublisher are the sub dimensions of DimProduct.
  2. DimCity is a sub dimension of DimCustomer.
  3. Each sub dimension has a surrogate key.

Because each sub dimension has a surrogate key, there is additional complexity. If the sub dims are type 2, whenever the author name or publisher name changes, their surrogate key changes, resulting in a new row in DimProduct. Let me demonstrate:

Below are DimProduct, DimLine, DimAuthor and DimPublisher:

Now let’s make 1 change in each sub dimension (SCD type 2):

Each time any of the sub dim changes, in the DimProduct we will have 3 new rows:

As you can see above, we now have 4 versions of Rainfall in DimProduct.

That is the issue of using Snowflake: we can’t do type 2 on sub dims. We can only do type 1, like this:

Whenever I asked a DW practitioner which DW methodology he uses Star or Snowflake, the answers almost always have been Star. But when I dig deeper why not snowflake, nobody brings up this fundamental issue: inability to store the history of attributes. I hope this article would make it clearer to DW practitioners.

Advertisements

8 Comments »

  1. Don’t understand. Did you sell 2 or 8 books? Why is there a new factrecord for every change in The prod dim?

    Comment by Hennie — 16 July 2012 @ 7:01 pm | Reply

    • 2 books Hennie.
      You are right. It doesn’t make sense for the fact table to have a new fact record for every change in DimProduct.
      If the changes in DimProduct happened on the same day, the fact table should only have 1 row with the latest ProductKey.
      So there is no issue with the fact table, it will only have 1 row. I have removed the fact table from the article. Thanks for pointing this out.

      Comment by Vincent Rainardi — 24 July 2012 @ 10:48 pm | Reply

  2. Hi Vincent,

    How come in the sub dimension tables there’s no effective dates (from and to) which normally used in SCD-2?
    I thought by having from/to date when we can only look at when we do a look up to get the dimension key
    Cheers,
    Uzzie

    Comment by Uzzie — 22 July 2012 @ 3:17 pm | Reply

    • Hi Uzzie, because the sub dims are type 2 they should have effective date and expiry date columns (or ValidFrom and ValidTo columns).
      I just didn’t show these date columns in the images as I wasn’t focusing on their values.

      Comment by Vincent Rainardi — 24 July 2012 @ 10:55 pm | Reply

  3. The problem shown simplifies use and and would undermine the snowflakes. Large Data Warehousing uses the same surrogate key to his historical and adds a date into the composite key. Thus the question becomes simpler and more useful than a Star schema.

    Comment by Alberto Costa — 20 February 2014 @ 12:14 pm | Reply

  4. I don’t get your statement: ” we can’t do type 2 on sub dims. ”

    From what I saw is that actually you can right? and that there are 2 options when working with sub-dims (snowflaking):

    *sub-dim scd type 2
    *sub-dim scd type 1

    first option will require more complexity and storage for both parent and child dimensions.

    second option will be simpler but will not keep history of the attributes changes.

    I think that it depends on the business requirement right? Now, my question is, if they want to keep history of their sub-dims, is there a better option/methodology?

    Thanks a lot Vincent!

    Comment by saulcruz87 — 16 July 2014 @ 6:08 pm | Reply

    • Hi saulcruz87, the issue with having SCD type 2 in the sub dim tables is the timing:
      If you look at the 3 new rows on DimProduct above, the values in these 3 columns: LineKey, AuthorKey, PublisherKey are:
      1 1 2
      1 2 2
      2 2 2
      Where is 2 1 1? Where is 2 1 2?
      What if the change on DimPublisher table happens first, before the change in DimLine and DimAuthor?

      The standard methodology to keep the history of sub dim attributes is to include Line, Author and Publisher in DimProduct and make it a star schema. That is my prefered option.

      Comment by Vincent Rainardi — 16 July 2014 @ 7:52 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: