Data Warehousing, BI and Data Science

16 July 2012

The Main Weakness of Snowflake Schemas

Filed under: Data Warehousing — Vincent Rainardi @ 1:48 am
Tags:

Let’s have a look at this dimensional model:

It’s a snowflake schema. It is the dimensional model for a book distribution company. The company buys books from publishers, stores them in their warehouses and sells them in their shops. “Line” means product line, i.e. book category such as academic books, fiction books, sport books, etc.

There are a few interesting points here.

  1. DimLine, DimAuthor and DimPublisher are the sub dimensions of DimProduct.
  2. DimCity is a sub dimension of DimCustomer.
  3. Each sub dimension has a surrogate key.

Because each sub dimension has a surrogate key, there is additional complexity. If the sub dims are type 2, whenever the author name or publisher name changes, their surrogate key changes, resulting in a new row in DimProduct. Let me demonstrate:

Below are DimProduct, DimLine, DimAuthor and DimPublisher:

Now let’s make 1 change in each sub dimension (SCD type 2):

Each time any of the sub dim changes, in the DimProduct we will have 3 new rows:

As you can see above, we now have 4 versions of Rainfall in DimProduct.

That is the issue of using Snowflake: we can’t do type 2 on sub dims. We can only do type 1, like this:

Whenever I asked a DW practitioner which DW methodology he uses Star or Snowflake, the answers almost always have been Star. But when I dig deeper why not snowflake, nobody brings up this fundamental issue: inability to store the history of attributes. I hope this article would make it clearer to DW practitioners.

12 Comments »

  1. Don’t understand. Did you sell 2 or 8 books? Why is there a new factrecord for every change in The prod dim?

    Comment by Hennie — 16 July 2012 @ 7:01 pm | Reply

    • 2 books Hennie.
      You are right. It doesn’t make sense for the fact table to have a new fact record for every change in DimProduct.
      If the changes in DimProduct happened on the same day, the fact table should only have 1 row with the latest ProductKey.
      So there is no issue with the fact table, it will only have 1 row. I have removed the fact table from the article. Thanks for pointing this out.

      Comment by Vincent Rainardi — 24 July 2012 @ 10:48 pm | Reply

  2. Hi Vincent,

    How come in the sub dimension tables there’s no effective dates (from and to) which normally used in SCD-2?
    I thought by having from/to date when we can only look at when we do a look up to get the dimension key
    Cheers,
    Uzzie

    Comment by Uzzie — 22 July 2012 @ 3:17 pm | Reply

    • Hi Uzzie, because the sub dims are type 2 they should have effective date and expiry date columns (or ValidFrom and ValidTo columns).
      I just didn’t show these date columns in the images as I wasn’t focusing on their values.

      Comment by Vincent Rainardi — 24 July 2012 @ 10:55 pm | Reply

  3. The problem shown simplifies use and and would undermine the snowflakes. Large Data Warehousing uses the same surrogate key to his historical and adds a date into the composite key. Thus the question becomes simpler and more useful than a Star schema.

    Comment by Alberto Costa — 20 February 2014 @ 12:14 pm | Reply

  4. I don’t get your statement: ” we can’t do type 2 on sub dims. ”

    From what I saw is that actually you can right? and that there are 2 options when working with sub-dims (snowflaking):

    *sub-dim scd type 2
    *sub-dim scd type 1

    first option will require more complexity and storage for both parent and child dimensions.

    second option will be simpler but will not keep history of the attributes changes.

    I think that it depends on the business requirement right? Now, my question is, if they want to keep history of their sub-dims, is there a better option/methodology?

    Thanks a lot Vincent!

    Comment by saulcruz87 — 16 July 2014 @ 6:08 pm | Reply

    • Hi saulcruz87, the issue with having SCD type 2 in the sub dim tables is the timing:
      If you look at the 3 new rows on DimProduct above, the values in these 3 columns: LineKey, AuthorKey, PublisherKey are:
      1 1 2
      1 2 2
      2 2 2
      Where is 2 1 1? Where is 2 1 2?
      What if the change on DimPublisher table happens first, before the change in DimLine and DimAuthor?

      The standard methodology to keep the history of sub dim attributes is to include Line, Author and Publisher in DimProduct and make it a star schema. That is my prefered option.

      Comment by Vincent Rainardi — 16 July 2014 @ 7:52 pm | Reply

  5. it is funny that 5 years later I encountered this post and I’ve been wondering about a particular use case, while I agree with you that in this case including Line, Author and Publisher in DimProduct would make it simpler (star schema), what happens if for example DimAuthor has around 50 attributes, and there are other facts that are analyzed at the Author level, is building a flat DimProduct (including all of these new Author attributes) the best approach? Also, we would like to keep track of the author changes over time (Type 2), what are you thoughts Vincent?

    Comment by saul — 5 July 2019 @ 10:18 pm | Reply

    • Hi Saul, if Dim Author has 50 attributes and the facts need to be analysed on these attributes then I would connect Dim Author directly to the fact table. Not going through Dim Product. Please read my article “One or two dimensions” on this blog that discuss the advantages and disadvantages.

      Comment by Vincent Rainardi — 6 July 2019 @ 3:32 pm | Reply

      • I am experience this now but in my example, dim Author has a M:1 relationship with dim Product, therefore I can’t link dim author to the fact table. What to do in this case when I have 2 SCD

        Comment by Grant — 16 October 2019 @ 11:31 pm

      • Hi Grant, the standard Kimball approach is to remove the AuthorKey column from DimProduct, and create a bridge table connecting DimProduct and DimAuthor. This bridge table consists of 2 columns: ProductKey and AuthorKey.
        In my experience with this approach it is not easy to query the data. It is easier to put several author columns in DimProduct.

        The number of books with 1 author is a lot more than multiple authors. And with multiple authors, it is likely to be less than 5. It is something like: 1 author (90%), 2 authors (8%), 3 authors (1.3%), 4 authors (0.4%), 5 authors (0.2%), more than 5 authors (0.1%). This is because naturally it is difficult to collaborate with more than 1-2 person to co-write a book (being an author myself I know this first hand).

        And if it is 5 or more authors it is usually mentioned as “et al”. How many authors must we write before we can use “et al”? APA style guide mentioned 5 authors. So if it is 2 to 5 authors write them all, but if it is 6 or more, just write the first one and use “et al”. Because of this APA guide it means that in practice it is very unusual to find a book with more than 5 authors, because they will have been written as “et al”.

        So my recommendation is to create 3 author columns plus an “et al” column (Yes/No). If a book has more than 3 authors put a Yes on this “et al” column.

        There is a name for this issue. The author attribute is called a “multi valued attribute”. I’ve written about this issue here: https://dwbi1.wordpress.com/2011/03/13/dimensions-with-multi-valued-attributes/. My recommendation above is approach number 4 in this article. The Kimball’s standard approach of using a bridge table is approach number 3.

        Please read the paragraph begins with “An example of a Multi Valued Dimension is given by Ralph Kimball and Margy Ross…” to understand the difference between a “multi valued attribute” and a “multi valued dimension”. It is important to understand the difference between these 2 similar issues.

        I hope this helps, Vincent

        Comment by Vincent Rainardi — 17 October 2019 @ 5:26 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: