Data Warehousing and Business Intelligence

11 March 2011

When To Snowflake

Filed under: Data Warehousing — Vincent Rainardi @ 7:22 pm
Tags:

One of the most frequently asked questions in dimensional modelling is “when to snowflake”. Everybody talks about “when not to snowflake”, but “when to snowflake” is very rarely discussed. It is a general consensus in the data warehousing world that we must always use star schema for the presentation layer, but as with everything else in the world, there are always two sides of it. Sometimes more than two.

In this article I will not go through (at all) about “star vs snowflake” arguments. There are many web pages and books explaining this. I’m going to jump straight to “when do we snowflake”.

I found that there are a few situations where we should consider snowflake:

  1. When the sub dimension is used by several dimensions
  2. When the sub dimension is used by both the main dimension and the fact table(s)
  3. To make “base dimension” and “detail dimensions”
  4. To enrich a date attribute

I’ll go through one by one. As usual it is easier to explain & learn by example.

When the sub dimension is used by several dimensions

Example: In insurance data warehouse, the City-Country-Region columns which exist in DimBroker, DimPolicy, DimOffice and DimInsured, could be replaced by LocationKey pointing to DimLocation. Some call it DimGeography. This gives us a consistent hierarchy, i.e. relationship between City, Country & Region.

Whilst the advantage of this approach is consistency, the weakness of this approach is that we would lose flexibility. For example, in DimOffice, we could have 3 hubs (EMEA, America and Asia Pacific), not 7 regions used by DimBroker, DimPolicy & DimInsured (North America, South America, Africa, Europe, Middle East, Asia, Australia). It is a common understanding that the relationship between City and Country are more or less fixed, but the grouping of countries might be different between dimensions. If we put City, Country & Region in each of the 4 dimensions, we have the flexibility of each dimension having different hierarchy. In the DimOffice case we above have 2 options: a) put the City, Country, Region in DimOffice, or b) have 2 different attributes in DimLocation: Region and Hub. Some people go for c) create DimOfficeLocation, but I don’t see the point if this dim is only used in DimOffice, might as well unite them with DimOffice as per approach a).

Other examples of this case are: a) DimManufacturer and DimPackaging, used by several product dimensions in a manufacturing data mart, b) DimEmployee in Project data mart, and c) DimCustomer in a CRM data warehouse, d) DimBranch (some call it DimOffice) in Retail Banking warehouse. I’ve heard a few discussion about DimAddress, which like DimLocation is used by DimOffice, DimCustomer, DimSupplier etc (Retail mart), but usually the conclusion was it’s better to put address attributes (street, post code, etc) directly in the main dimension.

Sometimes the designer use hybrid approach, i.e. they put the attributes both in the main dim and in the sub dim. Again this is for flexibility reason, particularly if the sub dim is used directly by several fact tables (see below).

When the sub dimension is used by both the main dimension and the fact table(s)

A classic example of this situation is DimCustomer, when an account can only belong to 1 customer. DimCustomer is used in DimAccount, and is also used by the fact tables. Other examples are: DimManufacturer, DimBranch, DimProductGroup. In the case of DimProductGroup, some fact tables are at product level, but some fact tables are at Product Group level. Hence we need both DimProduct and DimProductGroup. The option here is a) put the product group attribute in both DimProduct and DimProductGroup, or b) snowflake, i.e. DimProduct doesn’t have Product Group attributes; it only has ProductGroupKey.

To Make A Base Dimension and Detail Dimensions

The classic examples of this case are: Insurance classes or LOB (line of business), Retail Banking account types, Attributes for different Product Lines, and in Pharma we have Disease/Medicine categories.

Insurance policies from insurance classes or LOB (e.g. marine, aviation, motor, property) have different attributes. So we pull the common attributes into 1 dimension called DimBasePolicy and the class-specific attributes into DimMarinePolicy, DimMotorPolicy, etc. LOB is a US market term, whereas Class is a Lloyd’s term (London market). Similarly, in retail banking we have DimBaseAccount, DimSavingAccount, DimMortgageAccount, DimCurrentAccount, etc. In investment banking different asset classes have different attributes.

The alternatives of this design are: a) have one generic detail dimension, with 100 attributes from different categories, b) a normalised version with 4 columns or so.

Approach a) would be very wide and sparse because (for example) Marine rows only use Marine attributes, etc. But it is easier to use than the “base-detail” approach. Approach b) would be in-query-able (is that a word?) and difficult to use/join.

To Enrich a Date Attribute

A date attribute (e.g. 2011-03-11) is often analysed by Month, Quarter or Year so if we have MaturityDate, EffectiveDate and CancellationDate in the dimension and the business needs to analyse each of them by Week, Month, Quarter, Year then we would need to have 12 attributes in the dim for this.

By replacing the date attribute with the date key, we could analyse the date by any attributes in the date dim. In this case I’d recommend to use a smart integer date key rather than a surrogate date key, e.g. 20110311 rather than 12345. It is more user-friendly, it would still have meaning if you don’t link to DimDate. The smart date key is a generally accepted exception to the Kimball rule of surrogate key.

This will be one of the topics I’m presenting at SQLBits 8. As usual I welcome any comments and discussion at vrainardi@gmail.com. Vincent 11/3/11

——————————————————————————————————————————————————————————

Reply to Jacob’s 2nd comment, 25 September 2013 @ 4:39 pm (I put it here as I can’t put pictures in the comment section).

Hi Jacob, what I meant by “By putting geographical attributes in the individual dimension, we have more flexibility” is:

Scenario 1: Put geography attributes in the 3 dimensions

Scenario 1 Put geography attributes in the 3 dimensions

Scenario 2: Scenario 2: Put geography attributes in DimLocation

Scenario 2 Put geography attributes in DimLocation

In scenario 1, we have the flexibility of grouping the countries in 3 different ways, i.e. in DimInsured Thailand is in APAC ex Japan region, in DimBroker Thailand is in Asia Pacific region and in DimOffice Thailand is in Asia region.

In scenario 2, we don’t have that flexibility. Thailand is always in Asia Pacific region. This is good for consistency of regional grouping across the company.

If we want to create this flexibility, we can create 3 columns in DimLocation: InsuredRegion, BrokerRegion and OfficeRegion. Let’s call this scenario 3. In scenario 3, Insured, Region and Office can have different regions. The advantage of scenario 3 over scenario 1 is that the relationship between city and country is centralised in one place. Whereas in scenario 1 the relationship between city and country are located in 3 different places.

The disadvantage of scenario 3 is: if the relationship attributes is not fixed like city-country, but can change over time. In this example below, we have Territory, Region and Division. The DimLocation looks like this:

Territory

We have 3 types of Territory (Supply Territory, Sales Territory, Service Territory), 3 types of Region (Supply Region, Sales Region, Service Region) and 2 types of Division (Sales Division, Service Division).

In this scenario, it is better to put the City, Territory, Country, Region and Division in DimSupplier, DimSalesperson and DimService.

A hub is a grouping of region, i.e.

Hub

A SubDim (sub dimension) is a table attached to a main dimension. Please see my SQLBit Brighton presentation about dimensional modelling here, about “1 or 2 dimensions” (page 5 to 8) and “when to snowflake” (page 9 to 11). You are right that DimLocation can be a subdim (attached to DimBroker), or a main dim (attached to a fact table).

Hope this helps Jacob,
Vincent

Advertisements

14 Comments »

  1. […] 原文链接:https://dwbi1.wordpress.com/2011/03/11/when-to-snowflake/ […]

    Pingback by 什么时候该用雪花模型? | 数据仓库 — 15 March 2011 @ 6:10 am | Reply

  2. […] 之前翻译了Vincent的一篇《When To Snowflake》,后来通过email与他结识,有幸参与了讲稿的review,不过说实话我自己的维度建模经验都非常少,只是提了一些不痛不痒的问题和评论,少部分被采纳,我是很高兴的。与Vincent的交流很愉快,他是个很真诚且乐于分享的人,无论是他的blog还是这次演讲,我想我都能学到很多,我想并希望我们能继续保持联系:-) […]

    Pingback by SQLBits 8 Brighton: 高级维度建模 | 数据仓库 — 8 April 2011 @ 8:35 am | Reply

  3. Any low cardinality attributes in a monster dimension (>million rows) should be snowflaked for the purpose of storage saving and performance.

    Comment by hang Liu — 10 April 2011 @ 1:44 am | Reply

  4. […] Rainardi (blog) has written an excellent post on When to Snowflake your dimensions in the data warehouse side, which would be a good read prior to this post. If your […]

    Pingback by When and How to Snowflake Dimension Sources : SSAS Design Part 1 - Some Random Thoughts — 3 June 2011 @ 8:38 am | Reply

  5. I really liked this article. It is very valuable. But i found it a bit ambiguous. Some good tabular examples would help. I also don’t recall this being covered in your book.

    For example this was a very complex sentence: “Whilst the advantage of this approach is consistency, the weakness of this approach is that we would lose flexibility. For example, in DimOffice, we could have 3 hubs (EMEA, America and Asia Pacific), not 7 regions used by DimBroker, DimPolicy & DimInsured (North America, South America, Africa, Europe, Middle East, Asia, Australia).” First, i didn’t understand what you meant by 3 hubs. I’m clueless. Second, i didn’t understand what you meant in the not statement–“not 7 regions….” I have no clue.

    “In the DimOffice case we above have 2 options: a) put the City, Country, Region in DimOffice, or b) have 2 different attributes in DimLocation: Region and Hub.” Huh? Its not just the hub part that loses me. Its the whole point of the b option.

    “Some people go for c) create DimOfficeLocation, but I don’t see the point if this dim is only used in DimOffice, might as well unite them with DimOffice as per approach a).” Gain, I don’t follow you. And what is the difference between DimOffice and DimOfficeLocation that you prefer a) over DimOfficeLocation?

    I don’t understand what you mean by a “sub dim.” You don’t define the term. If DimBroker, DimPolicy, DimOffice and DimInsured share attributes City, Country, Region, then pulling them out would make this Location Dimension the parent dimension. (For each location there will be many offices, brokers, and policies.) So sub dim to me is confusing especially since you don’t define it. It would be helpful for you to explain what exists BEFORE and what exists AFTERWARDS so people can be sure they came to the right conclusion. The use of sub dim can also (due to a lack of definition) pre-suppose a snowflake schema. if you are talking about why a snowflake schema should exist by using an example of it already existing, that can be frustrating.

    “The alternatives of this design are: a) have one generic detail dimension, with 100 attributes from different categories, b) a normalised version with 4 columns or so.” Where did you get 4 columns from?

    “Approach a) would be very wide and sparse because (for example) Marine rows only use Marine attributes, etc. But it is easier to use than the “base-detail” approach. Approach b) would be in-query-able (is that a word?) and difficult to use/join.” I don’t know why it would be in-query-able. Also what is “base-detail”?

    Sorry to sound so critical. Its a great article. But it is also frustrating because it doesn’t ensure that the user can go from one point to the next. That at least causes me to make a lot of assumptions that I’m not clear about.

    I am able to make some sense of this article. I’m just not sure I’ve come to the right conclusions. And I know from experience what is extremely clear when it is not spelled out can be completely wrong.

    Comment by jacob — 23 September 2013 @ 7:42 pm | Reply

    • Hi Jacob, thanks for your reply. DimBroker is the intermediary party from whom we get the business. (we as in an general insurance company). In DimBroker, we have City, Country and Region columns, which is the location of the broker. DimOffice contains the location of our branches. DimOffice also have City, Country and Region columns. DimInsured is the company or person which we are covering in the policy, in other word, our clients. DimInsured also have City, Country and Region columns. In this example, the region columns contains these 7 values: North America, South America, Africa, Europe, Middle East, Asia, Australia. The flexibility of putting geographical attributes in each of the 3 dimensions (DimInsured, DimOffice, DimBroker) is that we can customise these geographical attribute for each of the 3 dimensions. For example, in DimOffice we could add a column called Hub, which contains 3 values: EMEA, Americas and APAC. We only add “Hub” column in the DimOffice, and not in DimBroker or DimInsured, because Hub (in this example) is only applicable to DimOffice.

      We also have the flexibility to make the values of the Region columns different between the 3 dimensions. For example, in DimInsured, the values of Region could be: North America, South America, Africa, Europe, Middle East, Asia, Australia; whereas in DimOffice the value of Region are Americas, Africa, Europe & Middle East, APAC ex Japan, Japan. This is because our branches/offices are organised differently to the client locations.

      So, my point was, by putting geographical attributes in the individual dimension, we have more flexibility.
      If, on the other hand, we centralised the geographical attributes in one dimension called DimLocation, and connect it in snowflake fashion to 3 dimensions (DimInsured, DimBroker and DimOffice). In this arrangement, the 3 dimensions will have to have the same geographical attributes and values. We don’t have the flexibility to differentiate the values of these geographical attributes between the 3 dimensions.

      By SubDim (sub dimension) I mean a dimension like DimLocation above, which is “connected” to other dimensions.

      In general insurance (aka Property and Casualty business), we have about 30 different classes (aka “book”), i.e. Marine (Hull and Cargo), Equine, Aviation, Space (Satellite), Professional Indemnity, Specie, Accident and Health, Motor, Renewable Energy, War, Terrorism, Financial Liability. In an insurance Data Warehouse, the core issue is DimRisk, i.e. storing the risk attributes for these different classes.

      By “the 4 columns”, what I meant was a table which has the following 4 columns: Key, Class, Attribute, Value. This way, if we want to add a new attribute, we don’t have to change the structure of the table.

      By “one generic dimension with 100 columns”, what I meant was: a dimension table which contains the attributes from all classes. In this dimension table (called DimSubRisk, or DimSubPolicy) each attribute is put as a column.

      Kind regards,
      Vincent

      VR, 29/9/2013: apologies I’m using a terminology which is not standard (SubDim). The standard Kimball term for SubDim is Outrigger. This word is probably more understandable by more people. That’s should be the word/term I used in the first place. For those who are not familiar with the concepts of snowflake, outriggers and bridges, I would recommend reading Margy Ross writing here: http://www.kimballgroup.com/2008/09/03/design-tip-105-snowflakes-outriggers-and-bridges/. And we shouldn’t talk about Outrigger without talking about Slowly Changing Dimension type 4, 5, 6 and 7! as Margy Ross wrote here: http://www.kimballgroup.com/2013/02/05/design-tip-152-slowly-changing-dimension-types-0-4-5-6-7/. The Kimball Group included this SCD type 4,5,6,7 in their third edition of Data Warehouse Toolkit.

      Comment by Vincent Rainardi — 24 September 2013 @ 6:14 pm | Reply

      • Thanks for the effort. I’m going to have to re-read the article in order to reply to this. There are something I understand and some things I’m not sure about yet. I still don’t get this point: “So, my point was, by putting geographical attributes in the individual dimension, we have more flexibility.
        If, on the other hand, we centralised the geographical attributes in one dimension called DimLocation, and connect it in snowflake fashion to 3 dimensions (DimInsured, DimBroker and DimOffice). In this arrangement, the 3 dimensions will have to have the same geographical attributes and values. We don’t have the flexibility to differentiate the values of these geographical attributes between the 3 dimensions.”

        Also i still don’t understand what the hub is and its relationship. Perhaps you forgot to explain that.

        And I still don’t understand subdim DimLocation. What makes it a subdim? In my mind DimLocation can be placed in 3 different spots in a data mart. DimLocation can be attached directly to the Fact table with no other relationships. It can be attached to the DimInsured, DimBroker, and DimOffice in a type of third normal form. It can also be a set of columns/attributes within DimInsured, DimBroker, and DimOffice. What is the before condition? And what is the after condition? WHen you say subdim, i really don’t know what you mean.

        thanks!

        Comment by jacob — 25 September 2013 @ 4:39 pm

      • Hi Jacob, my reply is above in the body of the article (as I can’t put pictures on the reply section)

        Comment by Vincent Rainardi — 25 September 2013 @ 8:16 pm

  6. Here is what i think I got
    When the sub dimension is used by several dimensions
    – This is a situation where you see several dimension repeat the same columns (City, Country, Region) You could pull that info out into a separate parent dimension.

    When the sub dimension is used by both the main dimension and the fact table(s)
    – This title itself makes no sense to me, but what helped me was the Product and Product Group example. This is a situation, as i understand it, where both levels of the hierarchy is used in fact tables. Like buildings and rooms. In some Fact tables Room is the necessary grain. But in others, a person, for example, is not concerned with the rooms but the overall structure or building.

    To Make A Base Dimension and Detail Dimensions
    – This seems like a Supertype/Subtype situation.

    To Enrich a Date Attribute
    – Couldn’t follow well enough

    Comment by jacob — 23 September 2013 @ 7:58 pm | Reply

    • Hi Jacob, thanks again for your comment. Your understanding about the first 3 points is correct.

      The last one (To Enrich a Date Attribute), consider the Policy dimension, which has many dates in it: the date the policy was underwritten, the date the policy started to become active, the date the policy is due for renewal (which could be different from the policy expiry date). Another example is Claim dimension, which also have many date columns in it, i.e. date the accident happened, date the accident was reported to us, date the police was notified (in practice they almost always happen on the same day, but not always), the date estimate was given, the date the repair was done, etc. Generally, any “process or flow dimensions” have many date columns. In the Policy dimension, rather than creating Underwriting Date, Underwriting Month, Underwriting Year, Activation Date, Activation Month, Activation Year, Renewal Date, Renewal Month, Renewal Year, Expiry Date, Expiry Month, and Expiry Year; we should only create Underwriting Date Key, Activation Date Key, Renewal Date Key, and Expiry Date Key, which point to the Date Dimension. Yes this is snowflaking, but for a good reason, because unlike the geographical attributes (city-country-region in the example above), the date attributes are always the same across any dimensions.

      Hope this helps, Vincent

      Comment by Vincent Rainardi — 24 September 2013 @ 7:03 pm | Reply

      • This was very helpful. Thank you also for the confirmation.

        Comment by jacob — 25 September 2013 @ 4:16 pm

  7. Thanks for a nicely written blog post. I do like your positive list approch to snowflaking, but would like to point out a missing condition to them.

    before you should put information in the same flake, the must be within the same information domain.

    Guldmann

    Comment by guldmann — 15 June 2015 @ 10:58 am | Reply

    • Apologies I don’t understand what you mean, Guldmann. Could you give us some examples?

      Comment by Vincent Rainardi — 15 June 2015 @ 6:20 pm | Reply

  8. […] Rainardi (blog) has written an excellent post on When to Snowflake your dimensions in the data warehouse side, which would be a good read prior to this post. If your […]

    Pingback by When and How to Snowflake Dimension Sources : SSAS Design Part 1 — 29 December 2015 @ 4:22 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: