One of the most frequently asked questions in dimensional modelling is “when to snowflake”. Everybody talks about “when not to snowflake”, but “when to snowflake” is very rarely discussed. It is a general consensus in the data warehousing world that we must always use star schema for the presentation layer, but as with everything else in the world, there are always two sides of it. Sometimes more than two.
In this article I will not go through (at all) about “star vs snowflake” arguments. There are many web pages and books explaining this. I’m going to jump straight to “when do we snowflake”.
I found that there are a few situations where we should consider snowflake:
- When the sub dimension is used by several dimensions
- When the sub dimension is used by both the main dimension and the fact table(s)
- To make “base dimension” and “detail dimensions”
- To enrich a date attribute
I’ll go through one by one. As usual it is easier to explain & learn by example.
When the sub dimension is used by several dimensions
Example: In insurance data warehouse, the City-Country-Region columns which exist in DimBroker, DimPolicy, DimOffice and DimInsured, could be replaced by LocationKey pointing to DimLocation. Some call it DimGeography. This gives us a consistent hierarchy, i.e. relationship between City, Country & Region.
Whilst the advantage of this approach is consistency, the weakness of this approach is that we would lose flexibility. For example, in DimOffice, we could have 3 hubs (EMEA, America and Asia Pacific), not 7 regions used by DimBroker, DimPolicy & DimInsured (North America, South America, Africa, Europe, Middle East, Asia, Australia). It is a common understanding that the relationship between City and Country are more or less fixed, but the grouping of countries might be different between dimensions. If we put City, Country & Region in each of the 4 dimensions, we have the flexibility of each dimension having different hierarchy. In the DimOffice case we above have 2 options: a) put the City, Country, Region in DimOffice, or b) have 2 different attributes in DimLocation: Region and Hub. Some people go for c) create DimOfficeLocation, but I don’t see the point if this dim is only used in DimOffice, might as well unite them with DimOffice as per approach a).
Other examples of this case are: a) DimManufacturer and DimPackaging, used by several product dimensions in a manufacturing data mart, b) DimEmployee in Project data mart, and c) DimCustomer in a CRM data warehouse, d) DimBranch (some call it DimOffice) in Retail Banking warehouse. I’ve heard a few discussion about DimAddress, which like DimLocation is used by DimOffice, DimCustomer, DimSupplier etc (Retail mart), but usually the conclusion was it’s better to put address attributes (street, post code, etc) directly in the main dimension.
Sometimes the designer use hybrid approach, i.e. they put the attributes both in the main dim and in the sub dim. Again this is for flexibility reason, particularly if the sub dim is used directly by several fact tables (see below).
When the sub dimension is used by both the main dimension and the fact table(s)
A classic example of this situation is DimCustomer, when an account can only belong to 1 customer. DimCustomer is used in DimAccount, and is also used by the fact tables. Other examples are: DimManufacturer, DimBranch, DimProductGroup. In the case of DimProductGroup, some fact tables are at product level, but some fact tables are at Product Group level. Hence we need both DimProduct and DimProductGroup. The option here is a) put the product group attribute in both DimProduct and DimProductGroup, or b) snowflake, i.e. DimProduct doesn’t have Product Group attributes; it only has ProductGroupKey.
To Make A Base Dimension and Detail Dimensions
The classic examples of this case are: Insurance classes or LOB (line of business), Retail Banking account types, Attributes for different Product Lines, and in Pharma we have Disease/Medicine categories.
Insurance policies from insurance classes or LOB (e.g. marine, aviation, motor, property) have different attributes. So we pull the common attributes into 1 dimension called DimBasePolicy and the class-specific attributes into DimMarinePolicy, DimMotorPolicy, etc. LOB is a US market term, whereas Class is a Lloyd’s term (London market). Similarly, in retail banking we have DimBaseAccount, DimSavingAccount, DimMortgageAccount, DimCurrentAccount, etc. In investment banking different asset classes have different attributes.
The alternatives of this design are: a) have one generic detail dimension, with 100 attributes from different categories, b) a normalised version with 4 columns or so.
Approach a) would be very wide and sparse because (for example) Marine rows only use Marine attributes, etc. But it is easier to use than the “base-detail” approach. Approach b) would be in-query-able (is that a word?) and difficult to use/join.
To Enrich a Date Attribute
A date attribute (e.g. 2011-03-11) is often analysed by Month, Quarter or Year so if we have MaturityDate, EffectiveDate and CancellationDate in the dimension and the business needs to analyse each of them by Week, Month, Quarter, Year then we would need to have 12 attributes in the dim for this.
By replacing the date attribute with the date key, we could analyse the date by any attributes in the date dim. In this case I’d recommend to use a smart integer date key rather than a surrogate date key, e.g. 20110311 rather than 12345. It is more user-friendly, it would still have meaning if you don’t link to DimDate. The smart date key is a generally accepted exception to the Kimball rule of surrogate key.
This will be one of the topics I’m presenting at SQLBits 8. As usual I welcome any comments and discussion at vrainardi@gmail.com. Vincent 11/3/11
[...] 原文链接:http://dwbi1.wordpress.com/2011/03/11/when-to-snowflake/ [...]
Pingback by 什么时候该用雪花模型? | 数据仓库 — 15 March 2011 @ 6:10 am |
[...] 之前翻译了Vincent的一篇《When To Snowflake》,后来通过email与他结识,有幸参与了讲稿的review,不过说实话我自己的维度建模经验都非常少,只是提了一些不痛不痒的问题和评论,少部分被采纳,我是很高兴的。与Vincent的交流很愉快,他是个很真诚且乐于分享的人,无论是他的blog还是这次演讲,我想我都能学到很多,我想并希望我们能继续保持联系:-) [...]
Pingback by SQLBits 8 Brighton: 高级维度建模 | 数据仓库 — 8 April 2011 @ 8:35 am |
Any low cardinality attributes in a monster dimension (>million rows) should be snowflaked for the purpose of storage saving and performance.
Comment by hang Liu — 10 April 2011 @ 1:44 am |
[...] Rainardi (blog) has written an excellent post on When to Snowflake your dimensions in the data warehouse side, which would be a good read prior to this post. If your [...]
Pingback by When and How to Snowflake Dimension Sources : SSAS Design Part 1 - Some Random Thoughts — 3 June 2011 @ 8:38 am |