One of the most frequently asked questions in dimensional modelling is “when to snowflake”. Everybody talks about “when not to snowflake”, but “when to snowflake” is very rarely discussed. It is a general consensus in the data warehousing world that we must always use star schema for the presentation layer, but as with everything else in the world, there are always two sides of it. Sometimes more than two.
In this article I will not go through (at all) about “star vs snowflake” arguments. There are many web pages and books explaining this. I’m going to jump straight to “when do we snowflake”.
I found that there are a few situations where we should consider snowflake:
- When the sub dimension is used by several dimensions
- When the sub dimension is used by both the main dimension and the fact table(s)
- To make “base dimension” and “detail dimensions”
- To enrich a date attribute
I’ll go through one by one. As usual it is easier to explain & learn by example.
When the sub dimension is used by several dimensions
Example: In insurance data warehouse, the City-Country-Region columns which exist in DimBroker, DimPolicy, DimOffice and DimInsured, could be replaced by LocationKey pointing to DimLocation. Some call it DimGeography. This gives us a consistent hierarchy, i.e. relationship between City, Country & Region.
Whilst the advantage of this approach is consistency, the weakness of this approach is that we would lose flexibility. For example, in DimOffice, we could have 3 hubs (EMEA, America and Asia Pacific), not 7 regions used by DimBroker, DimPolicy & DimInsured (North America, South America, Africa, Europe, Middle East, Asia, Australia). It is a common understanding that the relationship between City and Country are more or less fixed, but the grouping of countries might be different between dimensions. If we put City, Country & Region in each of the 4 dimensions, we have the flexibility of each dimension having different hierarchy. In the DimOffice case we above have 2 options: a) put the City, Country, Region in DimOffice, or b) have 2 different attributes in DimLocation: Region and Hub. Some people go for c) create DimOfficeLocation, but I don’t see the point if this dim is only used in DimOffice, might as well unite them with DimOffice as per approach a).
Other examples of this case are: a) DimManufacturer and DimPackaging, used by several product dimensions in a manufacturing data mart, b) DimEmployee in Project data mart, and c) DimCustomer in a CRM data warehouse, d) DimBranch (some call it DimOffice) in Retail Banking warehouse. I’ve heard a few discussion about DimAddress, which like DimLocation is used by DimOffice, DimCustomer, DimSupplier etc (Retail mart), but usually the conclusion was it’s better to put address attributes (street, post code, etc) directly in the main dimension.
Sometimes the designer use hybrid approach, i.e. they put the attributes both in the main dim and in the sub dim. Again this is for flexibility reason, particularly if the sub dim is used directly by several fact tables (see below).
When the sub dimension is used by both the main dimension and the fact table(s)
A classic example of this situation is DimCustomer, when an account can only belong to 1 customer. DimCustomer is used in DimAccount, and is also used by the fact tables. Other examples are: DimManufacturer, DimBranch, DimProductGroup. In the case of DimProductGroup, some fact tables are at product level, but some fact tables are at Product Group level. Hence we need both DimProduct and DimProductGroup. The option here is a) put the product group attribute in both DimProduct and DimProductGroup, or b) snowflake, i.e. DimProduct doesn’t have Product Group attributes; it only has ProductGroupKey.
To Make A Base Dimension and Detail Dimensions
The classic examples of this case are: Insurance classes or LOB (line of business), Retail Banking account types, Attributes for different Product Lines, and in Pharma we have Disease/Medicine categories.
Insurance policies from insurance classes or LOB (e.g. marine, aviation, motor, property) have different attributes. So we pull the common attributes into 1 dimension called DimBasePolicy and the class-specific attributes into DimMarinePolicy, DimMotorPolicy, etc. LOB is a US market term, whereas Class is a Lloyd’s term (London market). Similarly, in retail banking we have DimBaseAccount, DimSavingAccount, DimMortgageAccount, DimCurrentAccount, etc. In investment banking different asset classes have different attributes.
The alternatives of this design are: a) have one generic detail dimension, with 100 attributes from different categories, b) a normalised version with 4 columns or so.
Approach a) would be very wide and sparse because (for example) Marine rows only use Marine attributes, etc. But it is easier to use than the “base-detail” approach. Approach b) would be in-query-able (is that a word?) and difficult to use/join.
To Enrich a Date Attribute
A date attribute (e.g. 2011-03-11) is often analysed by Month, Quarter or Year so if we have MaturityDate, EffectiveDate and CancellationDate in the dimension and the business needs to analyse each of them by Week, Month, Quarter, Year then we would need to have 12 attributes in the dim for this.
By replacing the date attribute with the date key, we could analyse the date by any attributes in the date dim. In this case I’d recommend to use a smart integer date key rather than a surrogate date key, e.g. 20110311 rather than 12345. It is more user-friendly, it would still have meaning if you don’t link to DimDate. The smart date key is a generally accepted exception to the Kimball rule of surrogate key.
Reply to Jacob’s 2nd comment, 25 September 2013 @ 4:39 pm (I put it here as I can’t put pictures in the comment section).
Hi Jacob, what I meant by “By putting geographical attributes in the individual dimension, we have more flexibility” is:
Scenario 1: Put geography attributes in the 3 dimensions
Scenario 2: Scenario 2: Put geography attributes in DimLocation
In scenario 1, we have the flexibility of grouping the countries in 3 different ways, i.e. in DimInsured Thailand is in APAC ex Japan region, in DimBroker Thailand is in Asia Pacific region and in DimOffice Thailand is in Asia region.
In scenario 2, we don’t have that flexibility. Thailand is always in Asia Pacific region. This is good for consistency of regional grouping across the company.
If we want to create this flexibility, we can create 3 columns in DimLocation: InsuredRegion, BrokerRegion and OfficeRegion. Let’s call this scenario 3. In scenario 3, Insured, Region and Office can have different regions. The advantage of scenario 3 over scenario 1 is that the relationship between city and country is centralised in one place. Whereas in scenario 1 the relationship between city and country are located in 3 different places.
The disadvantage of scenario 3 is: if the relationship attributes is not fixed like city-country, but can change over time. In this example below, we have Territory, Region and Division. The DimLocation looks like this:
We have 3 types of Territory (Supply Territory, Sales Territory, Service Territory), 3 types of Region (Supply Region, Sales Region, Service Region) and 2 types of Division (Sales Division, Service Division).
In this scenario, it is better to put the City, Territory, Country, Region and Division in DimSupplier, DimSalesperson and DimService.
A hub is a grouping of region, i.e.
A SubDim (sub dimension) is a table attached to a main dimension. Please see my SQLBit Brighton presentation about dimensional modelling here, about “1 or 2 dimensions” (page 5 to 8) and “when to snowflake” (page 9 to 11). You are right that DimLocation can be a subdim (attached to DimBroker), or a main dim (attached to a fact table).
Hope this helps Jacob,