Data Platform and Data Science

17 June 2010

One or Two Dimensions

Filed under: Data Warehousing — Vincent Rainardi @ 5:41 am
Tags:

Each customer can have more than one account. Should we

a)  create one dimension: dim_account, with grain = account, PK = account_key, and customer details become attributes of this dimension. On the fact table we only have account_key FK.
b)  create two dimensions: dim_account (grain = account, PK = account_key) and dim_customer (grain: customer, PK: customer_key). On the fact table we have 2 FKs: account_key and customer_key.
c)  snowflake, i.e. create dim_account and dim_customer as per b), but on the fact table we only have 1 FK: account_key. On dim_account we have customer_key FK.

What are the advantages and disadvantages for each of the three approaches above? In what case/situation does each approach work best?

Before I start, there are 2 more approaches, derived from the above 3 main ones above:
d)  as per b) but on dim_account we have customer_key FK, which connects dim_account directly to dim_customer
e)  as per a) but on dim_account we have customer_key column, and on the fact table we have 2 FKs: account_key and customer_key

OK, now the advantages and disadvantages. For each of the 5 approaches above, I will elaborate the main advantage, in which situation is that approach best used, the main disadvantage and any issues when using that approach.

a) One Dimension

The main advantage is simpliciy, i.e. there is only 1 dimension and 1 fact table, and a 1-column link between them. This approach is best used when we don’t have other fact table(s) with grain = customer and we don’t need dim_customer to be created.

The main disadvantage is we don’t have dim_customer. So a fact table with the grain = customer can’t be built. An attempt to use dim_account for that fact table (using account_key to link) will result in difficulties in maintaining the account_key column on the fact table (to point to the right customer). The difficulties are especilly felt when the dim is SCD type 2.

b) Two Dimensions

The main advantage is: from the fact table we can the get customer details without knowing the account_key. In which Ideal to be used when we have a fact table with the grain = account and another fact table with grain = customer, but we don’t have the need to get the customer details for a particular account.

The main disadvantage is: we cannot ‘go’ from dim_account to dim_customer (and vice versa) without going through the fact table. All ‘traffic’ between dim_account and dim_customer must go via the fact table. How? Using select distinct, e.g.

select distinct account_key, customer_key from fact1

The fact table could be big. If it’s a 1 billion row fact table, this query would be slow. An unclustered index on both the account_key and customer_key helps the performance, but still a lot slower than approach c), d) or e). Unfortunately, we can’t materialize the above query using an indexed view. The select statement in an indexed view cannot contain DISTINCT keyword.

c) Snowflake

The main advantage is: it’s modular i.e. we have 2 separate dimension tables but can combine them easily to create a bigger dimension. If we have a fact table with the grain = account and another fact table with grain = customer, and we need to get the customer details for a particular account without going through the fact table, then this is the ideal approach to use.

The main disadvantage of snowflake approach is: in a fact table with the grain = account, to get the breakdown of a measure by customer details it’s a little bit more complicated than approach a).

In Figure a) : (One Dimension approach)

select a.customer_attribute1, sum(f.measure1)
from fact1 f
inner join dim_account a on f.account_key = a.account_key
group by a.customer_attribute1

In Figure c) : (Snowflake approach)

select c.customer_attribute1, sum(f.measure1)
from fact1 f
inner join dim_account a on f.account_key = a.account_key
inner join dim_customer c on a.customer_key = c.customer_key
group by c.customer_attribute1

But this disadvantage is relatively very small compared to the advantage of having a customer dimension which is linked to the account dimension.

d) Two Dimensions with Inter-Dimension Link

The main disadvantage of the Two Dimensions approach is that we can’t “go” from account dimension to customer dimension without going through the fact table. And the main disadvantage of the Snowflake approach is that we can’t access dim_customer directly from the fact table. This approach is trying to fix both problems. So there are 2 main advantages:

a)  we can “go” from account dimension to customer dimension, and
b)  we can access dim_customer directly from the fact table

When should we use this approach? When a) we have a fact table with the grain = customer so we need to create dim_customer, b) when we need to access customer details from the fact table directly, and c) we need to get the customer details for a particular account without going through the fact table.

What is the main disadvantage of this approach? The “duplicated effort” in maintaining the customer key. We maintain the customer key column in the fact table as well as in the dim_account.

e) One Dimension With Customer Key

This approach is trying to fix the disadvantage of the One Dimension approach, which is the inability to build a fact table with the grain = customer because we don’t have dim_customer with its own customer_key.

So in this approach we add a column to the dim_account: customer_key. This enable us to build a fact table with the grain = customer, but keeping the number of dimensions to one.

This approach is not as popular as c) and d) in solving dim_customer issue because it is “undecisive”. It is trying to create a dim_customer but it doesn’t want to create a dim_customer. In c) and d) we have dim_customer physically, but in e) the dim_customer is hidden inside dim_account. Making it a) more difficult to maintain (especially if it is a type 2), and b) less modular/flexible, which are the main disadvantages of this approach.

Please bear in mind that the above concept is applicable not only dim_customer & dim_account, but to all dimensions with hierarchical relationship. For example, dim_product and dim_product_group, dim_policy and dim_line.

13 Comments »

  1. […] Answer: Depends on how they are going to be used, as I explained in my article “One or two dimensions” here. […]

    Pingback by Data Warehousing Interview Questions « Data Warehousing and Business Intelligence — 11 December 2010 @ 10:59 pm | Reply

  2. […] via the fact table. Which is why some designer “link the 2 dimension via a back door”. Read here for details, case […]

    Pingback by Denormalising a Fact Table « Data Warehousing and Business Intelligence — 26 February 2011 @ 9:00 am | Reply

  3. In the section … you begin the second example with … Should it be ? You are joining dim_customer to dim_account because it is a snowflake.

    Comment by Aalam Rangi — 18 September 2014 @ 3:52 pm | Reply

    • Sorry Alaam, which section did you mean? And which second example did you mean?

      Comment by Vincent Rainardi — 18 September 2014 @ 4:28 pm | Reply

      • Wow, the comment box just gobbled my words 🙂 I think it cuts out parentheses or colons.

        Anyway, I was referring to section C-Snowflake which has two example code snippets in blue. The second snippet is preceded by the intro text IN B which refers to the figure b (Two Dimensions). Should it not be referring to figure c (Snowflake) because the snippet has an INNER JOIN between dim_customer and dim_account?

        Comment by Aalamjeet Rangi — 19 September 2014 @ 6:00 pm

      • Yes you are right Aalam, it is figure c) not figure b). I have corrected it. Thanks for the correction.

        Comment by Vincent Rainardi — 19 September 2014 @ 11:45 pm

  4. My previous comment lost some words so rewriting. In the section c) Snowflake … you begin the second example with In b): … Should it be written as In c): ? Because you are joining dim_customer to dim_account as a snowflake.

    Comment by Aalam Rangi — 18 September 2014 @ 3:54 pm | Reply

  5. You are a very skilled writer. best blog for data warehouse and business intelligence. I have also tried writing few articles on dwbi

    http//:www.dwbimaster.com

    Comment by sangita — 16 December 2015 @ 1:15 pm | Reply

  6. for approach d).. is ‘customer_key’ a natural key or surrogate key? and what is the customer dim and account dim is multi to multi relationship? thanks

    Comment by Howard Z — 21 January 2017 @ 4:45 am | Reply

    • Hi Howard, in approach d) customer_key is a surrogate key. If the customer dim and account dim is many-to-many relationship, we can’t use approach d). Approach d) can only be used if it is one-to-many relationship (1 account belongs to only 1 customer, 1 customer can have many accounts). If it is many-to-many relationship (1 account belong to many customers, 1 customer can have many accounts) then we need to create a new “factless” fact table with only 2 columns (customer_key and account_key) to store this relationship. The fact table in approach d) contains measures so it would be double counting if we create 2 rows for 1 account (to represent 2 customers), therefore it needs to be a factless fact table.

      Comment by Vincent Rainardi — 21 January 2017 @ 7:13 am | Reply

  7. interconnected dimension, what if I have 10 dimension tables, how will you interconnect them? and is it ok to take this approach.
    also can you please explain ripple effect in data warehouse.

    Comment by Punit — 8 July 2017 @ 4:03 pm | Reply

    • Hi Punit, ripple effect in data warehousing is when a parent entity (eg product family) gets a new version, then all its children and grand children (eg product group and product) will need to have a new version too. This is best explained by Dani Schnider et al in their book here: https://www.trivadis.com/sites/default/files/downloads/Comparison_DWH_Core_Modeling.pdf
      I don’t believe there can be 10 interconnected dimensions in real project, please let me know what dimensions they are.

      If it is just a theoretical case, the theoretical answer is: no, don’t interconnect them, but flatten them into 2-3 big dimensions instead.

      Thanks for your question. You are clearly an expert in data modelling and I would be grateful if you could share your own view on the above.

      Comment by Vincent Rainardi — 9 July 2017 @ 8:06 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.