Data Warehousing and Business Intelligence

11 February 2016

Six Dimension Types

Filed under: Analysis Services — Vincent Rainardi @ 7:29 pm

In data warehousing there are 6 types of dimension:

  1. Normal dimension
  2. Junk dimension
  3. Split dimension
  4. Text dimension
  5. Stacked dimension
  6. Distinct Attribute dimension

 

1. Normal Dimension

A normal dimension is when all attributes are related (they are all about 1 entity, e.g. Product), it has a business key (natural key) and all attributes are dependent on the surrogate key, like this:

1 Normal Dimension

In the above example, the business key is PRODUCT_CODE.

Sometimes we find a date attribute created as a date column (DATE_PRICE_UPDATED in the above example, and sometimes as a surrogate key (DATE_LAUNCHED_KEY in the above example). Another snowflake example is SUPPLIER_KEY, enabling us to access all the supplier attributes stored in the DIM_SUPPLIER.

The above example is type 2, but a normal dimension can be Type 0, 1, 2, 3, 4, or 2+1. Type 4 is a temporal table, type 0 is static, e.g. dim_date. Type 2+1 is where for the same attribute we have type 1 column (latest value) and type 2 column (historical values over many rows).

2. Junk Dimension

The classic scenario is: the source table in the transaction/source system has 4 to 8 columns, each with 2 to 6 values (some are Yes/No columns). And these columns are not related to each other, and they are not an attribute of a big dimension. Most of these columns explain the fact table itself.

As an example, let’s consider a payment table in a transaction system. The payment table has order ID column (which can be composite, i.e. a group of several orders), customer ID column, due due date, actual payment date, payment amount, payment method. But this table also has the following columns: processed_indicator, has_deposit, is_partial, and recurring_flag. Each of these columns is Yes/No column.

So we can create a junk dimension like this:

2 Junk Dimension

Note that a junk dimension is always type 0 (static). The name of the dimension contains the word “junk”, usually after “dim” rather than at the end. For the Y/N columns, the column names are standardised with “IS_…” or “…_FLAG”. The data type is consistent, i.e. for a Y/N column either it is bit or CHAR(1), but not INT nor VARCHAR(N) – my preference is CHAR(1) for clarity of values, and possibility of expanding into 3 or 4 values.

A junk dimension does not have a business key.

Expanding a junk dimension: if we add a column to the junk dimension, here’s what happens. The above junk dimension has 9 rows i.e. 8 for the combination of 4 Y/N columns, plus the unknown rows. If we add 1 more Y/N column (say IS_REFUNDABLE), the number of rows becomes 17, i.e. for the existing 8 rows (not the unknown row), we set IS_REFUNDABLE to N. Then we copy the existing 8 rows to row 10-17, with IS_REFUNDABLE as Y.

A junk dimension can also expand because there are more value in a column. For example: the IS_PROCESSED column was Y/N column but now it also has a third value, which is U (unknown).

In the fact table we have JUNK_PAYMENT_KEY column, containing value 0 to 17.

Note that in a junk dimension the attributes are not related to each other. For example, they are not customer attributes or product attributes. Because if they are, they should be moved to the customer dimension or product dimension. The attributes in the junk dimension are at transaction level (or whatever the fact table is about), not at the dimension level.

3. Split Dimension

When a dimension is predicted to be very big, say 20 million rows (usually because it has many type 2 attributes), the designer may take a conscious decision to split the dimension into two or three, for performance reason. A classic scenario is customer dimension for ecommerce companies with internet-based customer registration facility.

The split is always vertical, i.e. some columns go to dim 1 and some columns go to dim 2. For example, the customer contact details are put into dim_customer_contact, the customer attributes related to order processing are put into dim_customer_order, and the customer attributes related to marketing, promotions and loyalty programs are put into dim_customer_marketing.

3 Split Dimension

The related rows in each dimension have the same business key (natural key). It is not practical to put the surrogate key of the other dimension(s). It is much better to put the business key instead.

When a big type 2 dimension is split into two or three, in most cases all of the resulting dimensions are also type 2. This is because we split by function (business area). It would not be good to split by SCD type of the attributes, i.e. all type 1 columns go to dim 1 and all type 2 columns go to dim 2, because it would be impossible to construct the hierarchies.

Joining split dimensions: we join on the business key, taking the latest version. The best practice is not to join on-the-fly (using a view), but during the ETL, stored physically as DIM_CUSTOMER table, containing the latest version of customer records, having all 3 surrogate keys. Joining on-the-fly usually have performance issue when querying because of the data volume. During the ETL we can use a procedure which creates 3 temp tables containing the latest versions of all customers (without the system columns e.g. effective date columns), then join them. Then we update only the changed rows, which is best done using hashbytes or checksum depending on row length and null values, with DIM_CUSTOMER clustered index on the business key (CUSTOMER_ID).

4. Text Dimension

If the source transaction table has narrow text column, say 10 to 20 characters, and this column is not at dimension level, but is at the fact table level, then we leave it in the fact table. For example: order ID, transaction ID, payment ID.

But if the source transaction table has a wide text column, say varchar(255) or varchar(800), we have two design choices. The standard text book way is to put this varchar column in a dimension, which I called Text Dimension. The other design choice is to leave it in the fact table. An example is a free text comment column, a transaction description column, or a order note column.

4 Text Dimension

There are 2 considerations to choose which design: the length of the text column, and the repeatability of the values. The longer the text column, the more right it is to put it in a dimension, especially if the fact table is large. This is because of space consideration, i.e. if the fact table is more than 1 TB, DBAs might raise concern of its management (backup, indexing, stats) as it needs to be duplicated into Dev, UAT and DR environment.

If the value in this varchar column is highly repeatable, say a factor of 10 (10 million rows equal to 1m distinct rows), then we safe consideration of space. And performance too, as most of the queries accessing the fact table are not concerned with the text column.

Boundary: If the length of the text column is 20-70 characters, and the repeatability is very low (say a factor of 1.2) then it is reasonable to leave this text column in the fact table. But if the length is over 100 character, it is difficult to argue that the performance hindrance is outweighing the user-friendliness.

SCD: a text dimension is always type 1.

Business key: a text dimension does not have a business key.

Combining several text dimensions: if we have two commentary/notes column in the source table, it is best to create two text dimension, rather than combining them into one text dimension, because a) functional clarity – different surrogate keys in the fact table describe what they are, and b) query performance – the dim tables will have fewer rows, avoiding Cartesian explosion when they are combined

5. Stacked dimension

A stacked dimension is a dimension where two or more dimensions are combined into one dimensions, like this:

5 Stacked Dimension

A stacked dimension only has one or two attributes, and it is always SCD type 0 (no updates).

A stacked dimension is not recommended. It is wrong to do it like that. But they do exists. Usually because it was like that in the source system, so they simply copied it into the data warehouse.

We often come across a few type and status columns: customer status, product type, store type, security type, security class, broker type, etc. All of these columns should go to their respective dimensions, because they are truly the property of that dimension.

But there are type and status columns which are the property of the fact table, such as: transaction type or transaction status. To combine the transaction type and the transaction status into one dimension, we just a junk dimension, as in point #2 above. Not using a stacked dimension like this point.

6. Distinct Attribute Dimension

The last type is a distinct attributes dimension where all the attributes are the property of the fact table itself. For example, consider a holding table in fund management, where the grain is one row for each fund, security and date. Each security has many attributes, such as sector, rating, country, currency, asset class, etc. Examples of security are bonds, equities, CDS, IRS, options, futures and swaps. In theory these attributes are consistent across funds. But that is not the case, because there might be an overwrite.

In a normal situation, sector, rating, country, currency, and asset class are all properties of the security dimension. But because these “security attributes” can be different across funds, then they are truly the property of the holding fact table, and they becomes “position attribute”. And therefore, to store them properly, we should create a distinct attribute dimension like this:

6 Distinct Attribute Dimension

The above example is over simplified. In reality there are many different rating columns, e.g. S&P, Moody’s, Fitch, internal/house, and combinations between them. There are many different types of industry sectors such as Barclays, ICB, GICS (MSCI), Moody’s, etc. There are many different asset class categorisation. So in practice the number of columns in the above dimension is not 6, but closer to 50.

A “distinct attribute” dimension is always SCD type 0. We can’t change existing rows (no update). We can only insert new rows.

As the name implies, the above dimension contains a distinct list of attributes. Not all the possible values, but only values which actually exist in the fact table. So we don’t have a Cartesian explosion here.

A “distinct attribute” dimension does not have a business key. All the attributes are the property of the fact table, not the property of a particular dimension.

A “distinct attribute” dimension can be split into two or three physically. It is a vertical split, like in point #3 above (Split Dimension), based on logical grouping of the attributes. The purpose of splitting it is to reduce the number or rows, and thus improve the query performance.

In a way this becomes like a junk dimension, but covering a wider range of attributes, i.e. 50-60 columns instead of only 3-5 columns.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: