Data Warehousing and Data Science

28 August 2015

3 Things in Date Dimension

Filed under: Analysis Services — Vincent Rainardi @ 6:16 pm

In this article I would like to write about 3 things in Date Dimension, i.e. a) minimize the number of columns, b) minimize the number of rows, c) last business day / month end indicator.

Minimize the Number of Columns

This month I saw a date dimension with only about 15 columns. In the past I have seen many date dimensions, and they all have a lot of columns. Probably about 40-50 columns. Now that I’ve seen the slim version, I’ve been asking myself whether those 40-50 columns are really required. Fiscal week, fiscal period, fiscal quarter, fiscal year: do we use them? System date (char 10), SQL date (datetime), Julian date (int): do we use them? Day (1-7), day of the week (long and short), month name (long and short), do we use them? System columns such as date created, date updated, do we use them? Week number, week start date, week end date, day of the month, day of the quarter, day of the year: do we use them? Week day flag, week end flag, last day of the month flag, last day of the quarter flag, last day of the period flag, last working day of the month, bank holiday flag (which country?): do we use them?

Most of the those columns are created just because we can, not because we use them. If we are being minimalist, I think we could probably get away with just 4 columns: date SK, date (date type = date, not datetime), month (int, YYYYMM), year (int, YYYY).

Minimize the Number of Rows

Also this week I saw for the first time a date dimension which is up to year 2400. I think it is not right, and it is unnecessary. I think there are fixed income securities with expiry date of 2050, but expiry date is not being keyed and linked to date dimension. It is usually stored as an attribute of security/instrument dimension, in date format e.g. DD/MM/YYYY (or MM first in the US).

So what we need to store as rows in Date Dimension is the oldest transaction date (2005? 2010?) and further 5 years from now (2020). Span of 10-15 years should be enough.

Last Business Date in the Month (Month End Date Flag)

If you work for a local company (local being defined as 1 country), then the last business date in the month aka Month End Date flag is very useful. It’s not the last day of the month (e.g. 28/2, 31/3, 30/4, etc) but the last business day in the month. For example, this month is August 2015. In the UK, Monday 31st August is a bank holiday. So if you work for a UK company (local), then the month end date is 28th August (today). I found that this flag is useful (Month End Date flag). It enables us to retrieve month end data from the fact table.

21 August 2015

Distinct Attributes Rows in Dimension Tables

Filed under: Analysis Services — Vincent Rainardi @ 6:05 pm

There are two cases where it is a good idea to create rows in dimension tables containing distinct attributes values. The first one is prospect rows in the customer dimension. The second one is unsecuritised rows in the security dimension.

Prospects (prospective clients) can either be put in the customer dimension, or in a separate table (prospect dimension). Sales fact only uses customers, so there is no problem there. But Sales Pipeline fact uses both prospects and clients, so if we separate them into different dimensions, we have a problem.

Sales Pipeline contains information about the status of prospective clients, until they become clients (on-boarded). If we have 2 separate dimensions (DimProspect and DimCustomer) then on FactSalesPipeline we have to create 2 dim keys.

Fact Sales Pipeline

The top row is a client, and the bottom row is a prospect. Because prospects and customers share many similar attributes, it makes sense to put prospects in the customer dimensions. The problem is, in some companies, prospects have no business keys, unlike customers which have business keys (Customer ID). This is because prospects are recorded in Sales Pipeline book/system without registering them in the main transaction system.

But in their marketing analysis or pipeline, the business needs include both prospects and customers. Because of this, it makes sense to put them in one dimension tables. And because prospects have no business keys, the only way to check if a row already exists in the dimension, is by doing select distinct on all attributes, and combine them in one table (customer dimension):

DimCustomer

The second case, unsecuritised rows in security dimension is a little bit different (for background see here), but the potential solution is the same: distinct attribute rows. Unsecurised positions will have to handled using junk dimension, i.e. we put the security attributes such as country, rating, sector into a junk dimension, because we don’t have a business key to identify them in the security dimension.

Multi-legged positions such as IRS, currency swap and CDS are tricky to handle, because different legs may have different security attributes, e.g. the legs could be in different currencies, or different risk countries, or different asset types.

By putting a combination of these attributes into security dimension and giving it a surrogate key (just like in a junk dimension) the fact table could refer to those attributes. Of course for these rows there is no SCD; if the combination is not there then simply create a new row.

The advantage of doing this (combining both real securities and attributes for unsecuritised rows in the security dimension) are:

  1. The fact table only deal with 1 dimension, rather than two, or 20 (i.e. if we put individual attributes in their own dimension, each with SK directly on the fact table).
  2. We can create hierarchies, so we can “go up and down” the hierarchies, meaning that we can aggregate the measure at different hierarchy levels.

Flip Flopping in Dimension Tables

Filed under: Analysis Services — Vincent Rainardi @ 7:01 am

When a dimension is populated from two different sources, a phenomena known as Flip Flopping could occur. Source1 populates an attribute with Value1, whereas Source2 populates the same attribute with Value2. Everyday, the ETL creates a new row with the attribute = Value1. Then a few seconds later it closes that row and creates another new row with the attribute = Value2.

Flip Flop

Today is 21/8/2015, current time: 7:35am. ProductKey 3156 is yesterday’s rows. The ETL this morning closed this row at 4:12:27. It then created a new row (3157) because it thought that the size of the tumble dryer was 7 kg (sourced from Source1). 4 seconds later, the ETL read Source2 and the size there was 6 kg. So it closed 3157 and created 3158 with size = 6 kg, which is now the active row.

To solve this, either update the source system to agree on the same size (best approach), or customise the ETL to ignore the size data from one of the two sources.

Blog at WordPress.com.