Data Warehousing and Business Intelligence

9 August 2011

Expiry Date column in SCD Type 2 Dimension

Filed under: Data Warehousing — Vincent Rainardi @ 4:19 am
Tags:

There are 3 parts of this article, all about Expiry Date column in SCD type 2 dimension.

  1. Date only Expiry Date (no time component), set to Effective Date or a day before?
  2. Expiry Date: Should We Have the Millisecond?
  3. Expiry Date of the current row: NULL or max date?

A. Expiry Date without the Time Component

If the Effective Date and Expiry Date in our SCD type 2 dimension is date only (no time component), should we set the Expiry Date = Effective Date of the new row, or a day before?

As always it is easier to explain by example. This is a typical SCD type 2 dimension:

The question is: should the expiry date on the first row be 05/08/2011 instead?

Let’s see what happens if it’s 04/08/2011 like in the above table. If the transaction date in the source system has no time component, e.g. 03/08/2011 then it will fit nicely. The fact row will be mapped to exactly 1 row in the above type 2 dimension.

But if the transaction date in the source system has the time component e.g. 04/08/2011 11:23:09 then it will “fall between the cracks” e.g. fall between the expiry of the first row (04/08/2011) and the effective date of the 2nd row (05/08/2011), so it will neither be mapped to the 1st row nor the 2nd row of the dimension. This is because 04/08/2011 means 04/08/2011 00:00:00 and 05/08/2011 means 05/08/2011 00:00:00 so 04/08/2011 11:23:09 neither belong to the 1st row nor the 2nd row.

So in this case we will need to set the Expiry Date of the first row to 05/08/2011, so that 04/08/2011 11:23:09 will belong to the 1st row.

We will also need to make a “convention” in the ETL that “midnight” dates (e.g. with time component = 00:00:00) will fall to the 2nd row, e.g. if the transaction date is 05/08/2011 00:00:00 then the ETL process will map it to the 2nd row.

The best one is of course to set the data type of Effective Date and Expiry Date column to datetime instead of date data type, like below:

With datetime data type like above, we can be sure that a transaction date with time component will never “fall between the cracks”. It will always be mapped to a dim row.

B. Should We Have the Millisecond?

But what if the transaction time is 04/08/2011 23:59:59.100, i.e. 100 milliseconds after the expiry date? In a global data warehouse this is possible. Should we have the millisecond like below?

This way, 04/08/2011 23:59:59 100 will belong to the first row.

What if the transaction time is 04/08/2011 23:59:59.1234567? With the arrival of the datetime2 data type in 2008 this is now possible to have this time value in the source system. Datetime2 has a precision of 100 nanoseconds, i.e. 7 digits after the second.

Should we set the data type of the Effective and Expiry Dates to datetime2 instead of datetime? If you use 2008 yes you should, in my opinion. So that the SCD2 dim would look like this:

But if you use 2005, then on the ETL you need to have a convention. For example: anything that is less that 05/08/2011 (like 04/08/2011 23:59:59.1234567) will be mapped to the first row.

C. Expiry Date of the Current Row: NULL or max date

In all the above examples, I set the Expiry Date of the current row as a max date, e.g. 31/12/9999 23:59:59.9999999. In some data warehouses, the Expiry Date of the current row is set as NULL. Which one is better and why?

As always there are pluses and minuses to everything.

If we put a certain date, e.g. 31st December with the year = 2999 or 9999, we will have to hardcode this date in the ETL. And it’s not only 1 place, but in many places. In the insert script of the ETL of every table. And there might cause different perceptions/understanding. Developer1 thinks it’s year 2999 whereas Developer2 may think it’s year 3999. Tester1 thinks it’s 1st Jan 9999, Tester2 may think it’s 31st Dec 9999. Documentation1 says 2999, documentation2 says 9999, etc.

On the other hand if we put NULL it’s simpler. Because it’s only 1 value. But using NULL we can’t compare with a certain date. If we want to query rows which have ExpiryDate greater than today for example, the NULL ExpiryDate will fail. So we can’t do

select * from t1 where ExpiryDate >= getdate()

In order to get rows with ExpiryDate greater than today, we have to add “or is null” like this:

select * from t1 where ExpiryDate >= getdate() or ExpiryDate is null

Kind regards,

Vincent Rainardi, 9th August 2011

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: