Data Warehousing and Business Intelligence

18 September 2011

Slowly Changing Dimension Type 2 – Initialising a New Attribute

Filed under: Data Warehousing — Vincent Rainardi @ 8:26 pm

Intro: you have an existing type 2 dimension, which is already populated, and you would like to add a new type 2 attribute. How do you populate this new column? What value should you populate it with, particularly with regards to the different versions of the same dimension row. Thank you to Jacob for pointing this out.

Note: the date format in this article, and in this web site, is European (dd/mm/yyyy). Apologies to the American folks who use mm/dd/yyyy.

————————————————————————————————————————————————————————-

Before I begin let I would like to thank my colleague Simon Jensen who explained the concept to me.

Situation: we have a type 2 dimension. We added a new column. We want to initialise its value. As always it is easier to learn by example. This is a type 2 dimension:

On 1/8/2011 the customer changed tariff from A9 to B2. Then on 1/9/2011 the customer changed the meter from digital to dual dials.

Now we have a new attribute: Payment Type

Customer1’s payment type is “Quarterly” until 14th August, then Direct Debit from 15th August.

Issue: How we do we fill in the Payment Type column? Because we only have 3 rows and the dates of the rows does not correspond to the 15/8/2011.

Answer: we expire row 2 and insert a new row for 15/8/2011 to 31/8/2011:

Of course for this new row we need to use new surrogate key (4).

Now we can clearly see the 3 changes for Customer1:
a) On 1/8/2011 he changed tariff from A9 to B2
b) On 15/8/2011 he changed payment type from Quarterly to Direct Debit
c) On 1/9/2011 he changed meter type from Digital to Dual Dials.

Now we can extrapolate this logic to all customers, and write an ETL logic to initiate the new attribute (Payment Type) for all rows in the table.

It does not mean that we will always have to create a new row. If Payment Type for a customer never changes, then obviously we don’t need to create a new row in the table. In the above example, if the Payment Type for Customer1 is always Direct Debit, then we just populate the existing 3 rows with “Direct Debit”.

If the dimension has a lot of CDS type 2 attributes (like 50 of them), it could be quite a complex ETL logic to write:

For each customer:

  • If the new attribute doesn’t change value then update existing rows.
  • If the new attribute changes value then a) insert a new row for that date range, b) expire the old row, and c) update the Payment Type for existing rows.

The source table for Payment Type is like this:

To be performant, when writing the code we should not use a cursor to loop around a million customer, but to write a SQL for all customers in one go. For large data sets, “set based” operation is always faster than cursor based. First we split the customers into 2 groups:

a) The customers which the Payment Type changes value
b) The customers which the Payment Type doesn’t change value: it stay the same throughout

Step 1. For b), we perform a global update. Not customer by customer, looping in a cursor, but all rows in the table in 1 go (“update from”).

Step 2. For a), we perform a global insert. First we identify the rows which we need to expire. Before we expire them, we use their expiry date, and insert the new rows. All attributes are the same as the old row, except the SK and the Payment Type.

Step 3. For a), we also perform a global update: we update the value of the Payment Type. When joining the customer dimension table with the Payment Type source table, we join on the NK column, as well as the valid_from/to column.

This is essentially what we are joining: (left: target, right: source). All these 4 columns need indexing of course.

Remember that on the target (left) there are only 3 rows, not 4 rows, because 1 row already has a Payment Type (we inserted it in Step 2). The “date join” works on the basis that the target date range is entirely within the source date range, i.e. the join criteria is:
on target.validfrom >= source.validfrom and target.validto <= source.validto

Vincent Rainardi, 18/9/2011

9 Comments »

  1. I don’t think you’ve got step 2 right. The existing expiry date from the row you are about to expire becomes the expiry date of the new row (you don’t add 1 to it) and then the existing expiry date is set to be 1 day before the start of the new row.

    Also, the expiry date of row SK=4 in the third table should be 31/08/2011

    Comment by Dave — 21 September 2011 @ 3:45 pm | Reply

    • Thank you for your correction Dave. I have corrected the 30/8/2011 to 31/8/2011. Have also removed “add 1 to it” from the expiry date on step 2 near the end of the article.

      Comment by Vincent Rainardi — 21 September 2011 @ 6:54 pm | Reply

  2. Thanks Vincent for the informative post. Got to learn alot from you.

    Comment by Vikram — 16 November 2011 @ 9:57 am | Reply

  3. None of this was straight forward to me at first. i think i get it. You have a dimension that already exists and then you add a column that already has SCD Type 2 on it. So how to you populate the empty cells to reflect what is in the source of the new attribute or column. That is the question. I don’t think that is clear in the intro. The word initialize doesn’t mean a lot to me.

    Also i was tripped up by the dates. I was wondering why you switched from January 8 to using day/month. Eventually i realized what was going on. You are European and was doing the dates the wrong way. I don’t see why you guys do things the wrong way. Just do it the American way. That’s always the right way. Just like one should drive on the right side of the street. There is a right side and a wrong side. Keeps down confusion.

    Thanks for the example. It is quite informative once you figure it out.

    Comment by jacob — 11 September 2013 @ 5:22 am | Reply

    • I think the intro could be elaborated on a little more.

      Comment by jacob — 11 September 2013 @ 5:23 am | Reply

    • Thanks for your feedback Jacob. I have added an intro and explained the date format.

      Comment by Vincent Rainardi — 11 September 2013 @ 6:13 am | Reply

  4. Once we fix the dimension table with new historical records, we have a bigger problem to be resolved for the fact tables. Based on the example given in the article, we will have to update all the fact records created for the customer (NK 101) between 15/8 thru 30/8 to point the SK 4. Until this is not done, none of the fact records will be referencing the new dimension record, SK 4 and may not get the benefit of extra history row captured in the dimension table. ~JC

    Comment by J Cyriac — 12 September 2013 @ 3:22 am | Reply

    • That’s right JC. Good point. Well put.

      Comment by Vincent Rainardi — 12 September 2013 @ 5:18 am | Reply

      • This goo.gl/IoyYQB might give more about what i was trying explain in short. ~JC

        Comment by J Cyriac — 12 September 2013 @ 5:41 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: