Intro: you have an existing type 2 dimension, which is already populated, and you would like to add a new type 2 attribute. How do you populate this new column? What value should you populate it with, particularly with regards to the different versions of the same dimension row. Thank you to Jacob for pointing this out.
Note: the date format in this article, and in this web site, is European (dd/mm/yyyy). Apologies to the American folks who use mm/dd/yyyy.
Before I begin let I would like to thank my colleague Simon Jensen who explained the concept to me.
Situation: we have a type 2 dimension. We added a new column. We want to initialise its value. As always it is easier to learn by example. This is a type 2 dimension:
On 1/8/2011 the customer changed tariff from A9 to B2. Then on 1/9/2011 the customer changed the meter from digital to dual dials.
Now we have a new attribute: Payment Type
Customer1’s payment type is “Quarterly” until 14th August, then Direct Debit from 15th August.
Issue: How we do we fill in the Payment Type column? Because we only have 3 rows and the dates of the rows does not correspond to the 15/8/2011.
Answer: we expire row 2 and insert a new row for 15/8/2011 to 31/8/2011:
Of course for this new row we need to use new surrogate key (4).
Now we can clearly see the 3 changes for Customer1:
a) On 1/8/2011 he changed tariff from A9 to B2
b) On 15/8/2011 he changed payment type from Quarterly to Direct Debit
c) On 1/9/2011 he changed meter type from Digital to Dual Dials.
Now we can extrapolate this logic to all customers, and write an ETL logic to initiate the new attribute (Payment Type) for all rows in the table.
It does not mean that we will always have to create a new row. If Payment Type for a customer never changes, then obviously we don’t need to create a new row in the table. In the above example, if the Payment Type for Customer1 is always Direct Debit, then we just populate the existing 3 rows with “Direct Debit”.
If the dimension has a lot of CDS type 2 attributes (like 50 of them), it could be quite a complex ETL logic to write:
For each customer:
- If the new attribute doesn’t change value then update existing rows.
- If the new attribute changes value then a) insert a new row for that date range, b) expire the old row, and c) update the Payment Type for existing rows.
The source table for Payment Type is like this:
To be performant, when writing the code we should not use a cursor to loop around a million customer, but to write a SQL for all customers in one go. For large data sets, “set based” operation is always faster than cursor based. First we split the customers into 2 groups:
a) The customers which the Payment Type changes value
b) The customers which the Payment Type doesn’t change value: it stay the same throughout
Step 1. For b), we perform a global update. Not customer by customer, looping in a cursor, but all rows in the table in 1 go (“update from”).
Step 2. For a), we perform a global insert. First we identify the rows which we need to expire. Before we expire them, we use their expiry date, and insert the new rows. All attributes are the same as the old row, except the SK and the Payment Type.
Step 3. For a), we also perform a global update: we update the value of the Payment Type. When joining the customer dimension table with the Payment Type source table, we join on the NK column, as well as the valid_from/to column.
This is essentially what we are joining: (left: target, right: source). All these 4 columns need indexing of course.
Remember that on the target (left) there are only 3 rows, not 4 rows, because 1 row already has a Payment Type (we inserted it in Step 2). The “date join” works on the basis that the target date range is entirely within the source date range, i.e. the join criteria is:
on target.validfrom >= source.validfrom and target.validto <= source.validto
Vincent Rainardi, 18/9/2011