Sometime we need to add a new attribute or a new measure into our data warehouse. The question here is, how about the historical data for that new attribute/measure?
If your dimension is type 2, then the new attribute will have a value starting from when it goes live (say next month, 21/3/2012). But how about its value before 21/3/2012? We need to create the rows for the past periods, i.e. say that the originally the dimension is like this:
And suppose that the last surrogate key si 20098. And we want the new attribute to go live on 21/3/2012. So we insert the new rows, like this:
Notice that the SK for the new rows starts with 20099, and the date for the new rows starts with 21/3/2012. In the example above we expire the original rows on 20/3/2012 and create the new rows starting from 21/3/2012.
What about the yellow boxes above (historical values)? What should we set them to? If we know the historical value, then we update them. If we don’t know the historical value, we set it to “Unknown”.
Say that the historical values of the new attribute are as follows:
In this case the dimension will become as follows (I’m only displaying Natural Key A):
But if we don’t know the historical value, we simply set value of the New Attribute for the past row(s) to “Unknown”, like this:
If the new attribute is type 1, then it’s simpler. We just need to update all rows with the new value (no history), i.e.
Adding a New Measure
For transaction fact tables, we need to find out the new measure values for each combination of dimensional surrogate keys, i.e.
And it’s not easy. In a transaction fact table, usually each row is a transaction. We need to find out the natural key of the transaction. This is the PK in the source table, for example: transaction ID. If we have historical data then we upload it. But if we don’t, we set the historical value to null or zero, depending on the additivity of the measure (see here). If we don’t have the historical data, the new measure will have new values starting from the date it goes live, but it will be zero/null before the go live date.
For periodic snapshot fact tables, we need to update the new measure value for every snapshot date in the past, i.e.
The task of updating past rows could be enormous. A period snapshot fact table could start 10 years ago, storing 1 million rows each day. Updating these existing rows could take days. Often, we don’t need to do that, because:
a) The OLTP (source system) doesn’t usually store snapshoted history that long. If they do, it would probably be the last few months.
b) In most cases, a year back of historical data would satisfy the business. Only very few cases went back more than 1 year.
The historical measure data in the OLTP may only be monthly summary, of which we need to apply to the whole month. For example, today is 22/2/2012 and from next month (22/3/2012) we will have a new measure. The OLTP has historical data for 31st Jan 2011, 28th Feb 2011, 1st March 2011, …. , 31st Dec 2011, 1st Jan 2012. So only the last day of every month they have the historical value for this measure, like this:
In this case we will need to apply the values for 31st Jan 2011 to the whole month of January, i.e. snapshot dates 1/1/2011, 2/1/2011, 3/1/2011, etc.
If we don’t have the historical data, then we’ll leave the new measure in the past rows as zero or null.
Hope this helps. As always I welcome questions and comments at vrainardi@gmail.com
Vincent Rainardi, 23/2/2012