Data Warehousing and Data Science

22 February 2012

Adding a New Attribute and Measure

Filed under: Data Warehousing — Vincent Rainardi @ 7:53 am
Tags:

Sometime we need to add a new attribute or a new measure into our data warehouse. The question here is, how about the historical data for that new attribute/measure?

If your dimension is type 2, then the new attribute will have a value starting from when it goes live (say next month, 21/3/2012). But how about its value before 21/3/2012? We need to create the rows for the past periods, i.e. say that the originally the dimension is like this:

And suppose that the last surrogate key si 20098. And we want the new attribute to go live on 21/3/2012. So we insert the new rows, like this:

Notice that the SK for the new rows starts with 20099, and the date for the new rows starts with 21/3/2012. In the example above we expire the original rows on 20/3/2012 and create the new rows starting from 21/3/2012.

What about the yellow boxes above (historical values)? What should we set them to? If we know the historical value, then we update them. If we don’t know the historical value, we set it to “Unknown”.

Say that the historical values of the new attribute are as follows:

In this case the dimension will become as follows (I’m only displaying Natural Key A):

But if we don’t know the historical value, we simply set value of the New Attribute for the past row(s) to “Unknown”, like this:

If the new attribute is type 1, then it’s simpler. We just need to update all rows with the new value (no history), i.e.

Adding a New Measure

For transaction fact tables, we need to find out the new measure values for each combination of dimensional surrogate keys, i.e.

And it’s not easy. In a transaction fact table, usually each row is a transaction. We need to find out the natural key of the transaction. This is the PK in the source table, for example: transaction ID. If we have historical data then we upload it. But if we don’t, we set the historical value to null or zero, depending on the additivity of the measure (see here). If we don’t have the historical data, the new measure will have new values starting from the date it goes live, but it will be zero/null before the go live date.

For periodic snapshot fact tables, we need to update the new measure value for every snapshot date in the past, i.e.

The task of updating past rows could be enormous. A period snapshot fact table could start 10 years ago, storing 1 million rows each day. Updating these existing rows could take days. Often, we don’t need to do that, because:

a)    The OLTP (source system) doesn’t usually store snapshoted history that long. If they do, it would probably be the last few months.

b)    In most cases, a year back of historical data would satisfy the business. Only very few cases went back more than 1 year.

The historical measure data in the OLTP may only be monthly summary, of which we need to apply to the whole month. For example, today is 22/2/2012 and from next month (22/3/2012) we will have a new measure. The OLTP has historical data for 31st Jan 2011, 28th Feb 2011, 1st March 2011, …. , 31st Dec 2011, 1st Jan 2012. So only the last day of every month they have the historical value for this measure, like this:

In this case we will need to apply the values for 31st Jan 2011 to the whole month of January, i.e. snapshot dates 1/1/2011, 2/1/2011, 3/1/2011, etc.

If we don’t have the historical data, then we’ll leave the new measure in the past rows as zero or null.

Hope this helps. As always I welcome questions and comments at vrainardi@gmail.com

Vincent Rainardi, 23/2/2012

15 February 2012

Effective and Expiry Dates in Type 2 Dimension

Filed under: Data Warehousing — Vincent Rainardi @ 10:32 pm
Tags:

This is a short article about the time element in the effective/expiry date column in an SCD type 2 dimension. The format used in this article (and all my articles in this blog) is DD/MM/YYYY. Thanks to Jim Kennedy for his feedback about the date format. Apologies for the confusion.

In many data warehouses, I still see that the effective date and expity date columns in the type 2 dimension is set to date only, i.e. 12/02/2012. This is incorrect. We should have the time element as well, i.e. 12/02/2012 08:56:17. Why, because if we have 3 updates on an attribute for the same day, we will be screwed.

Here’s an example:

The customer type is updated 3 times today. In the beginning it was “UK Individual”. At 10 o’clock it changes to “International”. Then at 1 o’clock it changes to “Business”. And finally at 3 o’clock it changes to “Group”. Then we’ll have something like this:

… resulting in a lot of confusion when the fact table calls “give me the customer type for 13/2/2012”. Because 4 rows are returned. All 4 customer types above are returned when you join the fact table with “…where effective_date <= 13/02/2012 and expiry date >= 13/02/2012”.

If the effective and expiry dates have time elements in them, it would like this:

Now when the fact table calls “give me the customer type for 13/2/2012”, the answer is UK Individual. Why? Because 13/2/2012 implies that it is 13/2/2012 00:00:00. Hence it will return only 1 row: the top row above.

But is that what you want? Usually no. Usually we want the last value entered that day, i.e. in the above case it’s Group. But how do we do that? Simple, we add 23:59:59 like this “… where effective_date <= 13/02/2012 23:59:59 and expiry_date >= 13/02/2012 23:59:59”

As always I welcome suggestions and questions at vrainardi@gmail.com

Vincent Rainardi 15/2/2012

 

 

City and Rank

Filed under: Data Warehousing — Vincent Rainardi @ 10:20 pm
Tags:

City is an attribute of the customer dimension. Why don’t we put city in its own dimension, with separate SK on the fact table? Because we need to analyse city in the customer context.

City is also an attribute of the supplier dimension. Why don’t we put city in its own dimension, with separate SK on the fact table? Because we need to analyse city in the supplier context.

Why don’t we snowflake and put city in its own table, linked to both the customer and supplier dimension? Because if we snowflake the structure it is more complex; we need to do more joins. And because we need to analyse city with two separate contexts: customer and supplier.

It is best to leave city on both the customer dim and the supplier dim (and the broker dim, and the store dim, where applicable). They will form different geography hierarchies on each dimension. On the customer dim it could be city-county-country-area-region. On the supplier it could be city-country-region.

The organisation (data) for the geography hierarchy on each dimension could also be different. The store dimension may have smaller areas than the customer dim. On the customer dim Luxemburg may belong to Western Europe area. But on the store dim Luxemburg may belong to Benelux area.

This design gives flexibility which often required by many large corporations. Contradict to popular believes that corporations need a single, golden standard (ala MDM), I found that many corporations require more flexibility than a single standard. This is because corporations are essentially group of companies and each part of the organisation have different way of doing business.

The case of industry sector is also similar. It is a hierarchy which grouped the customers, suppliers and partner organisations. Telecom, retail, healthcare, government, manufacturing, transport, technology, etc. This is particularly important in the B2B business, and in particular for financial institutions. In the UK and US we have SIC, Standard Industrial Classification. For example: 200 different SICs are then grouped into 80 subclasses, and these subclasses are grouped into 40 classes, which is then grouped into 6 hyperclasses.

Customer Rank

Customer rank is an attribute of a customer. We order all active customers based on their orders in the last 12 months. So customer with rank = 700 is the 700th biggest purchaser in the company. The “valuation” is done on monthly basis. So a customer which is #700 on 1st January could be #800 on 1st Feb. Should we put rank in the customer dimension as a normal attribute?

I believe the answer is no. We should be putting customer rank on its own fact table. Why? Because it changes every month, it would be sensible to create a fact table with 2 SKs: month key and customer key. So the grain is 1 row for each customer for each month. The measures that we can put in this fact table are: rank, order value (last 12 months), order value (last 6 months), number of orders (last 12 months), number of products purchased (last 12 months), number of distinct products purchased (last 12 months), avarage order value (last 12 months), etc.

Unlike the other measures I just mentioned, rank is not really a measure, I know. It is an attribute.But because it changes every month, we put it in a separate fact table. What if we only calculate the customer rank not once a month but once a quarter? We still need to put rank in a separate fact table, not in the customer dimension. What if we calculate it once a year? We still need to put rank in a separate fact table. What if we calculate the customer rank every day? Then we definitely need to put it on a daily snapshot fact table.

The above monthly customer fact table could store a lot of other useful measures and attributes. Basically any monthly evaluation or calculation that concerns customer can be stored in this fact table. Various different customer ratios for example.

So you see, even though rank is (by entity definition, dependency, etc) an attribute/property of a customer, during dimensional modelling we could choose not to put it in the customer dimension.

As always I welcome suggestions and questions at vrainadi@gmail.com

Vincent Rainardi, 13/2/2012

8 February 2012

Parent SK in the Fact Table

Filed under: Data Warehousing — Vincent Rainardi @ 7:26 am
Tags:

This article is about customer dimension. For B2B (business to business), each customer may have a parent company. The parent in turn has another parent. I encountered this for the first time when I was working for a utility company. We tracked up to 3 levels up and put this grandparent SK in the fact table. This link was used widely throughout the BI reports.

In some industries like asset management and law they call it client dimension. In other industries it is called party dimension (insurance) or counterparty dimension (banking). In this article I’m going to call it customer dimension, a term which everybody is familiar with.

The idea is simple. We put the Parent SK in the fact table. This saves us from doing self-join on the customer dimension. Suppose we have a customer dimension like this:

And a fact table like this:

TLP means Top Level Parent. In practice, this is what the users are mostly interested in. They want to analyse the measures based on the TLP. They are less interested in the parents at intermediate level. They would like to organise the sales structure around the TLP. They may need to calculate the bonus based on the TLP. When I was designing the warehouse for a utility company, TLP was used a lot. A lot of BI analysis was done on TLP. Many measures were analysed against it. When I was working in investment banking industry, they also use TLP a lot. I believe any B2B organisation (not only companies but also government) have a common interest in TLP.

There are 2 design patterns. We can put the TLP information in a separate dimension. Or we can put the TLP information in the customer dimension. I prefer the first one. The reasons:

a)    Flexibility. There are attributes which are specifically for TLP. They are not applicable for leaf-level customers (here on I will call them LLC). For example, number of companies is only applicable for TLPs, it is not applicable for LLCs.

b)    Performance. The join to TLP dim gives faster query response times, i.e. if the query only requires TLP information, we do not need to touch the LLC dimension. LLC dim could be big whereas TLP dim is a lot smaller. 10 times smaller is common.

We always have to look from both sides of the coin. The disadvantage of this approach is that we need to add TLP_Key on many fact tables, in addition to the customer key. I say that this is a small price to pay for the huge benefit we have.

As always I welcome comments and questions at vrainardi@gmail.com

Vincent Rainardi, 8/2/2012

Blog at WordPress.com.