Data Warehousing, BI and Data Science

13 December 2017

Loading Late Arriving Dimension Rows

Filed under: Data Warehousing — Vincent Rainardi @ 8:19 am

One of very common thing in “Kimball warehousing” (i.e. dimensional model) is late arriving dimension. And yet it is still not obvious for many people what to do when the expected dimension row finally comes, particularly in the case of transactional fact table. In this article I will try to explain what a late arriving dimension row is, and what to do when it comes.

Late Arriving Dimension Row

It is probably easier if I explain using an example. Suppose we had an internet shop. On Monday 11th Dec 2017 at 19:40 a new customer (customer A) registered on the shop, then purchased a product. Both the sales record (order 213) and the customer record were stored in the sales system. On Tuesday at 2am the warehouse load started, reading the sales data from the sales system, but reading the customer data from the customer master. Customer A was not sent from the sales system to the customer master until Tuesday at 8pm (because it’s daily load at 8pm). So when the data warehouse on Tuesday 2am reads the customer master, it did not get the data for customer A. The following day, on Wed 13th Dec at 2am, when the warehouse load started, it received the customer A data from the customer master.

Periodic Snapshot Fact Table

There is an issue here, i.e. at 19:40 on Monday the sales system should have pushed customer A record to the customer master. But it didn’t because the customer master only loads from several source systems (one of them is the sales system) once a day at 8pm. Because of the issue, on Tue 2am when the sales fact table was loaded in the warehouse, there is no row for customer A in Dim Customer, and therefore the Customer SK column for that sales fact row for Sales Order 213 is 0. Let me repeat: the customer surrogate key column for order 213 fact row for business date Mon 11th Dec is 0, meaning “unknown customer”. See yellow cell below.

That’s what happened in the data warehouse load on Tuesday at 2am. What happened on Wed 2am load? At that time, the warehouse is populating the sales fact table for the 12th Dec business date (snapshot date), and this time, the Customer SK column for order 213 is populated with the surrogate key of Customer A, see the green cell below:

1. Periodic Snapshot Fact Table
2. Customer Dimension

We can see above that Order 213 started to exist on 11th Dec (it didn’t exist on 10th Dec). The customer SK for Sales Order 213 was 0 on 11th Dec and 74 on 12th Dec.

Note: because there is a fact row on 11th Dec for Order 213, the total amount for that date is correct (£81) but the breakdown on City is in correct. On 11th Dec Manchester has 0 sales. Only on 12th Dec Manchester has the correct amount of sales, i.e. £45.

Transaction Fact Table

That was a Periodic Snapshot Fact Table. How about a Transaction Fact Table?

It is like this:
3. Transaction Fact Table

Order 213 started to exist in the transaction fact table on 11th Dec with Customer SK = 0. On 12th Dec the Customer SK was updated to 74.

How do we update this Customer SK from 0 to 74 on 12th Dec? More specifically, how do we know which fact row to update on each day?

The answer is by using the staging table. As part of customer dimension loading process, we update each Transaction fact table for every new customer. On the above example, when we loaded sales order 213 on the 11th Dec, we store this in the sales staging table:

4. Sales Staging Table

On the 11th Dec the customer SK is 0. On 12th Dec when the data warehouse load ran, it checks this sales staging for Customer SK = 0 and fund that Order 213 has Customer SK = 0. It then looked at the Sales Order fact table and find the row for order 213 and update the Customer SK column on the fact table to 74. It then updated the customer SK column on the staging table with 74 too.

It is often useful to put an “incomplete” indicator column (Y/N) on this staging table, so that when looking for any SK column = 0, we first filter the rows by Incomplete = Y so we get much less rows.

So let me repeat, as part of the dimension loading process, we update each Transaction fact table for every new customer. This is the customer staging table:

5. Customer Staging Table.JPG

We can see above that Customer A started to exist in the data warehouse on 12th Dec. On 12th Dec, Customer A was a new row inserted into this customer staging table. As part of inserting a new customer, we corrected the Customer SK in every Transaction fact table for these new customers.

Note:

  1. If the Customer Dimension is SCD type 2, then we need to update the customer SK in every transaction fact table, not only for new customers, but for updated customers too. Every time a customer is updated, the old row is expired and a new row is created. Right after that, the customer SK in all corresponding fact rows need to be updated. For this reason it is not practical to have a type 2 dimension for a transaction fact table.
    You can argue that there’s no need to update the customer SK (for a type 2 customer dim) because that was the condition of that customer when the sale occurred 3 years ago. Yes but in reality most businesses need to see the latest customer data (as of today), not the 3 year old one. Some solutions create 2 customer SK columns in the transaction fat table: as it was when the sale occurred and as it is today.
  2. The most important feature of an MDM system (Master Data Management) is to create new customers and products the moment they are created in any source system. If this timing requirement can’t be met, there is no point spending any money on the MDM system. Even if the MDM integrates customer/product data from 10 different systems, if the creation of a new customer/product is one day late the MDM is practically not usable. In order for an MDM to be real time like this, there are 2 techniques: a) use a push mechanism, not a pull mechanism, and b) create a “skeleton” customer / product, i.e. the row for a new customer only contains customer IDs, the other columns are left blank. This “skeleton row” is then updated later on. This is a key criteria for any MDM system: the row is gradually built up over time, not in one go. If we don’t do this we can’t meet the fundamental criteria of an MDM system, which is this: the creation of a new customer/product must be real time. The updating is real time, it can be a day late, but the creation of a new customer/product must be real time.
Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: