Data Warehousing, BI and Data Science

20 December 2017

Late Data Warehouse Loading

Filed under: Data Warehousing — Vincent Rainardi @ 8:26 am

If your source systems are running late, how do you ensure that your data warehouse is still loaded on time?

In a retail scenario, it is possible that the customer mastering or product mastering system got delayed. Say they usually finish loading at 2am, but today it’s 1pm and they haven’t run yet. What do you? You have the sales data and other fact data waiting to be loaded since 5am. The answer is: load the fact data using yesterday’s customer data and yesterday’s product data.

In a Kimball DW (dimensional model), the data warehouse designer often put ETL constraints that the fact table must be loaded after all the dimensions are loaded. If some dimensions are late, all the fact tables are not loaded.

There are two things we can do in this case:

  1. Make the dependency fact table specific. A fact table should only be dependent on the dimension that it uses. So fact table A which is only uses dimension 1 to 4, should be loaded even though dimension 5 is late.
  2. The second thing we can do is to have an “emergency ETL button” which enables us to load today’s facts using yesterday’s dimension data. Physically, this is what the button does: in the enterprise scheduler (such as Active Batch, Autosys or Control-M), the button removes the dependency on the fact table loads so that they don’t wait for the dimensions to be loaded.

For #2 above, the data warehouse load is happening as normal, but using whatever dimension data available at that time.

When the product dimension data (or whatever dimension which was late) is available, it automatically trigger the normal warehouse load, which will update the Product SK in those fact tables.

The fact row for a new product would have 0 Product SK. How do we know what product this fact row is for? The answer for this is to use double keys, i.e. we have 2 key columns. One for Surrogate Key (SK) and one for the Business Key (BK). That fact row for a new product would have the BK column populated. So when the real load happens, we can populate the SK column by looking up the BK column in the product dimension.

Let me repeat: to enable us to correct the SK when the real load starts, we need to have “double keys” in the fact table: both the SK column and the BK columns. Thanks and credit to Troy Stevenson for this “double key” idea.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: