Data Warehousing and Business Intelligence

9 August 2011

Using Accumulating Snapshot Fact table to Monitor Status

Filed under: Data Warehousing — Vincent Rainardi @ 4:32 am
Tags:

Status is a specific stage in a workflow. This is an example of a workflow:

In the above workflow, we have 6 statuses. The workflow begins with Status 1, and ends with Status 5 or 6. The normal workflow is: Status 1, Status 2, Status 4, Status 6 (1,2,4,6). And we have 3 alternate flows:

  1. 1, 3, 5
  2. 1, 2, 3, 5
  3. 1, 2, 4, 3, 5

An example of a workflow is a mortgage application. A mortgage application needs to go through various stages such as: credit score check, collateral check, application approved, application declined, etc.

When the workflow is in action, we have a date of those various stages:

In the above example, the application got at Status 1 on date1, at Status 2 on date2, at Status 4 on date3 and it got at status 6 on date4.

In the transaction/source system (e.g. mortgage application), the questions are about individual applications, such as:
SS1. What is the current status of Application 1?
SS2. When did Application 2 reach Status 4?

But in the data warehouse, the questions are less about individual applications. The questions are more about the status of many or all applications, such as:
DW1. How many applications are currently on Status 4?
DW2. What is the average duration (in days) between Status 2 and Status 4?

In many data warehouse implementations, Status is created as an attribute on the main dimension, e.g. customer dimension, product dimension, or application dimension. The dimension is set as Slowly Changing Dimension type 2 (SCD2), like this:

This SCD2 structure is more appropriate to answer Transaction/Source System questions such as SS1 and SS2 above. Using this SCD2 structure, it is difficult to answer Data Warehouse questions such as DW1 and DW2 above. Using this structure the query to get “How many on Status 4” is:

select count(*) from DimApp where IsCurrent = 1 and Status = 'Status4'

Output: 2

And the query to answer “What is the average duration (in days) between Status 2 and Status 4?” is

with x as
(select DateDiff(d, EffectiveDate, ExpiryDate) as Days from DimApp where Status = 'Status2')
select avg(Days*1.0) from x

Output: 1.5

The “*1.0” is to convert days column from integer to decimal so we don’t get an integer output.

But, from Status 2 an application doesn’t always go to Status 4. It can go to Status 3 instead! So how do we find out which application goes from Status 2 to Status 4 (App2) and which app goes from Status 2 to Status 3 (App3)? We can’t!

Not able to identify which row is the previous state, we are forced to add “Previous App Key” column:

Now we can join the table to itself using PrevKey and AppKey column. So the query to answer “What is the average duration (in days) between Status 2 and Status 4?” becomes:

select avg(datediff(d, StatusFromDate, StatusToDate)*1.0) from
( select A.AppId, B.Status as StatusFrom, B.EffectiveDate as StatusFromDate,
  A.Status as StatusTo, A.EffectiveDate as StatusToDate
  from DimApp A
  join DimApp B on A.PrevKey = B.AppKey
) X where StatusFrom = 'Status2' and StatusTo = 'Status4'

Output: 2.5

 

Using Accumulating Snapshot Fact Table

To answer question DW1 and DW2 above, it is more appropriate to use Accumulating Snapshot Fact Table, like this:

The “Ind” columns are indicator columns. It shows which status the application is currently on.

For a background on Accumulating Snapshot you can read Kimball Toolkit book chapter 3 page 75 and his article here.

Using this structure the query to get “how many on status 4” becomes:

select sum(S4Ind) from FactApp

Output: 2

Whereas the query to answer “What is the average duration (in days) between Status 2 and Status 4?” is:

select avg(datediff(d, S2Date, S4Date)*1.0) from FactApp
where S2Date is not null and S4Date is not null

As you can see, it is easier to use Accumulating Snapshot Fact Table to answer Data Warehousing questions such as:
DW1. How many applications are currently on Status 4?
DW2. What is the average duration (in days) between Status 2 and Status 4?

Vincent Rainardi, 8th August 2011

5 Comments »

  1. This is nice, but in your workflow you could visit status 3 twice. So the questions I would like to answer are:
    1. Which apps where in Status X on a specific date range.
    2. Which apps where in Status X on a specific date range, and this was the last time they hit this status so far.
    3. Which apps where in Status X on a specific date range, and this was the first time they hit this status.

    I think this is not possible to solve with an Accumulating Snapshot Fact Table, and i have to stick to the status dimension table, right? maybe adding a flag “first” and “last” next to “current” to indicate whether the app passed this status the first or the last time or none of it as I can check for current at the moment.

    Do you have a better approach where I could use the benefits of an Accumulating Snapshot Fact Table?

    Comment by Philipp Kemmeter — 2 October 2014 @ 12:23 pm | Reply

    • Okay, i have to correct: In your workflow you cannot visit status 3 multiple times. But let’s say, you could :)

      Comment by Philipp Kemmeter — 2 October 2014 @ 12:59 pm | Reply

  2. This is nice article on DWH design, Do you have any better design idea to model a process which has following characteristics
    1. Some part of process may be cyclic means my process path could be S1 -> S2 -> S3 -> S2 -> S3 -> S2 -> S3 -> S5

    Its good if you give your insight to model this kind of process in DWH to get answer to DW1 and DW2 in your article.

    Comment by Arun Bhimani — 25 July 2015 @ 5:04 am | Reply

    • Hi Arun, for your case, I still believe that accumulated snapshot fact table structure (ASF) is better for answering questions involving many applications such as DW1 and DW2, compared to SCD type 2 dimension structure.
      For your case, in the ASF table I would allow multiple rows for 1 application, by adding an Interation Number column:
      Row 1: Iteration = 1, S1Date, S2Date and S3Date are populated
      Row 2: Iteration = 2, S2Date and S3Date are populated
      Row 3: Iteration = 3, S2Date, S3Date and S5Date are populated
      So whenever an application “goes back” to a previous state, we create a new row with a new iteration number.

      2 points Arun:

      Point 1: I found that in the end we have to frequently calculate the number of days between a status and the next status. Almost always we have to do this. So later on I developed an idea to it is better to store the number of days between statuses in the ASF table.

      Point 2: Also, I found that if we have 15 possible statuses it is annoying to have 15 indicator columns. It is better to have just 1 column called Current Status and populate it with S1 or S2 etc. reflecting the current status. The question “How many applications are in S4?” is just as easy to be answered between structure A which has 15 indicator column, and structure B which as only 1 Current Status column. But the question of “Give me the breakdown of status of all applications at the moment, e.g. how many in S1, how many in S2, etc?” is tedious to be answered using structure A but very easy to be answered using structure B.

      And usually this second question has a date condition, such as “Give me the breakdown of status of all applications last year, e.g. on 31st Dec 2014”. Now, whether it is “at the moment” (e.g. today), or “last year” (e.g. 31st Dec 2014), we have to examine S1Date, S2Date, S3Date, etc, depending on the Current Status column. That is quite annoying (read: slow). So I began to develop an idea of having a Date of Latest Status column (call it Latest Date) specifically to answer this kind of queries/questions.

      Comment by Vincent Rainardi — 25 July 2015 @ 5:52 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 271 other followers

%d bloggers like this: