Data Warehousing and Data Science

28 January 2012

When is a number not a measure?

Filed under: Data Warehousing — Vincent Rainardi @ 12:13 pm
Tags:

When it is a rank, a probability, or a frequency

a) Rank

In the source system we may find customer ranking, with value = 1,2,3, and so on. This is the customer rank based on their profitability. This rank is not a measure. It is an attribute. So accordingly it should be put in a dimension table, not in a fact table.

b) Probability

A source system may contain a number which is a probability. For example, in credit risk business, we have probability of default (PD). This is not a measure, but an attribute. It is not sumable. And it is not aggregatable. If we have a country with 10 issuers, what’s the PD of that country? You can’t average it. A measure is a measurement from a business event. An attribute is a permanent property of that entity.

c) Frequency

If you use Moody’s Analytics (http://www.moodysanalytics.com/), you will come across EDF (https://www.creditedge.com/). EDF is Expected Default Frequency, i.e. how many times we expect the issuer to default in 1 year. If we expect that an issuer might default in the next 5 years with 0.1% chance, then the EDF is 0.02%. An EDF is arguably an attribute. Though many people think it is a measure. Generally speaking a frequency is an attribute. If a machine is expected to breakdown 4 times a year, this frequency number (4) is a property of that machine, hence it’s an attribute. If on average London has a 30% chance of raining on any day, this 30% is the property of London city. It’s an attribute, not a measure. It does not originate from an event.

As usual I welcome your comments.

Vincent Rainardi, 28/1/2012

Processing Past Data

Filed under: Data Warehousing — Vincent Rainardi @ 10:31 am
Tags:

If your data warehouse is processing files from other systems, and one day you didn’t get a file from a source system, what would you do? Of course you will still run the data warehouse ETL, to process the available data files. The data warehouse ETL (or “the daily batch”) must run every day, so that the business gets the latest data.

How, here’s the question, if that 1 missing file was produced the next day, what would you do with it? Can you process it into your data warehouse?

If the file populates a snapshot fact table, it is straight forward because the new data will just occupy the missing snapshot date:

But for dimensions it’s a bit tricky, especially if it’s type 2.

Before:

After:

As you see above, in addition to inserting a row, we have to update the effective date of the next row. The question is, how do we know which one is the next row? Well, by sorting on the effective date. Yes it’s not simple.

How about type 3 dims?

We need to update the Prev Version, as well as the Current Version Date, i.e.

How about transaction fact table? Quite straight forward, as we only need to insert the missing date:

How about accumulating snapshot fact table? Not straight forward. In addition to updating status 2 columns, we need to update the duration of status 1:

In the above scenario, status 3 happened on 3rd Jan. Then on 4th Jan we receive the data file for 2nd Jan.

Vincent Rainardi, 28/1/2012

Blog at WordPress.com.