Data Warehousing and Data Science

30 March 2011

Ragged Hierarchy in SSAS

Filed under: Analysis Services — Vincent Rainardi @ 6:08 pm
Tags:

Someone asked me how to do a ragged hierarchy in SSAS. I thought it was “natural” in SSAS, i.e. using a “parent column” in the table then make that column a “parent attribute” in a “parent child” dimension. But then somebody else asked me the same question and I realised that the issue was that by default we would see the parent node as a child node like this:

Parent nodes are displayed as child nodes

To eliminate it we need to set the MembersWithData property to NonLeafDataHidden.

Here’s the complete scenario.

This is what we want to achieve:

This is the source table:

create table Company (CompanyKey int, ParentKey int, CompanyName varchar(50))
insert into Company (CompanyKey, ParentKey, CompanyName) values
(1, NULL, ‘Company1’), (2, 1, ‘Company2’), (3, 1, ‘Company3’), (4, 2, ‘Company4’), (5, 4, ‘Company5’),
(6, 1, ‘Company6’), (7, 6, ‘Company7’), (8, 7, ‘Company8’), (9, 7, ‘Company9’)
select * from Company

We add that table to the DSV (data source view) and create a self-relationship from ParentKey to CompanyKey:

Then we create a dimension:

Then on the Parent Key attribute we set 2 properties: on the Usage to Parent and the MembersWithData to NonLeafDataHidden:

And the result is as expected:

Let’s create a fact table, then see how it looks in Excel.

create table FactInvoice (FactInvoiceKey int, CompanyKey int, InvoiceAmount money)
insert into FactInvoice values (1,2,430), (2,2,120), (3,6,220), (4,6,250)
select * from FactInvoice

FactInvoice

Create a cube:

Cube Design

 

Browse in Excel:

Browse in Excel - Field ListBrowse in Excel - Columns and RowsBrowse In Excel

Company 1 is the parent of both Company 2 and 6.

 

Update A Table Based on Itself

Filed under: SQL Server — Vincent Rainardi @ 5:34 pm
Tags:

In the example below we want to update 29th March data based on 30th March.

Before Update:

After Update:

SQL:

update Table1
set Col1 = B.Col1, Col2 = B.Col2, Col3 = B.Col3
from Table1 A
join
( select ID, Col1, Col2, Col3
  from Table1 where Date = 20110330
) B on A.ID = B.ID and A.Date = 20110329
where A.Date = 20110329

28 March 2011

Storing History on 3rd Normal Form (3NF)

Filed under: Data Warehousing — Vincent Rainardi @ 10:50 pm
Tags:

In Dimensional Modelling, one way of storing history of the attributes is using SCD 2 or 3, or 6 (1+2, or 1+3). And to store the history of measures we use snapshotting.

One of the questions that people ask me in 3NF warehousing is: In 3NF how do we store history?

Let’s look at the case of Health Care industry: a surgery with 5 GPs and 100 patients. GP = General Practitioner = doctor. Some patients have been with the surgery for 45 years, some only 1 week. Some patients visited the surgery 20 times a year, some only once every 3 years. On each visit the GP made a record of the patient situation/symptoms (or not), a diagnosis (or not), and a prescription/treatment (or not).

So how do we store the patient historical data, such as previous address and telephone number? A common method is to store it on another table with identical structure, plus a few “system” columns. This table is known as “audit table”, as when it first came up (probably 20-30 years ago) it was used for audit purpose. So, before the record was changed, the whole record was copied to the audit table first. One of those “system columns” is the date when the record was copied. Another “system column” is an FK which links to the main patient table’s PK.

Patient Table:
PatientID Name Address Phone DOB
34 Name1 Address3 Phone2 23/04/1961
Audit Table:
AuditID PatientID Name Address Phone DOB DateCreated
12548 34 Name1 Address1 Phone1 23/04/1961 02/11/1988
16251 34 Name1 Address2 Phone1 23/04/1961 21/04/1999
23611 34 Name1 Address2 Phone2 23/04/1961 07/08/2001

This is usually created using a trigger or CDC software. SQL Server 2008 has CDC built-in.

But, in 3NF we always have multiple entities. And they are related to each other. And they all have historical data.

Imagine we have 3 tables, each linked to each other (FK-PK):

If we add history it becomes messy:

Why is it messy? Because we have to relate the 3 history tables to each other, like this:

Now, imagine if you have 20 tables. It will be indeed like spaghetti. And in reality we have 50 even 100 tables, each one with its history table. Each table is related to 1 to 5 other tables. Messy.

In order to keep things simple, there are 2 things that we usually do:

  1. Only add history table to the tables which really require it, not all of them.
  2. Not having the main table, but put the main table inside the history table.

The 2 points above really simplify things, making the ERD to look something like this:

As I said, not every table are historised. In the above diagram, table 3 is not historised. Only table 1 and table 2 are. Note that now we don’t need to link the history table back to the main table. The history tables themselves are linked, and we have no main tables.

This is how point 2 is done, 1 table which serve as both the main table and the history table:

Patient Table:
AuditID PatientID Name Address Phone DOB DateCreated Active
10839 34 Name1 Address1 Phone1 23/04/1961 17/10/1995 N
12548 34 Name1 Address2 Phone1 23/04/1961 02/11/1988 N
16251 34 Name1 Address2 Phone2 23/04/1961 21/04/1999 N
23611 34 Name1 Address3 Phone2 23/04/1961 07/08/2001 Y

Notice that the Audit ID and the Date Created are shifted 1 row down. To make this table acts like the main table, we use the Active column. Without the Active column we will have to find the max(Date Created) first for each Patient ID, then do a self-join back to the same table. This causes query performance issues.

This is very similar to Kimball’s concept of SCD 2.

QUERYING

And how do you join between these tables to query the data?

There are 2 main query operations:

  1. Current / today’s data.
  2. Past / history data.

Current Data

To get the current data, filter each history table with Active = Y, then join on the main PK (i.e. in the above example, Patient ID).

History Data

If we want to get past data like when it was 31/12/2010, we enter the world of painful join 🙂

This is how:

  • Step 1: for each table, for each patient get the 31/12/2010 version. In the case of Patient table above:

select PatientID, max(CreatedDate) where CreatedDate <= ’31/12/2010′ group by PatientID

Let the above query be ZZZ.

  • Step 2: Do a self-join to get the 31/12/2010 version for each patient:

select P1.* from Patient P1

inner join (ZZZ) as P2 on P1.PatientID = P2.PatientID and P1.CreatedDate = P2.CreatedDate

  • Step 3: Do the same self-join on each table that we want to query.
  • Step 4: Join all the tables on the PK-FK relationship.

UPDATING DATA

And how do you update the data?

Back to the patient table:

Before:
AuditID PatientID Name Address Phone DOB DateCreated Active
10839 34 Name1 Address1 Phone1 23/04/1961 17/10/1995 N
12548 34 Name1 Address2 Phone1 23/04/1961 02/11/1988 N
16251 34 Name1 Address2 Phone2 23/04/1961 21/04/1999 Y
After:
AuditID PatientID Name Address Phone DOB DateCreated Active
10839 34 Name1 Address1 Phone1 23/04/1961 17/10/1995 N
12548 34 Name1 Address2 Phone1 23/04/1961 02/11/1988 N
16251 34 Name1 Address2 Phone2 23/04/1961 21/04/1999 N
23611 34 Name1 Address3 Phone2 23/04/1961 07/08/2001 Y
  • Step 1: First, before you insert the new row, update the Active flag first to N:

update Patient set Active = ‘N’
where Patient ID = 34 and Active = ‘Y’

  • Step 2: Insert the new row, with Audit ID = identity (incremental sequential number), DateCreated = today and Active = Y.

insert into Patient (PatientID, Name, Address, Phone, DOB, DateCreated, Active)
values (34, ‘Name1’, ‘Address3’, ‘Phone2′, ’23/4/1961’, getdate(), ‘Y’)


22 March 2011

Linking 2 Fact Tables

Filed under: Data Warehousing — Vincent Rainardi @ 7:05 pm
Tags:

It’s a weather forecast. You have “measurement sets” consisting of area, temperature, humidity, wind directions, etc every 5 mins. Each area has different set. For example if A=Area, T=temp, H=humidity, WD=Wind Direction, WS=Wind Speed, the forecast for Area1 is like this.

Input:
13:05 A1, Tmin1, Tmax1, Hmin1, Hmax1, WD1, WS1
13:10 A1, Tmin2, Tmax2, Hmin2, Hmax2, WD2, WS2
13:15 A1, Tmin3, Tmax3, Hmin3, Hmax3, WD3, WS3
13:20 etc.

Then you take a group of measurements in the last N days over certain area (say area 1 to 100) and make a forecast for a certain day in the future. Example: take the last 7 days for area 1 to 100, and forecast for the next 7 days. The forecast is for a specific area (the grain is “day” or “hour”), and the input is from the “close by” areas. The forecast contain 3 variables: temp (min & max), cloudiness, humidity, wind direction, wind speed.

Output (Forecast):
Day1 A1, Tmin1, Tmax1, H1, WD1, WS1
Day2 A2, Tmin2, Tmax2, H2, WD2, WS2
Day3 A3, Tmin3, Tmax3, H3, WD3, WS3
Day4 etc.

So the FactMeasurement (input) is like this:

SetKey, DateKey, TimeKey, AreaKey, MinTemp, MaxTemp, MinHum, MaxHum, WinDir, WinSpd

And the FactForecast (output) is like this:

ForecastSetKey, DateKey, TimeKey, AreaKey, MinTemp, MaxTemp, Hum, WinDir, WinSpd

Now, the question is, where would you record the fact that Forecast Set 38 (which is say 200 rows) is predicted from row 3100 to 3800 in Measurement fact table?

This is where the FactKey comes in. If we have FactKey on both fact tables:

FactMeasurement: FactKey, DateKey, TimeKey, AreaKey, MinTemp, MaxTemp, MinHum, MaxHum, WinDir, WinSpd

FactForecast: FactKey, DateKey, TimeKey, AreaKey, MinTemp, MaxTemp, Hum, WinDir, WinSpd

Then we create a bridge table:

BridgeKey, MeasurementFactKey, ForecastFactKey

In some lucky cases, we don’t need to create a bridge table, because the relationship between the input and the output is 1 to 1, i.e. each output row is determined from 1 input row. In this case we can just include the Input fact key on the output fact table. But this is very rare. The majority of the case is many to many.

16 March 2011

How to Deal with Status

Filed under: Data Warehousing — Vincent Rainardi @ 11:19 pm
Tags:

There are 4 ways we can deal with Status attribute:

  1. Put it in a separate dimension, with a dim key in the fact table.
  2. Put it in a “status fact table”, which is an accumulated snapshot.
  3. Leave it in the dimension table as type 1.
  4. Leave it in the dimension table as type 2.

Once we explained what they are to the business, number 3 is usually rejected outright. The issue is that they need to record the date when the status changes and they want to analyse it. Whether it is insurance or banking, manufacturing or telco, it is usually a “pipeline process”, aka “workflow”. In a pipeline process, an object (document, project, application, etc) goes to different stages. The users often need to know how long the application/document stays at each stage. Either they want to find a “bottleneck”, or inefficiencies, e.g. “why on earth it takes 2 weeks to approve it?”

Apart from the dates they also want to know the indicator/flag, i.e. say you have 5 stages: stage1 then stage2 then it goes to either stage3 or stage4. If it goes to stage3 then finish, but if it goes to stage4 it goes to stage5. For example: Credit Card Application process:

Case 1: The application was received 7/3/11, credit scoring passed 9/3/11 (2 days), application approved 10/3/11 (1 day), credit limit determined 14/3/11 (2 working days), card issued 15/3/11 (1 working day).

Case 2: The application was received 7/3/11, credit scoring failed 9/3/11 (2 days), and application rejected 10/3/11 (1 day).

The queries are about how many days passes (on average) between status 4 (app approved) and status 6 (credit limit determined), etc. How many apps are currently at status 1 (app received but not scored yet)?

The best way to store the dates is to have accumulative fact table, as per option 2 above. As always there are pluses and minuses of each approach. The weakness of number 2 is that the number of statuses is fixed. Leaving it as type 2 (option 4 above) does gives us the history (what status and when) but the query is more horrible / difficult, in comparison to option 2. Putting it into separate dim does resolve the “rapidly changing attribute” issue, but doesn’t give us the management of dates and statuses.

When you are doing accumulative fact table, consider creating indicator/flag columns out of 1000 apps in the process (status = 1,2,3,4,5 not 5 or 7), how many of them are on status 1, how many are on status 2, etc? Also consider creating a column called current status. This is a must have in most situation. Useful to answer queries like, how many of status 6 do we have now? Who are they and what are they break down by … etc.

As usual I welcome any comments and discussion at vrainardi@gmail.com. Vincent 17/3/11.

Measures on a Dimension Table

Filed under: Data Warehousing — Vincent Rainardi @ 9:20 pm
Tags:

Recently I found a dimension table which had both attributes and measures. The reason for putting the measures in there was because they had the same grain as the dimension. I was thinking: was there a good reason why they should separate the measures into their own fact table? Several things that popped into my mind were:

  1. Was any “date” part of the grain? If so there could be an argument to break half of it (the measure columns) into a fact table, as this means there would be 2 dim key columns in the fact table: the main dim key and the date key (not only 1).
  2. There could be an advantage for ETL: we could truncate-reload the measures without affecting the dimension.
  3. And vice versa, we could treat the attribute as, for example, SCD 2 attributes, without affecting the measures.

As usual I welcome comments and discussion at vrainardi@gmail.com. Vincent 16/3/11.

13 March 2011

The Word “Fact” – Terminology Clarification

Filed under: Data Warehousing — Vincent Rainardi @ 10:31 am
Tags:

In dimensional data warehousing, the word “Fact” in means a measure, i.e. a single numeric column.

The word “Fact Table” means a database table or view that contains the Fact or Measure columns.

Above are what Ralph Kimball and Margy Ross used in their Toolkit book (link).

Lately, I heard people use the word “Fact” to refer to a fact table. For example “A fact has measures”. I’m not saying it’s wrong, as people can term it as they like. I’m saying that it is not standard. I put a post here so that the community is not surprised when you heard the word “Fact” used in that way.

Dimensions with Multi Valued Attributes

Filed under: Data Warehousing — Vincent Rainardi @ 1:19 am
Tags:

A multi valued attribute is an attribute which has more than 1 value per dimension row. A “Multi Valued Attribute” is different to A “Multi Valued Dimension”. A “Multi Value Attribute” occurs in a dimension, whereas a “Multi Valued Dimension” occurs in a fact table. A “Multi Valued Dimension” is a dimension with more than 1 value per fact row. As always it is best to explain by example:

DimCustomer

CustomerName|City|PhoneNumber
Anita Haynes|Edinburgh|0131 123456
Bobby Locard|Peterborough|01733 123456
Ken Barrera|Cambridge|01223 111111, 1223 222222

The Phone Number column above is an example of  a multi valued attribute. It can have more than 1 value.

Another example for DimCustomer is interest. In a travel company, one customer can have several interests, which could be captured in an online survey.

Examples of multivalued attributes in the Product dimension are colour and size, e.g.

ProductKey|ProductCode|Colour|Size
1|A|Red|Small
2|B|Blue|Small,Medium
3|C|Green,Yellow|Small

The Colour column above is an example of a multi valued attribute. It can have more than 1 value. Same with the Size column.

An example of a Multi Valued Dimension is given by Ralph Kimball and Margy Ross in the Health Care chapter of their Toolkit book (page 263 chapter 13 on the 2nd edition): Diagnosis dimension on the Billing Line Fact Table. For each fact row there could be more than 1 diagnosis. 99% of the time, when this happens, we remove that dim from the fact table, because the grain does not suit the fact table. However, as Kimball and Ross said, it is sometimes useful to keep them in the fact table.

But this article is about Multi Valued Attribute, not Multi Valued Dimension.

There are several approaches to deal with a dimension with a multi valued attribute.

  1. Lower the grain of the dimension
  2. Put the attribute in another dimension, link direct to the fact table
  3. Use a fact table (bridge table) to link the 2 dimensions
  4. Have several columns in the dim for that attribute
  5. Put the attribute in a snow-flaked sub dimension
  6. Keep in one column using commas or pipes

1. Lower the grain of the dimension

In this approach we change the grain of the dimension, taking into account the multivalued attribute(s). It is quicker to easier by example: in the case of the colour and size above, we change the DimProduct to a lower grain, e.g. the combination ProductCode-Colour-Size as follows:

ProductKey|ProductCode|Colour|Size
1|A|Red|Small
2|B|Blue|Small
3|B|Blue|Medium
4|C|Green|Small
5|C|Yellow|Small

But in many cases we are not able to do this, because the fact table requires that the product dimension is at Product Code grain. For example: we don’t have the sales information per colour and size, but only by product code. You could argue that if the sale is for product B, we could put ProductKey 2 or 3, doesn’t really matter which one. But this gives the wrong impression that we know the colour or size.

2. Move the attribute to another dimension, linked direct to the fact table

An example of this is DimProduct which has a “Manufacturer” attribute. A product is usually made by 1 manufacturer, but sometimes two or three. One solution is to move manufacturer to its own dimension (DimManufacturer) and in the fact table we have ManufacturerKey. This approach is only acceptable if the fact table is relevant to the new dim, for example FactSupplierPerformance. But if the fact table is nothing to do with DimManufacturer (say FactSales) we can’t put ManufacturerKey there. Note that to use this approach we need to know (or assume) how to allocate the measures for each manufacturer, e.g. equal proportion, one takes all, or weighted:

FactSupplierPerformance

Before:

ProductKey|SupplierKey |Measure1
1|1|10
2|2|9

After, if product1 has 2 manufacturers, equal proportion:

ProductKey|ManufacturerKey|SupplierKey |Measure1
1|1|1|5
1|2|1|5
2|2|2|9

“One takes all”: (this is the easiest, but the question is always: which one?)

ProductKey|ManufacturerKey|SupplierKey |Measure1
1|1|1|10
1|2|1|0
2|2|2|9

In most cases, we don’t have the allocation information (as in, if it is 50-50 or 30-70), we only know that product1 has 2 manufacturers.

3. Use a bridge table to link the 2 dimensions

This is the “standard” approach, when we need a multivalued sub dim to be linked to the main dim. For example, in the case of manufacturer above:

FactSupplierPerformance:
ProductKey|SupplierKey |Measure1
1|1|10
2|2|9

DimProduct:
ProductKey|ProductCode
1|A
2|B

FactProductManufacturerBridge:
ProductKey|ManufacturerKey|Weight
1|1|0.5
1|2|0.5
2|1|1
(Weight or percentage column, aka allocation column, is optional)

DimManufacturer:
ManufacturerKey|ManufacturerName
1|Manufacturer1
2|Manufacturer2

Some of the advantages of this approach are:

  • We keep the fact table as it is
  • We keep the main dim as it is
  • That “guessing of the percentage/weight” is limited only to the relationship between product and manufacturer

Whereas the disadvantage, in comparison to approach 1 & 2 is: we need “more join” as there are more tables.

4. Have several columns in the dim for that attribute

If the number of attributes is small and fixed, this is by far the most popular approach to handle Multi Valued Attribute, i.e. in the case of DimCustomer above, have Phone1 and Phone2 columns. I said “if the number of attributes is small and fixed”, i.e. 2 or 3. But if the number of attributes is large (e.g. >10) or if it’s variable (e.g. sometimes 2, sometimes 20), approach 2 and 3 above are more popular, and more appropriate.

5. Put the attribute in a snowflake sub dimension

We can’t really do this, as it is 1 to many (1 row in the main dim corresponds to many rows in the sub dim). So we need a bridge table, which brings us back to approach 3.

6. Keep in 1 column using commas or pipes

e.g. “01223 111111, 1223 222222”. This is a crazy idea, I know, but there have been talks about it in the DW practice. Not only talks, I’ve seen it implemented too! It is more flexible than having several columns (approach 4) and simpler than approach 3 or 2. So if the purpose of the attribute is “display only” or “information only” on a report (rather than analyse or slice & dice), there is an argument for using this approach, particularly if the number of attributes is small (e.g. 1 to 4).

As usual I welcome suggestions and discussion at vrainardi@gmail.com. Thank you for reading. Vincent 13/3/11. Revised for minor corrections 1/4/11.

11 March 2011

When To Snowflake

Filed under: Data Warehousing — Vincent Rainardi @ 7:22 pm
Tags:

One of the most frequently asked questions in dimensional modelling is “when to snowflake”. Everybody talks about “when not to snowflake”, but “when to snowflake” is very rarely discussed. It is a general consensus in the data warehousing world that we must always use star schema for the presentation layer, but as with everything else in the world, there are always two sides of it. Sometimes more than two.

In this article I will not go through (at all) about “star vs snowflake” arguments. There are many web pages and books explaining this. I’m going to jump straight to “when do we snowflake”.

I found that there are a few situations where we should consider snowflake:

  1. When the sub dimension is used by several dimensions
  2. When the sub dimension is used by both the main dimension and the fact table(s)
  3. To make “base dimension” and “detail dimensions”
  4. To enrich a date attribute

I’ll go through one by one. As usual it is easier to explain & learn by example.

When the sub dimension is used by several dimensions

Example: In insurance data warehouse, the City-Country-Region columns which exist in DimBroker, DimPolicy, DimOffice and DimInsured, could be replaced by LocationKey pointing to DimLocation. Some call it DimGeography. This gives us a consistent hierarchy, i.e. relationship between City, Country & Region.

Whilst the advantage of this approach is consistency, the weakness of this approach is that we would lose flexibility. For example, in DimOffice, we could have 3 hubs (EMEA, America and Asia Pacific), not 7 regions used by DimBroker, DimPolicy & DimInsured (North America, South America, Africa, Europe, Middle East, Asia, Australia). It is a common understanding that the relationship between City and Country are more or less fixed, but the grouping of countries might be different between dimensions. If we put City, Country & Region in each of the 4 dimensions, we have the flexibility of each dimension having different hierarchy. In the DimOffice case we above have 2 options: a) put the City, Country, Region in DimOffice, or b) have 2 different attributes in DimLocation: Region and Hub. Some people go for c) create DimOfficeLocation, but I don’t see the point if this dim is only used in DimOffice, might as well unite them with DimOffice as per approach a).

Other examples of this case are: a) DimManufacturer and DimPackaging, used by several product dimensions in a manufacturing data mart, b) DimEmployee in Project data mart, and c) DimCustomer in a CRM data warehouse, d) DimBranch (some call it DimOffice) in Retail Banking warehouse. I’ve heard a few discussion about DimAddress, which like DimLocation is used by DimOffice, DimCustomer, DimSupplier etc (Retail mart), but usually the conclusion was it’s better to put address attributes (street, post code, etc) directly in the main dimension.

Sometimes the designer use hybrid approach, i.e. they put the attributes both in the main dim and in the sub dim. Again this is for flexibility reason, particularly if the sub dim is used directly by several fact tables (see below).

When the sub dimension is used by both the main dimension and the fact table(s)

A classic example of this situation is DimCustomer, when an account can only belong to 1 customer. DimCustomer is used in DimAccount, and is also used by the fact tables. Other examples are: DimManufacturer, DimBranch, DimProductGroup. In the case of DimProductGroup, some fact tables are at product level, but some fact tables are at Product Group level. Hence we need both DimProduct and DimProductGroup. The option here is a) put the product group attribute in both DimProduct and DimProductGroup, or b) snowflake, i.e. DimProduct doesn’t have Product Group attributes; it only has ProductGroupKey.

To Make A Base Dimension and Detail Dimensions

The classic examples of this case are: Insurance classes or LOB (line of business), Retail Banking account types, Attributes for different Product Lines, and in Pharma we have Disease/Medicine categories.

Insurance policies from insurance classes or LOB (e.g. marine, aviation, motor, property) have different attributes. So we pull the common attributes into 1 dimension called DimBasePolicy and the class-specific attributes into DimMarinePolicy, DimMotorPolicy, etc. LOB is a US market term, whereas Class is a Lloyd’s term (London market). Similarly, in retail banking we have DimBaseAccount, DimSavingAccount, DimMortgageAccount, DimCurrentAccount, etc. In investment banking different asset classes have different attributes.

The alternatives of this design are: a) have one generic detail dimension, with 100 attributes from different categories, b) a normalised version with 4 columns or so.

Approach a) would be very wide and sparse because (for example) Marine rows only use Marine attributes, etc. But it is easier to use than the “base-detail” approach. Approach b) would be in-query-able (is that a word?) and difficult to use/join.

To Enrich a Date Attribute

A date attribute (e.g. 2011-03-11) is often analysed by Month, Quarter or Year so if we have MaturityDate, EffectiveDate and CancellationDate in the dimension and the business needs to analyse each of them by Week, Month, Quarter, Year then we would need to have 12 attributes in the dim for this.

By replacing the date attribute with the date key, we could analyse the date by any attributes in the date dim. In this case I’d recommend to use a smart integer date key rather than a surrogate date key, e.g. 20110311 rather than 12345. It is more user-friendly, it would still have meaning if you don’t link to DimDate. The smart date key is a generally accepted exception to the Kimball rule of surrogate key.

This will be one of the topics I’m presenting at SQLBits 8. As usual I welcome any comments and discussion at vrainardi@gmail.com. Vincent 11/3/11

——————————————————————————————————————————————————————————

Reply to Jacob’s 2nd comment, 25 September 2013 @ 4:39 pm (I put it here as I can’t put pictures in the comment section).

Hi Jacob, what I meant by “By putting geographical attributes in the individual dimension, we have more flexibility” is:

Scenario 1: Put geography attributes in the 3 dimensions

Scenario 1 Put geography attributes in the 3 dimensions

Scenario 2: Scenario 2: Put geography attributes in DimLocation

Scenario 2 Put geography attributes in DimLocation

In scenario 1, we have the flexibility of grouping the countries in 3 different ways, i.e. in DimInsured Thailand is in APAC ex Japan region, in DimBroker Thailand is in Asia Pacific region and in DimOffice Thailand is in Asia region.

In scenario 2, we don’t have that flexibility. Thailand is always in Asia Pacific region. This is good for consistency of regional grouping across the company.

If we want to create this flexibility, we can create 3 columns in DimLocation: InsuredRegion, BrokerRegion and OfficeRegion. Let’s call this scenario 3. In scenario 3, Insured, Region and Office can have different regions. The advantage of scenario 3 over scenario 1 is that the relationship between city and country is centralised in one place. Whereas in scenario 1 the relationship between city and country are located in 3 different places.

The disadvantage of scenario 3 is: if the relationship attributes is not fixed like city-country, but can change over time. In this example below, we have Territory, Region and Division. The DimLocation looks like this:

Territory

We have 3 types of Territory (Supply Territory, Sales Territory, Service Territory), 3 types of Region (Supply Region, Sales Region, Service Region) and 2 types of Division (Sales Division, Service Division).

In this scenario, it is better to put the City, Territory, Country, Region and Division in DimSupplier, DimSalesperson and DimService.

A hub is a grouping of region, i.e.

Hub

A SubDim (sub dimension) is a table attached to a main dimension. Please see my SQLBit Brighton presentation about dimensional modelling here, about “1 or 2 dimensions” (page 5 to 8) and “when to snowflake” (page 9 to 11). You are right that DimLocation can be a subdim (attached to DimBroker), or a main dim (attached to a fact table).

Hope this helps Jacob,
Vincent

8 March 2011

The Primary Key of a Fact Table with the Grain the same as a Dimension

Filed under: Data Warehousing — Vincent Rainardi @ 10:14 pm
Tags:

In data warehousing it is common to have a fact table with the grain the same as the grain of a dimension. The dimension is usually the same as the main dimension, i.e. in CRM: customer, in manufacturing: product, in health care: patient, in insurance: policy, and in banking: account. For example: in retail, we have Fact Customer Purchasing Behaviour. The grain of this fact table is: 1 row for each customer. The fact table contains the amount of money that each customer spent on each of the 7 major product classes, in the last 12 months.

Another example is a workflow of mortgage approval process in retail banking. Grain: 1 row for each account. Measures: number of working days on each of the 5 stages in mortgage approval. From the date the customer applied, until the date the agreement is signed: application days, questionnaire days, verification days, approval days, and signing days. Or until the customer gets the money.

These fact tables are usually accumulating snapshot. In maintaining this table, we do more SQL update statement than insert statement. Usually these fact tables also have other dimension keys, i.e. in the last example of retail banking: it’s mortgage officer dimension, branch dimension, and mortgage (type) dimension.

Right, that’s the background, now the core of this article: Primary Key.  We have 2 questions:

  1. In the fact table, should we make the main dimension key (like account key in the mortgage example above) as the Primary Key?
  2. Should we make the account key in the fact table as a Foreign Key to dim account?

As usual, we should not focus on what’s right and what’s wrong. But to: in which case should I make it a PK, and in which case should I not make it a PK.

Should we make the main dimension key as the primary key of the fact table?

Before we start discussing this, I’d like to mention that if you need a background about “Primary Key on the Fact Table”, you can read this article: Primary Key on the Fact Table.

Whether we should make the main dimension key as the PK of the fact table or not, depends on two things:

  1. Is the grain of the fact table really the same as the grain of the main dimension?
  2. Whether we plan to have a surrogate fact key.

The first factor seems obvious, but we really should examine it carefully. I have found several cases where at the first sight the grain seems the same, but with closer examination I found that the grain of the fact table is more detail than the main dimension.

It’s easier to explain with an example. Take the mortgage status fact table example above. The grain of this fact table is (supposedly) 1 row for each mortgage account. We are not supposed to have an account with 2 rows (or more) in this fact table. But, we also have other dim keys in this fact table: branch key, mortgage type key and mortgage officer key. What if it is possible for a mortgage to have more than 1 officer? What if a mortgage was transferred from one branch to another, e.g. in 2008 it was in branch A, but in 2009 it was in branch B? What if a mortgage changed its type, say from “interest only” to “repayment”, or from “fixed rate” to “tracker”? In these cases, for 1 mortgage account we could have 2 rows in the fact table.

In these cases, we have three options:

  1. Maintain the principle that a mortgage can only have 1 row in this fact table, or
  2. Allow for a mortgage to have several rows in this fact table.
  3. Take the offending dimension key out.

If we stick with the principle that a mortgage can only have 1 row in this fact table, then we restrict the other dimension keys. In the above example, the other dimension keys are branch key, mortgage type key and mortgage officer key. We modify them as follows:

  • Mortgage officer key: change to “main mortgage officer key”, which specifies the  main* credit officer dealing with the customer for this mortgage. *see next paragraph.
  • Branch key: change to “current branch key”, which is the latest branch for this mortgage. If a mortgage was moved from branch A to branch B, then this column contains the key to branch B.
  • Mortgage type key: change to “current mortgage type key”, which is the latest mortgage type for this mortgage. If a mortgage was changed from “fixed rate” to “tracker”, then this column contains the key to tracker mortgage type.

I would like to comment further with regard to the word “main” that I mark with * above. To determine which one is current we need to have a business rule. As an example: “if there are two account officers assigned to a particular mortgage, then the main mortgage officer is the officer with the highest rank, at the time the mortgage was approved”. Indeed it is possible for an officer to outrank other officer over time (for example, because of promotion), hence the business rule specifies “at the time the mortgage was approved”. Indeed it is very possible that the 2 officers have the same rank, in this case the rule can further specify for example, “in the case that the officers have the same rank, the main mortgage officer is the first signatory on the approval document”.

I found that, over the years, the main problems with defining business rules are:

  1. There isn’t an existing business practice for it and different part of the business have different view on the proposed rule. And we can’t get them to agree.
  2. The rule is technically “not implementable”. As in, the data that is required to implement that rule is not captured anywhere in the system.

In these situations we need to be practical. If the missing data can be captured that usually the best alternative. If not, we need to create the rule with the data that we have. In the case of disagreeing departments or business units, it is often useful to go higher. At higher level, the business priority is usually more obvious, making it easier to decide.

The other two options are to allow for a mortgage to have several rows in this fact table, or remove the offending dimension key column. Allowing a mortgage to have several rows usually adds more issues with regards to the measures values. The issues are usually either a) how to split the measure across rows, or b) double counting.

Removing the offending dimension key column (and keep the grain of the fact table at mortgage account level) can be a good, practical approach. To get the branch key, mortgage type keys and mortgage officer key we will need to join to another fact table which has these keys. Unless you have an issue with the query performance, this is the “cleanest” approach as you keep those dimension keys where they should belong. And not duplicating them.

Whether we plan to have a surrogate fact key

The second factor affecting the decision about whether we should make the main dimension key as the primary key of the fact table is whether we plan to have a surrogate fact key.

I believe that saying “it’s a standard” is an unwise blanketing approach. To have a policy to create a surrogate fact key on each and every fact table without considering their structure and usage is disadvantageous. True that a surrogate (PK) fact key has its advantages, including better query performance, better indexing (specific to SQL Server platform, not applicable to Oracle), and enable us to uniquely identify a fact row. But there are cases where these advantages don’t occur.

One of those cases is accumulating snapshot fact table, with the grain equals to the main dimension. Consider the mortgage status fact table above, which enables us to monitor closely the number of working days it took for a mortgage to progress from one stage to another. If the grain of this fact table is 1 row for each mortgage account, it make sense to make the mortgage account key as the primary key of the fact table, not by creating a surrogate fact key instead. Creating a surrogate fact key is not necessary because only one dimension key column in this fact table determines the granularity of the fact table. The need for a surrogate fact key arises when there are more than one dimension key columns determine the granularity.

Should we make the main dimension key in the fact table as a Foreign Key to the main dimension

In the above example, this mean: should we make account key in the fact table as a FK to the account dimension.

In my opinion, the answer is yes (identifying), because some accounts may not be in the fact table. Cancelled accounts, for example. This also guarantees that all the account keys in the fact table are valid. In the fact table we should not have 0 (unknown) account key. The warehouse load (ETL) should check this, in addition to the FK constraint.

Accumulated snapshot fact table is usually not too big. Having an FK (only on 1 column) doesn’t have a negative impact on the load performance. It on the other hand, have an advantage on the query performance (when we join the fact to the account dimension). When the fact is being loaded, we do more update than insert, so this FK doesn’t have an impact on the load performance.

Next Page »

Blog at WordPress.com.