Data Warehousing and Business Intelligence

28 March 2011

Storing History on 3rd Normal Form (3NF)

Filed under: Data Warehousing — Vincent Rainardi @ 10:50 pm
Tags:

In Dimensional Modelling, one way of storing history of the attributes is using SCD 2 or 3, or 6 (1+2, or 1+3). And to store the history of measures we use snapshotting.

One of the questions that people ask me in 3NF warehousing is: In 3NF how do we store history?

Let’s look at the case of Health Care industry: a surgery with 5 GPs and 100 patients. GP = General Practitioner = doctor. Some patients have been with the surgery for 45 years, some only 1 week. Some patients visited the surgery 20 times a year, some only once every 3 years. On each visit the GP made a record of the patient situation/symptoms (or not), a diagnosis (or not), and a prescription/treatment (or not).

So how do we store the patient historical data, such as previous address and telephone number? A common method is to store it on another table with identical structure, plus a few “system” columns. This table is known as “audit table”, as when it first came up (probably 20-30 years ago) it was used for audit purpose. So, before the record was changed, the whole record was copied to the audit table first. One of those “system columns” is the date when the record was copied. Another “system column” is an FK which links to the main patient table’s PK.

Patient Table:
PatientID Name Address Phone DOB
34 Name1 Address3 Phone2 23/04/1961
Audit Table:
AuditID PatientID Name Address Phone DOB DateCreated
12548 34 Name1 Address1 Phone1 23/04/1961 02/11/1988
16251 34 Name1 Address2 Phone1 23/04/1961 21/04/1999
23611 34 Name1 Address2 Phone2 23/04/1961 07/08/2001

This is usually created using a trigger or CDC software. SQL Server 2008 has CDC built-in.

But, in 3NF we always have multiple entities. And they are related to each other. And they all have historical data.

Imagine we have 3 tables, each linked to each other (FK-PK):

If we add history it becomes messy:

Why is it messy? Because we have to relate the 3 history tables to each other, like this:

Now, imagine if you have 20 tables. It will be indeed like spaghetti. And in reality we have 50 even 100 tables, each one with its history table. Each table is related to 1 to 5 other tables. Messy.

In order to keep things simple, there are 2 things that we usually do:

  1. Only add history table to the tables which really require it, not all of them.
  2. Not having the main table, but put the main table inside the history table.

The 2 points above really simplify things, making the ERD to look something like this:

As I said, not every table are historised. In the above diagram, table 3 is not historised. Only table 1 and table 2 are. Note that now we don’t need to link the history table back to the main table. The history tables themselves are linked, and we have no main tables.

This is how point 2 is done, 1 table which serve as both the main table and the history table:

Patient Table:
AuditID PatientID Name Address Phone DOB DateCreated Active
10839 34 Name1 Address1 Phone1 23/04/1961 17/10/1995 N
12548 34 Name1 Address2 Phone1 23/04/1961 02/11/1988 N
16251 34 Name1 Address2 Phone2 23/04/1961 21/04/1999 N
23611 34 Name1 Address3 Phone2 23/04/1961 07/08/2001 Y

Notice that the Audit ID and the Date Created are shifted 1 row down. To make this table acts like the main table, we use the Active column. Without the Active column we will have to find the max(Date Created) first for each Patient ID, then do a self-join back to the same table. This causes query performance issues.

This is very similar to Kimball’s concept of SCD 2.

QUERYING

And how do you join between these tables to query the data?

There are 2 main query operations:

  1. Current / today’s data.
  2. Past / history data.

Current Data

To get the current data, filter each history table with Active = Y, then join on the main PK (i.e. in the above example, Patient ID).

History Data

If we want to get past data like when it was 31/12/2010, we enter the world of painful join :)

This is how:

  • Step 1: for each table, for each patient get the 31/12/2010 version. In the case of Patient table above:

select PatientID, max(CreatedDate) where CreatedDate <= ’31/12/2010′ group by PatientID

Let the above query be ZZZ.

  • Step 2: Do a self-join to get the 31/12/2010 version for each patient:

select P1.* from Patient P1

inner join (ZZZ) as P2 on P1.PatientID = P2.PatientID and P1.CreatedDate = P2.CreatedDate

  • Step 3: Do the same self-join on each table that we want to query.
  • Step 4: Join all the tables on the PK-FK relationship.

UPDATING DATA

And how do you update the data?

Back to the patient table:

Before:
AuditID PatientID Name Address Phone DOB DateCreated Active
10839 34 Name1 Address1 Phone1 23/04/1961 17/10/1995 N
12548 34 Name1 Address2 Phone1 23/04/1961 02/11/1988 N
16251 34 Name1 Address2 Phone2 23/04/1961 21/04/1999 Y
After:
AuditID PatientID Name Address Phone DOB DateCreated Active
10839 34 Name1 Address1 Phone1 23/04/1961 17/10/1995 N
12548 34 Name1 Address2 Phone1 23/04/1961 02/11/1988 N
16251 34 Name1 Address2 Phone2 23/04/1961 21/04/1999 N
23611 34 Name1 Address3 Phone2 23/04/1961 07/08/2001 Y
  • Step 1: First, before you insert the new row, update the Active flag first to N:

update Patient set Active = ‘N’
where Patient ID = 34 and Active = ‘Y’

  • Step 2: Insert the new row, with Audit ID = identity (incremental sequential number), DateCreated = today and Active = Y.

insert into Patient (PatientID, Name, Address, Phone, DOB, DateCreated, Active)
values (34, ‘Name1′, ‘Address3′, ‘Phone2′, ’23/4/1961′, getdate(), ‘Y’)


About these ads

2 Comments »

  1. This doesn’t seem at all like 3NF. This seems like a hierarchal history of flat files. There’s no way DOB would be in any historical record and still be 3nf. It would be in the immutable table (patient_ID, DOB, Date_Created) and the Patient_Contact would have the Patient PK, and the history of address and phone, each broken out if truly desired. How is this 3NF if you don’t do that?

    Comment by Bob — 18 September 2012 @ 5:38 pm | Reply

    • You are right Bob once we add audit/history it is is not 3NF any more.

      Comment by Vincent Rainardi — 19 September 2012 @ 9:33 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 204 other followers

%d bloggers like this: