In Dimensional Modelling, one way of storing history of the attributes is using SCD 2 or 3, or 6 (1+2, or 1+3). And to store the history of measures we use snapshotting.
One of the questions that people ask me in 3NF warehousing is: In 3NF how do we store history?
Let’s look at the case of Health Care industry: a surgery with 5 GPs and 100 patients. GP = General Practitioner = doctor. Some patients have been with the surgery for 45 years, some only 1 week. Some patients visited the surgery 20 times a year, some only once every 3 years. On each visit the GP made a record of the patient situation/symptoms (or not), a diagnosis (or not), and a prescription/treatment (or not).
So how do we store the patient historical data, such as previous address and telephone number? A common method is to store it on another table with identical structure, plus a few “system” columns. This table is known as “audit table”, as when it first came up (probably 20-30 years ago) it was used for audit purpose. So, before the record was changed, the whole record was copied to the audit table first. One of those “system columns” is the date when the record was copied. Another “system column” is an FK which links to the main patient table’s PK.
This is usually created using a trigger or CDC software. SQL Server 2008 has CDC built-in.
But, in 3NF we always have multiple entities. And they are related to each other. And they all have historical data.
Imagine we have 3 tables, each linked to each other (FK-PK):
If we add history it becomes messy:
Why is it messy? Because we have to relate the 3 history tables to each other, like this:
Now, imagine if you have 20 tables. It will be indeed like spaghetti. And in reality we have 50 even 100 tables, each one with its history table. Each table is related to 1 to 5 other tables. Messy.
In order to keep things simple, there are 2 things that we usually do:
- Only add history table to the tables which really require it, not all of them.
- Not having the main table, but put the main table inside the history table.
The 2 points above really simplify things, making the ERD to look something like this:
As I said, not every table are historised. In the above diagram, table 3 is not historised. Only table 1 and table 2 are. Note that now we don’t need to link the history table back to the main table. The history tables themselves are linked, and we have no main tables.
This is how point 2 is done, 1 table which serve as both the main table and the history table:
Notice that the Audit ID and the Date Created are shifted 1 row down. To make this table acts like the main table, we use the Active column. Without the Active column we will have to find the max(Date Created) first for each Patient ID, then do a self-join back to the same table. This causes query performance issues.
This is very similar to Kimball’s concept of SCD 2.
And how do you join between these tables to query the data?
There are 2 main query operations:
- Current / today’s data.
- Past / history data.
To get the current data, filter each history table with Active = Y, then join on the main PK (i.e. in the above example, Patient ID).
If we want to get past data like when it was 31/12/2010, we enter the world of painful join
This is how:
- Step 1: for each table, for each patient get the 31/12/2010 version. In the case of Patient table above:
select PatientID, max(CreatedDate) where CreatedDate <= ’31/12/2010′ group by PatientID
Let the above query be ZZZ.
- Step 2: Do a self-join to get the 31/12/2010 version for each patient:
select P1.* from Patient P1
inner join (ZZZ) as P2 on P1.PatientID = P2.PatientID and P1.CreatedDate = P2.CreatedDate
- Step 3: Do the same self-join on each table that we want to query.
- Step 4: Join all the tables on the PK-FK relationship.
And how do you update the data?
Back to the patient table:
- Step 1: First, before you insert the new row, update the Active flag first to N:
update Patient set Active = ‘N’
where Patient ID = 34 and Active = ‘Y’
- Step 2: Insert the new row, with Audit ID = identity (incremental sequential number), DateCreated = today and Active = Y.
insert into Patient (PatientID, Name, Address, Phone, DOB, DateCreated, Active)
values (34, ‘Name1’, ‘Address3’, ‘Phone2′, ’23/4/1961’, getdate(), ‘Y’)