Real Time Data Warehouse
The idea of a real time data warehouse has been around for a long time. I heard about it in 2006. It means that when the source system changes, the warehouse changes immediately.
To achieve this, with regards to the data flow, there are 2 things that need to happen:
- The changes from the source system must be detected immediately and sent to the warehouse.
- The dimension tables in the warehouse must be updated immediately, followed by the fact tables.
The trick for point 1 is: if the source systems is large or many, we need to choose which area we’d like to monitor.
To support point 2, the fact table needs to be transactional. If it is a periodic snapshot, we need to partition it so that today’s data is in its own partition*. This is so that we can update the warehouse within a few seconds.
*if your database is not Enterprise edition, put in a “Today” table joined to “History” table using a view for reading).
If we have a large dimension (more than 1m rows), remove 1-2 rapidly changing attributes put it into the fact table. If it is still large, split it into 2 or 3 dimensions, joined independently to the fact tables. A large dim with 5m rows can be split into 2 dims with 10,000-50,000 rows each, depending on the permutation of the data in each column.
Use in memory database if possible, i.e. put the whole data warehouse in memory. At least the dimensions and the real time portion of the fact tables should be put in memory.
To detect the change in the source system, the best one is to read the log files. The alternative is to use triggers. SQL Server has 2 mechanisms: CDC and Change Tracking. CDC reads the transaction log files, whereas Change Tracking uses triggers.
Log reading mechanism is better because it is asynchronous, so does not disturb the writing in any way. The source system (OLTP usually) can write at normal speed without anything blocking or delaying the process. Triggers on the other hand, delay the writing because it needs to fire the strigger first before the writing process is fully completed.
3 TYPES OF DATA CHANGES
If it is a new dimension row or a new fact row, then we just insert the row and it’s done in a few ms, no problem. The problem is if we need to update existing rows. That’s what I mean with “Changes”. There are 3 types of changes:
- A dimension row changes
- A fact row changes (transaction fact table)
- A fact row changes (periodic snapshot)
#1. For type 1, update the row, for type 2: expire the row as usual then create a new row. This new row is not used by any fact table. The expired row is used by the fact tables, but there is nothing to change in the fact table.
#2 . This is updating existing row in a fact table, which can be a measure column or a dimension key column. The measure column change: we update the measure column (we can find the fact row because none of the dim key column changes). The dim key column change: we look up the new key in the dim table then update the fact table. It is necessary that we get the old value of the natural key of the dim column (i.e. before the change). Sometimes we don’t get it and we are stuck, which means we can’t include this change in the real time. We may need to delete as well, if a transaction fact row is removed from the source table.
#3. Similar to #2 but the limited that today’s partition. If it’s a measure change we find the row based on the dim keys (lookup the natural keys in the dim tables). If it’s a dim key change we need the old value before the change.
So none of these types of changes are heavy, it is possible to find the row and update it within a few ms, if we have an in memory database, which came useful for the lookup when doing #2 and #3.