Data Warehousing, BI and Data Science

14 October 2014

Real Time Data Warehouse

Filed under: Data Warehousing — Vincent Rainardi @ 8:42 pm
Tags:

Real Time Data Warehouse

The idea of a real time data warehouse has been around for a long time. I heard about it in 2006. It means that when the source system changes, the warehouse changes immediately.

DATA FLOW

To achieve this, with regards to the data flow, there are 2 things that need to happen:

  1. The changes from the source system must be detected immediately and sent to the warehouse.
  2. The dimension tables in the warehouse must be updated immediately, followed by the fact tables.

The trick for point 1 is: if the source systems is large or many, we need to choose which area we’d like to monitor.

DATA MODELLING

To support point 2, the fact table needs to be transactional. If it is a periodic snapshot, we need to partition it so that today’s data is in its own partition*. This is so that we can update the warehouse within a few seconds.

*if your database is not Enterprise edition, put in a “Today” table joined to “History” table using a view for reading).

If we have a large dimension (more than 1m rows), remove 1-2 rapidly changing attributes put it into the fact table. If it is still large, split it into 2 or 3 dimensions, joined independently to the fact tables. A large dim with 5m rows can be split into 2 dims with 10,000-50,000 rows each, depending on the permutation of the data in each column.

Use in memory database if possible, i.e. put the whole data warehouse in memory. At least the dimensions and the real time portion of the fact tables should be put in memory.

DETECTION MECHANISM

To detect the change in the source system, the best one is to read the log files. The alternative is to use triggers. SQL Server has 2 mechanisms: CDC and Change Tracking. CDC reads the transaction log files, whereas Change Tracking uses triggers.

Log reading mechanism is better because it is asynchronous, so does not disturb the writing in any way. The source system (OLTP usually) can write at normal speed without anything blocking or delaying the process. Triggers on the other hand, delay the writing because it needs to fire the strigger first before the writing process is fully completed.

3 TYPES OF DATA CHANGES

If it is a new dimension row or a new fact row, then we just insert the row and it’s done in a few ms, no problem. The problem is if we need to update existing rows. That’s what I mean with “Changes”. There are 3 types of changes:

  1. A dimension row changes
  2. A fact row changes (transaction fact table)
  3. A fact row changes (periodic snapshot)

#1. For type 1, update the row, for type 2: expire the row as usual then create a new row. This new row is not used by any fact table. The expired row is used by the fact tables, but there is nothing to change in the fact table.

#2 . This is updating existing row in a fact table, which can be a measure column or a dimension key column. The measure column change: we update the measure column (we can find the fact row because none of the dim key column changes). The dim key column change: we look up the new key in the dim table then update the fact table. It is necessary that we get the old value of the natural key of the dim column (i.e. before the change). Sometimes we don’t get it and we are stuck, which means we can’t include this change in the real time. We may need to delete as well, if a transaction fact row is removed from the source table.

#3. Similar to #2 but the limited that today’s partition. If it’s a measure change we find the row based on the dim keys (lookup the natural keys in the dim tables). If it’s a dim key change we need the old value before the change.

So none of these types of changes are heavy, it is possible to find the row and update it within a few ms, if we have an in memory database, which came useful for the lookup when doing #2 and #3.

Advertisements

8 Comments »

  1. Hi Vincent,
    Good work!!!

    I found the following article from 2014

    “Data Warehousing and Business Intelligence
    14 October 2014
    Real Time Data Warehouse
    Filed under: Data Warehousing — Vincent Rainardi @ 8:42 pm

    The article talks about updating the dimension and facts table directly in real time.
    Lets say we have oracle DB , would like to update the facts and dimensions that are big enough, if I need to update the data with zero latency.
    I believe there will be resource contention issue and update will take long time and defeats the purpose of real time update.

    Please share your experience

    Thanks !!!
    Hal

    Comment by hallavendercgmailcom — 1 May 2016 @ 6:30 am | Reply

    • Hi Hal, we can achieve a few seconds latency, if we do the followings:
      1. Choose only 1 fact table
      2. Insert only in the fact table (no updates or deletes)
      3. Put change tracking mechanism in the source tables (source of dim tables and this 1 fact table)
      Contention issue: put no lock on all queries, so we can insert into the fact table, and update/insert into dims.

      Comment by Vincent Rainardi — 1 May 2016 @ 2:01 pm | Reply

  2. Great Thanks
    Please clarify for my understanding
    1. Choose only 1 fact table — we should be having one fact fact table for each subject area .
    2. Insert only in the fact table (no updates or deletes) — Sometimes we have late arriving dimensions how to handle we we do not update or delete on fact tables?
    3. Put change tracking mechanism in the source tables — Do you mean to say CDC or replication or federation techniques to capture changes ?

    4. How about updating the dimension tables — it can be done in realtime to handle late arriving dimensions to handling updates/ deletes — logical deletes

    I am new to relatime concept and in a learning mode.

    Really need your help in understanding.

    I also came across “hot partitions ” from Kimball where he is talked about,
    View is created by joining the hot partition table ” current day” data and Data warehouse to give an illusion of realtime data warehouse. Is this technique old or still being practised in industries.

    Thanks!!!

    Comment by hallavendercgmailcom — 1 May 2016 @ 4:07 pm | Reply

    • Hi Hal,
      1. We build the DW as normal (e.g. 10 fact tables), and then choose 1 fact table for real time update (the other 9 are overnight).
      2. We can’t handle late arriving dim in real time because doing deletes or updates the fact table could take a long time (unless the fact table is small). We do fact table deletes/updates overnight.
      3. Yes, CDC or replciation to capture changes, then a query retrieves the changed row from the source table.
      4. Insert & update in the dimension table is real time (to give the real time fact table the correct SK).
      Hot partition is crucial if your fact table is a periodic snapshot. You need to update only today partition (otherwise it will take a long time).

      Comment by Vincent Rainardi — 2 May 2016 @ 5:49 am | Reply

  3. Hi,

    One more question

    3. Put change tracking mechanism in the source tables (source of dim tables and this 1 fact table) —
    The flow of data is
    Step 1 : Source ————–> staging —————>dimensions ( for new records — insert or perform lookup for dim_ids)
    Step 2: Staging —————-> FACT table (inserts by performing STEP 1 for dim_ids)

    Please comment

    Thanks

    Comment by hallavendercgmailcom — 1 May 2016 @ 4:18 pm | Reply

    • Hi Hal, we don’t have staging table for real time fact table & dim table.
      Step 1. A new row in the source tables (say Sales Transaction table and Customer table).
      Step 2. The CDC captures the change. A query joins both tables and bring the new transaction row into DW (with the new customer name embedded in the row)
      Step 3. Insert the new customer into customer dim
      Step 4. Insert the new sales trans into sales fact

      Comment by Vincent Rainardi — 2 May 2016 @ 6:01 am | Reply

  4. Thanks,
    Wondering how you are handling the updates in DIM tables.
    I hope you are maintaining history SCD2.
    lets say your customer record has updated information.
    You create a new record in DIM table and flag the new record as Current record = ‘Y’ the old record get updated as current_record= ‘N’
    Are you following the same process for DIM updates in realtime with no latency

    Thanks

    Comment by hallavendercgmailcom — 2 May 2016 @ 3:37 pm | Reply

    • Yes same process for SCD2 i.e. expire existing row and insert new row, and use the new SK in the inserted fact row.

      Comment by Vincent Rainardi — 2 May 2016 @ 7:01 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: