Data Warehousing and Data Science

9 February 2015

Populating Fact Tables

Filed under: Data Warehousing — Vincent Rainardi @ 12:48 am

Fact tables are normally loaded from transaction tables such as order tables or from transactional files, such as web logs. Hence the number of rows to update or insert in a load is much larger than in dimensions. The core of loading fact tables is to change the natural keys into surrogate keys. Let’s start with the basic steps, then continue with some practical experiences such as loading partitioned fact tables, slim lookup tables, deduplication, loading snapshot and delta fact tables, fact table indexes and reloading history data.

Basic Steps

When loading a transaction table from OLTP into a fact table in the data warehouse, the value columns on the transaction table become fact table measures, the primary keys on the transaction table such as order number become degenerate dimension columns on the fact table, and the alternate primary keys such as date, customer and product become dimensional key columns on the fact table.

As rows of fact tables are loaded, each of the key columns is converted from the natural key. This is why all the dimension tables must be populated first before we populate the fact tables: because we need the surrogate keys on the dimension tables to translate the source table’s natural keys. This is the very basic and the very heart of data warehouse loading, so it is important to understand it. Perhaps the best way to describe this concept is using an example. Consider a simplified order detail table with the following columns: order_id (primary key), order_date, product_id, quantity and price. In reality there should be line number column as well but in order to simplify it let’s not use line number in this example, i.e. let us assume that each order only contain one line.

Source table: order_detail

order_id order_date product_id quantity price last_update
352 15/07/2006 BGCKQ 1 12.99 30/10/2006
410 30/10/2006 KMSCG 2 7.99 30/10/2006

In the data warehouse let us assume that we have an over simplified star schema consisting of a date dimension, a product dimension and a sales fact table, as describe below.

Date dimension table: dim_date

date_key date day month
2388 15/07/2006 Saturday July
2485 30/10/2006 Monday October

Product dimension table: dim_product

product_key product_id product_name description
53076 BGCKQ Aike U2 IDE HD case USB 2
92184 KMSCG Sony BP71 VIAO CG7 battery

Sales fact table: fact_sales

fact_key date_key product_key order_id quantity price load_time
830923 2388 53076 352 1 11.99 15/07/2006


  1. Note that order_id 352 already exists in the data warehouse but the price has changed. When it was loaded on 15/07/2006 the price was 11.99 but now the price is 12.99.
  2. In reality date dimension would contains many more other attributes, such as day of the week (2 columns: name and number), ISO date (YYYY-MM-DD), date of the year (number of elapsed days since the beginning of the year), Julian date (number of elapsed days since 4713 BC), SQL date (in SQL Server datetime format), day of the month, calendar week / month / quarter / year (3 columns for month: full name, short name and number), fiscal week / period / quarter / year, weekday flag, statutory holiday flag, last day of the month flag, etc. Plus day and month names in other languages if your data warehouse is internationalised.
  3. In reality the dimension tables would have standard dimension table columns such as load_time and SCD attributes but for this simplified case these columns are not displayed here.
  4. It is a common practice not to have a fact_key column in the fact table, with the argument being the combination of all the dimension keys will make the record unique. In this case, the primary key of the fact table is a composite key. When loading this kind of fact table we will need to compare all dimensional key columns to identify the correct row when doing updating (see step 3 below).

The basic steps in loading data warehouse fact tables are described below. It is not that different from the steps in loading dimension tables (see my previous article). In this example, it is assumed that the source table, order_detail on the OLTP, has been loaded incrementally into a stage table called order_detail. The stage database name is stg. It is also assumed that we are using SQL Server 2000 or 2005, hence the SQL statements are in Transact SQL. So what we need to do now is to load the fact table records from the stage into the data warehouse.

Step 1. Create The Temp Table

SELECT * INTO #fact_sales FROM dw.dbo.fact_sales WHERE 1 = 0

The temp table after it is created:

fact_key date_key product_key order_id quantity price load_time


  1. By deliberately specifying a false condition, we create an empty temp table, taking the structure of the target dimension table.
  2. You may want / need to increase the size of the temp database depending on the size of your load.
  3. SELECT INTO is better than CREATE TABLE because it is more flexible to adapt to structural changes.

Step 2. Populate The Temp Table


INSERT INTO #fact_sales
(date_key, product_key, quantity, price, load_time)
ISNULL(f.fact_key, 0),
ISNULL(d.date_key, 0),
ISNULL(p.product_key, 0),
ISNULL(s.quantity, 0),
ISNULL(s.price, 0),
FROM stg.dbo.order_detail s
LEFT JOIN dw.dbo.dim_date d ON s.trans_date = d.sql_date
LEFT JOIN dw.dbo.dim_product p ON s.product_id = p.product_id
LEFT JOIN dw.dbo.sales_fact f ON d.date_key = f.date_key
AND p.product_key = f.product_key
WHERE s.load_time BETWEEN @last_run AND @current_run

The temp table after it is populated:

fact_key date_key product_key order_id quantity price load_time
830923 2388 53076 352 1 12.99 30/10/2006
0 2485 92184 410 2 7.99 30/10/2006

Source table: order_detail

order_id order_date product_id quantity price last_update
352 15/07/2006 BGCKQ 1 12.99 30/10/2006
410 30/10/2006 KMSCG 2 7.99 30/10/2006

Date dimension table: dim_date

date_key date day month
2388 15/07/2006 Saturday July
2485 30/10/2006 Monday October

Product dimension table: dim_product

product_key product_id product_name description
53076 BGCKQ Aike U2 IDE HD case USB 2
92184 KMSCG Sony BP71 VIAO CG7 battery


  1. Look how dimensional keys are looked up in a single step, by joining the fact table to the dimension tables on the natural keys.
  2. Because they are LEFT JOINS, the fact table rows without corresponding imensional
    rows will results in NULL dimensional keys in the fact table. The ISNULL then convert these NULLs to zeros.
  3. Notice that we don’t populate the load_time with getdate(), but with a variable named @current_load_time. This variable is populated with getdate() at the beginning of the loading batch and used by all processes in the batch. This is necessary so that in the event of failure, we know the point in time we have to restart the process from.
  4. Notice that we only load the rows between @last_run and @current_run. This is necessary if we keep the records on the stage table for a few days, i.e. if the stage contains more than 1 day data. If we clear the data from the stage table as soon as we load them into data warehouse, we don’t need to specify this where clause.
  5. The example above is using only 2 dimensions but in the real practice we would have to deal with more dimensions. 10 to 15 dimension key columns on the fact tables are common occurance.

Step 3. Update Existing Records

SET f.date_key = t.date_key,
f.product_key = t.product_key,
f.order_id = t.order_id,
f.quantity = t.quantity,
f.price = t.price,
f.load_time = t.load_time
FROM dw.fact_sales f
INNER JOIN #fact_sales t ON f.fact_key = t.fact_key
WHERE t.fact_key <> 0 AND
(f.date_key <> t.date_key OR
f.product_key <> t.product_key OR
f.order_id <> t.order_id OR
f.quantity <> t.quantity OR
f.price <> t.price)

Source table: order_detail

order_id order_date product_id quantity price last_update
352 15/07/2006 BGCKQ 1 12.99 30/10/2006
410 30/10/2006 KMSCG 2 7.99 30/10/2006

fact_sales after the update:

fact_key date_key product_key order_id quantity price load_time
830923 2388 53076 352 1 12.99 30/10/2006


  1. Here we update the fact table, based on the data on the temp table. In this case the price was updated from 11.99 to 12.99.
  2. We only update the rows where the tmp table’s fact_key is not 0, i.e. the rows already exist on the target fact table. For the rows where the fact_key is 0 (not exist on the fact table), we will insert them into the fact table later on.
  3. Notice that when updating rows we update the load time column as well. The last line is used to specify which changes we want to pickup. In most cases, we want to pick up changes on all columns, but sometimes there are legitimate reasons for business rules to specify that changes on certain columns are to be ignored.

Step 4. Insert New Records

INSERT INTO dw.dbo.fact_sales
(date_key, product_key, order_id, quantity, price)
SELECT date_key, product_key, order_id, quantity, price
FROM #fact_sales
WHERE fact_key = 0

Source table: order_detail

order_id order_date product_id quantity price last_update
352 15/07/2006 BGCKQ 1 12.99 30/10/2006
410 30/10/2006 KMSCG 2 7.99 30/10/2006

Date dimension table: dim_date

date_key date day month
2388 15/07/2006 Saturday July
2485 30/10/2006 Monday October

Product dimension table: dim_product

product_key product_id product_name description
53076 BGCKQ Aike U2 IDE HD case USB 2
92184 KMSCG Sony BP71 VIAO CG7 battery

fact_sales after the insert:

fact_key date_key product_key order_id quantity price load_time
830923 2388 53076 352 1 12.99 30/10/2006
916912 2485 92184 410 2 7.99 30/10/2006


  1. It is a good practice to always declare the column names. This is important for flexibility and maintenance.
  2. Let RDMBS maintains the fact_key. When setting up the data warehouse, set this column to be IDENTITY (1,1).

Logging and Closing

In every step above we need to do error handling and logging. Error handling is important because if the loading fails on any steps, we need to be able to recover from the failure. Logging is important to know what exactly happened on each steps, i.e. how many records are processed, etc.

At the end of the program, we should not forget to clean everything up, i.e. drop the temp table(s), follow control protocol e.g. set the process to complete state, etc.

The above code shows how to do upsert with native SQL codes, which is very fast and efficient. But it is worth to note here, that good dedicated ETL tools such as Informatica and Data Integrator have the facilities to do in-memory lookups which has very good performance too. Disk-base lookup is definitely not they way to go here, as they are slow. Mainly because there are a lot of SQL statements to execute, i.e. one for each lookup, and each of these statements ‘touches’ the disk, which is a costly operation.

Now that we have understood the basic steps in loading fact tables, let’s familiar ourselves with a few practical experiences such as loading partitioned fact tables, slim lookup tables, deduplication, loading snapshot and delta fact tables, and dealing with fact table indexes.

Slim Lookup Tables

When a dimension is very large, sometimes it would be a significant performance improvement if we provide a separate lookup table for key management. For example, if our customer dimension has 100 columns and 10 million rows, we could create a customer key lookup with only 3 columns: customer_id, customer_key, load_time, which would increase the performance of dimensional key lookup process on step 2 above. It is also more suitable for performing in-memory lookup. If you are using Oracle 8i and above, physical table structures such as IOT (Index Organised Tables) would useful in terms of performance. On SQL Server it’s a good practice to cluster index slim lookup tables on the natural keys. On Teradata, we could use Practical Primary Index (PPI) on these lookup tables.

Example of Slim Key Lookup table: lookup_customer

customer_id customer_key load_time
493238 30012 02/10/2006
493240 30013 03/10/2006
493241 30014 03/10/2006

The load_time column would be useful if we have very long dimension table, e.g. 25 million rows. This is not uncommon when the data warehouse is utilised to drive CRM systems (Customer Relationship Management), especially in the dot com companies dealing with online campaigns, be it subscription based or tactical campaigns. In this case we can specify a where clause to limit the selection on this slim key lookup, for example where load_time is within the last 2 years. This would cover, say, 98% of the lookup. For the remaining 2%, e.g. the one older than 3 years, we then go to the main customer dimension table to find the customer_key.

Altough it takes a bit of time to maintain the lookup table, overall we still save time as the time saved by querying a slim key lookup table is a lot greater than the time required to maintain the lookup table itself. This is especially true for a Master Data Management (MDM) dimensions such as customer, product and account, where they are used all over the place in the data warehouse and data marts. Time dimension is not that long – 10 years is only 3651 rows – hence we tend not to have a lookup for time dimension. For those of us who think to set the grain of time dimension to hours or minutes, the general advice is: don’t. Either put a time stamp column on the fact table or have a time of day dimension. But this discussion (time of day dimension) is for another article.

Natural Key Changes

Sometimes a natural key which we thought was a good solid natural key could change and therefore can no longer be a natural key. Perhaps the best way to explain it is using an example. In a OLTP source system the customer table has a composite primary keys as combination of branch code and customer code. Customer code is only unique within a branch. Occasionally a branch could be closed and all the customers in that branch are moved or assigned to another branch. For example, see the customer table below.

branch_code customer_code customer_name other_attributes
1 1 Andrew
1 2 John
2 1 Steve
2 2 Paul

When branch 1 is closed, and its customers are moved to branch 2, the customer
table becomes:

branch_code customer_code customer_name other_attributes
2 1 Steve
2 2 Paul
2 3 Andrew
2 4 John

If we use branch_code and customer_code as natural key in the data warehouse, we need to handle this branch closure event with care. In the previous project we utilise another column in the customer table which would help identify a unique record and we create a special table in the data warehouse to translate the change natural key to the correct data warehouse key.

Unknown Records

If a fact table record does not have a corresponding dimension record, basically we have 2 choices: either we don’t load that record into the fact table, or we load it but we put 0 as the dimension key, referring to the unknown record in the dimension table. An unknown record is a record in the dimension table, with a dimension key of 0 and all the attributes are populated with blank string, number 0 or low value date, depending on the data type. Name and description columns are usually populated with the word “Unknown”. The load_time column is populated with the date of the record was created. This date is normally equal to the date the data warehouse was setup, because the record was created by the data warehouse setup scripts. For example:

Product dimension table: dim_product

product_key product_id product_name description min_level valid_until load_time
0 0 Unknown Unknown 0 1/1/1900 12/10/2004

Here is an example of a row in the source table with a product_id that does not exist in the dimension table:

order_id order_date product_id quantity price last_update
358 19/08/2006 BGCKZ 3 2.99 31/10/2006

Below is how fact_sales looks after that record is loaded. Notice that the product key is 0, which refers to the unknown record in the product dimension.

fact_key date_key product_key order_id quantity price load_time
830937 2424 0 358 3 2.99 31/10/2006

If we load it into the warehouse, we need to flag it into the data quality system, so that it can be reported and corrected on the subsequent load. If we don’t load it it should still be set as a data firewall rule. If we don’t load it, the total of measure on fact table would not be accurate. In the example above, the total sales amount for October 2006 would be 3 x £2.99 less than what it should be. Because of this we tend to load fact table record that does not have a corresponding dimension record and set the dimension key to 0. This way the fact table measure total would be correct, it’s just that the sales could not be traced to a valid product record. But all other dimension keys would still be valid, e.g. it can be traced to a valid date dimension record, a valid customer dimension record, etc. And above all, the referential integrity between the fact tables and the dimension tables are still valid.


There are 2 main causes why we perform deletion on the fact tables: 1. because the data extraction approach is fixed period extraction, and 2. to accomodate reverse transactions such as cancellations. No 1 is usually physical deletion and no 2 is usually logical deletion. Let’s discuss them one by one, using examples to clarify.

No 1, loading approach. In one of the data warehousing projects I’ve been involved with, we had difficulties extracting a transaction table from the source system incrementally, because the date stamp was not very reliable. We tried with 3 weeks tolerance i.e. where last updated date or created date was within the last 3 weeks but we still find some leakage, i.e. a few records were updated without the date stamp columns were not updated. Please refer to this article for discussion about incremental extraction and leakage. Luckily, source system did not allow the user to update records that were more than 6 months old. When a user tried to update a transaction record that was created more than 6 months ago, the system displayed an error, something like “You can not modify this record.” So our approach of extracting this transaction table to the stage was to get 6 months data every time. And consequently, the approach of loading the fact table from stage was to delete the rows on the fact table that exist in stage, then reload all the stage records into the fact table. It was actually the same as updating those records, but we found that it was quicker to delete then reload. We identify the records on the fact table that exist in stage by comparing all the dimension key columns, i.e. the composite
natural keys of the source transaction table.

No 2, accomodating reverse transaction. Sometimes, in the source table we had a main transaction table containing normal transactions and a smaller secondary transaction table containing cancellation records. We have 2 options loading this kind of tables into the fact table in the data warehouse: either we load the cancellation as a new record with negative measures, or we load the cancellation as logical deletion. Reverse transaction such as refund and credit notes needs to be implemented as negative measures, but for cancellation we have 2 options.

Each approach has its own advantages and disadvantages. Which one is better depends on the situation, for example whether we will be loading from that fact table into OLAP cubes or not, whether we will be creating a summary fact table or not, whether we need the historical dates or not, and whether the secondary source table contains complete data or not. If we decided to go for logical deletion, then for each cancellation record exists on the secondary source table, we mark the logical delete column on the fact table record. All processes further down the stream such as loading into data marts, creating summary tables or loading into OLAP cubes need to be aware of this column and they need to handle it properly.


When loading records from stage to fact tables, sometimes we have duplicate records. For example: we declare that the grain of the fact table is 1 day for every product for each store. But we found this on the stage table of the fact table:

date product_id store_id quantity
19/8/2006 BGCKZ 309 30
19/8/2006 BGCKZ 309 1

Do we add them up, take the maximum, or take the minimum, or take the average?
It is very important to understand why it happened in the first place. In this case I always found it useful to go back to the business users or the source system expert. The second record could be an error, and in this case we take the earliest record. Or it could be a correction and in this case we sum them up. Or there can only be 1 measurement per day and in the case of 2 or more records found it must be a system error and therefore they want us to take an average. Whatever the business rule is, we need to document it, obtain sign off and then implement it in the code. This process is called deduplication. This normally happens if the source system allow duplication, i.e. it does not have the necessary contraints in the database to make it unique as per the fact table grain criteria.

Deduplication does not only happen in fact tables, but also in dimension tables, especially MDM dimensions such as customer and product. Deduplication can also occurs when the fact table is loaded from 2 or more different source system. Many data quality software such as TrilliumDataFlux, DQ Global, have facilities to deduplicate data.

Fact Table Indexes

Fact tables can be very large. It is not unusual for them to have millions of rows. To improve query performance, fact tables are normally indexed. The cluster index of a fact table is normally a composite of the dimensional keys. Or the fact key column, if there is one. For the non clustered indexes, deciding which column to index depends on how the fact table is used. Whether it is used for direct queries by end users, for populating OLAP cubes or by reports. SQL Profiler and Index Tuning Wizard are useful to understand what indexes would help improving the query performance.

Indexing in data warehousing is quite different from OLTP. In data warehousing it is normal for tables to be indexed heavily to improve query performance. In OLTP, because the records are frequently updated the indexes are not too heavy. It is normal for OLTP tables with 10 to 30 columns to have 1 to 3 indexes. In data warehousing, it is normal for tables with 10 to 30 columns to have 5 to 15 indexes. I’m talking about Kimball’s dimensional model data warehouse here, not Inmon’s 3rd normal form data warehouse in Corporate Information Factory.

Both models of data warehouse are indexed a lot more than OLTP, but generally speaking dimensional model data warehouses are indexed more than the 3NF ones. Because data warehousing tables are heavily indexed, in data warehouse the indexes takes much more space than OLTP. It is normal in a data warehouse for the indexes to take more space than the data. As an example, in a data warehouse implementation I’ve been involved recently, the 3 main fact tables occupies 100 GB for data, but the indexes for these fact tables are 150 GB. This would never happen in an OLTP system.

If our fact table contains 100,000 rows or less, we just load the fact table with the indexes on. There is no need to worry about dropping indexes when loading. If our fact table contains more than 1 million rows, it may be quicker to drop the indexes before loading the data, and recreate them afterwards. If the load is less than 10% of the fact table length, generally speaking we don’t need to worry about dropping indexes. Chances are we could decrease theloading performance by doing so. But if we are loading 20% or more (of the fact table length, i.e. number of rows in the fact table) we may want to consider dropping and recreating indexes. It is very difficult to generalise this, as the performance differs depending on what indexes we have and what loading operations we perform. Hence we always need to test it in dev and QA to prove that our load is significantly improved by dropping and recreating indexes, before implementing it in production.

If you use Oracle, bitmap indexes and Index Organised Table (IOT) are very useful in data warehousing. If implemented correctly, for example bitmap indexes on columns with very low cardinality can be 10 times faster than B-Tree index. IOT are normally used for slim lookup tables. These features don’t exist in SQL Server. As far as I’m aware SQL Server only support B-Tree indexes.

Fact Table Partitioning

Table partitioning is new in SQL Server 2005, but has been in Oracle since 1997 on Oracle 8 and improved in version 8i and 9i. In SQL Server 2000 we only have partitioned view, not partitioned table. This article provides a good overview on the partitioning on SQL Server 2005. Joy Mundy wrote an article about partitioning on SQL Server 2000. For Oracle, it’s on this document and this manual. In DB2, table partitioning was introduced in version 8 since 2002, and was greatly improved in version 9 which was release in July 2006. Paul McInerney describes DB2 partitioning features for data warehousing in this article.

Fact table partitioning is basically dividing the fact table into several physical parts. Each part is called a partition. Each partition is ideally located on a different physical data file and ideally each file is located on different disk. For example, we can divide the fact table so that rows for January 2006 are located on partition 1, rows for February are located on partition 2, and so on. Each partition can be loaded separately in parallel. Partitioning can improve query performance and increase availability. We can add new partition, drop existing partition, truncate (empty) a particular partition, move a partition, split a partition, merge several partitions and exchange/switch partitions. All these activities improve maintenance operations and greatly simplify administration tasks.

If our fact table contains more than 1 million rows, we definitely need to consider partitioning it. Even if the fact table contains hundreds of thousand rows, we still need to consider partitioning it. It can improve the loading performance significantly. We are talking 10-20 times quicker here. This is because we load new data to just 1 partition, which is say 12 times smaller than the size of the whole fact table, hence quicker. There are also techniques to load data into a new empty table with exactly the same structure as the fact table, then switch/exchange partition that new table with the main table. Queries can be a lot quicker too. Again we are talking 10-20 times quicker here. We can also partition indexes so that each part of the index serves only one table partition. This allow more processes to run in parallel. Partition can really
be a life saver for a data warehouse.

If you use Oracle, always use local partitioned indexes instead of global partitioned index. If you use SQL Server 2005, SQL Server automatically partitions the index by using the same partition scheme and partitioning column as the table, unless you specify a different partition scheme. For more information on partition indexes on SQL Server, please refer to these guidelines.

Internal Data Warehouse Entities

Some data such as performance target or budget does not exist in any source system but it needs to exist in data warehouse. Hence they are known as internal data warehouse entities. For example, for each store we can calculate out of stock percentage, i.e. how many times a required product is out of stock. There may be a concensus that it needs to be under 5%. This 5% is not written any where in any source system but will be required by the reports so they need to exist somewhere in the data warehouse .

We have 2 options on how to get this performance target into the warehouse.
1) we can build an application which will store or persist the data into its database, then from there we ETL into staging and into the warehouse, or 2) put it on a simple spreadsheet and we import it into the warehouse. In any case we should not allow the data to be directly entered into data warehouse, for example using direct SQL statement or through SQL Server enterprise manager. All data stored in the warehouse must go through the data quality / data firewall rules, so that any exceptions to the rules are recorded and reported.

Loading Snapsnot and Delta Fact Tables

A snapshot fact table is a fact table that contains condition of measures or status at specific point in time. For example:

  • a fact table that contains actual inventory level for each product at 9 am every day
  • a fact table that contains balances of every saving account in all branches on the first day of every month
  • a fact table that contains the details of all accounts that each customer has every day

There are 2 kinds of snapshot fact tables, periodic and accumulating. Periodic snapshot fact tables contain condition of status taken at regular interval, i.e. we take the snapshot of the source table at regular interval. All 3 examples above are periodic snapshot fact tables. Accumulating snapshot fact tables show the status at any given moment. It is useful to track items with certain life time, for example: status of order lines. Please refer to this Ralph Kimball article for more details about snapshot fact tables. An example of accumulating snapshot can be found here.

How do we load periodic snapshot fact tables? We extract all records that satisfy the criteria from the source table at certain period. For example, take all active rows from the account tables including the balance amount. Do this automatically once a month on the first day. Loading accumulating snapshot is rather different.
We still take all records from the source table that satisfy the criteria, then we update the fact table. For the example of purchasing accumulating snapshot above, everytime there is new piece of information about a particular purchase, we update the fact table record. We only insert a new record in the fact table when there is a new purchase requisition.

Delta fact table is a fact table that we produce as a result of comparing the condition of a source table on 2 different time point. For example: account table. This table in the source system contains all customer accounts. Say on Monday we have 100,000 active accounts and on Tuesday we have 100,001 active accounts, i.e. there were 2 new accounts opened, 1 account closed and 3 accounts changed. Out of the 3 accounts changed, 1 is changed interest rate (from 5.75% to 5.50%), 1 changed the credit limit (from £3000 to £3500), and 1 the interest payment frequency (from daily to monthly). On the delta fact table there will be 6 new rows today (assume today is 6th Nov 2006). Legend for change_type: NA = new account, CA = closed account, IR = Interest Rate,
CL = Credit Limit, IPF = Interest Payment Frequency.

account_key change_date change_type IR_before IR_after CL_before CL_after IPF_before IPF_after
49912 6/11/2006 NA
26077 6/11/2006 CA
32109 6/11/2006 IR 5.75 5.50
19387 6/11/2006 CL 3000 3500
29462 6/11/2006 IPF D M

To populate delta fact table, we download the source account table everyday and compare today’s copy with yesterday’s copy and entered the differences on the delta fact table. As an illustration of practical implementation of delta fact tables, in my previous data warehousing project we use them to tract customer retention or customer churns. It is useful for Customer Relationship Management and also to calculate financial impact of customer churns.

Purging Or Pruning A Fact Table

Some people call it purging, some call it pruning. Purging or pruning a fact table is an activity to remove certain rows from the fact that satisfy certain criteria. To give us an idea below are some examples of purging criteria:

  1. older than 2 years
  2. older than 2 years and status is not active
  3. keep daily records for the last 4 weeks then Monday only for the last 2 years

Example 1 and 2 is normally for transaction fact table and example 3 is normally applicable for periodic snapshot. Purging is important when our data warehouse is a few years old. Purging can improve query and load performance significantly.
If the fact table is partitioned, it is a common exercise to archive the oldest partition (say older than 5 years old, partitioned by month) then drop or truncate the partition. Archiving can be done by exporting the partition (using exp in Oracle) or by backing up the partition.

Reloading Fact Table History

Soon after the data warehouse is in production, user may want us to reload say last 2 years history from the OLTP source system. Say they want 2 years history of order tables. It’s not difficult to reload the history. Normally we just need to modify the time window on the ETL package. For example, if you use SSIS or DTS, we could set the time window on a table in the metadata database, where the Last Successful Extraction Time (LSET) and Current Extraction Time (CET) are kept. See this article for the concept of LSET and CET.

The problem with reloading fact table history is: we need to reload all related dimensions too. And most OLTP do not keep this history. All orders in the last 2 years, yes no problem they have it. But all customer and products and store details in the last 2 years? Often the customer and products history are not kept, their details are overwritten with new ones. SCD concept is not known and implemented in OLTP systems. This way they lost the history. The best way is probably to load the last condition of product, customer and store tables from the source system, and for all order table records which we can’t find a match in the product, customer and store tables, we reference them to the unknown records. This way the sum of measure in the fact table will still be valid and the referential integrity will still be intact.


Fact tables are loaded after the dimension tables are populated. As rows of fact tables are loaded, each of the key columns is converted from the natural key. When loading fact tables sometimes we need to deduplicate the rows. To load fact table record that does not have a corresponding dimension record we set the dimension key to 0, referring to the unknown record in the dimension table. Loading performance could be a lot higher if the fact table is partitioned. Consider removing fact table indexes before loading the data and recreate them afterwards. Purging or pruning fact tables can improve query and load performance significantly.

Vincent Rainardi
12 th November 2006


  1. Thanks for such a detailed explaination, but I did not understand why would someone update an old transaction, as in your example of price change (unless it is a correction, is that what you meant?)

    Comment by Saqib — 26 October 2015 @ 10:35 am | Reply

    • Hi Saqib,
      many books and websites recommend against updating fact tables (I’m talking about Periodic Snapshot Fact Tables here), and they are right. Because it is a pain to update a fact table. But in reality, occasionally, we need to do it. For example: correction. Before the accounting period is closed, the company may need to restate their transactions (retail industry) or positions (asset management industry). If the fact table has a large number of rows (say 25bn rows), we usually employ the technique of “swapping out the daily partition” to update a Periodic Snapshot Fact Table.
      A Accumulative Snapshot Fact Table is of course, perpetually updated.
      For a Transaction Fact table, one of the recommended approach to make a correction is by posting (inserting) a corrective row, particularly if the actual business process is supportive e.g. goods returns, payment refunds, etc.

      Comment by Vincent Rainardi — 26 October 2015 @ 6:26 pm | Reply

  2. Thank you.Helpful article.

    Comment by umasankar — 15 July 2016 @ 7:03 am | Reply

  3. I am new to DW and input is really appreciated.

    Let us say i have one fact table as mentioned in below link.The grain of the fact table 1 row for each account and we do not have fact_key in fact table.

    To implement step 3[Update] on balance field do i need to join only columns which are part of grain[Snapshot Date Key,Account Key] or all columns[Snapshot Date Key,Account Key,Customer Key and Product Type Key]

    Comment by Vamsi Krishna — 22 September 2016 @ 1:49 pm | Reply

  4. Can we load the data into fact table from 20-30 dimension tables? Is this possible?

    Comment by madhu — 6 December 2016 @ 4:54 pm | Reply

  5. Thank you very much for taking the time to put this example together, it has really answered a lot of questions for me…Do you by any chance have any similar articles on populating dimension tables?

    Comment by Niall — 7 December 2016 @ 5:12 pm | Reply

    • No I don’t Niall. Type 2 dimension tables are populated by using upsert, i.e. update if exists and insert if not. The matching criteria is the business key.

      Comment by Vincent Rainardi — 7 December 2016 @ 7:43 pm | Reply

  6. Thanks Vincent. A helpful post. Just letting you know that there are a few links that are missing. Perhaps this is due to the links having disappeared as this blog post is quite old. But it’s been very useful and educational to me.

    Comment by Scott — 13 October 2017 @ 9:28 am | Reply

  7. Vincent , I have a question with Delta Fact Load like mentioned above. So My Dims and facts are loaded everyday . While look up , the condition is natural id and source system (multiple sources) , My Fact and Dims are not in sync (records are different for the grain as fact receives update not the same day as Dims) so if tonight dimension got rec id 1 and updated the historical column say status ‘L’ to ‘M’. While fact with no update for that load will miss the dim changes . as on the third day Dim history column is again ‘L’ . Now fact on capturing this change is not capturing one dim change … how to handle this situation .? Looking forward to hear from you soon . Hope the question is clear.

    Also in such a situation should i using incremental on Fact or control append .

    Comment by Sudhir K Prasad — 20 March 2018 @ 3:41 pm | Reply

    • Hi Sudhir,
      in the fact table we should put the latest dimension key for that business date.
      For example, if for 19/3/2018 the value in the dim is:
      customer_sk customer_name status effective_from effective_to
      1 Blue Granite L 19/3/2018 08:00 19/3/2018 10:00
      2 Blue Granite M 19/3/2018 10:01 19/3/2018 14:00
      3 Blue Granite N 19/3/2018 14:00 20/3/2018 09:00
      Then we should put SK = 3 in the fact table (Status = N), because on 19/3/2018 at 23:59 the latest dimension row for Blue Granite is 3 (by looking at the effective date columns). So the fact table should be like this:
      business_date_sk customer_sk balance_amount
      20180319 3 12345

      Yes we are missing Status L (which is from 8 to 10am) and Status M (which is from 10am to 2pm), but the granularity of the fact table is once a day (1 row per business date) and we just need to take the latest condition for that business date (for all dimensions).

      If the business needs to know the changes of Blue Granite status on 19/3/2018, then we should query just the customer dimension, independently of the fact table.

      It is always faster to update the fact table incrementally. I’m not too sure what you meant by “Controlled Append” but I think you are comparing yesterday’s condition with today’s condition, which is quite slow.

      Comment by Vincent Rainardi — 20 March 2018 @ 7:02 pm | Reply

  8. Thanks Vincent, for this detailed resource. I would like to ask you about how to load bank account with some of the clients’ balance that change on a daily basis.

    Comment by Ramon Yisau — 8 May 2018 @ 7:39 am | Reply

    • Hi Ramon, the usual approach is to create a Transaction Fact Table which contains every transaction from every account during the day, each with their amount (this table is called fact_account_transaction, columns: fact_key, transaction_date_key, customer_account_key, transaction_description_key, transaction_amount, transaction_datetime, upload_datetime).
      Then we create a daily Snapshot Fact Table which contains only the end of day balance of every day or every account (this table is called fact_account_balance_snapshot, columns: fact_key, snapshot_date_key, customer_account_key, account_balance, upload_datetime).
      Hope this helps,

      Comment by Vincent Rainardi — 8 May 2018 @ 5:55 pm | Reply

      • Hi Vincent, thanks for this vital response. Actually what we extract from the OLTP system is the end of day balance, however, my concern is that if the bank has about 300,000 Account records does it mean that we’ll be inserting all the balances every day into the daily snapshot fact table? Keep in mind that some clients might not transact for some days which means that the daily balances of those accounts would be the same for that period.
        Will this not make the fact table grows quickly, e.g if 300,000 account balances are inserted for 5 days, it means a total of 1.5 million records in the daily snapshot fact table.


        Comment by Ramon Yisau — 11 May 2018 @ 7:06 am

      • Hi Ramon, yes you do need to insert 300k rows every day into the snapshot fact table, because if you only insert the changed accounts, the snapshot fact table will be difficult to query. And yes the fact table will get bigger quickly into billions of rows. Perhaps it would be helpsful to read this article about it:

        Comment by Vincent Rainardi — 11 May 2018 @ 7:50 pm

  9. Hello Vincent,

    Thanks for the article link provided. It really shed more light on our worries and showed us a way forward.
    Just like Oliver Twist, I have some more questions but this is based on Loan management in the data warehouse. We created a Dimension and Fact table for our OLTP Loans in the data warehouse but a number of articles online suggested that Loan should be treated as Dimension only due to a number of changes it undergoes such as a decrease in disbursed amount, top up (Increase in disbursed amount), partial disbursement, repayment, overdue processing, change in Interest rate etc.
    So, I need your advice on the better way to handle this.

    Many thanks.

    Comment by Ramon Yisau — 12 May 2018 @ 12:25 pm | Reply

    • Hi Ramon, a loan has many stages during its life cycle including application (initiation/origination), collateral assessment, credit decision/approval, disbursement, repayment, impairments and closure. Therefore “static” loan attributes such as loan type, amount, term, application date, approval date, collateral details, interest rates, limit, fees, penalties, agreement ID, maturity date, installment, settlement, conditions, etc needs to be put into a loan dimension. But loan does have a transaction fact table, which records the cash flow, including disbursements, repayments, interests and write off. Loan also have a daily snapshot fact table recording the account balances (this is usually combined with deposit accounts). There are various types of loan: fix term or revolver, fix interest or variable, mortgage, business loan, asset finance, buy to let, business overdraft, cash flow finance. They have different loan attributes and cash flow. The main hierarchies for loan data is customer hierarchy (group > client > agreement > account), product type hierarchy (see different types of loan above), financial hierarchy (balance sheet accounts) and date hierarchy (deal pipeline analysis, credit control, maturity analysis, impairment early warning). Impaired loans are usually handled by separate department, have their own life cycle and process, and therefore have their own fact table, and its own dimension. The link between client dimension and loan dimension is particularly crucial to enable the customer hierarchy above (both dimensions are SCD type 2, and both have “parent-child hierarchy”). Deal pipeline analysis usually requires an accumulated snapshot fact table. Hope this helps, otherwise please email me on

      Comment by Vincent Rainardi — 13 May 2018 @ 1:53 am | Reply

  10. I am new DW and about fact table i have a very rich database have 200 tables and i want to make a fact table according to a client which need infroamtion present in 15 tables now do i have to load fact table accordingly? i mean School Cambridge(from school table), district(e.g where its located), ….. and much more all related data should be in one row thats waht fact table is right.

    Comment by Zain Ul Abedin — 9 July 2018 @ 9:38 am | Reply

    • Hi Zain, thanks for your question. Not all related data should be in one row Zain. Only data related to 1 business process should be in 1 row. So for example if the business process is “student registration”, then you need to get all the measures relevant/used in this registration process plus all relevant dimension into 1 fact row. Another process like “examination result” should be in another fact table, not mixed with the registration fact table.

      Comment by Vincent Rainardi — 9 July 2018 @ 6:22 pm | Reply

      • Thanks Sir Vincent, i really appreciated that, and acted as you suggested it was all working fine till i hit the measures selection for the fact table, What confuses me is that Measures should be unit measure and calculated at report level or measure should be calculated and then add to the fact table for example, As i mentioned earlier about DB which has data about all the schools in a province, now i have to prepare a DW which has information about schools such as total each schools facilities, total working facilities, total not working facilities, total student, total male student, total female student, total usable commodities and etc etc now should i add such columns to fact table or should i just add unit data into columns and calculate them at report level. Thanks

        Comment by Zain Ul Abedin — 17 July 2018 @ 8:03 am

      • Thanks Zain. I’m not a Sir, please just call me Vincent. Generally, measures should be stored in the fact table at the lowest possible level. In your case, 1 row in the fact table is 1 school. The columns are:
        – school_key, which links to the school dimension table
        – year, which is when the data is obtained
        – students
        – male_students
        – female_students
        – school_facilities
        – working_school_facilities
        – and so on.
        The report then selects data from the fact table, and sums up the above measure columns, and report them per city, per school category (e.g. public school, private school, etc.), per school level (e.g. primary school, secondary school, etc) for that year.
        For example:
        sum(F.students) as students, sum(F.male_students) as male_students, sum(F.female_students) as female_students,
        sum(F.school_facilities) as school_facilities, sum(F.working_school_facilities) as working_school_facilities
        from fact_school_information F
        inner join dim_school DS on F.school_key = DS.school_key
        group by, DS.school_level
        order by, DS.school_level
        I hope this helps.

        Comment by Vincent Rainardi — 18 July 2018 @ 7:27 am

  11. Thank Vincent, it really cleared my concept about Fact tables. it cleared more concept then i have obtaining from books in last 12 days . Thank you

    Comment by Zain Ul Abedin — 19 July 2018 @ 7:14 am | Reply

  12. Can’t thank enough, Vincent

    Comment by Hem — 2 November 2018 @ 11:35 pm | Reply

  13. Hi dear,
    I have more than 7 dimensions tables. All the tables have concept called effective date and effective status(A/I). The each dimensions are joined with multiple transactions tables.

    I can write SQL query to get load data to fact table while joining all the transactions and dimensions with concept of max effective date.

    but I thinks that the SQL query is right approach when considering the dimension changes(SCD).

    could you please suggest appropriate approach load the fact with these dimensions joins. (preferred SSIS)


    Comment by Abdul — 1 January 2019 @ 10:42 am | Reply

    • small correction : I thinks that the SQL query is NOT right approach when considering the dimension changes(SCD).

      Comment by Abdul — 1 January 2019 @ 10:43 am | Reply

      • Hi Abdul, the SQL to load into the fact table while joining all the transactions and dimensions by considering the SCD changes is like this:
        SELECT DD.date_key as transaction_date_key, DC.customer_key, DP.product_key,
        T.sales_amount, T.cost_amount, T.profit_amount
        FROM Source.Transaction_Table T
        LEFT JOIN DWH.Dim_Date DD on T.transaction_date = DD.full_date
        LEFT JOIN DWH.Dim_Customer DC on T.customer_code = DC.customer_code
        and T.transaction_date between DC.effective_date and DC.expiry_date
        LEFT JOIN DWH.Dim_Product DP on T.product_id = DC.product_id
        and T.transaction_date between DP.effective_date and DP.expiry_date

        Comment by Vincent Rainardi — 4 January 2019 @ 8:39 am

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at

%d bloggers like this: