Deleting today’s rows from the fact table takes ages if the table is not indexed. Best to index it (clustered) on the snapshot date key column. Alternatively, partition the fact table so you can truncate the partition (both table and index).
My daughter brings my internet stick today so I have no access to the internet on the bus and on the train. It’s annoying as I can’t check email and learn things on the net but it has its own blessing, i.e. I can focus on writing what I experienced in the past week, without being distracted by the temptation of the internet.
Periodic Fact Table
I am helping a friend building a data warehouse for a company. One of the things he found yesterday was performance issue with delete at fact table loading. To recap, there are 3 types of fact table (on pure Kimball method): periodic snapshot, accumulative snapshot and transaction. On the periodic snapshot, we insert all the rows in the source table into the fact table every day. If in the source fact table we have 1000 rows and it grows by 10 rows a day (so 1st June it’s 1000, 2nd June it’s 1010, etc) then after 3 days the fact table will look like this:
So every day, about 1000 rows will be inserted into the fact table.
One of the core principle of ETL in data warehousing is Rerunability. Meaning that, if the ETL fails half way through the batch, we should be able to rerun it again.
In the above scenario, imagine that now is 4th June and we have 1030 rows to load. The ETL (SSIS or Informatica for example) managed to load 900 rows then it failed because of network error. We fixed the network issue and we rerun this fact table load. But there are already 900 rows in the fact table. So after this second run, we will have 900+1030 rows = 1930 rows.
We can’t make it a “transaction”. The concept of transaction is only practically applicable when dealing with row by row. If we insert thousands of rows and we make it a transaction (using row count to indicate if the transaction is complete), we will hit a performance issue when the ETL fails. Because the SQL Server or Oracle need to delete the inserted rows one by one.
Deleting Today’s Rows
So the traditional way of providing Rerunability, is to delete today’s rows from the fact table. In the example above, “delete from Fact1 where SnapshotDateKey = 20120604”. If you use SSIS, this delete task would be an Execute SQL Task, and this 20120604 is an input parameter to the task, taken from a variable called TodaysDate (package scope).
And this TodaysDate package variable is populated by an Execute SQL Task earlier in the package, which contains this SQL: (I’ll build it step by step so you can see the process)
select convert(varchar, getdate(), 20)
select replace(convert(varchar, getdate(), 20), ‘-‘, ‘’)
select convert(int,replace(convert(varchar, getdate(), 20), ‘-‘, ‘’))
The output is an integer like 20120604. And in the Execute SQL Task we put this output into the TodaysDate package variable.
Performance Issue with Delete
The problem is, deleting 1 million rows out of 10 billion is painfully slow. It could take hours. When the data warehouse was first built, there were only a few rows in the fact table, so the delete was performing well. Usually in minutes. But after a year, with 10 billion rows in the fact table, it now takes hours to delete.
Now this is the core of this article, as I explained in the summary above. There are two ways of solving this:
- Clustered index on snapshot date key
- Partition the fact table
1. Clustered Index on Snapshot Date Key
By adding a clustered index on the Snapshot Date Key column, we force the fact table to be physically ordered on Snapshot Date Key. So that the rows for todays date are located in 1 place, i.e. at the end of the table’s file (group). That is why the delete is so much faster. To give an illustration about the scale of performance improvement, before the clustered index, a fact table with 10 billion rows could take 1-2 hours to delete 1 million rows, but with clustered index, it takes only 2-5 minutes. A major performance improvement.
The syntax is:
create clustered index Fact1_SnapshotDateKey on Fact1(SnapshotDateKey)
For a fact table, as we have only 1 chance of ordering it physically, if it is a snapshot fact table, there is no argument that the clustered index has to be on the snapshot date key. This principle is NOT applicable for transaction fact table or accummulated snapshot. I repeat: NOT applicable.
2. Partition the Fact Table
If clustered index approach above is still too slow, or because you need the clustered index for something else, a better approach is to partition the fact table.
By partitioning it, we now can truncate only a portion of the table, i.e. we can truncate 1 partition. If we partition the fact table on the snapshot date key, we can truncate today’s rows in a zip. Unlike delete operation, truncate is not logged. So it is lightening fast. Whether it’s 1 million rows or 1 billion rows, the truncate only take a zip (a few seconds, i.e. between 0 and 5 seconds).
One thing to remember here is that the partitioned index (or indices) must be aligned with the table so that truncating the table partition will also truncate the index partition. Because if not we need to rebuild the index and dealing with 10 billion rows in the fact table, we don’t want to hear the word “Rebuild” in the same sentence as the word “Index”.
After deleting the today’s rows, the insert is fast. If the insert is slow, look at the constrains. Default and identity is fine but FKs is nightmare. If your fact table is in billions, I do not recommend having physical RI like FKs in the fact table. Deletion will be slow and more importantly insert will be slow. Remove the FKs. But how about querying it, it will be slower. Yes, but only a little bit slower, not as much slower as the insert. At ETL we have made sure the RI (Referential Integrity) any way, so why put FKs?
If your fact table is under 1 million rows then by all means you can put FKs on all Surrogate Key columns, but when the fact table are in billions, every little count.