In Microsoft BI, when we do a slowly changing dimension type 2, we instinctively do SCD Wizard. The problem with SCD Wizard is the performance. For each source row coming to the SCD box, SSIS does a lookup on the target dimension table to find out if the row already exists or not. That is a crazy idea. And SSIS also update the target dimension table on row by row basis. That is also a crazy idea.
If your dimension is 1000 rows and the source data is 1000 rows, the SCD takes a few minutes. But if your dimension is 500k and the source data is 500k, it’s 5-6 hours performance. The whole idea is fundamentally flawed. On the welcome screen of the SSIS box, there should be a warning: Warning, do not use SSIS SCD Wizard if your dimension is above 1000 rows.
So, if your dimension is 1m rows, what should you do?
- Do it on set basis, not row by row basis.
- Filter the incoming source rows to the changed rows.
- Don’t everything type 2, but only necessary columns.
- Index the dimension table on the seek columns.
- Group the updates to the target dimension table.
- Archive unused rows in the dimension table.
To do update an SCD type 2 dimension, in SSIS we should do this:
- Obtain Current Time (CT).
- Stage the source table. 1m rows (S).
- Identity changed rows on the staging table. 50k rows (C).
- Put the C rows on a new staging table.
- Identify which C rows already exist on the dimension table by comparing the business key.
Say 40k rows exist (E) and 10k rows don’t exist (D).
- Out of the 40k E rows, identify which rows has changed attributes. Say there are 12k rows (CA).
- Update dimension table: expire the 12k CA rows (set Expiry Date = CT, IsActive = N).
- Insert 22k row (10k D + 12k CA) into dimension table (in 1 go, not in 2 goes1). Expiry Date = 9999, IsActive = Y.
D rows: Effective Date = 0000. CA rows: Effective Date = CT.
- In step 2, enable fast load, do not put any constraint on the staging table.
- Step 3 is done by hashing or checksum.
- Staging table C on step 4 can be used by other work flow, especially the Fact Load.
- To support step 5 and 7, index the dimension table on the Natural Key column.
- To support step 6, index the dim table on the type 2 columns to be compared.
Do the compare in the order of the indexed columns.
- We can use the Merge command to do step 7 & 8, but we need to do it twice. So the old Upsert (update + insert) works as efficiently as Merge. Probably more efficient as we don’t do delete.
This is one major weakness in SSIS compared to Informatica PowerCenter so I have no doubt Microsoft must have been building a set based SSIS SCD Wizard for the next release 🙂