Someone recently asked my opinion on loading a dimension table using MS BI / SSIS, i.e. comparing the SCD Transformation, the Merge Join + Conditional Split transformation, the Change Data Capture and the Merge command. Below is my opinion. I would love to hear your comments, opinions, and experience on this topic.
SCD Transformation created by SCD Wizard is not efficient in terms of performance, as I explained here. This is because it is doing row-by-row operation. The Dimension Merge SCD Component (link) tried to address the issues in the SCD Transformation and SCD Wizard.
We can use Merge Join transformation where the souce rows are merged with the dimension rows on the business key column. Then we use a Conditional Split transformation to update or insert into the dimension table based on the surrogate key column (if SK is null then insert, if SK is not null then update). Masud Parvez provides a good example here. But this is also a row-by-row operation so I would not use this approach when the dimension table is large. In particular, I don’t like the fact that we have to read the dimension table first and make it as a source (to be merged with the incoming source rows). This is highly inefficient. I also feel that using Merge Join + Conditional Split is cumbersome, compared to using a separate Update and Insert command.
Change Data Capture in SQL Server provides an output table with the same structure as the source table, with five additional columns (Start Log Sequence Number, Sequence Value, Operation, Update Mask), see here. CDC is a mechanism to extract the data change from the source system. It is not a mechanism to insert or update a dimension table. To load a dimension table, we will need to read the CDC output table, and update or insert into the dimension table based on the _$Operation column (1 is a delete, 2 is an insert, 3 and 4 are update).
Using Merge SQL command to load a dimension table is explained quite comprehensively by Alex Whittles here. First of all, we need to be aware of the bugs with the Merge command as Aaron Bertrand explained here, some of which continue to exist even until version 2012 as they are “by design”. We also need to be aware of the concurrency/locking issues, resulting in PK Violation error, as Dan Guzman explained here. Then we need to prepare the index. Microsoft recommend a clustered index in the target table, but in the dimenson table the clustered index should be the surrogate key column, not the business key column.
Merge command is less efficient compared to separate Update and Insert commands, because when we use separate Update and Insert commands, the Insert command does the Insert for both the new rows and the existing rows in one go. The Merge command on the other hand must do the Insert twice, once for the new rows and another one for the existing rows.
My preferred approach to load a dimension table using SSIS is to bring the source rows into a staging table, then use Execute SQL command to do the following, which I think is the most efficient way:
- Identify which source rows already exist in the dimension table (A) and new rows which don’t exists in the dimension table (B) by comparing the business key.
- In A, identify rows with changed attributes changes (only the columns we are monitoring) – call this C.
- Update the dimension table to expire the C rows (not A).
- Insert both B and C rows into the dimension table in one go.