Data Warehousing and Business Intelligence

9 September 2015

Loading a Dimension Table using SSIS

Filed under: Analysis Services — Vincent Rainardi @ 7:46 am

Someone recently asked my opinion on loading a dimension table using MS BI / SSIS, i.e. comparing the SCD Transformation, the Merge Join + Conditional Split transformation, the Change Data Capture and the Merge command. Below is my opinion. I would love to hear your comments, opinions, and experience on this topic.

SCD Transformation created by SCD Wizard is not efficient in terms of performance, as I explained here. This is because it is doing row-by-row operation. The Dimension Merge SCD Component (link) tried to address the issues in the SCD Transformation and SCD Wizard.

We can use Merge Join transformation where the souce rows are merged with the dimension rows on the business key column. Then we use a Conditional Split transformation to update or insert into the dimension table based on the surrogate key column (if SK is null then insert, if SK is not null then update). Masud Parvez provides a good example here. But this is also a row-by-row operation so I would not use this approach when the dimension table is large. In particular, I don’t like the fact that we have to read the dimension table first and make it as a source (to be merged with the incoming source rows). This is highly inefficient. I also feel that using Merge Join + Conditional Split is cumbersome, compared to using a separate Update and Insert command.

Change Data Capture in SQL Server provides an output table with the same structure as the source table, with five additional columns (Start Log Sequence Number, Sequence Value, Operation, Update Mask), see here. CDC is a mechanism to extract the data change from the source system. It is not a mechanism to insert or update a dimension table. To load a dimension table, we will need to read the CDC output table, and update or insert into the dimension table based on the _$Operation column (1 is a delete, 2 is an insert, 3 and 4 are update).

Using Merge SQL command to load a dimension table is explained quite comprehensively by Alex Whittles here. First of all, we need to be aware of the bugs with the Merge command as Aaron Bertrand explained here, some of which continue to exist even until version 2012 as they are “by design”. We also need to be aware of the concurrency/locking issues, resulting in PK Violation error, as Dan Guzman explained here. Then we need to prepare the index. Microsoft recommend a clustered index in the target table, but in the dimenson table the clustered index should be the surrogate key column, not the business key column.

Merge command is less efficient compared to separate Update and Insert commands, because when we use separate Update and Insert commands, the Insert command does the Insert for both the new rows and the existing rows in one go. The Merge command on the other hand must do the Insert twice, once for the new rows and another one for the existing rows.

My preferred approach to load a dimension table using SSIS is to bring the source rows into a staging table, then use Execute SQL command to do the following, which I think is the most efficient way:

  1. Compare the stage table and the dimension table to find out which rows already exist in the dimension table, and those rows have changed.
  2. Update the dimension table, to set the expiry date of the rows which have changed.
  3. Insert the new rows and the changed rows into the dimension table.

For example: we have a dimension table called dim_country in the Data Warehouse (DW) with 5 columns: country_key, country_code, country_name, effective_date, expiry_date, is_current. In the staging table is called country, which has 2 columns: country_code and country_name.

update dw.dim_country set expiry_date = getdate(), is_current = ‘N’
from the stage.country s
inner join dw.dim_country d on d.country_code = s.country_code
where d.country_code = s.country_code and d.country_name <> s.country_name

insert into dw.dim_country ( country_code, country_name, effective_date,
expiry_date, is_current )
select s.country_code, s.country_name, getdate() as effective_date,
’31 Dec 9999′ as expiry_date, ‘Y’ as is_current
from the stage.country s
left join dw.dim_country d on s.country_code = d.country_code
where d.country_name is null –new rows
or (d.country_name <> s.country_name) –changed rows

Advertisements

6 Comments »

  1. If it’s a large dimension table you could also create a hash key on coloumns that triggers a change.

    The hash-method is in most cases the fastest way to identify changes.
    You can either use T-SQL to make your hash key or a script task in ssis

    Comment by Henrik Matz — 9 September 2015 @ 11:21 am | Reply

  2. I found your review very interesting, especially since I’m starting out in the BI world. As you’ve already created a lab for popular fact tables, why draw some material from how popular size tables, too?

    Comment by Rafael Lima - Brazil — 24 April 2017 @ 8:45 pm | Reply

    • One more question.
      You can explain your approach in a bit more detail, unfortunately for me it was not very clear that they are like tables A, B and C mentioned in the example.

      Comment by Rafael Lima — 25 April 2017 @ 1:32 am | Reply

      • Hi Rafael, I have given an example. Hopefully this makes it clearer for you.

        Comment by Vincent Rainardi — 25 April 2017 @ 4:23 am

  3. Hi Vincent, your example was very good and it became clear to me now.
    Again his knowledge helped to enrich my work.
    Thank you!!!

    Comment by Rafael Lima - Brazil — 25 April 2017 @ 11:08 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: