Data Warehousing and Business Intelligence

19 June 2012

SSIS SCD Wizard Performance Issue

Filed under: Data Warehousing,SSIS — Vincent Rainardi @ 7:23 pm
Tags: ,

In Microsoft BI, when we do a slowly changing dimension type 2, we instinctively do SCD Wizard. The problem with SCD Wizard is the performance. For each source row coming to the SCD box, SSIS does a lookup on the target dimension table to find out if the row already exists or not. That is a crazy idea. And SSIS also update the target dimension table on row by row basis. That is also a crazy idea.

If your dimension is 1000 rows and the source data is 1000 rows, the SCD takes a few minutes. But if your dimension is 500k and the source data is 500k, it’s 5-6 hours performance. The whole idea is fundamentally flawed. On the welcome screen of the SSIS box, there should be a warning: Warning, do not use SSIS SCD Wizard if your dimension is above 1000 rows.

So, if your dimension is 1m rows, what should you do?

  1. Do it on set basis, not row by row basis.
  2. Filter the incoming source rows to the changed rows.
  3. Don’t everything type 2, but only necessary columns.
  4. Index the dimension table on the seek columns.
  5. Group the updates to the target dimension table.
  6. Archive unused rows in the dimension table.

To do update an SCD type 2 dimension, in SSIS we should do this:

  1. Obtain Current Time (CT).
  2. Stage the source table. 1m rows (S).
  3. Identity changed rows on the staging table. 50k rows (C).
  4. Put the C rows on a new staging table.
  5. Identify which C rows already exist on the dimension table by comparing the business key.
    Say 40k rows exist (E) and 10k rows don’t exist (D).
  6. Out of the 40k E rows, identify which rows has changed attributes. Say there are 12k rows (CA).
  7. Update dimension table: expire the 12k CA rows (set Expiry Date = CT, IsActive = N).
  8. Insert 22k row (10k D + 12k CA) into dimension table (in 1 go, not in 2 goes1). Expiry Date = 9999, IsActive = Y.
    D rows: Effective Date = 0000. CA rows: Effective Date = CT.

Note:

  1. In step 2, enable fast load, do not put any constraint on the staging table.
  2. Step 3 is done by hashing or checksum.
  3. Staging table C on step 4 can be used by other work flow, especially the Fact Load.
  4. To support step 5 and 7, index the dimension table on the Natural Key column.
  5. To support step 6, index the dim table on the type 2 columns to be compared.
    Do the compare in the order of the indexed columns.
  6. We can use the Merge command to do step 7 & 8, but we need to do it twice. So the old Upsert (update + insert) works as efficiently as Merge. Probably more efficient as we don’t do delete.

This is one major weakness in SSIS compared to Informatica PowerCenter so I have no doubt Microsoft must have been building a set based SSIS SCD Wizard for the next release 🙂

Advertisements

5 Comments »

  1. Hi Vincent,
    Thanks a lot for this and the other great postings. They are really useful.
    I’m new in the BI, so forgive this naive doubt: when you mention in step 3 “Identity changed rows on the staging table … is done by hashing or checksum”.
    What do you mean by “hashing” ?
    Thanks in advance !

    Comment by Mário — 20 June 2012 @ 5:18 pm | Reply

    • Hi Mario,
      what I meant was to convert all the columns in the row to string then concatenate them, and then take the SHA or MD hash using Hashbytes function in SQL Server. In Oracle we use DBMS_CRYPTO library e.g. HASH_MD5 and HASH_SH1 functions.

      Comment by Vincent Rainardi — 21 June 2012 @ 4:53 am | Reply

  2. Hi Vincent,
    I’m a long time reader, first time commenter. Have you tried the SSIS Kimball SCD (changed to Dimension Merge SCD) that Todd McDermid wrote? I have used it on several projects abd was very pleased with the performance…though none of the dimensions were more than 20k members or so. I’m just curious if you have any thoughts on it, especially at higher data volumes.

    http://dimensionmergescd.codeplex.com/

    Thanks,
    Drew

    Comment by Drew — 21 June 2012 @ 3:34 pm | Reply

    • Hi Drew,
      I have come across Todd McDermid’s Dimension Merge SCD Component but I haven’t used it.
      I heard that the performance is far better than the built-in SCD Wizard.

      Comment by Vincent Rainardi — 22 June 2012 @ 6:42 am | Reply

  3. […] Transformation created by SCD Wizard is not efficient in terms of performance, as I explained here. This is because it is doing row-by-row operation. The Dimension Merge SCD Component (link) tried […]

    Pingback by Loading a Dimension Table using SSIS | Data Warehousing and Business Intelligence — 9 September 2015 @ 7:46 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: