Data Warehousing, BI and Data Science

9 September 2015

Loading a Dimension Table using SSIS

Filed under: Analysis Services — Vincent Rainardi @ 7:46 am

Someone recently asked my opinion on loading a dimension table using MS BI / SSIS, i.e. comparing the SCD Transformation, the Merge Join + Conditional Split transformation, the Change Data Capture and the Merge command. Below is my opinion. I would love to hear your comments, opinions, and experience on this topic.

SCD Transformation created by SCD Wizard is not efficient in terms of performance, as I explained here. This is because it is doing row-by-row operation. The Dimension Merge SCD Component (link) tried to address the issues in the SCD Transformation and SCD Wizard.

We can use Merge Join transformation where the souce rows are merged with the dimension rows on the business key column. Then we use a Conditional Split transformation to update or insert into the dimension table based on the surrogate key column (if SK is null then insert, if SK is not null then update). Masud Parvez provides a good example here. But this is also a row-by-row operation so I would not use this approach when the dimension table is large. In particular, I don’t like the fact that we have to read the dimension table first and make it as a source (to be merged with the incoming source rows). This is highly inefficient. I also feel that using Merge Join + Conditional Split is cumbersome, compared to using a separate Update and Insert command.

Change Data Capture in SQL Server provides an output table with the same structure as the source table, with five additional columns (Start Log Sequence Number, Sequence Value, Operation, Update Mask), see here. CDC is a mechanism to extract the data change from the source system. It is not a mechanism to insert or update a dimension table. To load a dimension table, we will need to read the CDC output table, and update or insert into the dimension table based on the _$Operation column (1 is a delete, 2 is an insert, 3 and 4 are update).

Using Merge SQL command to load a dimension table is explained quite comprehensively by Alex Whittles here. First of all, we need to be aware of the bugs with the Merge command as Aaron Bertrand explained here, some of which continue to exist even until version 2012 as they are “by design”. We also need to be aware of the concurrency/locking issues, resulting in PK Violation error, as Dan Guzman explained here. Then we need to prepare the index. Microsoft recommend a clustered index in the target table, but in the dimenson table the clustered index should be the surrogate key column, not the business key column.

Merge command is less efficient compared to separate Update and Insert commands, because when we use separate Update and Insert commands, the Insert command does the Insert for both the new rows and the existing rows in one go. The Merge command on the other hand must do the Insert twice, once for the new rows and another one for the existing rows.

My preferred approach to load a dimension table using SSIS is to bring the source rows into a staging table, then use Execute SQL command to do the following, which I think is the most efficient way:

  1. Compare the stage table and the dimension table to find out which rows already exist in the dimension table, and those rows have changed.
  2. Update the dimension table, to set the expiry date of the rows which have changed.
  3. Insert the new rows and the changed rows into the dimension table.

For example: we have a dimension table called dim_country in the Data Warehouse (DW) with 5 columns: country_key, country_code, country_name, effective_date, expiry_date, is_current. In the staging table is called country, which has 2 columns: country_code and country_name.

update dw.dim_country set expiry_date = getdate(), is_current = ‘N’
from the stage.country s
inner join dw.dim_country d on d.country_code = s.country_code
where d.country_code = s.country_code and d.country_name <> s.country_name

insert into dw.dim_country ( country_code, country_name, effective_date,
expiry_date, is_current )
select s.country_code, s.country_name, getdate() as effective_date,
’31 Dec 9999′ as expiry_date, ‘Y’ as is_current
from the stage.country s
left join dw.dim_country d on s.country_code = d.country_code
where d.country_name is null –new rows
or (d.country_name <> s.country_name) –changed rows

Advertisements

11 Comments »

  1. If it’s a large dimension table you could also create a hash key on coloumns that triggers a change.

    The hash-method is in most cases the fastest way to identify changes.
    You can either use T-SQL to make your hash key or a script task in ssis

    Comment by Henrik Matz — 9 September 2015 @ 11:21 am | Reply

  2. I found your review very interesting, especially since I’m starting out in the BI world. As you’ve already created a lab for popular fact tables, why draw some material from how popular size tables, too?

    Comment by Rafael Lima - Brazil — 24 April 2017 @ 8:45 pm | Reply

    • One more question.
      You can explain your approach in a bit more detail, unfortunately for me it was not very clear that they are like tables A, B and C mentioned in the example.

      Comment by Rafael Lima — 25 April 2017 @ 1:32 am | Reply

      • Hi Rafael, I have given an example. Hopefully this makes it clearer for you.

        Comment by Vincent Rainardi — 25 April 2017 @ 4:23 am

  3. Hi Vincent, your example was very good and it became clear to me now.
    Again his knowledge helped to enrich my work.
    Thank you!!!

    Comment by Rafael Lima - Brazil — 25 April 2017 @ 11:08 am | Reply

  4. Hello! I’ve successfully used the example to update the new records in a DW I’m working on. But now a question arose when adding a new record, to update an old data, how do we update the key of this row in the fact table? Since the key that is in the fact table points to the old line.

    Comment by Rafael Lima - Brazil — 3 April 2018 @ 2:14 pm | Reply

    • Hi Rafael, we don’t update the dimension key column in the fact table.
      Let’s say that the London store changed its manager from Mr Smith (store key = 123) to Mrs Green (store key = 456) on 15th Feb 2018. Up to 14th Feb 2018 all the London rows in the Sales fact table have store keys = 123. On 16th Feb we do not change these fact table rows. Their store keys are still 123. We do not change them to 456. Only new fact rows inserted have store key = 456.
      This way, when we analyse the data in the fact table, the sales amount under Mr Smith is correct.

      Comment by Vincent Rainardi — 3 April 2018 @ 5:31 pm | Reply

  5. Thanks for the explanation, I understood the logic. So it depends on the business rule, right?

    Explaining my situation better, there is a table with the names of the clients, where commercial attendants often miss the clients ‘names and correct the words by changing the clients’ names to the correct form. Therefore, this new line still needs to maintain the history of the previous fact lines, before the correction.

    I implemented the dimension in DW, so that when there are these corrections in the names the system executes type 2 of Slowly Changing.

    So following what you’ve explained now, in my case it’s best to use the zero type of Slowly Changing, correct? In order to correct the data and also keep the history in the fact table.

    Comment by Rafael Lima — 3 April 2018 @ 11:45 pm | Reply

    • Yes it depends on the business rule Rafael. I doubt that you want the new fact rows to be associated with the original client name, because the original client name is incorrect. I think what you meant is: you want to the old fact rows to be associated with the new client name. So if a client called Mrs Joanna Harricord (client key = 123) purchased a product on 15th Feb 2018, and on 20th Feb 2018 her name was corrected to Mrs Joanna Harriford (client key = 456), then you want this fact row to point to 456. You want to this fact row to have client SK = 456, not 123.

      In this case I would suggest in your client dimension you have a column called ClientName which is a type 1 column, and another column called OriginalClientName, which is a type 0 column. Type 1 means we overwrite, Type 0 means we don’t change the value. So on 15th Feb 2018 this row in the client dimension, a row with client key = 123 has both ClientName column and OriginalClientName column set to Mrs Joanna Harricord. And on 20th Feb 2018 we change the ClientName column on this row to Mrs Joanna Harriford, but we do not change the OriginalClientName. We do not create a new row in the client dimension, because ClientName is a type 1 column. Because the client key in the fact table is still 123, that fact row will be associated with the new client name.

      I don’t see the reason why you want to create a new row in the client dimension if the client name changes, because as you say it is a correction. So you want to overwrite the incorrect client name with the right client name. So you want the client name column to be type 1 (overwrite), and not type 2 (create a new row). In the above case the original client name will be stored in the OriginalClientName column, which is a type 0 (do not change the value). I hope this helps, Vincent

      Comment by Vincent Rainardi — 4 April 2018 @ 4:49 am | Reply

  6. Hello Vincent, your tips have been very valuable to my project. You cleared the logic very well and that’s what I need, so I’ll follow your advice.

    Increasingly I learn not to leave things that are simple, complex.

    I greatly appreciate your help.

    Comment by Rafael Lima - Brazil — 4 April 2018 @ 11:26 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: