Data Platform and Data Science

26 May 2012

How to Populate a Fact Table using SSIS (part 2)

Filed under: Data Warehousing,SSIS — Vincent Rainardi @ 9:47 pm

In part 1 (link) we have populated a fact table (FactHire) by reading the source table (Hire) and looking up 3 dimensions (DimCustomer, DimVan and DimDate) to get the surrogate dimension keys and then insert into FactHire.

In the end of the article I said that there were 2 things that we needed to do:

  1. Process the no match rows
    If in the Hire table we have a CustomerId which does not exist in the Customer Dimension, we need to set the CustomerKey to 0.
  2. Delete fact rows, if exists
    If today’s rows exists in the target fact table, delete them first.

So let’s get started.

Process the No Match Rows

The Lookup transform we try to convert the natural keys from the Hire table (Customer ID, Registration No and Hire Date) to the dimension keys (CustomerKey, VanKey and HireDateKey) by looking up the dimension table. Consider that in the Hire table we have a CustomerId of 123. We search DimCustomer and we couldn’t find CustomerId 123. In this case we need to set the Customer Key to 0.

To do that, we need to process the “No Match Output” of the Lookup transform (the second green row). Then merge it back into the main flow. Here’s how it will look when finished:

So let’s begin. The first thing to do is to handle the “rows with no match entries”. Double click on [Get Customer Key] and change the [Specify how to handle rows with no matching entries] from [Failed component] to [Redirect rows to no match output], like this:

Now as you can see above there are 2 green arrows coming out of Get Customer Key lookup transformation. The [Lookup Matched Output] green arrow is now going to [Sort on HireId1], and we set the [Lookup No Match Output] to a Derived Column transformation called [Mismatched CustomerId]. In this [Mismatched Customer Id], we are simply setting the CustomerKey column = 0, like this:

We would like to combine the output of Match Output and No Match Output using a Merge transformation, like this:

But the Merge transformation requires us to sort each of the two inputs. So we have to pass them through a Sort transform like this:

In the [Sort on HireId2], we sort the data set on HireId, and pass all other columns through to the output, like this: (and we do exactly the same on [Sort on HireId 1])

So why do we sort on HireId? Why not on other column? Because HireId is a unique identifier. It uniquely identifies a hire transaction. Imagine this scenario: there are 1000 rows on the Hire table in HireBase. 997 of them has CustomerId exist in DimCustomer, and 3 of them has CustomerId which  don’t exist in DimCustomer. So the Match Output will have 997 rows and No Match Output will have 3 rows. When we merge them again, we want to use a column which was a unique ID before they split up, so that when they merge this unique ID will identify them properly.

Now that we have merged the first lookup, we need to do the same on the 2nd and 3rd Lookups (Van Key and Hire Date Key). But this time we don’t need to sort the output of the derived column and lookup, because they are already sorted:

Now our lookup is complete and proper. We have processed the No Match rows.

Delete Target Rows if Exist

One of the basic principles of data warehouse ETL is “RERUNABILITY”. Meaning that we need to be able to run the ETL several times a day without causing any issue to the data. In terms of loading a Period Snapshot Fact table, this means that we need to delete the [today’s date rows] in the target fact table, before we load the new rows.

Let’s say that this morning the ETL ran, pumping 1000 rows into the FactHire table, all with Snapshot Date = today. And now (say 2pm) if we were to run it again, it will pump another 1000 rows with Snapshot Date = today, causing duplication of data. Because of this we need to delete today’s rows in the target fact table.

Unfortunately we can’t use truncate in this case because we only want to delete today’s rows, not yesterday’s ones or any other dates before today. We need an index to be created JUST on Snapshot Date. I repeat: JUST on the Snapshot Date. We don’t want covering index or query help index or any multi key index, we just want Snapshot Date alone. Why? Because if we put other column on the index it may not get used when we delete today’s rows because the selectivity is too low. Can we use table partitioning? Yes we can. We can (sometimes we should, if the volume warrants it) partition a periodic snapshot fact table on Snapshot Date. But sometimes we can’t because the partitioning key is used for something more selecting, such as on product Id. But in majority of the cases we can.

Any way I digress. We need to delete today’s rows. Should you want to discuss further about partitioning you can email me at vrainardi@gmail.com.

Deleting today’s rows is a simple OLEDB SQL Command transformation like this:

And inside, the OLE DB Command look like this:

As you can see the SqlCommand is “Delete from FactHire where SnapshotDateKey = 20120526”. But we don’t want to hardcode the date! We want it to be today’s date. So we make it like this:


delete from FactHire where SnapshotDateKey = convert(int, replace( convert(varchar(10), GETDATE(), 20) ,'-',''))

The above code snippet will convert today’s date into an integer. For real warehouse project this is not acceptable, because the batch could start 1 millisecond before midnight and this OLE DB Command could end up executing on the next day. For real warehouse project the first thing to do in the ETL Control Flow is to set the Batch Date and Time using a script task, like this:

And the script task populates a global variable so that all workflows in the package can use it. Of course in the real project there would be a lot of dimension population work flow, before it hits the fact table population.

So that’s it, we have done 2 things that we set out at the beginning of this article: to process the No Match Output and to delete today’s rows if any.

Hope this helps,

Vincent Rainardi, 26/5/2012

22 Comments »

  1. […] As this is long enough. I’m going to put it as a separate article (link). […]

    Pingback by How to Populate a Fact Table using SSIS (part 1) « Data Warehousing and Business Intelligence — 26 May 2012 @ 9:57 pm | Reply

  2. Thanks you very match, very useful for me as beginner in BI

    Comment by Med — 11 February 2015 @ 7:03 pm | Reply

  3. hi, thank you for this great explanation.
    i have a question though; in my data warehouse, the data are not updated daily but once a month, some once a quarter, and others once a year
    so how do i deal, for example, with the “” Delet this month’s rows from fact table “”
    thank you,

    Comment by F_my — 8 April 2015 @ 10:38 pm | Reply

    • Hi Feten,
      “Delete this month’s data and repopulate this month’s data” is specifically for monthly periodic snapshot fact tables.
      For daily periodic snapshot fact tables, if you load it once a day, you don’t need to delete anything. But if you load intraday, then you will need to delete today’s rows.
      For yearly periodic snapshot fact tables, you need to delete this year’s data.
      For transactional fact tables, you need to load adjustments and corrections as NEW ROWS, not as updates.

      Comment by Vincent Rainardi — 9 April 2015 @ 9:44 pm | Reply

  4. Thanks once again for explaining this so clearly! Maybe you should consider making some videos as well.

    Comment by Danny — 2 May 2015 @ 4:57 am | Reply

  5. Wouldn’t it be easier and better for the performance to use a condinitional split and filter the values smaller/bigger than getdate()?
    Instead of deleting all the fact data and refill the table?

    Comment by Tom — 3 June 2015 @ 11:28 am | Reply

    • You are right Tom, in practice/reality we don’t delete or truncate fact table. This is just to simplify it so that it’s easier to understand.

      Comment by Vincent Rainardi — 4 June 2015 @ 7:08 pm | Reply

      • OK, then I got it right.
        Thanks for that great explanation!

        Comment by Tom — 8 June 2015 @ 11:29 am

    • By doing this, the data is not duplicated when the fact table receives the data? Or should I write the snapshot column in the facts, in DATETIME format?

      In the case of lines that do not meet the condition, what should we do with them?

      Comment by Rafael Lima - Brazil — 16 May 2017 @ 12:46 pm | Reply

      • Hi Rafael, when populating a fact table we do not delete all the rows from the fact table. We only insert new rows for today, with snapshot date column = today’s date. If today’s rows already exist (for example because the first load was incorrect so we load it again), then we delete the existing today’s rows, before inserting the new today’s rows.

        In the case of lines that do not meet the condition we can do two things:
        1. We do not insert that bad row into the fact table.
        2. We insert that bad row into the fact table, after we “fix” it.
        For example,
        1. The source data has 20,000 rows, all with business date = “17 May 2017” (today’s date). But there is 1 row with the business date of “65 April 2017”, which is an invalid business date. This row has an amount of $100. In this case we do not insert the row, and the total amount for today will be $100 less than the source data. We decide not to insert this row because we don’t know under what Snapshot Date this row should be inserted as. Is it 17 May 2017? Is it 15 May 2017? We don’t know.
        2. The source data has 20,000 rows, all with business date = “17 May 2017” (today’s date). But there is 1 row with Shipping Date of “65 April 2017”, which is an invalid Shipping Date. In this case we still insert the row into the fact table, and we set the Shipping Date Key column in the fact table to zero. If we cannot find the Shipping Date from source in the date dimension, we set the Shipping Date Key column in the fact table to zero, which points to the unknown row in the Date Dimension. In this case the total amount for today in the fact table will be the same as the source data.

        Comment by Vincent Rainardi — 17 May 2017 @ 6:40 am

  6. Considering the objective of this process flow, there should be no need to do a merge. Merging first requires sorting which typically takes a really really long time (hours). Perhaps using ‘union all’ would be sufficient, using far less time

    Comment by Jed — 7 March 2016 @ 2:56 am | Reply

  7. Great Work! Thanks a lot Vincent

    Comment by Srini — 14 April 2016 @ 7:05 pm | Reply

  8. Hi

    Usually in Dim table we will maintain history. Say in Dim customer there is a chance to have more than one record for the same customer ID, which means the look up join will fetch more than one surrogate key and result in duplication of records. But we needed only the latest surrogate key for the given customer ID. how to handle this scenario.

    Thanks

    Comment by Karthik — 22 August 2016 @ 6:18 am | Reply

    • Hi Karthik, you would need to find out when the sale happened, and use this sale date to determine which customer row to fetch (which customer key should be put in that fact row)

      Comment by Vincent Rainardi — 22 August 2016 @ 7:10 am | Reply

  9. Thanks Vincent for the very quick response.

    Comment by Karthik — 22 August 2016 @ 3:29 pm | Reply

  10. Hello Vincent!
    In the processes (1. Process the no match rows).
    I do not quite understand why it is necessary to set the CustomerKey to 0 if the CustomerId does not exist in the Customer dimension.
    Instead, should not we add this line that does not yet exist in the dimension?

    Comment by Rafael Lima — 10 May 2017 @ 1:06 am | Reply

    • Hi Rafael, you are right, before we load data into the fact table, we create new customers in the dimension table. So in theory, when we load data into the fact table, we have all the customers. But sometimes, there are 1 or 2 customers which we can’t find in the dimension table and in this case in the fact table we set the CustomerKey column to 0.

      Comment by Vincent Rainardi — 10 May 2017 @ 4:41 am | Reply

      • Hi Vincent,
        Thank you so much for a very very good and practical example.
        My question is about Rafael’s comment, what we have to do next with setting the CustomerKey to 0? Are we have to reload DimTables? and when?
        Thanks again.
        Ali

        Comment by aliyassinigmailcom — 6 December 2017 @ 10:44 pm

      • Very good question Ali. Apologies for the delay in replying. In most cases the customer dimension is loaded every day. So we will have the missing customer in the customer dimension the next day, or in 2 days at the most. This is because in most companies the source of the customer dimension is the customer master data, not the sales system. So the sales happened on Monday in the sales system, it’s a new customer which was setup in the sales system on Monday, but that customer won’t be created in the customer master system until Tuesday. The reason the data warehouse is taking customer data from the customer master system is the enrichment of attributes, including external customer data such as credit rating and social demography, and internal customer data such as customer classifications. The customer master system is usually a CRM system, such as Salesforce.

        Now to answer your question: when will the customer surrogate key column in the fact table be corrected?
        In most cases the next day, but occasionally 2 days later. Whenever the customer is setup in the customer master system.
        Note: if the fact table is a snapshot fact table, the fact row for the next snapshot day will be corrected, but the “same day” fact row will have 0 customer SK. So for the above case, on Monday the customer SK for that sales fact row will be 0. But on Tuesday the customer Sk for that sales fact row will be correct.
        If the fact table is a transaction fact table, when the customer has been setup in the customer master the next day, the ETL process need to find the fact row and correct it. Unlike in a snapshot fact table which has many fact row for that sales (1 fact row every snapshot date), in a transaction fact table there is only 1 fact row for that sales in the entire fact table.

        I hope this helps.

        Comment by Vincent Rainardi — 10 December 2017 @ 8:33 am

  11. Hi Vincent,
    Thanks again, that was very helpful.
    I’m working on your example(HireBase), and I thought may be we can prevent from putting 0 customer SK in the FactHire while there are new customers in the MasterCustomer, by finding new customer and inserting into DimCustomer before populating FactHire. Is this a right solution?
    Thank you so much.
    Ali

    Comment by aliyassinigmailcom — 11 December 2017 @ 7:35 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a reply to Med Cancel reply

Blog at WordPress.com.