In part 1 (link) we have populated a fact table (FactHire) by reading the source table (Hire) and looking up 3 dimensions (DimCustomer, DimVan and DimDate) to get the surrogate dimension keys and then insert into FactHire.
In the end of the article I said that there were 2 things that we needed to do:
- Process the no match rows
If in the Hire table we have a CustomerId which does not exist in the Customer Dimension, we need to set the CustomerKey to 0.
- Delete fact rows, if exists
If today’s rows exists in the target fact table, delete them first.
So let’s get started.
Process the No Match Rows
The Lookup transform we try to convert the natural keys from the Hire table (Customer ID, Registration No and Hire Date) to the dimension keys (CustomerKey, VanKey and HireDateKey) by looking up the dimension table. Consider that in the Hire table we have a CustomerId of 123. We search DimCustomer and we couldn’t find CustomerId 123. In this case we need to set the Customer Key to 0.
To do that, we need to process the “No Match Output” of the Lookup transform (the second green row). Then merge it back into the main flow. Here’s how it will look when finished:
So let’s begin. The first thing to do is to handle the “rows with no match entries”. Double click on [Get Customer Key] and change the [Specify how to handle rows with no matching entries] from [Failed component] to [Redirect rows to no match output], like this:
Now as you can see above there are 2 green arrows coming out of Get Customer Key lookup transformation. The [Lookup Matched Output] green arrow is now going to [Sort on HireId1], and we set the [Lookup No Match Output] to a Derived Column transformation called [Mismatched CustomerId]. In this [Mismatched Customer Id], we are simply setting the CustomerKey column = 0, like this:
We would like to combine the output of Match Output and No Match Output using a Merge transformation, like this:
But the Merge transformation requires us to sort each of the two inputs. So we have to pass them through a Sort transform like this:
In the [Sort on HireId2], we sort the data set on HireId, and pass all other columns through to the output, like this: (and we do exactly the same on [Sort on HireId 1])
So why do we sort on HireId? Why not on other column? Because HireId is a unique identifier. It uniquely identifies a hire transaction. Imagine this scenario: there are 1000 rows on the Hire table in HireBase. 997 of them has CustomerId exist in DimCustomer, and 3 of them has CustomerId which don’t exist in DimCustomer. So the Match Output will have 997 rows and No Match Output will have 3 rows. When we merge them again, we want to use a column which was a unique ID before they split up, so that when they merge this unique ID will identify them properly.
Now that we have merged the first lookup, we need to do the same on the 2nd and 3rd Lookups (Van Key and Hire Date Key). But this time we don’t need to sort the output of the derived column and lookup, because they are already sorted:
Now our lookup is complete and proper. We have processed the No Match rows.
Delete Target Rows if Exist
One of the basic principles of data warehouse ETL is “RERUNABILITY”. Meaning that we need to be able to run the ETL several times a day without causing any issue to the data. In terms of loading a Period Snapshot Fact table, this means that we need to delete the [today’s date rows] in the target fact table, before we load the new rows.
Let’s say that this morning the ETL ran, pumping 1000 rows into the FactHire table, all with Snapshot Date = today. And now (say 2pm) if we were to run it again, it will pump another 1000 rows with Snapshot Date = today, causing duplication of data. Because of this we need to delete today’s rows in the target fact table.
Unfortunately we can’t use truncate in this case because we only want to delete today’s rows, not yesterday’s ones or any other dates before today. We need an index to be created JUST on Snapshot Date. I repeat: JUST on the Snapshot Date. We don’t want covering index or query help index or any multi key index, we just want Snapshot Date alone. Why? Because if we put other column on the index it may not get used when we delete today’s rows because the selectivity is too low. Can we use table partitioning? Yes we can. We can (sometimes we should, if the volume warrants it) partition a periodic snapshot fact table on Snapshot Date. But sometimes we can’t because the partitioning key is used for something more selecting, such as on product Id. But in majority of the cases we can.
Any way I digress. We need to delete today’s rows. Should you want to discuss further about partitioning you can email me at firstname.lastname@example.org.
Deleting today’s rows is a simple OLEDB SQL Command transformation like this:
And inside, the OLE DB Command look like this:
As you can see the SqlCommand is “Delete from FactHire where SnapshotDateKey = 20120526”. But we don’t want to hardcode the date! We want it to be today’s date. So we make it like this:
delete from FactHire where SnapshotDateKey = convert(int, replace( convert(varchar(10), GETDATE(), 20) ,'-',''))
The above code snippet will convert today’s date into an integer. For real warehouse project this is not acceptable, because the batch could start 1 millisecond before midnight and this OLE DB Command could end up executing on the next day. For real warehouse project the first thing to do in the ETL Control Flow is to set the Batch Date and Time using a script task, like this:
And the script task populates a global variable so that all workflows in the package can use it. Of course in the real project there would be a lot of dimension population work flow, before it hits the fact table population.
So that’s it, we have done 2 things that we set out at the beginning of this article: to process the No Match Output and to delete today’s rows if any.
Hope this helps,
Vincent Rainardi, 26/5/2012