Data Platform and Data Science

28 May 2012

Month Attributes in the Date Dimension

Filed under: Data Warehousing — Vincent Rainardi @ 6:36 am
Tags:

When building a date dimension we often have month attributes such as:

  • Month Name: January, February, March
  • Month Number: 1, 2, 3

The primary function of the month attributes is to drill down from year or quarter to date. And in this case, the above attributes are useless, because it does not have the year element. You cannot drill down from 2012 to January. You can drill down from 2012 to January 2012. So you need the year element in the month attribute, like this:

  • Month Name: January 2012, February 2012, March 2012 (*1)
    or: 2012 Jan, 2012 Feb, 2012 Mar (*2)
  • Month Number: 2012-01, 2012-02, 2012-03 (*3)

In my experience, some users preferred to have *3 (the year in the front and all numeric), so that it could be sorted automatically. In SSAS, you can have the sort key so that *1 and *2 can be sorted automatically too. But after the data has been exported to Excel, it’s not sorted any more.

This is also the case with Quarter attribute. It’s no good to have Q1, Q2, Q3 without the year. We need to have the year there to be able to drill down, i.e. 2012 Q1, 2012 Q2, 2012 Q3. This way, the quarter attribute can be made into a hierarchy, i.e.

The same with date attribute, it needs to have the year and month elements, i.e.

  • Date: 2012-05-14, 2012-05-15, 2012-05-16
  • or 2012 May 14, 2012 May 15, 2012 May 16
  • or 14 May 2012, 15 May 2012, 16 May 2012
  • or 14/5/2012, 15/5/2012, 16/5/2012

So that the date hierarchy will look like this in SSAS or BO/Cognos:
(left: all numeric with year in the front, right: month name in character)

My point here is: to make a date hierarchy (in AS/BO/Cognos), the month attribute must have more members than the quarter attribute. And the quarter attribute must have more members than the year attribute, so that we can drill down from year to quarter to month to date.

26 May 2012

Principles of Explaining Things

Filed under: Other — Vincent Rainardi @ 10:19 pm

Some of the principles I try to follow when writing a post/article are:

  1. Learn by example
    It is easier for the reader to understand if we give an example. It is often easier for them to understand the example, than the definition. I believe that human learn best  using examples, not by understanding the definitions. For example, rather than explaining how to populate a fact table, it is better to actually do it in SSIS and take the readers through the process, step by step. An example provides the reader/learner with something to copy.
  2. Show the end result first
    It is quicker to make readers understand a procedure if we snow them how the end result looks like, before we begin. For example, before explaining the 6 steps of how to build an SSIS workflow that populate a dimension, readers will have clearer understanding if we first display the screenshot: “This is how it will look when we finish. Now let’s do it step by step.” Because now the readers have the big picture in their mind of what they are trying to achieve.
  3. Start with a simple one first, then build up
    Rather trying to be perfect and explain 16 steps required to build a cube, it is better to start with just 6 steps, the most basic one. Then after we have succeeded to make the user/reader understand the basic 6 steps, we introduce 3 additional steps. And after they can truly grasp the 9 steps, we add 3 more, and so on until we reach the perfect 16 steps. But if you try to explain all 16 in the beginning, it will take longer time for them to understand.
  4. It’s easier to show it than explaining it
    Rather than typing 1000 words try to explain Resource Governor in SQL Server, it is quicker to just show them the screens. Ideally it’s video capture, but if you can’t then static picture will do. Pictures are much better than words in explaining things to other people.

How to Populate a Fact Table using SSIS (part 2)

Filed under: Data Warehousing,SSIS — Vincent Rainardi @ 9:47 pm

In part 1 (link) we have populated a fact table (FactHire) by reading the source table (Hire) and looking up 3 dimensions (DimCustomer, DimVan and DimDate) to get the surrogate dimension keys and then insert into FactHire.

In the end of the article I said that there were 2 things that we needed to do:

  1. Process the no match rows
    If in the Hire table we have a CustomerId which does not exist in the Customer Dimension, we need to set the CustomerKey to 0.
  2. Delete fact rows, if exists
    If today’s rows exists in the target fact table, delete them first.

So let’s get started.

Process the No Match Rows

The Lookup transform we try to convert the natural keys from the Hire table (Customer ID, Registration No and Hire Date) to the dimension keys (CustomerKey, VanKey and HireDateKey) by looking up the dimension table. Consider that in the Hire table we have a CustomerId of 123. We search DimCustomer and we couldn’t find CustomerId 123. In this case we need to set the Customer Key to 0.

To do that, we need to process the “No Match Output” of the Lookup transform (the second green row). Then merge it back into the main flow. Here’s how it will look when finished:

So let’s begin. The first thing to do is to handle the “rows with no match entries”. Double click on [Get Customer Key] and change the [Specify how to handle rows with no matching entries] from [Failed component] to [Redirect rows to no match output], like this:

Now as you can see above there are 2 green arrows coming out of Get Customer Key lookup transformation. The [Lookup Matched Output] green arrow is now going to [Sort on HireId1], and we set the [Lookup No Match Output] to a Derived Column transformation called [Mismatched CustomerId]. In this [Mismatched Customer Id], we are simply setting the CustomerKey column = 0, like this:

We would like to combine the output of Match Output and No Match Output using a Merge transformation, like this:

But the Merge transformation requires us to sort each of the two inputs. So we have to pass them through a Sort transform like this:

In the [Sort on HireId2], we sort the data set on HireId, and pass all other columns through to the output, like this: (and we do exactly the same on [Sort on HireId 1])

So why do we sort on HireId? Why not on other column? Because HireId is a unique identifier. It uniquely identifies a hire transaction. Imagine this scenario: there are 1000 rows on the Hire table in HireBase. 997 of them has CustomerId exist in DimCustomer, and 3 of them has CustomerId which  don’t exist in DimCustomer. So the Match Output will have 997 rows and No Match Output will have 3 rows. When we merge them again, we want to use a column which was a unique ID before they split up, so that when they merge this unique ID will identify them properly.

Now that we have merged the first lookup, we need to do the same on the 2nd and 3rd Lookups (Van Key and Hire Date Key). But this time we don’t need to sort the output of the derived column and lookup, because they are already sorted:

Now our lookup is complete and proper. We have processed the No Match rows.

Delete Target Rows if Exist

One of the basic principles of data warehouse ETL is “RERUNABILITY”. Meaning that we need to be able to run the ETL several times a day without causing any issue to the data. In terms of loading a Period Snapshot Fact table, this means that we need to delete the [today’s date rows] in the target fact table, before we load the new rows.

Let’s say that this morning the ETL ran, pumping 1000 rows into the FactHire table, all with Snapshot Date = today. And now (say 2pm) if we were to run it again, it will pump another 1000 rows with Snapshot Date = today, causing duplication of data. Because of this we need to delete today’s rows in the target fact table.

Unfortunately we can’t use truncate in this case because we only want to delete today’s rows, not yesterday’s ones or any other dates before today. We need an index to be created JUST on Snapshot Date. I repeat: JUST on the Snapshot Date. We don’t want covering index or query help index or any multi key index, we just want Snapshot Date alone. Why? Because if we put other column on the index it may not get used when we delete today’s rows because the selectivity is too low. Can we use table partitioning? Yes we can. We can (sometimes we should, if the volume warrants it) partition a periodic snapshot fact table on Snapshot Date. But sometimes we can’t because the partitioning key is used for something more selecting, such as on product Id. But in majority of the cases we can.

Any way I digress. We need to delete today’s rows. Should you want to discuss further about partitioning you can email me at vrainardi@gmail.com.

Deleting today’s rows is a simple OLEDB SQL Command transformation like this:

And inside, the OLE DB Command look like this:

As you can see the SqlCommand is “Delete from FactHire where SnapshotDateKey = 20120526”. But we don’t want to hardcode the date! We want it to be today’s date. So we make it like this:


delete from FactHire where SnapshotDateKey = convert(int, replace( convert(varchar(10), GETDATE(), 20) ,'-',''))

The above code snippet will convert today’s date into an integer. For real warehouse project this is not acceptable, because the batch could start 1 millisecond before midnight and this OLE DB Command could end up executing on the next day. For real warehouse project the first thing to do in the ETL Control Flow is to set the Batch Date and Time using a script task, like this:

And the script task populates a global variable so that all workflows in the package can use it. Of course in the real project there would be a lot of dimension population work flow, before it hits the fact table population.

So that’s it, we have done 2 things that we set out at the beginning of this article: to process the No Match Output and to delete today’s rows if any.

Hope this helps,

Vincent Rainardi, 26/5/2012

How to Populate a Fact Table using SSIS (part 1)

Filed under: Data Warehousing,SSIS — Vincent Rainardi @ 2:24 pm
Tags: ,

Seems to me that some people are still struggling to populate their fact tables using SSIS. When doing it, they come across issues like these:

  1. Where do I populate my fact table from?
  2. How do I get the dimension keys to put into my fact table?
  3. Where can I get the data for the measures columns?
  4. With what do I populate the snapshot date column?
  5. What is the primary key of my fact table?
  6. The source table doesn’t have a primary key of the source table. What should I do?
  7. I got duplicate rows in the fact table. How do I handle it?
  8. The rows with that snapshot date already exist. What should I do?
  9. Lookup cache: should I choose full cache or partial cache?
  10. I can’t find the row in the dimension table, what should I do?

As always, the best way to explain is by example, which is more effective than answering the above questions. So I’m going to do the following:

  1. Describe the background on the company and the data warehouse
  2. Create the source tables and populate them
  3. Create the dimension tables and populate them
  4. Create the fact table (empty)
  5. Build an SSIS package to populate the fact table, step by step

Background

It’s a van hire company called TopHire. It’s purely fictional of course. Their business system (HireBase) captures the rental information, including the customer information. HireBase has a fleet database where all vans are maintained. HireBase contains only 3 tables:

  1. Customer: contains 100 customers, e.g. name, data of birth, telephone number, etc. A column called CustomerId uniquely identifies a customer.
  2. Van: contains 20 vans that Top Hire operates. It’s their fleet. Contains the registration number, engine size, van size, colour, year of manufacture. The unique identifier for each van is the registration number.
  3. Hire: contains 1000 hire transactions since 1st Jan 2011. Every hire transaction stores: who the customer was, which van was rented out, the date it occurred, number of days and various hire fees/charges (the van, sat nav, insurance, damage waiver and total bill). For each transaction we have Hire ID, which is a unique identifier in this table.

The data warehouse contains 4 tables:

  1. Date dimension: contains every single date from 2006 to 2016.
  2. Customer dimension: contains 100 customers. To be simple we’ll make it type 1 so we don’t create a new row for each change.
  3. Van dimension: contains 20 vans. To be simple we’ll make it type 1 so we don’t create a new row for each change.
  4. Hire fact table: contains 1000 hire transactions since 1st Jan 2011. It is a daily snapshot fact table so that every day we insert 1000 rows into this fact table. So over time we can track the changes of total bill, van charges, satnav income, etc.

Create the source tables and populate them

So now we are going to create the 3 tables in HireBase database: Customer, Van, and Hire. Then we populate them.

First I’ll show you how it looks when it’s done:

Customer table:

Van table:

Hire table:

And here is the script to create and populate them:

-- Create database
create database HireBase
go
use HireBase
go

-- Create customer table
if exists (select * from sys.tables where name = 'Customer')
drop table Customer
go

create table Customer
( CustomerId varchar(20) not null primary key,
 CustomerName varchar(30), DateOfBirth date, Town varchar(50),
 TelephoneNo varchar(30), DrivingLicenceNo varchar(30), Occupation varchar(30)
)
go

-- Populate Customer
truncate table Customer
go

declare @i int, @si varchar(10), @startdate date
set @i = 1
while @i <= 100
begin
 set @si = right('0'+CONVERT(varchar(10), @i),2)
 insert into Customer
 ( CustomerId, CustomerName, DateOfBirth, Town, TelephoneNo, DrivingLicenceNo, Occupation)
 values
 ( 'N'+@si, 'Customer'+@si, DATEADD(d,@i-1,'2000-01-01'), 'Town'+@si, 'Phone'+@si, 'Licence'+@si, 'Occupation'+@si)
 set @i = @i + 1
end
go

select * from Customer

-- Create Van table
if exists (select * from sys.tables where name = 'Van')
drop table Van
go

create table Van
( RegNo varchar(10) not null primary key,
 Make varchar(30), Model varchar(30), [Year] varchar(4),
 Colour varchar(20), CC int, Class varchar(10)
)
go

-- Populate Van table
truncate table Van
go

declare @i int, @si varchar(10)
set @i = 1
while @i <= 20
begin
 set @si = convert(varchar, @i)
 insert into Van
 ( RegNo, Make, Model, [Year], Colour, CC, Class)
 values
 ( 'Reg'+@si, 'Make'+@si, 'Model'+@si,
 case @i%4 when 0 then 2008 when 1 then 2009 when 2 then 2010 when 3 then 2011 end,
 case when @i%5<3 then 'White' else 'Black' end,
 case @i%3 when 0 then 2000 when 1 then 2500 when 2 then 3000 end,
 case @i%3 when 0 then 'Small' when 1 then 'Medium' when 2 then 'Large' end)
 set @i = @i + 1
end
go

select * from Van

-- Create Hire table
if exists (select * from sys.tables where name = 'Hire')
drop table Hire
go

create table Hire
( HireId varchar(10) not null primary key,
 HireDate date not null,
 CustomerId varchar(20) not null,
 RegNo varchar(10), NoOfDays int, VanHire money, SatNavHire money,
 Insurance money, DamageWaiver money, TotalBill money
)
go

-- Populate Hire table
truncate table Hire
go

declare @i int, @si varchar(10), @DaysFrom1stJan int, @CustomerId int, @RegNo int, @mi int
set @i = 1
while @i <= 1000
begin
 set @si = right('000'+convert(varchar(10), @i),4) -- string of i
 set @DaysFrom1stJan = (@i-1)%200 --The Hire Date is derived from i modulo 200
 set @CustomerId = (@i-1)%100+1 --The CustomerId is derived from i modulo 100
 set @RegNo = (@i-1)%20+1 --The Van RegNo is derived from i modulo 20
 set @mi = (@i-1)%3+1 --i modulo 3
 insert into Hire (HireId, HireDate, CustomerId, RegNo, NoOfDays, VanHire, SatNavHire, Insurance, DamageWaiver, TotalBill)
 values ('H'+@si, DateAdd(d, @DaysFrom1stJan, '2011-01-01'),
 left('N0'+CONVERT(varchar(10),@CustomerId),3), 'Reg'+CONVERT(varchar(10), @RegNo),
 @mi, @mi*100, @mi*10, @mi*20, @mi*40, @mi*170)
 set @i += 1
end
go

select * from Hire

Create the Data Warehouse

So now we are going to create the 3 dimension tables and 1 fact table in the data warehouse: DimDate, DimCustomer, DimVan and FactHire. We are going to populate the 3 dimensions but we’ll leave the fact table empty. The purpose of this article is to show how to populate the fact table using SSIS.

First I’ll show you how it looks when it’s done:

Date Dimension:

Customer Dimension:

Van Dimension:

Hire Fact Table:

And then we do it. This is the script to create and populate those dim and fact tables:

-- Create the data warehouse
create database TopHireDW
go
use TopHireDW
go

-- Create Date Dimension
if exists (select * from sys.tables where name = 'DimDate')
drop table DimDate
go

create table DimDate
( DateKey int not null primary key,
 [Year] varchar(7), [Month] varchar(7), [Date] date, DateString varchar(10))
go

-- Populate Date Dimension
truncate table DimDate
go

declare @i int, @Date date, @StartDate date, @EndDate date, @DateKey int,
 @DateString varchar(10), @Year varchar(4),
 @Month varchar(7), @Date1 varchar(20)
set @StartDate = '2006-01-01'
set @EndDate = '2016-12-31'
set @Date = @StartDate

insert into DimDate (DateKey, [Year], [Month], [Date], DateString)
 values (0, 'Unknown', 'Unknown', '0001-01-01', 'Unknown') --The unknown row

while @Date <= @EndDate
begin
 set @DateString = convert(varchar(10), @Date, 20)
 set @DateKey = convert(int, replace(@DateString,'-',''))
 set @Year = left(@DateString,4)
 set @Month = left(@DateString, 7)
 insert into DimDate (DateKey, [Year], [Month], [Date], DateString)
 values (@DateKey, @Year, @Month, @Date, @DateString)
 set @Date = dateadd(d, 1, @Date)
end
go

select * from DimDate

-- Create Customer dimension
if exists (select * from sys.tables where name = 'DimCustomer')
drop table DimCustomer
go

create table DimCustomer
( CustomerKey int not null identity(1,1) primary key,
 CustomerId varchar(20) not null,
 CustomerName varchar(30), DateOfBirth date, Town varchar(50),
 TelephoneNo varchar(30), DrivingLicenceNo varchar(30), Occupation varchar(30)
)
go

insert into DimCustomer (CustomerId, CustomerName, DateOfBirth, Town, TelephoneNo,
 DrivingLicenceNo, Occupation)
select * from HireBase.dbo.Customer

select * from DimCustomer

-- Create Van dimension
if exists (select * from sys.tables where name = 'DimVan')
drop table DimVan
go

create table DimVan
( VanKey int not null identity(1,1) primary key,
 RegNo varchar(10) not null,
 Make varchar(30), Model varchar(30), [Year] varchar(4),
 Colour varchar(20), CC int, Class varchar(10)
)
go

insert into DimVan (RegNo, Make, Model, [Year], Colour, CC, Class)
select * from HireBase.dbo.Van
go

select * from DimVan

-- Create Hire fact table
if exists (select * from sys.tables where name = 'FactHire')
drop table FactHire
go

create table FactHire
( SnapshotDateKey int not null, --Daily periodic snapshot fact table
 HireDateKey int not null, CustomerKey int not null, VanKey int not null, --Dimension Keys
 HireId varchar(10) not null, --Degenerate Dimension
 NoOfDays int, VanHire money, SatNavHire money,
 Insurance money, DamageWaiver money, TotalBill money
)
go

select * from FactHire

Now you can see that the 3 dimensions have been populated. And the fact table is created and empty, ready for us to populate it.

Build the SSIS Package to Populate the Fact Table

First I show you how it looks when it’s done. This is the overall workflow we are going to build:

This is a bit of description on each of the 6 boxes above:

  1. Read Hire table in HireBase: First we get the original OLTP records from the hire table in HireBase.
  2. Get Snapshot Date Key: we get today’s date and convert it into an integer as Snapshot Date Key. FactHire is a snapshot fact table. A periodic snapshot fact table. Which means that every day we populate the fact table with 1000 rows like this:Next week, on Monday 28th May, we insert 1000 rows to the fact table, which is all the rows in the Hire table. Then on Tuesday 29th  May we insert 1010 rows. Because there would be 10 new hire transactions in the Hire table in HireBase system. The 1000 “old” rows might change as well on Tuesday, not all of them will be the same as Monday. Perhaps there are 10 rows which are modified, the amount/fees changed, etc. Whatever the condition of the Hire table on Tuesday, we are taking all rows in put them into our data warehouse. And we do this every day. We every day, put whatever is in the Hire table into FactHire. That’s what Periodic Snapshot fact table means.Hence in a periodic snapshot fact table we have a “Snapshot Date”. Meaning: the date when we captured the source table. Or, the date when these 1000 rows were inserted into this fact table. The value of this snapshot date is today’s date (the date the ETL runs). The date the load happens.
  3. Get Customer Key: from Hire table we get the Customer ID which was involved in the hire transaction. We then go to the Customer Dimension to get the Customer Key for this Customer ID.
  4. Get Van Key: from Hire table we get the Registration Number which was rented out in the transaction. We then go to the Van Dimension to get the Van Key for this Registration Number.
  5. Get Hire Date Key: from Hire table we get the Hire Date, which is the date when the hire transaction happened. We then go to the Date Dimension to get the Date Key for this Hire Date.
  6. Populate FactHire: After we get all the required dimension keys, we insert the rows into the FactHire fact table.

Now that we know what to build, let’s build it. The box numbers below refer to the 6 boxes above.

Box 1. Read Hire table in HireBase

We create the data source in the workflow. It’s OLE DB Data Flow Source, from HireBase. We take all columns, so we don’t specify a query:

Box 2. Get Snapshot Date Key

Create a Derived Column and set the Expression to:

(DT_I4)((DT_WSTR,4)YEAR(GETDATE()) + RIGHT(“0” + (DT_WSTR,2)MONTH(GETDATE()),2) + RIGHT(“0” + (DT_WSTR,2)DAY(GETDATE()),2))

See the complete article here.

Box 3. Get Customer Key

Create a Lookup transformation and set to Full Cache, OLE DB connection manager, Fail component, like this:

Full cache: SSIS reads all rows from DimCustomer into memory, then read the Hire table and SSIS doesn’t need to read DimCustomer any more. All lookup operations are done in memory.
Partial cache: SSIS build the memory cache while reading the rows from Hire table. When SSIS reads the rows from Hire table, it tries to find the Customer Key in memory cache. If it can’t find it, it reads the DimCustomer on disk.
No cache: SSIS doesn’t build memory cache. SSIS joins DimCustomer to Hire table (using RDBMS operation) to find the Customer Key.

Cache Connection Manager (CCM): SSIS reads DimCustomer and put in memory cache. This cache can be reused and shared between several Lookups, and across several dataflows. CCM reduces database workload.

OLE DB connection manager: a table in the warehouse (in this case it’s DimCustomer), see Connection pane below:

In the Columns pane, connect Customer Id from the Hire table to the Customer Id on the DimCustomer table, and bring the CustomerKey as output:

Box 4: Get Van Key

It’s the same as Get Customer Key above, i.e. using Lookup transform. When connecting [Get Customer Key] and [Get Van Key] choose Lookup Match Output. I’ll come back for Lookup No Match Output at the end of the article. I just want to get this up and running first. We should always get a simple one first up and running, then build up from there.

Box 5. Get Hire Date Key

It’s the same as Get Customer Key above.

Box 6. Populate FactHire

It’s OLE DB destination. On the Connection Manager pane, specify FactHire as the target table, and leave everything as is:

On the Mapping pane, the source column has the same name as the destination column, except HireDateKey:

That’s it. Run it and all should be green, inserting 1000 rows into the target fact table:

And the FactHire fact table now looks like this:

The 4 dimension key columns are populated with the surrogate key from the respective dimension.

Wrapping Up

The above workflow works but there are 2 things we need to do:

  1. The no match rows
    If in the Hire table we have a CustomerId which does not exist in the Customer Dimension, we need to set the CustomerKey to 0.
  2. Delete fact rows, if exists
    If today’s rows exists in the target fact table, delete them first.

As this is long enough. I’m going to put it as a separate article (link).

Hope this helps,

Vincent Rainardi

26/10/12: Data type error when mapping HireDate column to Date column

On Box 5 Get Hire Date Key, on the Lookup Transformation editor, on the Columns pane (see below), when we try to map HireDate column from Available Input Columns to the Date column on Available Lookup Columns, we get an error message: Cannot map the input column, ‘HireDate’, to the lookup column, ‘Date’, because the data types do not match:

Thank you to Masu for pointing this out.

This is because on the Hire table (the input), the HireDate column is in date data type, whereas in the DimDate table, the Date column is in varchar(10).

To solve this issue, in the Get Hire Date Key task, in the Connection pane, instead of “Use a table or a view”, select “Use results of an SQL query” and type:

select DateKey, Year, Month, convert(date, Date) from DimDate where date <> ‘Unknown’

Because both are now of data type Date, in the column pane you won’t have any issue connecting HireDate to Date:

How to Convert Datetime to Int in SSIS

Filed under: SSIS — Vincent Rainardi @ 5:30 am

This usually happen if you need to insert Date Key into a fact table. You need do convert a date from the source table which has datetime data type to integer, i.e. from ‘2012-05-26 00:00:00’ to 20120526.

In T-SQL you can do this to convert datetime to int:

convert(int, replace( convert(varchar(10), @Date, 20) ,’-‘,”))

But in SSIS data flow, what do you do?

You need to use derived column like this:

The Derived Column transformation will add a new column to the workflow, which is the integer Date Key.

In SSIS we have the following date and string functions:

So to convert today’s date to integer, the idea is to get the year component, and the month component and the day component (as 3 different string), then concatenate them, then convert to integer.

So in the derived column we do these 3 components:


(DT_WSTR,4) YEAR(GETDATE())  ----- (1)

(DT_WSTR,2) MONTH(GETDATE())    ------ (2)

(DT_WSTR,2) DAY(GETDATE())   ------- (3)

But we need to remember that for November, the (2) output is 2 digit (11). But for May, the (2) output is 1 digit (5). So for 22nd May 2012 it would be 2012522 because May is single digit. So we need to pad it with zero like this:


RIGHT( "0" + (DT_WSTR,2) MONTH(GETDATE()) , 2 )

And the same with date: if the date is 1st to 9th, it will be a single digit. So 6th May 2012 will be 2012056, because 6th is a single digit. So we need to pad it with zero, like this:


RIGHT( "0" + (DT_WSTR,2) DAY(GETDATE()) , 2 )

And finally we combine those 3 components using “+” like this:


(DT_WSTR,4) YEAR(GETDATE()) + RIGHT( "0" + (DT_WSTR,2) MONTH(GETDATE()) , 2 ) + RIGHT( "0" + (DT_WSTR,2) DAY(GETDATE()) , 2 )

So we will get a string like this: “20120522”.

As the last step, we just need to convert it to integer, using (DT_I4), like this:


(DT_I4)((DT_WSTR,4)YEAR(GETDATE()) + RIGHT("0" + (DT_WSTR,2)MONTH(GETDATE()),2) + RIGHT("0" + (DT_WSTR,2)DAY(GETDATE()),2))

In SSIS we have the following type cast operator available to us:

Here’s how it looks as an expression on the Derived Column window:

So once again the expression we need to convert today’s date (datetime) in Derived Column to integer is:


(DT_I4) ((DT_WSTR,4) YEAR(GETDATE()) + RIGHT( "0" + (DT_WSTR,2) MONTH(GETDATE()) , 2 ) + RIGHT( "0" + (DT_WSTR,2) DAY(GETDATE()) , 2 ))

Hope this helps,

Vincent Rainardi 26/5/2012

19 May 2012

System Interfaces Contract

Filed under: Data Warehousing — Vincent Rainardi @ 1:11 pm

The data types should not be changed with informing the downstream systems. Imagine if you were responsible to maintain the company data warehouse. And imagine that one source system suddenly changes the data types from varchar(30) to varchar(100), or from decimal(12,4) to decimal(20,4), without talking to you first. One day suddenly your warehouse batch failed and your team needs to fix it there and then. That would not be appropriate right?

So, if you are lucky enough to be the owner of an upstream system, you should not update your data types without first informing all downstream systems (including the data warehouse and BI systems) and agree with them an implementation date. On that date both you and them change the data type at the same time.

The source system can do what they like and we in the warehouse would suffer, right? Wrong. But in practical world, that’s what happen. The source system can drop a column any time they like. They can even drop a table. If we in DW read this column or table then we are stuffed! Our ETL suddenly stops working because the whole table is gone from the source system. What a wonderful way to start the week if on Monday morning we found that our DW batch failed because of a table was dropped from source. If this happens to you, insist to the source system that they have to back out that change. Then agree with them a date when they can do that, in the mean time you can prepare the DW to be ready to accept that change.

The first rule of system interfacing is that all party must sign to a contract. If you (DW) read their system (OLTP), you agree with them something like this:

The interface will be a file of this format (attached a File Spec), publish daily Mon-Fri at 11:15 pm on MQ (message queue), under this ID (attach the MQ spec).

If you are reading directly from their tables (using ODBC), then it would be something like this:

We (DW) will be reading directly from your database. Any DB must be notified to us 4 weeks in advance, especially table/view/column renaming/deletion (exception: new column/table/view).

That is called “Systems Interface Contract” (SIC). The interface should be signed by all parties involved. If you just read their table without agreeing anything then they can change their table any time and you are stuffed!

9 May 2012

A Data Warehouse with No Surrogate Keys

Filed under: Data Warehousing — Vincent Rainardi @ 11:07 pm
Tags:

Some companies built their data warehouses without surrogate keys. They have a customer table in their business system, with the primary key of CustomerId. In the data warehouse, they also use CustomerId as the primary key. One would thought “oh it’s probably an Inmon style warehouse”, but no, it was fact and dim, Kimball style warehouse. In the DimCustomer, they don’t have a surrogate key. Instead, they use CustomerId as the PK!

Talking to them I tried to understand the reasons why they did it that way. “Because we just want to truncate the DimCustomer and reload it every day. That would make it simple. We don’t need to find out which rows have changed and which rows are new rows. We simply delete everything in the dim and reload the customer table from the OLTP!” Then I realised that was the reason they couldn’t have a surrogate key in their DimClient. Because if they did they could not truncate reload the dim because the surrogate key would have changed. If on day 1 they have 1000 rows with ClientKey of 1 to 1000. On day 2, after truncate reload, the ClientKey will become 1001 to 2000. And that makes their live even more difficult: they would have to change the ClientKey in the fact table.

So for that reason, they use ClientId as well in the fact table, i.e. the PK in the OLTP. This way, in the dim the PK is 1 to 1000, and in the fact table, the PK is also 1 to 1000, and they match!

Figure 1. A Data Warehouse with no Surrogate Keys

Well, if not having surrogate keys make life easier, why do we bother creating surrogate keys in the data warehouse?

  1. To maintain the history of attributes in the dimension (type 2)
  2. To be able to point the old fact table to the old row in the dimension

“But we don’t need the old version of the row. We always want to see the latest version! Why do we need to have SK in the DW?”

To integrate multiple source systems which may have the same natural keys, i.e. the same CustomerId or ProductId.

“Oh that easy, we add SourceId column in both the dimension and the fact and we can join OK!”

True, that is true, we can join multiple source systems by using SourceId column in both the fact and the dim.

So that we can have the Unknown Row. If we don’t know who the Customer is for a particular sales transaction, we point the fact row to this Unknown Row. Same with DimProduct and all other dimensions.

If the CustomerId changes from numeric to alphanumeric in the OLTP, we don’t need to change all the fact tables. (We do if we use CustomerId as the primary key of the dimension). This is actually the core reason why we use surrogate key: because the natural key can change. We want to be able to update the natural key easily, without updating all the fact tables that use that dimension.

Another advantage is a surrogate key is performance: it is 1 integer column. The natural key may consist of several columns, for example: the PK of a bank account table is: branch code + product type code + account number. It is possible that 2 banks merged and their accounts were merged too, causing duplicate hence they prefix it again with the bank code, so it becomes: bank code + branch code + product type code + account number. That is long. Only one integer column is much shorter and more performant. It is quicker to index and quicker to find.

Figure 2. A Data Warehouse with Surrogate Keys

So those are the reasons why we have surrogate keys in the data warehouse! I hope this article helps those who are convinced that they need to use star schema and surrogate keys, but never knew the reasons. I hope they now understand the reasons behind using surrogate keys in a star schema. Personally this case helped me understand the reasons why some people chose not to use surrogate key. Throughout the years I’ve heard many “pro arguments” about using SK, but very few “contra arguments” in using surrogate keys.

In this particular case, they didn’t even know how to maintain a surrogate key. They liked the idea of “maintaining the history of the attributes” after I explained it, but they didn’t know the technicalities to do it in SSIS. It seems very complex and would take a lot of their time. When I introduced SCD transform in SSIS workflow to them, their eyes were bright and cheerful. “Wow that would make things a lot easier,” they said. Over the years, some companies have tried to make this Surrogate Key build business easier. Kalido and Wherescape for example, enable us to build a DW without worrying about the complicated issue of Surrogate Keys.

Once again the main reason of using surrogate key is to detach the data warehouse from the source system. To make the data warehouse more independent from the source system, and therefore gives us a more flexible design as a result. And the main benefit is that SKs enable us to maintain the history of the attributes in the dimension using SCD type 2.

Vincent Rainardi, 9/5/2012

Blog at WordPress.com.