Data Warehousing and Data Science

28 December 2012

SSIS 2012: Converting Date from Excel for Lookup

Filed under: SSIS — Vincent Rainardi @ 12:12 am

We have an Excel file like this (date column in DD-MM-YYYY):

Excel File

We have a Date Dimension table like this (date column in YYYY-MM-DD):

Date Dimension

Using SSIS 2012, we want to convert the date column in Excel from DD-MM-YYYY to YYYY-MM-DD so we can lookup the Date_Key value in the Date Dimension.

We want to output it in text file (error rows go to Error File):

Output FileError File

Overall

We need to build an SSIS 2012 package like this:

SSIS 2012 Package

The package reads the Excel file. It convert the date column in the Excel file to YYYY-MM-DD using Derived Column. It reads the Date Dimension table using Lookup transform, then lookup the Date_Key column. It then output the file to an Output file, and the error rows (failed lookup) to the Error File.

Before We Begin

SSIS Excel adapter can only run in 32 bit mode. So right click your project name, properties, and set Run64BitRunTime to False:

Run64bit

Step by Step

Step 1. Read the Excel File

On SSIS Toolbox, drag Data Flow Task to Control Flow canvas:

Data Flow Task

In SSIS Toolbox, double click Source Assistant, choose Excel:

Source Assistant

Click New, choose the Excel file to import. Data access mode = Table or View, Name of Excel sheet = Sheet1$. Click Preview:

Preview Excel File

Click Column on the left pane to verify that the column names are OK, then click OK to close the Excel Source Editor.

Step 2. Convert to YYYY-MM-DD Using Derived Column

From the SSIS Toolkit on the left hand side, drag Derived Column onto the Data Flow canvas. Configure it as follows:

Derived Column

Expression: (DT_WSTR,4)YEAR(Date) + “-” + RIGHT(“0” + (DT_WSTR,2)MONTH(Date),2) + “-” + RIGHT(“0” + (DT_WSTR,2)DAY(Date),2)

Date being the column from the Excel file.

At this step I would recommend setting up a Flat File destination and output the Derived Column result into it. This is to check that the date column in Excel file has been converted correctly.

If not, at least setup a Data Viewer so you can see what’s flowing out of the Derived Column.

Step 3. Read the Date Dimension using Lookup and get the Date Key

Drag Lookup transform from the Toolkit to the Data Flow canvas. Setup a connection to your SQL Server. Go to Connection pane, and configure as follows:

Lookup Transform

In the above SQL query box, you need to convert from varchar to nvarchar, otherwise it won’t match the output of the Derived Column which is WSTR (wide string / unicode). STR in SSIS data type = varchar, WSTR = nvarchar.

Click on Preview to see the data from the Date Dim:

Lookup - Preview

Click on Columns pane on the left and configure as follows:

Lookup - Columns

On the General tab, configure Redirect rows to Error output for No Matching Entries:

Lookup - General

4. Configure Output File and Error File

Drag Flat File Destination from the Toolkit panel onto the canvas and configure as follows:

Output File - General

Output File - Columns

Output File - Advanced

On the Output Flat File Destination Editor, configure the Mapping pane as follows:

Output Flat File Destination Mapping

Do the same with the Error Flat File Destination Editor:

Error Flat File Destination Mapping

That’s it you can now run the package:

Overall

3 October 2012

SSIS: Automating DDL Changes

Filed under: SSIS — Vincent Rainardi @ 5:58 pm
Tags:

See Preeti’s comment 3/10/12 on “SSIS: Importing a File with Dynamic Columns” article (link):

Will elaborate my question ?I have columns dynamically changing on source side like adding,deleting  which is SQL ,have my target as Oracle ,how to handle dynamically the source changes to be in sync with target (Oracle) structure before loading.I knw in any ETL process this has to be manually done like refreshing the source then updating the target table structure.Cannot this be automated?We have java transformation in ETL so cant we make use of it to automate the process ?If so then how?If not,are there any workarounds to do this?

This article is addressing the above.

—————————————————————

This can be done in theory, but in practice it is usually banned in production.

First, we obtain the new structure of the source table(s) by querying either information schema or system views/tables, i.e. if your source is SQL Server we query system catalogue views like this:

select c.name, dt.name, c.max_length, c.precision, c.scale, c.is_nullable
from sys.columns c
join sys.tables t on c.object_id = t.object_id
join sys.schemas s on t.schema_id = s.schema_id
join sys.types dt on c.user_type_id = dt.user_type_id
where t.name = ‘Table1’ and s.name = ‘Schema1’

We then obtain the structure of the target table(s) by querying either information schema or system views/tables, i.e. if your target is Oracle we can query ALL_TAB_COLUMNS like this:

SELECT COLUMN_NAME, DATA_TYPE, DATA_LENGTH, DATA_PRECISION, DATA_SCALE, NULLABLE
FROM ALL_TAB_COLUMNS
WHERE TABLE_NAME = ‘TABLE1’ AND OWNER = ‘OWNER1’

Both tasks above are done using Execute SQL Task in SSIS, storing it into a table (you create the table). For example:

As we can see, SubType and Value3Y doesn’t exist in the target table, and Value5Y doesn’t exist in the source table. The precision of Value1Y and Value5Y is different.

You can compare the source and target then use ALTER TABLE at the target to create the missing column, but this is risky, complex and causing headache. A simpler, more robust solution is to recreate the target table.

If you have different RDBMS between source and target, you will need to make a “data type conversion table”, i.e. varchar in SQL Server becomes varchar2 in Oracle, etc.

But as I said, this is usually banned in production because the Prod Support Manager can’t risk the instability the approach causes.

19 June 2012

SSIS SCD Wizard Performance Issue

Filed under: Data Warehousing,SSIS — Vincent Rainardi @ 7:23 pm
Tags: ,

In Microsoft BI, when we do a slowly changing dimension type 2, we instinctively do SCD Wizard. The problem with SCD Wizard is the performance. For each source row coming to the SCD box, SSIS does a lookup on the target dimension table to find out if the row already exists or not. That is a crazy idea. And SSIS also update the target dimension table on row by row basis. That is also a crazy idea.

If your dimension is 1000 rows and the source data is 1000 rows, the SCD takes a few minutes. But if your dimension is 500k and the source data is 500k, it’s 5-6 hours performance. The whole idea is fundamentally flawed. On the welcome screen of the SSIS box, there should be a warning: Warning, do not use SSIS SCD Wizard if your dimension is above 1000 rows.

So, if your dimension is 1m rows, what should you do?

  1. Do it on set basis, not row by row basis.
  2. Filter the incoming source rows to the changed rows.
  3. Don’t everything type 2, but only necessary columns.
  4. Index the dimension table on the seek columns.
  5. Group the updates to the target dimension table.
  6. Archive unused rows in the dimension table.

To do update an SCD type 2 dimension, in SSIS we should do this:

  1. Obtain Current Time (CT).
  2. Stage the source table. 1m rows (S).
  3. Identity changed rows on the staging table. 50k rows (C).
  4. Put the C rows on a new staging table.
  5. Identify which C rows already exist on the dimension table by comparing the business key.
    Say 40k rows exist (E) and 10k rows don’t exist (D).
  6. Out of the 40k E rows, identify which rows has changed attributes. Say there are 12k rows (CA).
  7. Update dimension table: expire the 12k CA rows (set Expiry Date = CT, IsActive = N).
  8. Insert 22k row (10k D + 12k CA) into dimension table (in 1 go, not in 2 goes1). Expiry Date = 9999, IsActive = Y.
    D rows: Effective Date = 0000. CA rows: Effective Date = CT.

Note:

  1. In step 2, enable fast load, do not put any constraint on the staging table.
  2. Step 3 is done by hashing or checksum.
  3. Staging table C on step 4 can be used by other work flow, especially the Fact Load.
  4. To support step 5 and 7, index the dimension table on the Natural Key column.
  5. To support step 6, index the dim table on the type 2 columns to be compared.
    Do the compare in the order of the indexed columns.
  6. We can use the Merge command to do step 7 & 8, but we need to do it twice. So the old Upsert (update + insert) works as efficiently as Merge. Probably more efficient as we don’t do delete.

This is one major weakness in SSIS compared to Informatica PowerCenter so I have no doubt Microsoft must have been building a set based SSIS SCD Wizard for the next release 🙂

26 May 2012

How to Populate a Fact Table using SSIS (part 2)

Filed under: Data Warehousing,SSIS — Vincent Rainardi @ 9:47 pm

In part 1 (link) we have populated a fact table (FactHire) by reading the source table (Hire) and looking up 3 dimensions (DimCustomer, DimVan and DimDate) to get the surrogate dimension keys and then insert into FactHire.

In the end of the article I said that there were 2 things that we needed to do:

  1. Process the no match rows
    If in the Hire table we have a CustomerId which does not exist in the Customer Dimension, we need to set the CustomerKey to 0.
  2. Delete fact rows, if exists
    If today’s rows exists in the target fact table, delete them first.

So let’s get started.

Process the No Match Rows

The Lookup transform we try to convert the natural keys from the Hire table (Customer ID, Registration No and Hire Date) to the dimension keys (CustomerKey, VanKey and HireDateKey) by looking up the dimension table. Consider that in the Hire table we have a CustomerId of 123. We search DimCustomer and we couldn’t find CustomerId 123. In this case we need to set the Customer Key to 0.

To do that, we need to process the “No Match Output” of the Lookup transform (the second green row). Then merge it back into the main flow. Here’s how it will look when finished:

So let’s begin. The first thing to do is to handle the “rows with no match entries”. Double click on [Get Customer Key] and change the [Specify how to handle rows with no matching entries] from [Failed component] to [Redirect rows to no match output], like this:

Now as you can see above there are 2 green arrows coming out of Get Customer Key lookup transformation. The [Lookup Matched Output] green arrow is now going to [Sort on HireId1], and we set the [Lookup No Match Output] to a Derived Column transformation called [Mismatched CustomerId]. In this [Mismatched Customer Id], we are simply setting the CustomerKey column = 0, like this:

We would like to combine the output of Match Output and No Match Output using a Merge transformation, like this:

But the Merge transformation requires us to sort each of the two inputs. So we have to pass them through a Sort transform like this:

In the [Sort on HireId2], we sort the data set on HireId, and pass all other columns through to the output, like this: (and we do exactly the same on [Sort on HireId 1])

So why do we sort on HireId? Why not on other column? Because HireId is a unique identifier. It uniquely identifies a hire transaction. Imagine this scenario: there are 1000 rows on the Hire table in HireBase. 997 of them has CustomerId exist in DimCustomer, and 3 of them has CustomerId which  don’t exist in DimCustomer. So the Match Output will have 997 rows and No Match Output will have 3 rows. When we merge them again, we want to use a column which was a unique ID before they split up, so that when they merge this unique ID will identify them properly.

Now that we have merged the first lookup, we need to do the same on the 2nd and 3rd Lookups (Van Key and Hire Date Key). But this time we don’t need to sort the output of the derived column and lookup, because they are already sorted:

Now our lookup is complete and proper. We have processed the No Match rows.

Delete Target Rows if Exist

One of the basic principles of data warehouse ETL is “RERUNABILITY”. Meaning that we need to be able to run the ETL several times a day without causing any issue to the data. In terms of loading a Period Snapshot Fact table, this means that we need to delete the [today’s date rows] in the target fact table, before we load the new rows.

Let’s say that this morning the ETL ran, pumping 1000 rows into the FactHire table, all with Snapshot Date = today. And now (say 2pm) if we were to run it again, it will pump another 1000 rows with Snapshot Date = today, causing duplication of data. Because of this we need to delete today’s rows in the target fact table.

Unfortunately we can’t use truncate in this case because we only want to delete today’s rows, not yesterday’s ones or any other dates before today. We need an index to be created JUST on Snapshot Date. I repeat: JUST on the Snapshot Date. We don’t want covering index or query help index or any multi key index, we just want Snapshot Date alone. Why? Because if we put other column on the index it may not get used when we delete today’s rows because the selectivity is too low. Can we use table partitioning? Yes we can. We can (sometimes we should, if the volume warrants it) partition a periodic snapshot fact table on Snapshot Date. But sometimes we can’t because the partitioning key is used for something more selecting, such as on product Id. But in majority of the cases we can.

Any way I digress. We need to delete today’s rows. Should you want to discuss further about partitioning you can email me at vrainardi@gmail.com.

Deleting today’s rows is a simple OLEDB SQL Command transformation like this:

And inside, the OLE DB Command look like this:

As you can see the SqlCommand is “Delete from FactHire where SnapshotDateKey = 20120526”. But we don’t want to hardcode the date! We want it to be today’s date. So we make it like this:


delete from FactHire where SnapshotDateKey = convert(int, replace( convert(varchar(10), GETDATE(), 20) ,'-',''))

The above code snippet will convert today’s date into an integer. For real warehouse project this is not acceptable, because the batch could start 1 millisecond before midnight and this OLE DB Command could end up executing on the next day. For real warehouse project the first thing to do in the ETL Control Flow is to set the Batch Date and Time using a script task, like this:

And the script task populates a global variable so that all workflows in the package can use it. Of course in the real project there would be a lot of dimension population work flow, before it hits the fact table population.

So that’s it, we have done 2 things that we set out at the beginning of this article: to process the No Match Output and to delete today’s rows if any.

Hope this helps,

Vincent Rainardi, 26/5/2012

How to Populate a Fact Table using SSIS (part 1)

Filed under: Data Warehousing,SSIS — Vincent Rainardi @ 2:24 pm
Tags: ,

Seems to me that some people are still struggling to populate their fact tables using SSIS. When doing it, they come across issues like these:

  1. Where do I populate my fact table from?
  2. How do I get the dimension keys to put into my fact table?
  3. Where can I get the data for the measures columns?
  4. With what do I populate the snapshot date column?
  5. What is the primary key of my fact table?
  6. The source table doesn’t have a primary key of the source table. What should I do?
  7. I got duplicate rows in the fact table. How do I handle it?
  8. The rows with that snapshot date already exist. What should I do?
  9. Lookup cache: should I choose full cache or partial cache?
  10. I can’t find the row in the dimension table, what should I do?

As always, the best way to explain is by example, which is more effective than answering the above questions. So I’m going to do the following:

  1. Describe the background on the company and the data warehouse
  2. Create the source tables and populate them
  3. Create the dimension tables and populate them
  4. Create the fact table (empty)
  5. Build an SSIS package to populate the fact table, step by step

Background

It’s a van hire company called TopHire. It’s purely fictional of course. Their business system (HireBase) captures the rental information, including the customer information. HireBase has a fleet database where all vans are maintained. HireBase contains only 3 tables:

  1. Customer: contains 100 customers, e.g. name, data of birth, telephone number, etc. A column called CustomerId uniquely identifies a customer.
  2. Van: contains 20 vans that Top Hire operates. It’s their fleet. Contains the registration number, engine size, van size, colour, year of manufacture. The unique identifier for each van is the registration number.
  3. Hire: contains 1000 hire transactions since 1st Jan 2011. Every hire transaction stores: who the customer was, which van was rented out, the date it occurred, number of days and various hire fees/charges (the van, sat nav, insurance, damage waiver and total bill). For each transaction we have Hire ID, which is a unique identifier in this table.

The data warehouse contains 4 tables:

  1. Date dimension: contains every single date from 2006 to 2016.
  2. Customer dimension: contains 100 customers. To be simple we’ll make it type 1 so we don’t create a new row for each change.
  3. Van dimension: contains 20 vans. To be simple we’ll make it type 1 so we don’t create a new row for each change.
  4. Hire fact table: contains 1000 hire transactions since 1st Jan 2011. It is a daily snapshot fact table so that every day we insert 1000 rows into this fact table. So over time we can track the changes of total bill, van charges, satnav income, etc.

Create the source tables and populate them

So now we are going to create the 3 tables in HireBase database: Customer, Van, and Hire. Then we populate them.

First I’ll show you how it looks when it’s done:

Customer table:

Van table:

Hire table:

And here is the script to create and populate them:

-- Create database
create database HireBase
go
use HireBase
go

-- Create customer table
if exists (select * from sys.tables where name = 'Customer')
drop table Customer
go

create table Customer
( CustomerId varchar(20) not null primary key,
 CustomerName varchar(30), DateOfBirth date, Town varchar(50),
 TelephoneNo varchar(30), DrivingLicenceNo varchar(30), Occupation varchar(30)
)
go

-- Populate Customer
truncate table Customer
go

declare @i int, @si varchar(10), @startdate date
set @i = 1
while @i <= 100
begin
 set @si = right('0'+CONVERT(varchar(10), @i),2)
 insert into Customer
 ( CustomerId, CustomerName, DateOfBirth, Town, TelephoneNo, DrivingLicenceNo, Occupation)
 values
 ( 'N'+@si, 'Customer'+@si, DATEADD(d,@i-1,'2000-01-01'), 'Town'+@si, 'Phone'+@si, 'Licence'+@si, 'Occupation'+@si)
 set @i = @i + 1
end
go

select * from Customer

-- Create Van table
if exists (select * from sys.tables where name = 'Van')
drop table Van
go

create table Van
( RegNo varchar(10) not null primary key,
 Make varchar(30), Model varchar(30), [Year] varchar(4),
 Colour varchar(20), CC int, Class varchar(10)
)
go

-- Populate Van table
truncate table Van
go

declare @i int, @si varchar(10)
set @i = 1
while @i <= 20
begin
 set @si = convert(varchar, @i)
 insert into Van
 ( RegNo, Make, Model, [Year], Colour, CC, Class)
 values
 ( 'Reg'+@si, 'Make'+@si, 'Model'+@si,
 case @i%4 when 0 then 2008 when 1 then 2009 when 2 then 2010 when 3 then 2011 end,
 case when @i%5<3 then 'White' else 'Black' end,
 case @i%3 when 0 then 2000 when 1 then 2500 when 2 then 3000 end,
 case @i%3 when 0 then 'Small' when 1 then 'Medium' when 2 then 'Large' end)
 set @i = @i + 1
end
go

select * from Van

-- Create Hire table
if exists (select * from sys.tables where name = 'Hire')
drop table Hire
go

create table Hire
( HireId varchar(10) not null primary key,
 HireDate date not null,
 CustomerId varchar(20) not null,
 RegNo varchar(10), NoOfDays int, VanHire money, SatNavHire money,
 Insurance money, DamageWaiver money, TotalBill money
)
go

-- Populate Hire table
truncate table Hire
go

declare @i int, @si varchar(10), @DaysFrom1stJan int, @CustomerId int, @RegNo int, @mi int
set @i = 1
while @i <= 1000
begin
 set @si = right('000'+convert(varchar(10), @i),4) -- string of i
 set @DaysFrom1stJan = (@i-1)%200 --The Hire Date is derived from i modulo 200
 set @CustomerId = (@i-1)%100+1 --The CustomerId is derived from i modulo 100
 set @RegNo = (@i-1)%20+1 --The Van RegNo is derived from i modulo 20
 set @mi = (@i-1)%3+1 --i modulo 3
 insert into Hire (HireId, HireDate, CustomerId, RegNo, NoOfDays, VanHire, SatNavHire, Insurance, DamageWaiver, TotalBill)
 values ('H'+@si, DateAdd(d, @DaysFrom1stJan, '2011-01-01'),
 left('N0'+CONVERT(varchar(10),@CustomerId),3), 'Reg'+CONVERT(varchar(10), @RegNo),
 @mi, @mi*100, @mi*10, @mi*20, @mi*40, @mi*170)
 set @i += 1
end
go

select * from Hire

Create the Data Warehouse

So now we are going to create the 3 dimension tables and 1 fact table in the data warehouse: DimDate, DimCustomer, DimVan and FactHire. We are going to populate the 3 dimensions but we’ll leave the fact table empty. The purpose of this article is to show how to populate the fact table using SSIS.

First I’ll show you how it looks when it’s done:

Date Dimension:

Customer Dimension:

Van Dimension:

Hire Fact Table:

And then we do it. This is the script to create and populate those dim and fact tables:

-- Create the data warehouse
create database TopHireDW
go
use TopHireDW
go

-- Create Date Dimension
if exists (select * from sys.tables where name = 'DimDate')
drop table DimDate
go

create table DimDate
( DateKey int not null primary key,
 [Year] varchar(7), [Month] varchar(7), [Date] date, DateString varchar(10))
go

-- Populate Date Dimension
truncate table DimDate
go

declare @i int, @Date date, @StartDate date, @EndDate date, @DateKey int,
 @DateString varchar(10), @Year varchar(4),
 @Month varchar(7), @Date1 varchar(20)
set @StartDate = '2006-01-01'
set @EndDate = '2016-12-31'
set @Date = @StartDate

insert into DimDate (DateKey, [Year], [Month], [Date], DateString)
 values (0, 'Unknown', 'Unknown', '0001-01-01', 'Unknown') --The unknown row

while @Date <= @EndDate
begin
 set @DateString = convert(varchar(10), @Date, 20)
 set @DateKey = convert(int, replace(@DateString,'-',''))
 set @Year = left(@DateString,4)
 set @Month = left(@DateString, 7)
 insert into DimDate (DateKey, [Year], [Month], [Date], DateString)
 values (@DateKey, @Year, @Month, @Date, @DateString)
 set @Date = dateadd(d, 1, @Date)
end
go

select * from DimDate

-- Create Customer dimension
if exists (select * from sys.tables where name = 'DimCustomer')
drop table DimCustomer
go

create table DimCustomer
( CustomerKey int not null identity(1,1) primary key,
 CustomerId varchar(20) not null,
 CustomerName varchar(30), DateOfBirth date, Town varchar(50),
 TelephoneNo varchar(30), DrivingLicenceNo varchar(30), Occupation varchar(30)
)
go

insert into DimCustomer (CustomerId, CustomerName, DateOfBirth, Town, TelephoneNo,
 DrivingLicenceNo, Occupation)
select * from HireBase.dbo.Customer

select * from DimCustomer

-- Create Van dimension
if exists (select * from sys.tables where name = 'DimVan')
drop table DimVan
go

create table DimVan
( VanKey int not null identity(1,1) primary key,
 RegNo varchar(10) not null,
 Make varchar(30), Model varchar(30), [Year] varchar(4),
 Colour varchar(20), CC int, Class varchar(10)
)
go

insert into DimVan (RegNo, Make, Model, [Year], Colour, CC, Class)
select * from HireBase.dbo.Van
go

select * from DimVan

-- Create Hire fact table
if exists (select * from sys.tables where name = 'FactHire')
drop table FactHire
go

create table FactHire
( SnapshotDateKey int not null, --Daily periodic snapshot fact table
 HireDateKey int not null, CustomerKey int not null, VanKey int not null, --Dimension Keys
 HireId varchar(10) not null, --Degenerate Dimension
 NoOfDays int, VanHire money, SatNavHire money,
 Insurance money, DamageWaiver money, TotalBill money
)
go

select * from FactHire

Now you can see that the 3 dimensions have been populated. And the fact table is created and empty, ready for us to populate it.

Build the SSIS Package to Populate the Fact Table

First I show you how it looks when it’s done. This is the overall workflow we are going to build:

This is a bit of description on each of the 6 boxes above:

  1. Read Hire table in HireBase: First we get the original OLTP records from the hire table in HireBase.
  2. Get Snapshot Date Key: we get today’s date and convert it into an integer as Snapshot Date Key. FactHire is a snapshot fact table. A periodic snapshot fact table. Which means that every day we populate the fact table with 1000 rows like this:Next week, on Monday 28th May, we insert 1000 rows to the fact table, which is all the rows in the Hire table. Then on Tuesday 29th  May we insert 1010 rows. Because there would be 10 new hire transactions in the Hire table in HireBase system. The 1000 “old” rows might change as well on Tuesday, not all of them will be the same as Monday. Perhaps there are 10 rows which are modified, the amount/fees changed, etc. Whatever the condition of the Hire table on Tuesday, we are taking all rows in put them into our data warehouse. And we do this every day. We every day, put whatever is in the Hire table into FactHire. That’s what Periodic Snapshot fact table means.Hence in a periodic snapshot fact table we have a “Snapshot Date”. Meaning: the date when we captured the source table. Or, the date when these 1000 rows were inserted into this fact table. The value of this snapshot date is today’s date (the date the ETL runs). The date the load happens.
  3. Get Customer Key: from Hire table we get the Customer ID which was involved in the hire transaction. We then go to the Customer Dimension to get the Customer Key for this Customer ID.
  4. Get Van Key: from Hire table we get the Registration Number which was rented out in the transaction. We then go to the Van Dimension to get the Van Key for this Registration Number.
  5. Get Hire Date Key: from Hire table we get the Hire Date, which is the date when the hire transaction happened. We then go to the Date Dimension to get the Date Key for this Hire Date.
  6. Populate FactHire: After we get all the required dimension keys, we insert the rows into the FactHire fact table.

Now that we know what to build, let’s build it. The box numbers below refer to the 6 boxes above.

Box 1. Read Hire table in HireBase

We create the data source in the workflow. It’s OLE DB Data Flow Source, from HireBase. We take all columns, so we don’t specify a query:

Box 2. Get Snapshot Date Key

Create a Derived Column and set the Expression to:

(DT_I4)((DT_WSTR,4)YEAR(GETDATE()) + RIGHT(“0” + (DT_WSTR,2)MONTH(GETDATE()),2) + RIGHT(“0” + (DT_WSTR,2)DAY(GETDATE()),2))

See the complete article here.

Box 3. Get Customer Key

Create a Lookup transformation and set to Full Cache, OLE DB connection manager, Fail component, like this:

Full cache: SSIS reads all rows from DimCustomer into memory, then read the Hire table and SSIS doesn’t need to read DimCustomer any more. All lookup operations are done in memory.
Partial cache: SSIS build the memory cache while reading the rows from Hire table. When SSIS reads the rows from Hire table, it tries to find the Customer Key in memory cache. If it can’t find it, it reads the DimCustomer on disk.
No cache: SSIS doesn’t build memory cache. SSIS joins DimCustomer to Hire table (using RDBMS operation) to find the Customer Key.

Cache Connection Manager (CCM): SSIS reads DimCustomer and put in memory cache. This cache can be reused and shared between several Lookups, and across several dataflows. CCM reduces database workload.

OLE DB connection manager: a table in the warehouse (in this case it’s DimCustomer), see Connection pane below:

In the Columns pane, connect Customer Id from the Hire table to the Customer Id on the DimCustomer table, and bring the CustomerKey as output:

Box 4: Get Van Key

It’s the same as Get Customer Key above, i.e. using Lookup transform. When connecting [Get Customer Key] and [Get Van Key] choose Lookup Match Output. I’ll come back for Lookup No Match Output at the end of the article. I just want to get this up and running first. We should always get a simple one first up and running, then build up from there.

Box 5. Get Hire Date Key

It’s the same as Get Customer Key above.

Box 6. Populate FactHire

It’s OLE DB destination. On the Connection Manager pane, specify FactHire as the target table, and leave everything as is:

On the Mapping pane, the source column has the same name as the destination column, except HireDateKey:

That’s it. Run it and all should be green, inserting 1000 rows into the target fact table:

And the FactHire fact table now looks like this:

The 4 dimension key columns are populated with the surrogate key from the respective dimension.

Wrapping Up

The above workflow works but there are 2 things we need to do:

  1. The no match rows
    If in the Hire table we have a CustomerId which does not exist in the Customer Dimension, we need to set the CustomerKey to 0.
  2. Delete fact rows, if exists
    If today’s rows exists in the target fact table, delete them first.

As this is long enough. I’m going to put it as a separate article (link).

Hope this helps,

Vincent Rainardi

26/10/12: Data type error when mapping HireDate column to Date column

On Box 5 Get Hire Date Key, on the Lookup Transformation editor, on the Columns pane (see below), when we try to map HireDate column from Available Input Columns to the Date column on Available Lookup Columns, we get an error message: Cannot map the input column, ‘HireDate’, to the lookup column, ‘Date’, because the data types do not match:

Thank you to Masu for pointing this out.

This is because on the Hire table (the input), the HireDate column is in date data type, whereas in the DimDate table, the Date column is in varchar(10).

To solve this issue, in the Get Hire Date Key task, in the Connection pane, instead of “Use a table or a view”, select “Use results of an SQL query” and type:

select DateKey, Year, Month, convert(date, Date) from DimDate where date <> ‘Unknown’

Because both are now of data type Date, in the column pane you won’t have any issue connecting HireDate to Date:

How to Convert Datetime to Int in SSIS

Filed under: SSIS — Vincent Rainardi @ 5:30 am

This usually happen if you need to insert Date Key into a fact table. You need do convert a date from the source table which has datetime data type to integer, i.e. from ‘2012-05-26 00:00:00’ to 20120526.

In T-SQL you can do this to convert datetime to int:

convert(int, replace( convert(varchar(10), @Date, 20) ,’-‘,”))

But in SSIS data flow, what do you do?

You need to use derived column like this:

The Derived Column transformation will add a new column to the workflow, which is the integer Date Key.

In SSIS we have the following date and string functions:

So to convert today’s date to integer, the idea is to get the year component, and the month component and the day component (as 3 different string), then concatenate them, then convert to integer.

So in the derived column we do these 3 components:


(DT_WSTR,4) YEAR(GETDATE())  ----- (1)

(DT_WSTR,2) MONTH(GETDATE())    ------ (2)

(DT_WSTR,2) DAY(GETDATE())   ------- (3)

But we need to remember that for November, the (2) output is 2 digit (11). But for May, the (2) output is 1 digit (5). So for 22nd May 2012 it would be 2012522 because May is single digit. So we need to pad it with zero like this:


RIGHT( "0" + (DT_WSTR,2) MONTH(GETDATE()) , 2 )

And the same with date: if the date is 1st to 9th, it will be a single digit. So 6th May 2012 will be 2012056, because 6th is a single digit. So we need to pad it with zero, like this:


RIGHT( "0" + (DT_WSTR,2) DAY(GETDATE()) , 2 )

And finally we combine those 3 components using “+” like this:


(DT_WSTR,4) YEAR(GETDATE()) + RIGHT( "0" + (DT_WSTR,2) MONTH(GETDATE()) , 2 ) + RIGHT( "0" + (DT_WSTR,2) DAY(GETDATE()) , 2 )

So we will get a string like this: “20120522”.

As the last step, we just need to convert it to integer, using (DT_I4), like this:


(DT_I4)((DT_WSTR,4)YEAR(GETDATE()) + RIGHT("0" + (DT_WSTR,2)MONTH(GETDATE()),2) + RIGHT("0" + (DT_WSTR,2)DAY(GETDATE()),2))

In SSIS we have the following type cast operator available to us:

Here’s how it looks as an expression on the Derived Column window:

So once again the expression we need to convert today’s date (datetime) in Derived Column to integer is:


(DT_I4) ((DT_WSTR,4) YEAR(GETDATE()) + RIGHT( "0" + (DT_WSTR,2) MONTH(GETDATE()) , 2 ) + RIGHT( "0" + (DT_WSTR,2) DAY(GETDATE()) , 2 ))

Hope this helps,

Vincent Rainardi 26/5/2012

20 March 2012

SSIS: Loading a Big File Fast

Filed under: SSIS — Vincent Rainardi @ 7:21 am
Tags:

Loading a big file into a table in SQL Server

The differences between a slow load and a very fast load are:

  1. Flat file source: ValidateExternalMetadata = false
  2. OLE DB Destination: AccessMode: OpenRowSet using FastLoad (default is OpenRowSet)

Flat file: 230,000 rows, 15 columns, pipe separated, qualifier: double quote
SSIS: data conversion to Unicode on 10 columns

Big difference in performance:

  • Before: 229 seconds (3.8 minutes)
  • After: 3 seconds

So next time you load a file into a table using SSIS, please remember to use these settings.

16 November 2011

SSIS: True/False and 1/0 values on Bit Columns

Filed under: SSIS — Vincent Rainardi @ 11:13 pm

The column is defined as bit on a SQL Server table. You created an SSIS package to export the table to a text file. In Mgt Studio when you query the table, the bit column displays 1 and 0 as its value, but in the output text file they are true and false.

You looked everywhere for what might have caused it to be true/false. You suspected the text file connection. But it is defined as string[50] there. So you changed the SQL query inside the workflow from

select column1 from table1

to: select convert(int, column1) from table1

But it is still the same. The output is still true/false not 1/0 as you see in Mgt Studio.

So you convert to varchar like this:

select convert(varchar, convert(int, column1)) from table1

You hoped that it will now be 1 and 0, but alas no. It is still true and false.

So you changed the data type in the text file connection from string to four byte signed integer.

What you get is -1 for true and 0 for false. Yes! Not 1 and 0 but -1 and 0.

So you changed the data type in the text file connection to four byte unsigned integer. Ha this will force it to be 1 and 0 you thought.

But alas, no! What you get is a very big number for true and 0 for false.f

So here the secret: right click on the OLE DB Source component in the data flow, and choose Show Advanced Editor. Go to the last tab, Input and Output Properties:

Go to OLE DB Source Output, Output Column (NOT External Columns!), then select the bit column. Change the DataType from Boolean to Four Byte Signed Integer.

Change your query to:

select convert(int, column1) as column1 from table1

Note: If you don’t the above the output will be -1 and 0, not 1 and 0.

And execute the data flow.

Voila, you get 1 and 0 in the output text file.

Note: whether in the flat file connection you defined the output columns as boolean or string or signed int, it will still be 1 and 0.

Hope it helps. As usual I welcome comments and corrections.

Vincent Rainardi, 16/11/2011

26 June 2011

Import/Export Files with Variable File Name Set at Run Time

Filed under: SSIS — Vincent Rainardi @ 6:37 am
Tags:

In ETL we often need to import files. But these files usually don’t have a fixed file name. They are variable. The file names change every day, because they have date on their file names. For example, yesterday it was called trans_20110627. Today it is called trans_20110628. Tomorrow it will be called trans_20110629.

How do we import them? We set the file name at run time.

How do we set the file name at run time? Using a script task, at run time we set the value of a variable. This variable is used as the file name of the file.

This might be one of the “oldest tricks in the book”, but still, I found that many people don’t know it. So it’s probably worth writing a post about it.

So here’s how to do it, step by step:

First, create a variable called TransFileName, of type string, scope = package:

Then create a script task on the control flow, call it Set File Name:

Edit the script task, put the TransFileName variable on the ReadWriteVariable:

Then click on Edit Script:

For simplicity in the above script I just put a fixed file name. But in the real project the logic is for example: find today’s file, find the latest file, etc, which is for another blog post. I’ll write that “find the latest file logic in using c# script” post in the next few days and put the link here.

Save the ScriptMain.cs window and close it. Click OK to come out of the Script Task Editor.

Create a Data Flow task:

Double click the Data Flow Task to edit it. Create Flat File Source and OLE DB Destination.

Double click on the Trans File to edit it. Click New on the Flat File Source Editor. Click Browse and select the file we want to import. For this exercise I’ve created a small text file like this:

It only contains 1 column and 2 rows. I also created a small table to illustrate the process:

create table VarFileName.Trans ( Col1 int )

So, click on Browse and select the file we want to import:

Click “Column names in the first data row” if your file has column names in the first row. Click on Columns tab on the left pane:

Specify the column delimiter e.g. pipe (|). Click on the Advanced tab on the left pane:

Set the data type for each column. Click OK to close the Flat File Connection Manager Editor and go back to the Flat File Source Editor:

Click on the Columns on the left pane:

Click OK to close the Flat File Source Editor.

Edit the OLE DB, point it to the target table, e.g. VarFileName.Trans:

Click on Mappings on the left pane and check that all columns on the file have been mapped properly to the table. In the example there is only 1 column:

Click OK to close the OLE DB Destination Editor and go back to the Data Flow window.

Now we are going to use the TransFileName variable as the file name of the file. We do a right click on the TransFile on the Connection Manager and select Properties.

Click on the ellipsis on the Expressions to open the Property Expression Editor. Set the Property column to Connection String. Under the Expression column, click on the ellipsis to open the Expression Builder. Choose the TransFileName variable, drag it to the Expression box:

 

Click OK to close the Expression Builder. Click OK again to close the Property Expressions Editor.

That’s it, we are done. At run time, the script task will set the value of the TransFileName variable. This variable is used as the file name of the file we want to import. Here’s the Control Flow (left) and Data Flow (right) at when we run it:

This can also be used to when we need to export to a file which we only know the name at run time. It works the same way.

Hope this helps,

Vincent Rainardi, 26/6/2011

21 June 2011

Updating a Variable based on a File (Script Task)

Filed under: SSIS — Vincent Rainardi @ 10:24 pm
Tags:

In the last article (link) I showed how to update a variable based on a file for script component. That’s useful when we are in a Work Flow. But sometimes the situation happens on the Control Flow. In this case we need to do a Script Task.

Using a Script Task it is a lot simpler than using a Script Component. We essentially need to do a bit of C# coding to read the file and set the variable.

This is the file that we want to read:

Pull the Script Task from the Toolbar to the Control Flow tab:

Define a variable to hold the Number of Customer:

Double click the Script Task to edit it. Set the read write variable to the variable we just created (NoOfCustomers):

Double click the Edit Script and edit it as follows:

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
using System.IO;

{
  public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
  {
    public void Main()
    {
      const string FILE_NAME = "C:\\Util\\File.txt";
      int ReturnValue;
      String line;
      char[] delimiters = "|".ToCharArray();
      string[] columns;
      try
      {
        if (!File.Exists(FILE_NAME))
        {
          MessageBox.Show("File " + FILE_NAME + " does not exist.");
          ReturnValue = (int)ScriptResults.Failure;
          return;
        }
        using (StreamReader sr = File.OpenText(FILE_NAME))
        {
          while ((line = sr.ReadLine()) != null)
          {
            //MessageBox.Show(line);
            columns = line.Split(delimiters);
            //MessageBox.Show(columns[0].ToString());
            //MessageBox.Show(columns[1].ToString());

            if (columns[0].ToString() == "Number Of Customers")
            {
              //MessageBox.Show(columns[0].ToString());
              //MessageBox.Show(columns[1].ToString());
              Dts.Variables["NoOfCustomers"].Value = Convert.ToInt32(columns[1].ToString());
            }
          }
          ReturnValue = (int)ScriptResults.Success;
        }
      }

      catch (Exception e)
      {
        MessageBox.Show(e.Message);
        ReturnValue = (int)ScriptResults.Failure;
      }

      Dts.TaskResult = ReturnValue;
    }
  }
}

// ref: http://msdn.microsoft.com/en-us/library/db5x7c0d.aspx

Next Page »

Blog at WordPress.com.