Data Warehousing and Business Intelligence

26 May 2012

How to Populate a Fact Table using SSIS (part 2)

Filed under: Data Warehousing,SSIS — Vincent Rainardi @ 9:47 pm

In part 1 (link) we have populated a fact table (FactHire) by reading the source table (Hire) and looking up 3 dimensions (DimCustomer, DimVan and DimDate) to get the surrogate dimension keys and then insert into FactHire.

In the end of the article I said that there are 2 things that we need to do:

  1. Process the no match rows
    If in the Hire table we have a CustomerId which does not exist in the Customer Dimension, we need to set the CustomerKey to 0.
  2. Delete fact rows, if exists
    If today’s rows exists in the target fact table, delete them first.

So let’s get started.

Process the No Match Rows

The Lookup transform we try to convert the natural keys from the Hire table (Customer ID, Registration No and Hire Date) to the dimension keys (CustomerKey, VanKey and HireDateKey) by looking up the dimension table. Consider that in the Hire table we have a CustomerId of 123. We search DimCustomer and we couldn’t find CustomerId 123. In this case we need to set the Customer Key to 0.

To do that, we need to process the “No Match Output” of the Lookup transform (the second green row). Then merge it back into the main flow. Here’s how it look when finished:

So let’s begin. The first thing to do is to handle the “rows with no match entries”. Double click on [Get Customer Key] and change the [Specify how to handle rows with no matching entries] from [Failed component] to [Redirect rows to no match output], like this:

Now as you can see above there are 2 green arrows coming out of Get Customer Key lookup transformation. The [Lookup Matched Output] green arrow is now going to [Sort on HireId1], and we set the [Lookup No Match Output] to a Derived Column transformation called [Mismatched CustomerId]. In this [Mismatched Customer Id], we are simply setting the CustomerKey column = 0, like this:

We would like to combine the output of Match Output and No Match Output using a Merge transformation, like this:

But the Merge transformation requires us to sort each of the two inputs. So we have to pass them through a Sort transform like this:

In the [Sort on HireId2], we sort the data set on HireId, and pass all other columns through to the output, like this: (and we do exactly the same on [Sort on HireId 1])

So why do we sort on HireId? Why not on other column? Because HireId is a unique identifier. It uniquely identifies a hire transaction. Imagine this scenario: there are 1000 rows on the Hire table in HireBase. 997 of them has CustomerId exist in DimCustomer, and 3 of them has CustomerId which  don’t exist in DimCustomer. So the Match Output will have 997 rows and No Match Output will have 3 rows. When we merge them again, we want to use a column which was a unique ID before they split up, so that when they merge this unique ID will identify them properly.

Now that we have merged the first lookup, we need to do the same on the 2nd and 3rd Lookups (Van Key and Hire Date Key). But this time we don’t need to sort the output of the derived column and lookup, because they are already sorted:

Now our lookup is complete and proper. We have processed the No Match rows.

Delete Target Rows if Exist

One of the basic principles of data warehouse ETL is “RERUNABILITY”. Meaning that we need to be able to run the ETL several times a day without causing any issue to the data. In terms of loading a Period Snapshot Fact table, this means that we need to delete the [today’s date rows] in the target fact table, before we load the new rows.

Let’s say that this morning the ETL ran, pumping 1000 rows into the FactHire table, all with Snapshot Date = today. And now (say 2pm) if we were to run it again, it will pump another 1000 rows with Snapshot Date = today, causing duplication of data. Because of this we need to delete today’s rows in the target fact table.

Unfortunately we can’t use truncate in this case because we only want to delete today’s rows, not yesterday’s ones or any other dates before today. We need an index to be created JUST on Snapshot Date. I repeat: JUST on the Snapshot Date. We don’t want covering index or query help index or any multi key index, we just want Snapshot Date alone. Why? Because if we put other column on the index it may not get used when we delete today’s rows because the selectivity is too low. Can we use table partitioning? Yes we can. We can (sometimes we should, if the volume warrants it) partition a periodic snapshot fact table on Snapshot Date. But sometimes we can’t because the partitioning key is used for something more selecting, such as on product Id. But in majority of the cases we can.

Any way I digress. We need to delete today’s rows. Should you want to discuss further about partitioning you can email me at vrainardi@gmail.com.

Deleting today’s rows is a simple OLEDB SQL Command transformation like this:

And inside, the OLE DB Command look like this:

As you can see the SqlCommand is “Delete from FactHire where SnapshotDateKey = 20120526”. But we don’t want to hardcode the date! We want it to be today’s date. So we make it like this:


delete from FactHire where SnapshotDateKey = convert(int, replace( convert(varchar(10), GETDATE(), 20) ,'-',''))

The above code snippet will convert today’s date into an integer. For real warehouse project this is not acceptable, because the batch could start 1 millisecond before midnight and this OLE DB Command could end up executing on the next day. For real warehouse project the first thing to do in the ETL Control Flow is to set the Batch Date and Time using a script task, like this:

And the script task populates a global variable so that all workflows in the package can use it. Of course in the real project there would be a lot of dimension population work flow, before it hits the fact table population.

So that’s it, we have done 2 things that we set out at the beginning of this article: to process the No Match Output and to delete today’s rows if any.

Hope this helps,

Vincent Rainardi, 26/5/2012

How to Populate a Fact Table using SSIS (part 1)

Filed under: Data Warehousing,SSIS — Vincent Rainardi @ 2:24 pm
Tags: ,

Seems to me that some people are still struggling to populate their fact tables using SSIS. When doing it, they come across issues like these:

  1. Where do I populate my fact table from?
  2. How do I get the dimension keys to put into my fact table?
  3. Where can I get the data for the measures columns?
  4. With what do I populate the snapshot date column?
  5. What is the primary key of my fact table?
  6. The source table doesn’t have a primary key of the source table. What should I do?
  7. I got duplicate rows in the fact table. How do I handle it?
  8. The rows with that snapshot date already exist. What should I do?
  9. Lookup cache: should I choose full cache or partial cache?

10. I can’t find the row in the dimension table, what should I do?

As always, the best way to explain is by example, which is more effective than answering the above questions. So I’m going to do the following:

  1. Describe the background on the company and the data warehouse
  2. Create the source tables and populate them
  3. Create the dimension tables and populate them
  4. Create the fact table (empty)
  5. Build an SSIS package to populate the fact table, step by step

Background

It’s a van hire company called TopHire. It’s purely fictional of course. Their business system (HireBase) captures the rental information, including the customer information. HireBase has a fleet database where all vans are maintained. HireBase contains only 3 tables:

  1. Customer: contains 100 customers, e.g. name, data of birth, telephone number, etc. A column called CustomerId uniquely identify a customer.
  2. Van: contains 20 vans that Top Hire operates. It’s their fleet. Contains the registration number, engine size, van size, colour, year of manufacture. The unique identifier for each van is the registration number.
  3. Hire: contains 1000 hire transactions since 1st Jan 2011. Every hire transaction stores: who the customer was, which van was rented out, the date it occurred, number of days and various hire fees/charges (the van, sat nav, insurance, damage waiver and total bill). For each transaction we have Hire ID, which is a unique identifier in this table.

The data warehouse contains 4 tables:

  1. Date dimension: contains every single date from 2006 to 2016.
  2. Customer dimension: contains 100 customers. To be simple we’ll make it type 1 so we don’t create a new row for each change.
  3. Van dimension: contains 20 vans. To be simple we’ll make it type 1 so we don’t create a new row for each change.
  4. Hire fact table: contains 1000 hire transactions since 1st Jan 2011. It is a daily snapshot fact table so that every day we insert 1000 rows into this fact table. So over time we can track the changes of total bill, van charges, satnav income, etc.

Create the source tables and populate them

So now we are going to create the 3 tables in HireBase database: Customer, Van, and Hire. Then we populate them.

First I’ll show you how it looks when it’s done:

Customer table:

Van table:

Hire table:

And here is the script to create and populate them:

-- Create database
create database HireBase
go
use HireBase
go

-- Create customer table
if exists (select * from sys.tables where name = 'Customer')
drop table Customer
go

create table Customer
( CustomerId varchar(20) not null primary key,
 CustomerName varchar(30), DateOfBirth date, Town varchar(50),
 TelephoneNo varchar(30), DrivingLicenceNo varchar(30), Occupation varchar(30)
)
go

-- Populate Customer
truncate table Customer
go

declare @i int, @si varchar(10), @startdate date
set @i = 1
while @i <= 100
begin
 set @si = right('0'+CONVERT(varchar(10), @i),2)
 insert into Customer
 ( CustomerId, CustomerName, DateOfBirth, Town, TelephoneNo, DrivingLicenceNo, Occupation)
 values
 ( 'N'+@si, 'Customer'+@si, DATEADD(d,@i-1,'2000-01-01'), 'Town'+@si, 'Phone'+@si, 'Licence'+@si, 'Occupation'+@si)
 set @i = @i + 1
end
go

select * from Customer

-- Create Van table
if exists (select * from sys.tables where name = 'Van')
drop table Van
go

create table Van
( RegNo varchar(10) not null primary key,
 Make varchar(30), Model varchar(30), [Year] varchar(4),
 Colour varchar(20), CC int, Class varchar(10)
)
go

-- Populate Van table
truncate table Van
go

declare @i int, @si varchar(10)
set @i = 1
while @i <= 20
begin
 set @si = convert(varchar, @i)
 insert into Van
 ( RegNo, Make, Model, [Year], Colour, CC, Class)
 values
 ( 'Reg'+@si, 'Make'+@si, 'Model'+@si,
 case @i%4 when 0 then 2008 when 1 then 2009 when 2 then 2010 when 3 then 2011 end,
 case when @i%5<3 then 'White' else 'Black' end,
 case @i%3 when 0 then 2000 when 1 then 2500 when 2 then 3000 end,
 case @i%3 when 0 then 'Small' when 1 then 'Medium' when 2 then 'Large' end)
 set @i = @i + 1
end
go

select * from Van

-- Create Hire table
if exists (select * from sys.tables where name = 'Hire')
drop table Hire
go

create table Hire
( HireId varchar(10) not null primary key,
 HireDate date not null,
 CustomerId varchar(20) not null,
 RegNo varchar(10), NoOfDays int, VanHire money, SatNavHire money,
 Insurance money, DamageWaiver money, TotalBill money
)
go

-- Populate Hire table
truncate table Hire
go

declare @i int, @si varchar(10), @DaysFrom1stJan int, @CustomerId int, @RegNo int, @mi int
set @i = 1
while @i <= 1000
begin
 set @si = right('000'+convert(varchar(10), @i),4) -- string of i
 set @DaysFrom1stJan = (@i-1)%200 --The Hire Date is derived from i modulo 200
 set @CustomerId = (@i-1)%100+1 --The CustomerId is derived from i modulo 100
 set @RegNo = (@i-1)%20+1 --The Van RegNo is derived from i modulo 20
 set @mi = (@i-1)%3+1 --i modulo 3
 insert into Hire (HireId, HireDate, CustomerId, RegNo, NoOfDays, VanHire, SatNavHire, Insurance, DamageWaiver, TotalBill)
 values ('H'+@si, DateAdd(d, @DaysFrom1stJan, '2011-01-01'),
 left('N0'+CONVERT(varchar(10),@CustomerId),3), 'Reg'+CONVERT(varchar(10), @RegNo),
 @mi, @mi*100, @mi*10, @mi*20, @mi*40, @mi*170)
 set @i += 1
end
go

select * from Hire

Create the Data Warehouse

So now we are going to create the 3 dimension tables and 1 fact table in the data warehouse: DimDate, DimCustomer, DimVan and FactHire. We are going to populate the 3 dimensions but we’ll leave the fact table empty. The purpose of this article is to show how to populate the fact table using SSIS.

First I’ll show you how it looks when it’s done:

Date Dimension:

Customer Dimension:

Van Dimension:

Hire Fact Table:

And then we do it. This is the script to create and populate those dim and fact tables:

-- Create the data warehouse
create database TopHireDW
go
use TopHireDW
go

-- Create Date Dimension
if exists (select * from sys.tables where name = 'DimDate')
drop table DimDate
go

create table DimDate
( DateKey int not null primary key,
 [Year] varchar(7), [Month] varchar(7), [Date] date, DateString varchar(10))
go

-- Populate Date Dimension
truncate table DimDate
go

declare @i int, @Date date, @StartDate date, @EndDate date, @DateKey int,
 @DateString varchar(10), @Year varchar(4),
 @Month varchar(7), @Date1 varchar(20)
set @StartDate = '2006-01-01'
set @EndDate = '2016-12-31'
set @Date = @StartDate

insert into DimDate (DateKey, [Year], [Month], [Date], DateString)
 values (0, 'Unknown', 'Unknown', '0001-01-01', 'Unknown') --The unknown row

while @Date <= @EndDate
begin
 set @DateString = convert(varchar(10), @Date, 20)
 set @DateKey = convert(int, replace(@DateString,'-',''))
 set @Year = left(@DateString,4)
 set @Month = left(@DateString, 7)
 insert into DimDate (DateKey, [Year], [Month], [Date], DateString)
 values (@DateKey, @Year, @Month, @Date, @DateString)
 set @Date = dateadd(d, 1, @Date)
end
go

select * from DimDate

-- Create Customer dimension
if exists (select * from sys.tables where name = 'DimCustomer')
drop table DimCustomer
go

create table DimCustomer
( CustomerKey int not null identity(1,1) primary key,
 CustomerId varchar(20) not null,
 CustomerName varchar(30), DateOfBirth date, Town varchar(50),
 TelephoneNo varchar(30), DrivingLicenceNo varchar(30), Occupation varchar(30)
)
go

insert into DimCustomer (CustomerId, CustomerName, DateOfBirth, Town, TelephoneNo,
 DrivingLicenceNo, Occupation)
select * from HireBase.dbo.Customer

select * from DimCustomer

-- Create Van dimension
if exists (select * from sys.tables where name = 'DimVan')
drop table DimVan
go

create table DimVan
( VanKey int not null identity(1,1) primary key,
 RegNo varchar(10) not null,
 Make varchar(30), Model varchar(30), [Year] varchar(4),
 Colour varchar(20), CC int, Class varchar(10)
)
go

insert into DimVan (RegNo, Make, Model, [Year], Colour, CC, Class)
select * from HireBase.dbo.Van
go

select * from DimVan

-- Create Hire fact table
if exists (select * from sys.tables where name = 'FactHire')
drop table FactHire
go

create table FactHire
( SnapshotDateKey int not null, --Daily periodic snapshot fact table
 HireDateKey int not null, CustomerKey int not null, VanKey int not null, --Dimension Keys
 HireId varchar(10) not null, --Degenerate Dimension
 NoOfDays int, VanHire money, SatNavHire money,
 Insurance money, DamageWaiver money, TotalBill money
)
go

select * from FactHire

Now you can see that the 3 dimensions have been populated. And the fact table is created and empty, ready for us to populate it.

Build the SSIS Package to Populate the Fact Table

First I show you how it looks when it’s done. This is the overall workflow we are going to build:

This is a bit of description on each of the 6 boxes above:

  1. Read Hire table in HireBase: First we get the original OLTP records from the hire table in HireBase.
  2. Get Snapshot Date Key: we get today’s date and convert it into an integer as Snapshot Date Key. FactHire is a snapshot fact table. A periodic snapshot fact table. Which means that every day we populate the fact table with 1000 rows like this:Next week, on Monday 28th May, we insert 1000 rows to the fact table, which is all the rows in the Hire table. Then on Tuesday 29thwe insert 1010 rows. Because there would be 10 new hire transaction in the Hire table in HireBase system. The 1000 “old” rows might change as well on Tuesday, not all of them will be the same as Monday. Perhaps there are 10 rows which are modified, the amount/fees changed, etc. Whatever the condition of the Hire table on Tuesday, we are taking all rows in put them into our data warehouse. And we do this every day. We every day, put whatever is in the Hire table into FactHire. That’s what Periodic Snapshot fact table means.Hence in a periodic snapshot fact table we have a “Snapshot Date”. Meaning: the date when we captured the source table. Or, the date when these 1000 rows were inserted into this fact table. The value of this snapshot date is today’s date (the date the ETL runs). The date the load happens.
  3. Get Customer Key: from Hire table we get the Customer ID which was involved in the hire transaction. We then go to the Customer Dimension to get the Customer Key for this Customer ID.
  4. Get Van Key: from Hire table we get the Registration Number which was rented out in the transaction. We then go to the Van Dimension to get the Van Key for this Registration Number.
  5. Get Hire Date Key: from Hire table we get the Hire Date, which is the date when the hire transaction happened. We then go to the Date Dimension to get the Date Key for this Hire Date.
  6. Populate FactHire: After we get all the required dimension keys, we insert the rows into the FactHire fact table.

Now that we know what to build, let’s build it. The box numbers below refer to the 6 boxes above.

Box 1. Read Hire table in HireBase

We create the data source in the workflow. It’s OLE DB Data Flow Source, from HireBase. We take all columns, so we don’t specify a query:

Box 2. Get Snapshot Date Key

Create a Derived Column and set the Expression to:

(DT_I4)((DT_WSTR,4)YEAR(GETDATE()) + RIGHT(“0″ + (DT_WSTR,2)MONTH(GETDATE()),2) + RIGHT(“0″ + (DT_WSTR,2)DAY(GETDATE()),2))

See the complete article here.

Box 3. Get Customer Key

Create a Lookup transformation and set to Full Cache, OLE DB connection manager, Fail component, like this:

Full cache: SSIS reads all rows from DimCustomer into memory, then read the Hire table and SSIS doesn’t need to read DimCustomer any more. All lookup operations are done in memory.
Partial cache: SSIS build the memory cache while reading the rows from Hire table. When SSIS reads the rows from Hire table, it tries to find the Customer Key in memory cache. If it can’t find it, it reads the DimCustomer on disk.
No cache: SSIS doesn’t build memory cache. SSIS joins DimCustomer to Hire table (using RDBMS operation) to find the Customer Key.

Cache Connection Manager (CCM): SSIS reads DimCustomer and put in memory cache. This cache can be reused and shared between several Lookups, and across several dataflows. CCM reduces database workload.

OLE DB connection manager: a table in the warehouse (in this case it’s DimCustomer), see Connection pane below:

In the Columns pane, connect Customer Id from the Hire table to the Customer Id on the DimCustomer table, and bring the CustomerKey as output:

Box 4: Get Van Key

It’s the same as Get Customer Key above, i.e. using Lookup transform. When connecting [Get Customer Key] and [Get Van Key] choose Lookup Match Output. I’ll come back for Lookup No Match Output at the end of the article. I just want to get this up and running first. We should always get a simple one first up and running, then build up from there.

Box 5. Get Hire Date Key

It’s the same as Get Customer Key above.

Box 6. Populate FactHire

It’s OLE DB destination. On the Connection Manager pane, specify FactHire as the target table, and leave everything as is:

On the Mapping pane, the source column has the same name as the destination column, except HireDateKey:

That’s it. Run it and all should be green, inserting 1000 rows into the target fact table:

And the FactHire fact table now looks like this:

The 4 dimension key columns are populated with the surrogate key from the respective dimension.

Wrapping Up

The above workflow works but there are 2 things we need to do:

  1. The no match rows
    If in the Hire table we have a CustomerId which does not exist in the Customer Dimension, we need to set the CustomerKey to 0.
  2. Delete fact rows, if exists
    If today’s rows exists in the target fact table, delete them first.

As this is long enough. I’m going to put it as a separate article (link).

Hope this helps,

Vincent Rainardi

How to Convert Datetime to Int in SSIS

Filed under: SSIS — Vincent Rainardi @ 5:30 am

This usually happen if you need to insert Date Key into a fact table. You need do convert a date from the source table which has datetime data type to integer, i.e. from ‘2012-05-26 00:00:00’ to 20120526.

In T-SQL you can do this to convert datetime to int:

convert(int, replace( convert(varchar(10), @Date, 20) ,’-',”))

But in SSIS data flow, what do you do?

You need to use derived column like this:

The Derived Column transformation will add a new column to the workflow, which is the integer Date Key.

In SSIS we have the following date and string functions:

So to convert today’s date to integer, the idea is to get the year component, and the month component and the day component (as 3 different string), then concatenate them, then convert to integer.

So in the derived column we do these 3 components:


(DT_WSTR,4) YEAR(GETDATE())  ----- (1)

(DT_WSTR,2) MONTH(GETDATE())    ------ (2)

(DT_WSTR,2) DAY(GETDATE())   ------- (3)

But we need to remember that for November, the (2) output is 2 digit (11). But for May, the (2) output is 1 digit (5). So for 22nd May 2012 it would be 2012522 because May is single digit. So we need to pad it with zero like this:


RIGHT( "0" + (DT_WSTR,2) MONTH(GETDATE()) , 2 )

And the same with date: if the date is 1st to 9th, it will be a single digit. So 6th May 2012 will be 2012056, because 6th is a single digit. So we need to pad it with zero, like this:


RIGHT( "0" + (DT_WSTR,2) DAY(GETDATE()) , 2 )

And finally we combine those 3 components using “+” like this:


(DT_WSTR,4) YEAR(GETDATE()) + RIGHT( "0" + (DT_WSTR,2) MONTH(GETDATE()) , 2 ) + RIGHT( "0" + (DT_WSTR,2) DAY(GETDATE()) , 2 )

So we will get a string like this: “20120522″.

As the last step, we just need to convert it to integer, using (DT_I4), like this:


(DT_I4)((DT_WSTR,4)YEAR(GETDATE()) + RIGHT("0" + (DT_WSTR,2)MONTH(GETDATE()),2) + RIGHT("0" + (DT_WSTR,2)DAY(GETDATE()),2))

In SSIS we have the following type cast operator available to us:

Here’s how it looks as an expression on the Derived Column window:

So once again the expression we need to convert today’s date (datetime) in Derived Column to integer is:


(DT_WSTR,4) YEAR(GETDATE()) + RIGHT( "0" + (DT_WSTR,2) MONTH(GETDATE()) , 2 ) + RIGHT( "0" + (DT_WSTR,2) DAY(GETDATE()) , 2 )

Hope this helps,

Vincent Rainardi 26/5/2012

20 March 2012

SSIS: Loading a Big File Fast

Filed under: SSIS — Vincent Rainardi @ 7:21 am
Tags:

Loading a big file into a table in SQL Server

The differences between a slow load and a very fast load are:

  1. Flat file source: ValidateExternalMetadata = false
  2. OLE DB Destination: AccessMode: OpenRowSet using FastLoad (default is OpenRowSet)

Flat file: 230,000 rows, 15 columns, pipe separated, qualifier: double quote
SSIS: data conversion to Unicode on 10 columns

Big difference in performance:

  • Before: 229 seconds (3.8 minutes)
  • After: 3 seconds

So next time you load a file into a table using SSIS, please remember to use these settings.

16 November 2011

SSIS: True/False and 1/0 values on Bit Columns

Filed under: SSIS — Vincent Rainardi @ 11:13 pm

The column is defined as bit on a SQL Server table. You created an SSIS package to export the table to a text file. In Mgt Studio when you query the table, the bit column displays 1 and 0 as its value, but in the output text file they are true and false.

You looked everywhere for what might have caused it to be true/false. You suspected the text file connection. But it is defined as string[50] there. So you changed the SQL query inside the workflow from

select column1 from table1

to: select convert(int, column1) from table1

But it is still the same. The output is still true/false not 1/0 as you see in Mgt Studio.

So you convert to varchar like this:

select convert(varchar, convert(int, column1)) from table1

You hoped that it will now be 1 and 0, but alas no. It is still true and false.

So you changed the data type in the text file connection from string to four byte signed integer.

What you get is -1 for true and 0 for false. Yes! Not 1 and 0 but -1 and 0.

So you changed the data type in the text file connection to four byte unsigned integer. Ha this will force it to be 1 and 0 you thought.

But alas, no! What you get is a very big number for true and 0 for false.f

So here the secret: right click on the OLE DB Source component in the data flow, and choose Show Advanced Editor. Go to the last tab, Input and Output Properties:

Go to OLE DB Source Output, Output Column (NOT External Columns!), then select the bit column. Change the DataType from Boolean to Four Byte Signed Integer.

Change your query to:

select convert(int, column1) as column1 from table1

Note: If you don’t the above the output will be -1 and 0, not 1 and 0.

And execute the data flow.

Voila, you get 1 and 0 in the output text file.

Note: whether in the flat file connection you defined the output columns as boolean or string or signed int, it will still be 1 and 0.

Hope it helps. As usual I welcome comments and corrections.

Vincent Rainardi, 16/11/2011

26 June 2011

Import/Export Files with Variable File Name Set at Run Time

Filed under: SSIS — Vincent Rainardi @ 6:37 am
Tags:

In ETL we often need to import files. But these files usually don’t have a fixed file name. They are variable. The file names change every day, because they have date on their file names. For example, yesterday it was called trans_20110627. Today it is called trans_20110628. Tomorrow it will be called trans_20110629.

How do we import them? We set the file name at run time.

How do we set the file name at run time? Using a script task, at run time we set the value of a variable. This variable is used as the file name of the file.

This might be one of the “oldest tricks in the book”, but still, I found that many people don’t know it. So it’s probably worth writing a post about it.

So here’s how to do it, step by step:

First, create a variable called TransFileName, of type string, scope = package:

Then create a script task on the control flow, call it Set File Name:

Edit the script task, put the TransFileName variable on the ReadWriteVariable:

Then click on Edit Script:

For simplicity in the above script I just put a fixed file name. But in the real project the logic is for example: find today’s file, find the latest file, etc, which is for another blog post. I’ll write that “find the latest file logic in using c# script” post in the next few days and put the link here.

Save the ScriptMain.cs window and close it. Click OK to come out of the Script Task Editor.

Create a Data Flow task:

Double click the Data Flow Task to edit it. Create Flat File Source and OLE DB Destination.

Double click on the Trans File to edit it. Click New on the Flat File Source Editor. Click Browse and select the file we want to import. For this exercise I’ve created a small text file like this:

It only contains 1 column and 2 rows. I also created a small table to illustrate the process:

create table VarFileName.Trans ( Col1 int )

So, click on Browse and select the file we want to import:

Click “Column names in the first data row” if your file has column names in the first row. Click on Columns tab on the left pane:

Specify the column delimiter e.g. pipe (|). Click on the Advanced tab on the left pane:

Set the data type for each column. Click OK to close the Flat File Connection Manager Editor and go back to the Flat File Source Editor:

Click on the Columns on the left pane:

Click OK to close the Flat File Source Editor.

Edit the OLE DB, point it to the target table, e.g. VarFileName.Trans:

Click on Mappings on the left pane and check that all columns on the file have been mapped properly to the table. In the example there is only 1 column:

Click OK to close the OLE DB Destination Editor and go back to the Data Flow window.

Now we are going to use the TransFileName variable as the file name of the file. We do a right click on the TransFile on the Connection Manager and select Properties.

Click on the ellipsis on the Expressions to open the Property Expression Editor. Set the Property column to Connection String. Under the Expression column, click on the ellipsis to open the Expression Builder. Choose the TransFileName variable, drag it to the Expression box:

 

Click OK to close the Expression Builder. Click OK again to close the Property Expressions Editor.

That’s it, we are done. At run time, the script task will set the value of the TransFileName variable. This variable is used as the file name of the file we want to import. Here’s the Control Flow (left) and Data Flow (right) at when we run it:

This can also be used to when we need to export to a file which we only know the name at run time. It works the same way.

Hope this helps,

Vincent Rainardi, 26/6/2011

21 June 2011

Updating a Variable based on a File (Script Task)

Filed under: SSIS — Vincent Rainardi @ 10:24 pm
Tags:

In the last article (link) I showed how to update a variable based on a file for script component. That’s useful when we are in a Work Flow. But sometimes the situation happens on the Control Flow. In this case we need to do a Script Task.

Using a Script Task it is a lot simpler than using a Script Component. We essentially need to do a bit of C# coding to read the file and set the variable.

This is the file that we want to read:

Pull the Script Task from the Toolbar to the Control Flow tab:

Define a variable to hold the Number of Customer:

Double click the Script Task to edit it. Set the read write variable to the variable we just created (NoOfCustomers):

Double click the Edit Script and edit it as follows:

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
using System.IO;

{
  public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
  {
    public void Main()
    {
      const string FILE_NAME = "C:\\Util\\File.txt";
      int ReturnValue;
      String line;
      char[] delimiters = "|".ToCharArray();
      string[] columns;
      try
      {
        if (!File.Exists(FILE_NAME))
        {
          MessageBox.Show("File " + FILE_NAME + " does not exist.");
          ReturnValue = (int)ScriptResults.Failure;
          return;
        }
        using (StreamReader sr = File.OpenText(FILE_NAME))
        {
          while ((line = sr.ReadLine()) != null)
          {
            //MessageBox.Show(line);
            columns = line.Split(delimiters);
            //MessageBox.Show(columns[0].ToString());
            //MessageBox.Show(columns[1].ToString());

            if (columns[0].ToString() == "Number Of Customers")
            {
              //MessageBox.Show(columns[0].ToString());
              //MessageBox.Show(columns[1].ToString());
              Dts.Variables["NoOfCustomers"].Value = Convert.ToInt32(columns[1].ToString());
            }
          }
          ReturnValue = (int)ScriptResults.Success;
        }
      }

      catch (Exception e)
      {
        MessageBox.Show(e.Message);
        ReturnValue = (int)ScriptResults.Failure;
      }

      Dts.TaskResult = ReturnValue;
    }
  }
}

// ref: http://msdn.microsoft.com/en-us/library/db5x7c0d.aspx

7 June 2011

SSIS: Update a Variable based on a File (Script Component)

Filed under: SSIS — Vincent Rainardi @ 7:15 pm
Tags:

Yesterday I wrote about setting a variable based on the value in a database table (link). Sometimes the source value is not stored in a database table but in a file. We need to read that file to get the value then update a package variable. Of course you can import that file into a table first, then query the table to update the variable via Result Set mapping like above. But sometimes you don’t want to do this. You want to update the variable straight from the file.

There are 2 ways to do this:

  1. Script component as a source
  2. Script Task

We use the Source Script Component when we want to set the variable as part of a Work Flow. We use the Script Task when we want to set the variable as part of a Control Flow? What’s the different? Work Flow is task where data flows from source to destination. Control Flow is a sequence of task, one after the other, not necessarily about data movement.

Out of those 2 above, in this article I will explain the first one (Update Variable from a File – Script Component as a Source). I will explain the second one (Script Task) in the next article.

OK Let’s start. Let’s say that the file we want to import is like below. Say what we want to get is the Number of Customers.

Let’s start. First we create a Data Flow by pulling the Data Flow task from the Toolbox on the left onto the Control Flow canvas:

Double click it to edit. Pull Script Component from the Toolbox onto the Work Flow. SSIS will ask whether we will use as a Source, Destination or Transformation. Choose Source and we get this:

Now create a variable to store the Number of Customer. First, click on the canvas of the Control Flow so that the variable scope is Package. Then create the variable, setting the type to Int32.

We go back to the data flow task, double click the script component:

On the ReadWriteVariables click on the Ellipses and select User::NoOfCustomers :

On the Input and Output tab, give the output a name (e.g. change it from “Output 0” to “ParamOutput”). Then create 2 output columns: ParamName and ParamValue. Set the Data Type to String for both of them:

On the Connection Managers, define a new connection:

Choose Flat File:

Give it a name, choose the file, and select “Column names in the first data row”:

Click on Columns on the left, check the preview (see below). Here we usually change the delimiter, but in this case it’s not required.

Click OK to close the window and back to the Script Transformation Editor Connection Managers tab. Choose the Connection Manager you just created (ParameterFile) and give it a name (e.g. ParamConn):

Now we are ready to edit the script (to be ready we need to have: connection, output and variable). Click on Script on the left and click on Edit Script:

Edit it like this:

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.IO; // This namespace is required for File functionalities
using System.Windows.Forms; // This namespace is required for the Message Boxes

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    // Declare the variables
    private StreamReader SR;
    private string File1;
    private int NoOfCustomers;

    public override void AcquireConnections(object Transaction)
    {
        // Get the ParamConn connection, store it as 'File1'
        IDTSConnectionManager100 CM = this.Connections.ParamConn;
        File1 = (string)CM.AcquireConnection(null);
    }

    public override void PreExecute()
    {
        // Create a reader for File1
        base.PreExecute();
        SR = new StreamReader(File1);
    }

    public override void PostExecute()
    {
        // Close the reader
        base.PostExecute();
        SR.Close();

        // Set the Package Variable
        Variables.NoOfCustomers = NoOfCustomers;
    }

    public override void CreateNewOutputRows()
    {
        // Declare variables
        string nextLine;
        string[] columns;
        char[] delimiters;

        // Set the delimiter
        delimiters = "|".ToCharArray();

        // Read the first line (header)
        nextLine = SR.ReadLine();

        // Split the line into columns
        columns = nextLine.Split(delimiters);

        // Read the second line and loop until the end of the file
        nextLine = SR.ReadLine();
        while (nextLine != null)
        {
            // Split the line into columns
            columns = nextLine.Split(delimiters);

            // Add a row on the output
            ParamOutputBuffer.AddRow();

            // Use MessageBox for Debugging
            // (you can debug a Script Task but not Script Component)
            MessageBox.Show(columns[0].ToString());
            MessageBox.Show(columns[1].ToString());

            // Set the values of the Script Component output according to the file content
            ParamOutputBuffer.ParamName = columns[0];
            ParamOutputBuffer.ParamValue = columns[1];

            // When we find the row, set the variable
            if (columns[0].ToString() == "Number Of Customers")
            {
                // Store the value of columns[1] to an internal variable
                NoOfCustomers = Convert.ToInt32(columns[1]);
                MessageBox.Show(NoOfCustomers.ToString());
                break; // Get out of the loop when we found the row
            }

            // Read the next line
            nextLine = SR.ReadLine();
        }
    }
}

Run it. You get message boxes 6 times containing: Analysis Type, Regulatory, Number Of Customers, 322514, Date Exported, 06/06/2011. Unfortunately we can’t debug a Script Component (we can debug a Script Task, but not a Script Component, see here), hence we use message boxes.

The code reads the file line by line, splitting each line on the delimiter. The split result is put in an array called columns. This array is mapped to the Output Buffer of the script component. When we found the “Number of Customer” row, set the internal variable then “break” out of the loop. At post execute, we set the package variable.

After the code is working properly don’t forget to comment out the message boxes.

Hope it’s helpful.

Vincent Rainardi, 7/6/2011

6 June 2011

SSIS: Updating a Variable based on Database

Filed under: SSIS — Vincent Rainardi @ 9:17 pm
Tags:

Sometimes when writing an SSIS package, we need to update the value of a package variable based on a database table. For example, we may have a variable called ImportFolder. This package variable contains the path to the folder which contains the files that we need to import. The value of ImportFolder variable is stored in the database. We have to query that database (select ImportFolder from table1 where col1 = …) and then set the value of the ImportFolder package variable.

In some cases the value that we need to update the package with is not stored in the database, but it is stored in a file. We need to open this file, read it and set the value of the package variable. In this article I will give an example on how to do the database one, and on the next article I’ll do the file one.

Execute SQL Task

We use an Execute SQL Task, then set the Result Set to a variable. The database doesn’t have to be SQL Server of course, it can be anything. Here’s how:

Pull Execute SQL task from the Toolbox to the Control Flow canvas:

Double click it to edit and type the query on the SQL Statement and set Result Set to Single Row.

Here’s an example of a SQL query to get the Import Folder from a database table:

After you set the SQL Statement and Result Set, click on the Result Set pane on the left handside and set the Result Name and Variable Name. Result Name is the column name on the query output and variable name is the name of the package variable (yes you need to create it first).

So click OK to come out of Execute SQL Task editor, create the variable as below.

Create the variable by clicking the first icon. You need to set the Data Type as String. To set the scope to Package, click on any point on the Control Flow canvas before creating the variable.

If you wonder “Where is this Variable window?” scroll down to the bottom of the screen, click Variables tab on the left, next to the Toolbar tab:

Now back to Result Set tab again (double click Execute SQL Task, click on Result Set tab on the left) and set the Result Name and Variable Name:

Run it. Before running it you may want to set a Breakpoint so you can see the value of the variable. Right click on the Execute SQL Task and choose Edit Breakpoints. Select the OnPostExecute event (this is after the task is run).

Then run the package. It will stop after it runs the Execute SQL task. To see the value of the ImportPath variables, type “ImportPath” in the Watch window:

Or expand the Locals window and find User::ImportPath:

If you wonder “Where are these Watch and Locals windows”, they are at the bottom of screen. If you can’t see them try: Windows menu, Reset Windows Layout.

Vincent Rainardi, 6/6/2011

PS. On the next article I’ll write about Updating a Variable from a File.

SSIS 2008 Data Profiler

Filed under: SSIS — Vincent Rainardi @ 5:30 am
Tags:

SQL Server 2008 has a new feature in SSIS: Data Profiler. For a given table, Data Profiler enables us to understand the number of distinct values in each column, and the distribution of data values. This is very useful in data warehousing, both in ETL and when building cubes in SSAS.

When building an ETL package, sometimes we need to know the data values in a column. For example, to understand whether a column is mostly empty (blank or null) or contains good data. Or to help us decide whether to exclude or include certain values or certain patterns, probably related to a data quality requirement. Or to help us identify potential hierarchies in a dimension table.

Data Profiler is also useful when doing performance tuning, for example when partitioning the a fact table or when partitioning a cube. Understanding the data distribution could assist us in creating similar size groups based on a particular column.

On the current SQL Server 2005 data warehousing project, I use Informatica Data Explorer (IDE) to understand the distribution of data values in the data warehouse fact and dimension tables, for the purpose of building SSAS cubes. I often asked myself: When will we be able to right click on any table in Management Studio, select ‘Analyze’ and get the data profile for that table?

Now in SSIS 2008 we have data profiler. Well, it’s not as friendly as doing a right click on any table and select ‘Analyze’. But it’s a start. I hope in the later releases of SQL Server we can right click any table in Management Studio and choose ‘Analyze’ to get the data profile.

So let’s find out what SSIS 2008 Data Profiler is all about. First, let’s have a look at its output:

Figure 1. Output of SSIS 2008 Data Profiler

The left pane shows the tables that we analyzed, along with the profiles that we executed for each table. The right hand pane shows the datails of the profile that we select on the left hand pane. For example, for ‘Column Value Distribution’ profile, the top right panel shows a list of columns and the number of distinct values in each column. If we click on any column on this panel, the bottom right panel displays the distribution of the data value, i.e. the value, number of rows containing that value, and the percentage of total rows. Figure 1 shows the distribution of the data values on the product_key column in the fact_sales table.

Other than analyzing the distribution of data value, SSIS 2008 Data Profiler can also analyze:

  1. The number of nulls in a column.
  2. The pattern of the data values, for example if a column contains values of 2008-01, 2008-02, …, 2008-12, then the pattern is “200d-dd” (d means digit, i.e. a number from 0 to 9).
  3. Minimum and maximum values of the data in each column.
  4. Mean (average) and standard deviation of the data values in each column. Standard deviation indicates how far the value is from the mean.
  5. Determine which column(s) is a good candidate for a primary key. This is useful for checking the 1st normal form rule, which states that each table must have a primary key that uniquely identifies each row.
  6. Checking whether a column is fully or partially dependant of other column(s). This is useful for checking the 2nd normal form and for identifying potential hierarchies within a dimension. In 2nd normal form, non primary key columns must be dependent on the primary key.
  7. Checking whether all values in a column exist in another table. In data warehousing, this is useful to identify if there are dimensional keys on the fact table which do not exist on the dimension table. For ODS design (or other normalized databases), this is useful to identify the 3rd normal form, which states that no column is transitively dependent on the primary key. In other words, it is useful for identifying potential foreign key relationships.

Now let’s understand the ‘how’.

Open BIDS, create an SSIS project, drag Data Profiling Task onto the design surface. Double click to edit it. Type the Destination File name.

Click on Profile Requests on the left pane. Select Column Value Distribution on the Profile Type column. Supply the connection details, select the Table and Column(s) at the bottom panel as shown on Figure 2 below and click OK.

Figure 2. Setting up a Data Profiling Task in SSIS

Run the SSIS package. Verify that the output file is created.

To view the output file, we need to use Data Profile Viewer utility. Go to C:\Program Files\Microsoft SQL Server\100\DTS\Binn (adjust to your SQL Server folder) and execute DataProfileViewer.exe. Click on Open, and open the output file that was produced by the SSIS Data Profiling Task earlier. It will look like figure 1 above.

In addition to Column Value Distribution Profiles that we chose above, we can also choose other profiles, such as Column Pattern Profiles, Column Statistics Profiles, Functional Dependency Profiles, Candidate Key Profiles, Column Null Ratio Profiles and Inclusion Profiles. These profiles provide the 7 analysis items I listed earlier.

The Data Profiling facility in SQL Server 2008 is quite basic. As is the the user friendliness. The viewer is not integrated in BI Development Studio, even in Release Candidate 0. But it is a good start. Whether it’s at an additional cost or not, it is a necessity these days that an ETL tool should have a data profiling facility. I’m sure in the future SSIS data profiling feature will improve a lot. I hope that in the future releases of SQL Server, the Data Profile Viewer utility is integrated into BI Development Studio and that we can right click any table in Management Studio and choose ‘Analyze’ to get the data profile.

Vincent Rainardi, 7th July 2008
Author of Building A Data Warehouse: With Examples in SQL Server

Repost from SQLServerCentral

Next Page »

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 52 other followers