Data Warehousing, BI and Data Science

3 October 2012

SSIS: Automating DDL Changes

Filed under: SSIS — Vincent Rainardi @ 5:58 pm
Tags:

See Preeti’s comment 3/10/12 on “SSIS: Importing a File with Dynamic Columns” article (link):

Will elaborate my question ?I have columns dynamically changing on source side like adding,deleting  which is SQL ,have my target as Oracle ,how to handle dynamically the source changes to be in sync with target (Oracle) structure before loading.I knw in any ETL process this has to be manually done like refreshing the source then updating the target table structure.Cannot this be automated?We have java transformation in ETL so cant we make use of it to automate the process ?If so then how?If not,are there any workarounds to do this?

This article is addressing the above.

—————————————————————

This can be done in theory, but in practice it is usually banned in production.

First, we obtain the new structure of the source table(s) by querying either information schema or system views/tables, i.e. if your source is SQL Server we query system catalogue views like this:

select c.name, dt.name, c.max_length, c.precision, c.scale, c.is_nullable
from sys.columns c
join sys.tables t on c.object_id = t.object_id
join sys.schemas s on t.schema_id = s.schema_id
join sys.types dt on c.user_type_id = dt.user_type_id
where t.name = ‘Table1’ and s.name = ‘Schema1’

We then obtain the structure of the target table(s) by querying either information schema or system views/tables, i.e. if your target is Oracle we can query ALL_TAB_COLUMNS like this:

SELECT COLUMN_NAME, DATA_TYPE, DATA_LENGTH, DATA_PRECISION, DATA_SCALE, NULLABLE
FROM ALL_TAB_COLUMNS
WHERE TABLE_NAME = ‘TABLE1’ AND OWNER = ‘OWNER1’

Both tasks above are done using Execute SQL Task in SSIS, storing it into a table (you create the table). For example:

As we can see, SubType and Value3Y doesn’t exist in the target table, and Value5Y doesn’t exist in the source table. The precision of Value1Y and Value5Y is different.

You can compare the source and target then use ALTER TABLE at the target to create the missing column, but this is risky, complex and causing headache. A simpler, more robust solution is to recreate the target table.

If you have different RDBMS between source and target, you will need to make a “data type conversion table”, i.e. varchar in SQL Server becomes varchar2 in Oracle, etc.

But as I said, this is usually banned in production because the Prod Support Manager can’t risk the instability the approach causes.

Advertisements

19 June 2012

SSIS SCD Wizard Performance Issue

Filed under: Data Warehousing,SSIS — Vincent Rainardi @ 7:23 pm
Tags: ,

In Microsoft BI, when we do a slowly changing dimension type 2, we instinctively do SCD Wizard. The problem with SCD Wizard is the performance. For each source row coming to the SCD box, SSIS does a lookup on the target dimension table to find out if the row already exists or not. That is a crazy idea. And SSIS also update the target dimension table on row by row basis. That is also a crazy idea.

If your dimension is 1000 rows and the source data is 1000 rows, the SCD takes a few minutes. But if your dimension is 500k and the source data is 500k, it’s 5-6 hours performance. The whole idea is fundamentally flawed. On the welcome screen of the SSIS box, there should be a warning: Warning, do not use SSIS SCD Wizard if your dimension is above 1000 rows.

So, if your dimension is 1m rows, what should you do?

  1. Do it on set basis, not row by row basis.
  2. Filter the incoming source rows to the changed rows.
  3. Don’t everything type 2, but only necessary columns.
  4. Index the dimension table on the seek columns.
  5. Group the updates to the target dimension table.
  6. Archive unused rows in the dimension table.

To do update an SCD type 2 dimension, in SSIS we should do this:

  1. Obtain Current Time (CT).
  2. Stage the source table. 1m rows (S).
  3. Identity changed rows on the staging table. 50k rows (C).
  4. Put the C rows on a new staging table.
  5. Identify which C rows already exist on the dimension table by comparing the business key.
    Say 40k rows exist (E) and 10k rows don’t exist (D).
  6. Out of the 40k E rows, identify which rows has changed attributes. Say there are 12k rows (CA).
  7. Update dimension table: expire the 12k CA rows (set Expiry Date = CT, IsActive = N).
  8. Insert 22k row (10k D + 12k CA) into dimension table (in 1 go, not in 2 goes1). Expiry Date = 9999, IsActive = Y.
    D rows: Effective Date = 0000. CA rows: Effective Date = CT.

Note:

  1. In step 2, enable fast load, do not put any constraint on the staging table.
  2. Step 3 is done by hashing or checksum.
  3. Staging table C on step 4 can be used by other work flow, especially the Fact Load.
  4. To support step 5 and 7, index the dimension table on the Natural Key column.
  5. To support step 6, index the dim table on the type 2 columns to be compared.
    Do the compare in the order of the indexed columns.
  6. We can use the Merge command to do step 7 & 8, but we need to do it twice. So the old Upsert (update + insert) works as efficiently as Merge. Probably more efficient as we don’t do delete.

This is one major weakness in SSIS compared to Informatica PowerCenter so I have no doubt Microsoft must have been building a set based SSIS SCD Wizard for the next release 🙂

26 May 2012

How to Populate a Fact Table using SSIS (part 1)

Filed under: Data Warehousing,SSIS — Vincent Rainardi @ 2:24 pm
Tags: ,

Seems to me that some people are still struggling to populate their fact tables using SSIS. When doing it, they come across issues like these:

  1. Where do I populate my fact table from?
  2. How do I get the dimension keys to put into my fact table?
  3. Where can I get the data for the measures columns?
  4. With what do I populate the snapshot date column?
  5. What is the primary key of my fact table?
  6. The source table doesn’t have a primary key of the source table. What should I do?
  7. I got duplicate rows in the fact table. How do I handle it?
  8. The rows with that snapshot date already exist. What should I do?
  9. Lookup cache: should I choose full cache or partial cache?
  10. I can’t find the row in the dimension table, what should I do?

As always, the best way to explain is by example, which is more effective than answering the above questions. So I’m going to do the following:

  1. Describe the background on the company and the data warehouse
  2. Create the source tables and populate them
  3. Create the dimension tables and populate them
  4. Create the fact table (empty)
  5. Build an SSIS package to populate the fact table, step by step

Background

It’s a van hire company called TopHire. It’s purely fictional of course. Their business system (HireBase) captures the rental information, including the customer information. HireBase has a fleet database where all vans are maintained. HireBase contains only 3 tables:

  1. Customer: contains 100 customers, e.g. name, data of birth, telephone number, etc. A column called CustomerId uniquely identifies a customer.
  2. Van: contains 20 vans that Top Hire operates. It’s their fleet. Contains the registration number, engine size, van size, colour, year of manufacture. The unique identifier for each van is the registration number.
  3. Hire: contains 1000 hire transactions since 1st Jan 2011. Every hire transaction stores: who the customer was, which van was rented out, the date it occurred, number of days and various hire fees/charges (the van, sat nav, insurance, damage waiver and total bill). For each transaction we have Hire ID, which is a unique identifier in this table.

The data warehouse contains 4 tables:

  1. Date dimension: contains every single date from 2006 to 2016.
  2. Customer dimension: contains 100 customers. To be simple we’ll make it type 1 so we don’t create a new row for each change.
  3. Van dimension: contains 20 vans. To be simple we’ll make it type 1 so we don’t create a new row for each change.
  4. Hire fact table: contains 1000 hire transactions since 1st Jan 2011. It is a daily snapshot fact table so that every day we insert 1000 rows into this fact table. So over time we can track the changes of total bill, van charges, satnav income, etc.

Create the source tables and populate them

So now we are going to create the 3 tables in HireBase database: Customer, Van, and Hire. Then we populate them.

First I’ll show you how it looks when it’s done:

Customer table:

Van table:

Hire table:

And here is the script to create and populate them:

-- Create database
create database HireBase
go
use HireBase
go

-- Create customer table
if exists (select * from sys.tables where name = 'Customer')
drop table Customer
go

create table Customer
( CustomerId varchar(20) not null primary key,
 CustomerName varchar(30), DateOfBirth date, Town varchar(50),
 TelephoneNo varchar(30), DrivingLicenceNo varchar(30), Occupation varchar(30)
)
go

-- Populate Customer
truncate table Customer
go

declare @i int, @si varchar(10), @startdate date
set @i = 1
while @i <= 100
begin
 set @si = right('0'+CONVERT(varchar(10), @i),2)
 insert into Customer
 ( CustomerId, CustomerName, DateOfBirth, Town, TelephoneNo, DrivingLicenceNo, Occupation)
 values
 ( 'N'+@si, 'Customer'+@si, DATEADD(d,@i-1,'2000-01-01'), 'Town'+@si, 'Phone'+@si, 'Licence'+@si, 'Occupation'+@si)
 set @i = @i + 1
end
go

select * from Customer

-- Create Van table
if exists (select * from sys.tables where name = 'Van')
drop table Van
go

create table Van
( RegNo varchar(10) not null primary key,
 Make varchar(30), Model varchar(30), [Year] varchar(4),
 Colour varchar(20), CC int, Class varchar(10)
)
go

-- Populate Van table
truncate table Van
go

declare @i int, @si varchar(10)
set @i = 1
while @i <= 20
begin
 set @si = convert(varchar, @i)
 insert into Van
 ( RegNo, Make, Model, [Year], Colour, CC, Class)
 values
 ( 'Reg'+@si, 'Make'+@si, 'Model'+@si,
 case @i%4 when 0 then 2008 when 1 then 2009 when 2 then 2010 when 3 then 2011 end,
 case when @i%5<3 then 'White' else 'Black' end,
 case @i%3 when 0 then 2000 when 1 then 2500 when 2 then 3000 end,
 case @i%3 when 0 then 'Small' when 1 then 'Medium' when 2 then 'Large' end)
 set @i = @i + 1
end
go

select * from Van

-- Create Hire table
if exists (select * from sys.tables where name = 'Hire')
drop table Hire
go

create table Hire
( HireId varchar(10) not null primary key,
 HireDate date not null,
 CustomerId varchar(20) not null,
 RegNo varchar(10), NoOfDays int, VanHire money, SatNavHire money,
 Insurance money, DamageWaiver money, TotalBill money
)
go

-- Populate Hire table
truncate table Hire
go

declare @i int, @si varchar(10), @DaysFrom1stJan int, @CustomerId int, @RegNo int, @mi int
set @i = 1
while @i <= 1000
begin
 set @si = right('000'+convert(varchar(10), @i),4) -- string of i
 set @DaysFrom1stJan = (@i-1)%200 --The Hire Date is derived from i modulo 200
 set @CustomerId = (@i-1)%100+1 --The CustomerId is derived from i modulo 100
 set @RegNo = (@i-1)%20+1 --The Van RegNo is derived from i modulo 20
 set @mi = (@i-1)%3+1 --i modulo 3
 insert into Hire (HireId, HireDate, CustomerId, RegNo, NoOfDays, VanHire, SatNavHire, Insurance, DamageWaiver, TotalBill)
 values ('H'+@si, DateAdd(d, @DaysFrom1stJan, '2011-01-01'),
 left('N0'+CONVERT(varchar(10),@CustomerId),3), 'Reg'+CONVERT(varchar(10), @RegNo),
 @mi, @mi*100, @mi*10, @mi*20, @mi*40, @mi*170)
 set @i += 1
end
go

select * from Hire

Create the Data Warehouse

So now we are going to create the 3 dimension tables and 1 fact table in the data warehouse: DimDate, DimCustomer, DimVan and FactHire. We are going to populate the 3 dimensions but we’ll leave the fact table empty. The purpose of this article is to show how to populate the fact table using SSIS.

First I’ll show you how it looks when it’s done:

Date Dimension:

Customer Dimension:

Van Dimension:

Hire Fact Table:

And then we do it. This is the script to create and populate those dim and fact tables:

-- Create the data warehouse
create database TopHireDW
go
use TopHireDW
go

-- Create Date Dimension
if exists (select * from sys.tables where name = 'DimDate')
drop table DimDate
go

create table DimDate
( DateKey int not null primary key,
 [Year] varchar(7), [Month] varchar(7), [Date] date, DateString varchar(10))
go

-- Populate Date Dimension
truncate table DimDate
go

declare @i int, @Date date, @StartDate date, @EndDate date, @DateKey int,
 @DateString varchar(10), @Year varchar(4),
 @Month varchar(7), @Date1 varchar(20)
set @StartDate = '2006-01-01'
set @EndDate = '2016-12-31'
set @Date = @StartDate

insert into DimDate (DateKey, [Year], [Month], [Date], DateString)
 values (0, 'Unknown', 'Unknown', '0001-01-01', 'Unknown') --The unknown row

while @Date <= @EndDate
begin
 set @DateString = convert(varchar(10), @Date, 20)
 set @DateKey = convert(int, replace(@DateString,'-',''))
 set @Year = left(@DateString,4)
 set @Month = left(@DateString, 7)
 insert into DimDate (DateKey, [Year], [Month], [Date], DateString)
 values (@DateKey, @Year, @Month, @Date, @DateString)
 set @Date = dateadd(d, 1, @Date)
end
go

select * from DimDate

-- Create Customer dimension
if exists (select * from sys.tables where name = 'DimCustomer')
drop table DimCustomer
go

create table DimCustomer
( CustomerKey int not null identity(1,1) primary key,
 CustomerId varchar(20) not null,
 CustomerName varchar(30), DateOfBirth date, Town varchar(50),
 TelephoneNo varchar(30), DrivingLicenceNo varchar(30), Occupation varchar(30)
)
go

insert into DimCustomer (CustomerId, CustomerName, DateOfBirth, Town, TelephoneNo,
 DrivingLicenceNo, Occupation)
select * from HireBase.dbo.Customer

select * from DimCustomer

-- Create Van dimension
if exists (select * from sys.tables where name = 'DimVan')
drop table DimVan
go

create table DimVan
( VanKey int not null identity(1,1) primary key,
 RegNo varchar(10) not null,
 Make varchar(30), Model varchar(30), [Year] varchar(4),
 Colour varchar(20), CC int, Class varchar(10)
)
go

insert into DimVan (RegNo, Make, Model, [Year], Colour, CC, Class)
select * from HireBase.dbo.Van
go

select * from DimVan

-- Create Hire fact table
if exists (select * from sys.tables where name = 'FactHire')
drop table FactHire
go

create table FactHire
( SnapshotDateKey int not null, --Daily periodic snapshot fact table
 HireDateKey int not null, CustomerKey int not null, VanKey int not null, --Dimension Keys
 HireId varchar(10) not null, --Degenerate Dimension
 NoOfDays int, VanHire money, SatNavHire money,
 Insurance money, DamageWaiver money, TotalBill money
)
go

select * from FactHire

Now you can see that the 3 dimensions have been populated. And the fact table is created and empty, ready for us to populate it.

Build the SSIS Package to Populate the Fact Table

First I show you how it looks when it’s done. This is the overall workflow we are going to build:

This is a bit of description on each of the 6 boxes above:

  1. Read Hire table in HireBase: First we get the original OLTP records from the hire table in HireBase.
  2. Get Snapshot Date Key: we get today’s date and convert it into an integer as Snapshot Date Key. FactHire is a snapshot fact table. A periodic snapshot fact table. Which means that every day we populate the fact table with 1000 rows like this:Next week, on Monday 28th May, we insert 1000 rows to the fact table, which is all the rows in the Hire table. Then on Tuesday 29th  May we insert 1010 rows. Because there would be 10 new hire transactions in the Hire table in HireBase system. The 1000 “old” rows might change as well on Tuesday, not all of them will be the same as Monday. Perhaps there are 10 rows which are modified, the amount/fees changed, etc. Whatever the condition of the Hire table on Tuesday, we are taking all rows in put them into our data warehouse. And we do this every day. We every day, put whatever is in the Hire table into FactHire. That’s what Periodic Snapshot fact table means.Hence in a periodic snapshot fact table we have a “Snapshot Date”. Meaning: the date when we captured the source table. Or, the date when these 1000 rows were inserted into this fact table. The value of this snapshot date is today’s date (the date the ETL runs). The date the load happens.
  3. Get Customer Key: from Hire table we get the Customer ID which was involved in the hire transaction. We then go to the Customer Dimension to get the Customer Key for this Customer ID.
  4. Get Van Key: from Hire table we get the Registration Number which was rented out in the transaction. We then go to the Van Dimension to get the Van Key for this Registration Number.
  5. Get Hire Date Key: from Hire table we get the Hire Date, which is the date when the hire transaction happened. We then go to the Date Dimension to get the Date Key for this Hire Date.
  6. Populate FactHire: After we get all the required dimension keys, we insert the rows into the FactHire fact table.

Now that we know what to build, let’s build it. The box numbers below refer to the 6 boxes above.

Box 1. Read Hire table in HireBase

We create the data source in the workflow. It’s OLE DB Data Flow Source, from HireBase. We take all columns, so we don’t specify a query:

Box 2. Get Snapshot Date Key

Create a Derived Column and set the Expression to:

(DT_I4)((DT_WSTR,4)YEAR(GETDATE()) + RIGHT(“0” + (DT_WSTR,2)MONTH(GETDATE()),2) + RIGHT(“0” + (DT_WSTR,2)DAY(GETDATE()),2))

See the complete article here.

Box 3. Get Customer Key

Create a Lookup transformation and set to Full Cache, OLE DB connection manager, Fail component, like this:

Full cache: SSIS reads all rows from DimCustomer into memory, then read the Hire table and SSIS doesn’t need to read DimCustomer any more. All lookup operations are done in memory.
Partial cache: SSIS build the memory cache while reading the rows from Hire table. When SSIS reads the rows from Hire table, it tries to find the Customer Key in memory cache. If it can’t find it, it reads the DimCustomer on disk.
No cache: SSIS doesn’t build memory cache. SSIS joins DimCustomer to Hire table (using RDBMS operation) to find the Customer Key.

Cache Connection Manager (CCM): SSIS reads DimCustomer and put in memory cache. This cache can be reused and shared between several Lookups, and across several dataflows. CCM reduces database workload.

OLE DB connection manager: a table in the warehouse (in this case it’s DimCustomer), see Connection pane below:

In the Columns pane, connect Customer Id from the Hire table to the Customer Id on the DimCustomer table, and bring the CustomerKey as output:

Box 4: Get Van Key

It’s the same as Get Customer Key above, i.e. using Lookup transform. When connecting [Get Customer Key] and [Get Van Key] choose Lookup Match Output. I’ll come back for Lookup No Match Output at the end of the article. I just want to get this up and running first. We should always get a simple one first up and running, then build up from there.

Box 5. Get Hire Date Key

It’s the same as Get Customer Key above.

Box 6. Populate FactHire

It’s OLE DB destination. On the Connection Manager pane, specify FactHire as the target table, and leave everything as is:

On the Mapping pane, the source column has the same name as the destination column, except HireDateKey:

That’s it. Run it and all should be green, inserting 1000 rows into the target fact table:

And the FactHire fact table now looks like this:

The 4 dimension key columns are populated with the surrogate key from the respective dimension.

Wrapping Up

The above workflow works but there are 2 things we need to do:

  1. The no match rows
    If in the Hire table we have a CustomerId which does not exist in the Customer Dimension, we need to set the CustomerKey to 0.
  2. Delete fact rows, if exists
    If today’s rows exists in the target fact table, delete them first.

As this is long enough. I’m going to put it as a separate article (link).

Hope this helps,

Vincent Rainardi

26/10/12: Data type error when mapping HireDate column to Date column

On Box 5 Get Hire Date Key, on the Lookup Transformation editor, on the Columns pane (see below), when we try to map HireDate column from Available Input Columns to the Date column on Available Lookup Columns, we get an error message: Cannot map the input column, ‘HireDate’, to the lookup column, ‘Date’, because the data types do not match:

Thank you to Masu for pointing this out.

This is because on the Hire table (the input), the HireDate column is in date data type, whereas in the DimDate table, the Date column is in varchar(10).

To solve this issue, in the Get Hire Date Key task, in the Connection pane, instead of “Use a table or a view”, select “Use results of an SQL query” and type:

select DateKey, Year, Month, convert(date, Date) from DimDate where date <> ‘Unknown’

Because both are now of data type Date, in the column pane you won’t have any issue connecting HireDate to Date:

20 March 2012

SSIS: Loading a Big File Fast

Filed under: SSIS — Vincent Rainardi @ 7:21 am
Tags:

Loading a big file into a table in SQL Server

The differences between a slow load and a very fast load are:

  1. Flat file source: ValidateExternalMetadata = false
  2. OLE DB Destination: AccessMode: OpenRowSet using FastLoad (default is OpenRowSet)

Flat file: 230,000 rows, 15 columns, pipe separated, qualifier: double quote
SSIS: data conversion to Unicode on 10 columns

Big difference in performance:

  • Before: 229 seconds (3.8 minutes)
  • After: 3 seconds

So next time you load a file into a table using SSIS, please remember to use these settings.

26 June 2011

Import/Export Files with Variable File Name Set at Run Time

Filed under: SSIS — Vincent Rainardi @ 6:37 am
Tags:

In ETL we often need to import files. But these files usually don’t have a fixed file name. They are variable. The file names change every day, because they have date on their file names. For example, yesterday it was called trans_20110627. Today it is called trans_20110628. Tomorrow it will be called trans_20110629.

How do we import them? We set the file name at run time.

How do we set the file name at run time? Using a script task, at run time we set the value of a variable. This variable is used as the file name of the file.

This might be one of the “oldest tricks in the book”, but still, I found that many people don’t know it. So it’s probably worth writing a post about it.

So here’s how to do it, step by step:

First, create a variable called TransFileName, of type string, scope = package:

Then create a script task on the control flow, call it Set File Name:

Edit the script task, put the TransFileName variable on the ReadWriteVariable:

Then click on Edit Script:

For simplicity in the above script I just put a fixed file name. But in the real project the logic is for example: find today’s file, find the latest file, etc, which is for another blog post. I’ll write that “find the latest file logic in using c# script” post in the next few days and put the link here.

Save the ScriptMain.cs window and close it. Click OK to come out of the Script Task Editor.

Create a Data Flow task:

Double click the Data Flow Task to edit it. Create Flat File Source and OLE DB Destination.

Double click on the Trans File to edit it. Click New on the Flat File Source Editor. Click Browse and select the file we want to import. For this exercise I’ve created a small text file like this:

It only contains 1 column and 2 rows. I also created a small table to illustrate the process:

create table VarFileName.Trans ( Col1 int )

So, click on Browse and select the file we want to import:

Click “Column names in the first data row” if your file has column names in the first row. Click on Columns tab on the left pane:

Specify the column delimiter e.g. pipe (|). Click on the Advanced tab on the left pane:

Set the data type for each column. Click OK to close the Flat File Connection Manager Editor and go back to the Flat File Source Editor:

Click on the Columns on the left pane:

Click OK to close the Flat File Source Editor.

Edit the OLE DB, point it to the target table, e.g. VarFileName.Trans:

Click on Mappings on the left pane and check that all columns on the file have been mapped properly to the table. In the example there is only 1 column:

Click OK to close the OLE DB Destination Editor and go back to the Data Flow window.

Now we are going to use the TransFileName variable as the file name of the file. We do a right click on the TransFile on the Connection Manager and select Properties.

Click on the ellipsis on the Expressions to open the Property Expression Editor. Set the Property column to Connection String. Under the Expression column, click on the ellipsis to open the Expression Builder. Choose the TransFileName variable, drag it to the Expression box:

 

Click OK to close the Expression Builder. Click OK again to close the Property Expressions Editor.

That’s it, we are done. At run time, the script task will set the value of the TransFileName variable. This variable is used as the file name of the file we want to import. Here’s the Control Flow (left) and Data Flow (right) at when we run it:

This can also be used to when we need to export to a file which we only know the name at run time. It works the same way.

Hope this helps,

Vincent Rainardi, 26/6/2011

21 June 2011

Updating a Variable based on a File (Script Task)

Filed under: SSIS — Vincent Rainardi @ 10:24 pm
Tags:

In the last article (link) I showed how to update a variable based on a file for script component. That’s useful when we are in a Work Flow. But sometimes the situation happens on the Control Flow. In this case we need to do a Script Task.

Using a Script Task it is a lot simpler than using a Script Component. We essentially need to do a bit of C# coding to read the file and set the variable.

This is the file that we want to read:

Pull the Script Task from the Toolbar to the Control Flow tab:

Define a variable to hold the Number of Customer:

Double click the Script Task to edit it. Set the read write variable to the variable we just created (NoOfCustomers):

Double click the Edit Script and edit it as follows:

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
using System.IO;

{
  public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
  {
    public void Main()
    {
      const string FILE_NAME = "C:\\Util\\File.txt";
      int ReturnValue;
      String line;
      char[] delimiters = "|".ToCharArray();
      string[] columns;
      try
      {
        if (!File.Exists(FILE_NAME))
        {
          MessageBox.Show("File " + FILE_NAME + " does not exist.");
          ReturnValue = (int)ScriptResults.Failure;
          return;
        }
        using (StreamReader sr = File.OpenText(FILE_NAME))
        {
          while ((line = sr.ReadLine()) != null)
          {
            //MessageBox.Show(line);
            columns = line.Split(delimiters);
            //MessageBox.Show(columns[0].ToString());
            //MessageBox.Show(columns[1].ToString());

            if (columns[0].ToString() == "Number Of Customers")
            {
              //MessageBox.Show(columns[0].ToString());
              //MessageBox.Show(columns[1].ToString());
              Dts.Variables["NoOfCustomers"].Value = Convert.ToInt32(columns[1].ToString());
            }
          }
          ReturnValue = (int)ScriptResults.Success;
        }
      }

      catch (Exception e)
      {
        MessageBox.Show(e.Message);
        ReturnValue = (int)ScriptResults.Failure;
      }

      Dts.TaskResult = ReturnValue;
    }
  }
}

// ref: http://msdn.microsoft.com/en-us/library/db5x7c0d.aspx

7 June 2011

SSIS: Update a Variable based on a File (Script Component)

Filed under: SSIS — Vincent Rainardi @ 7:15 pm
Tags:

Yesterday I wrote about setting a variable based on the value in a database table (link). Sometimes the source value is not stored in a database table but in a file. We need to read that file to get the value then update a package variable. Of course you can import that file into a table first, then query the table to update the variable via Result Set mapping like above. But sometimes you don’t want to do this. You want to update the variable straight from the file.

There are 2 ways to do this:

  1. Script component as a source
  2. Script Task

We use the Source Script Component when we want to set the variable as part of a Work Flow. We use the Script Task when we want to set the variable as part of a Control Flow? What’s the different? Work Flow is task where data flows from source to destination. Control Flow is a sequence of task, one after the other, not necessarily about data movement.

Out of those 2 above, in this article I will explain the first one (Update Variable from a File – Script Component as a Source). I will explain the second one (Script Task) in the next article.

OK Let’s start. Let’s say that the file we want to import is like below. Say what we want to get is the Number of Customers.

Let’s start. First we create a Data Flow by pulling the Data Flow task from the Toolbox on the left onto the Control Flow canvas:

Double click it to edit. Pull Script Component from the Toolbox onto the Work Flow. SSIS will ask whether we will use as a Source, Destination or Transformation. Choose Source and we get this:

Now create a variable to store the Number of Customer. First, click on the canvas of the Control Flow so that the variable scope is Package. Then create the variable, setting the type to Int32.

We go back to the data flow task, double click the script component:

On the ReadWriteVariables click on the Ellipses and select User::NoOfCustomers :

On the Input and Output tab, give the output a name (e.g. change it from “Output 0” to “ParamOutput”). Then create 2 output columns: ParamName and ParamValue. Set the Data Type to String for both of them:

On the Connection Managers, define a new connection:

Choose Flat File:

Give it a name, choose the file, and select “Column names in the first data row”:

Click on Columns on the left, check the preview (see below). Here we usually change the delimiter, but in this case it’s not required.

Click OK to close the window and back to the Script Transformation Editor Connection Managers tab. Choose the Connection Manager you just created (ParameterFile) and give it a name (e.g. ParamConn):

Now we are ready to edit the script (to be ready we need to have: connection, output and variable). Click on Script on the left and click on Edit Script:

Edit it like this:

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.IO; // This namespace is required for File functionalities
using System.Windows.Forms; // This namespace is required for the Message Boxes

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    // Declare the variables
    private StreamReader SR;
    private string File1;
    private int NoOfCustomers;

    public override void AcquireConnections(object Transaction)
    {
        // Get the ParamConn connection, store it as 'File1'
        IDTSConnectionManager100 CM = this.Connections.ParamConn;
        File1 = (string)CM.AcquireConnection(null);
    }

    public override void PreExecute()
    {
        // Create a reader for File1
        base.PreExecute();
        SR = new StreamReader(File1);
    }

    public override void PostExecute()
    {
        // Close the reader
        base.PostExecute();
        SR.Close();

        // Set the Package Variable
        Variables.NoOfCustomers = NoOfCustomers;
    }

    public override void CreateNewOutputRows()
    {
        // Declare variables
        string nextLine;
        string[] columns;
        char[] delimiters;

        // Set the delimiter
        delimiters = "|".ToCharArray();

        // Read the first line (header)
        nextLine = SR.ReadLine();

        // Split the line into columns
        columns = nextLine.Split(delimiters);

        // Read the second line and loop until the end of the file
        nextLine = SR.ReadLine();
        while (nextLine != null)
        {
            // Split the line into columns
            columns = nextLine.Split(delimiters);

            // Add a row on the output
            ParamOutputBuffer.AddRow();

            // Use MessageBox for Debugging
            // (you can debug a Script Task but not Script Component)
            MessageBox.Show(columns[0].ToString());
            MessageBox.Show(columns[1].ToString());

            // Set the values of the Script Component output according to the file content
            ParamOutputBuffer.ParamName = columns[0];
            ParamOutputBuffer.ParamValue = columns[1];

            // When we find the row, set the variable
            if (columns[0].ToString() == "Number Of Customers")
            {
                // Store the value of columns[1] to an internal variable
                NoOfCustomers = Convert.ToInt32(columns[1]);
                MessageBox.Show(NoOfCustomers.ToString());
                break; // Get out of the loop when we found the row
            }

            // Read the next line
            nextLine = SR.ReadLine();
        }
    }
}

Run it. You get message boxes 6 times containing: Analysis Type, Regulatory, Number Of Customers, 322514, Date Exported, 06/06/2011. Unfortunately we can’t debug a Script Component (we can debug a Script Task, but not a Script Component, see here), hence we use message boxes.

The code reads the file line by line, splitting each line on the delimiter. The split result is put in an array called columns. This array is mapped to the Output Buffer of the script component. When we found the “Number of Customer” row, set the internal variable then “break” out of the loop. At post execute, we set the package variable.

After the code is working properly don’t forget to comment out the message boxes.

Hope it’s helpful.

Vincent Rainardi, 7/6/2011

6 June 2011

SSIS: Updating a Variable based on Database

Filed under: SSIS — Vincent Rainardi @ 9:17 pm
Tags:

Sometimes when writing an SSIS package, we need to update the value of a package variable based on a database table. For example, we may have a variable called ImportFolder. This package variable contains the path to the folder which contains the files that we need to import. The value of ImportFolder variable is stored in the database. We have to query that database (select ImportFolder from table1 where col1 = …) and then set the value of the ImportFolder package variable.

In some cases the value that we need to update the package with is not stored in the database, but it is stored in a file. We need to open this file, read it and set the value of the package variable. In this article I will give an example on how to do the database one, and on the next article I’ll do the file one.

Execute SQL Task

We use an Execute SQL Task, then set the Result Set to a variable. The database doesn’t have to be SQL Server of course, it can be anything. Here’s how:

Pull Execute SQL task from the Toolbox to the Control Flow canvas:

Double click it to edit and type the query on the SQL Statement and set Result Set to Single Row.

Here’s an example of a SQL query to get the Import Folder from a database table:

After you set the SQL Statement and Result Set, click on the Result Set pane on the left handside and set the Result Name and Variable Name. Result Name is the column name on the query output and variable name is the name of the package variable (yes you need to create it first).

So click OK to come out of Execute SQL Task editor, create the variable as below.

Create the variable by clicking the first icon. You need to set the Data Type as String. To set the scope to Package, click on any point on the Control Flow canvas before creating the variable.

If you wonder “Where is this Variable window?” scroll down to the bottom of the screen, click Variables tab on the left, next to the Toolbar tab:

Now back to Result Set tab again (double click Execute SQL Task, click on Result Set tab on the left) and set the Result Name and Variable Name:

Run it. Before running it you may want to set a Breakpoint so you can see the value of the variable. Right click on the Execute SQL Task and choose Edit Breakpoints. Select the OnPostExecute event (this is after the task is run).

Then run the package. It will stop after it runs the Execute SQL task. To see the value of the ImportPath variables, type “ImportPath” in the Watch window:

Or expand the Locals window and find User::ImportPath:

If you wonder “Where are these Watch and Locals windows”, they are at the bottom of screen. If you can’t see them try: Windows menu, Reset Windows Layout.

Vincent Rainardi, 6/6/2011

PS. On the next article I’ll write about Updating a Variable from a File.

SSIS 2008 Data Profiler

Filed under: SSIS — Vincent Rainardi @ 5:30 am
Tags:

SQL Server 2008 has a new feature in SSIS: Data Profiler. For a given table, Data Profiler enables us to understand the number of distinct values in each column, and the distribution of data values. This is very useful in data warehousing, both in ETL and when building cubes in SSAS.

When building an ETL package, sometimes we need to know the data values in a column. For example, to understand whether a column is mostly empty (blank or null) or contains good data. Or to help us decide whether to exclude or include certain values or certain patterns, probably related to a data quality requirement. Or to help us identify potential hierarchies in a dimension table.

Data Profiler is also useful when doing performance tuning, for example when partitioning the a fact table or when partitioning a cube. Understanding the data distribution could assist us in creating similar size groups based on a particular column.

On the current SQL Server 2005 data warehousing project, I use Informatica Data Explorer (IDE) to understand the distribution of data values in the data warehouse fact and dimension tables, for the purpose of building SSAS cubes. I often asked myself: When will we be able to right click on any table in Management Studio, select ‘Analyze’ and get the data profile for that table?

Now in SSIS 2008 we have data profiler. Well, it’s not as friendly as doing a right click on any table and select ‘Analyze’. But it’s a start. I hope in the later releases of SQL Server we can right click any table in Management Studio and choose ‘Analyze’ to get the data profile.

So let’s find out what SSIS 2008 Data Profiler is all about. First, let’s have a look at its output:

Figure 1. Output of SSIS 2008 Data Profiler

The left pane shows the tables that we analyzed, along with the profiles that we executed for each table. The right hand pane shows the datails of the profile that we select on the left hand pane. For example, for ‘Column Value Distribution’ profile, the top right panel shows a list of columns and the number of distinct values in each column. If we click on any column on this panel, the bottom right panel displays the distribution of the data value, i.e. the value, number of rows containing that value, and the percentage of total rows. Figure 1 shows the distribution of the data values on the product_key column in the fact_sales table.

Other than analyzing the distribution of data value, SSIS 2008 Data Profiler can also analyze:

  1. The number of nulls in a column.
  2. The pattern of the data values, for example if a column contains values of 2008-01, 2008-02, …, 2008-12, then the pattern is “200d-dd” (d means digit, i.e. a number from 0 to 9).
  3. Minimum and maximum values of the data in each column.
  4. Mean (average) and standard deviation of the data values in each column. Standard deviation indicates how far the value is from the mean.
  5. Determine which column(s) is a good candidate for a primary key. This is useful for checking the 1st normal form rule, which states that each table must have a primary key that uniquely identifies each row.
  6. Checking whether a column is fully or partially dependant of other column(s). This is useful for checking the 2nd normal form and for identifying potential hierarchies within a dimension. In 2nd normal form, non primary key columns must be dependent on the primary key.
  7. Checking whether all values in a column exist in another table. In data warehousing, this is useful to identify if there are dimensional keys on the fact table which do not exist on the dimension table. For ODS design (or other normalized databases), this is useful to identify the 3rd normal form, which states that no column is transitively dependent on the primary key. In other words, it is useful for identifying potential foreign key relationships.

Now let’s understand the ‘how’.

Open BIDS, create an SSIS project, drag Data Profiling Task onto the design surface. Double click to edit it. Type the Destination File name.

Click on Profile Requests on the left pane. Select Column Value Distribution on the Profile Type column. Supply the connection details, select the Table and Column(s) at the bottom panel as shown on Figure 2 below and click OK.

Figure 2. Setting up a Data Profiling Task in SSIS

Run the SSIS package. Verify that the output file is created.

To view the output file, we need to use Data Profile Viewer utility. Go to C:\Program Files\Microsoft SQL Server\100\DTS\Binn (adjust to your SQL Server folder) and execute DataProfileViewer.exe. Click on Open, and open the output file that was produced by the SSIS Data Profiling Task earlier. It will look like figure 1 above.

In addition to Column Value Distribution Profiles that we chose above, we can also choose other profiles, such as Column Pattern Profiles, Column Statistics Profiles, Functional Dependency Profiles, Candidate Key Profiles, Column Null Ratio Profiles and Inclusion Profiles. These profiles provide the 7 analysis items I listed earlier.

The Data Profiling facility in SQL Server 2008 is quite basic. As is the the user friendliness. The viewer is not integrated in BI Development Studio, even in Release Candidate 0. But it is a good start. Whether it’s at an additional cost or not, it is a necessity these days that an ETL tool should have a data profiling facility. I’m sure in the future SSIS data profiling feature will improve a lot. I hope that in the future releases of SQL Server, the Data Profile Viewer utility is integrated into BI Development Studio and that we can right click any table in Management Studio and choose ‘Analyze’ to get the data profile.

Vincent Rainardi, 7th July 2008
Author of Building A Data Warehouse: With Examples in SQL Server

Repost from SQLServerCentral

5 June 2011

SSIS: Export a Query Result to a File

Filed under: SSIS — Vincent Rainardi @ 11:01 pm
Tags:

This seems to be a simple task but someone who just begins using SSIS will surely come across this task, and at that point he will need an example to copy from. Importing a file into a table is a more common task in Data Warehousing than exporting to a file, so it is possible that an SSIS developer can do:

  • Import a file into table
  • Run SQL to copy table1 to table 2
  • Export a table into a file

but never to this day he exported the result of a query to a file.

There are 2 ways to do it:

  1. Using Data Flow task
  2. Using Export Import Wizard

I recommend no 1 over 2 as we have more control but I’ll go through both. No we can’t use Execute SQL Task for this. Nor OLE DB Command Transformation.

Using Data Flow Task

Pull the Data Flow task into the Control Flow tab and double click it to edit:

On the Data Flow tab, pull the OLE DB Source and a Flat File Destination then connect them:

Double click the OLE DB Source to edit it:

Set the OLE DB connection manager to an existing one or click New to create a new one.

This is the most important one: Set the Data access mode to SQL Command. Type the SQL query on SSMS, test it then copy and paste it here. You can also use “Build Query”:

Click Preview to see the query result. Click Columns on the left hand side to check the output columns (this is optional) and then click OK.

Double click the Flat File Destination to edit it.

When asked, select the file format as Delimited. It is the most common. You can experiment with the other types later.

Click on New to define a new Connection Manager and set the file name by clicking Browse. Select “Column names in the first data row” as needed and change the connection manager name to one of your choice.

On the left hand side click on the Columns and set the Column delimiter to Vertical Bar {|}, see below. Again this is the most common one; you can experiment with the other delimiter later on. Then click OK.

Run it (Start Debugging) and check the file output. I would suggest that you only put 3 rows of data on your table so you can check if you have 3 rows on the output file. If you have 10,000 how do you check it? 🙂

Using Export Import Wizard

On the Solution Explorer, right click on the SSIS Packages and select SSIS Import and Export Wizard. Navigate through the pages.

1. Source: for SQL Server 2008 choose SQL Server Native Client 10.0, and select the DB as below; otherwise choose as appropriate.

2. Destination: choose Flat File and specify the File name. Choose “Column names in the first data row” if needed, as below:

3. Specify Table Copy or Query: choose Write a query to specify the data to transfer, as below:

4. Provide Source Query: type the query in SSMS and copy paste the query here, as below:

5. Set the column delimiter to Vertical Bar. This is the most common one. You can experiment with other delimiter later on.

6. Click Next, Finish, Close.

On the Solution Explorer you should have a new package, e.g. Package1.dtsx. It contains 1 Data Flow. The Data Flow contains an OLE DB Source and a Flat File Destination. It is just like the package we created a little earlier.

Hope this helps.

Picture for “skipping to the next task”: (see dsentelle’s comment below):
Skip to next task

 

Next Page »

Blog at WordPress.com.