Data Warehousing and Business Intelligence

7 June 2011

SSIS: Update a Variable based on a File (Script Component)

Filed under: SSIS — Vincent Rainardi @ 7:15 pm
Tags:

Yesterday I wrote about setting a variable based on the value in a database table (link). Sometimes the source value is not stored in a database table but in a file. We need to read that file to get the value then update a package variable. Of course you can import that file into a table first, then query the table to update the variable via Result Set mapping like above. But sometimes you don’t want to do this. You want to update the variable straight from the file.

There are 2 ways to do this:

  1. Script component as a source
  2. Script Task

We use the Source Script Component when we want to set the variable as part of a Work Flow. We use the Script Task when we want to set the variable as part of a Control Flow? What’s the different? Work Flow is task where data flows from source to destination. Control Flow is a sequence of task, one after the other, not necessarily about data movement.

Out of those 2 above, in this article I will explain the first one (Update Variable from a File – Script Component as a Source). I will explain the second one (Script Task) in the next article.

OK Let’s start. Let’s say that the file we want to import is like below. Say what we want to get is the Number of Customers.

Let’s start. First we create a Data Flow by pulling the Data Flow task from the Toolbox on the left onto the Control Flow canvas:

Double click it to edit. Pull Script Component from the Toolbox onto the Work Flow. SSIS will ask whether we will use as a Source, Destination or Transformation. Choose Source and we get this:

Now create a variable to store the Number of Customer. First, click on the canvas of the Control Flow so that the variable scope is Package. Then create the variable, setting the type to Int32.

We go back to the data flow task, double click the script component:

On the ReadWriteVariables click on the Ellipses and select User::NoOfCustomers :

On the Input and Output tab, give the output a name (e.g. change it from “Output 0” to “ParamOutput”). Then create 2 output columns: ParamName and ParamValue. Set the Data Type to String for both of them:

On the Connection Managers, define a new connection:

Choose Flat File:

Give it a name, choose the file, and select “Column names in the first data row”:

Click on Columns on the left, check the preview (see below). Here we usually change the delimiter, but in this case it’s not required.

Click OK to close the window and back to the Script Transformation Editor Connection Managers tab. Choose the Connection Manager you just created (ParameterFile) and give it a name (e.g. ParamConn):

Now we are ready to edit the script (to be ready we need to have: connection, output and variable). Click on Script on the left and click on Edit Script:

Edit it like this:

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.IO; // This namespace is required for File functionalities
using System.Windows.Forms; // This namespace is required for the Message Boxes

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    // Declare the variables
    private StreamReader SR;
    private string File1;
    private int NoOfCustomers;

    public override void AcquireConnections(object Transaction)
    {
        // Get the ParamConn connection, store it as 'File1'
        IDTSConnectionManager100 CM = this.Connections.ParamConn;
        File1 = (string)CM.AcquireConnection(null);
    }

    public override void PreExecute()
    {
        // Create a reader for File1
        base.PreExecute();
        SR = new StreamReader(File1);
    }

    public override void PostExecute()
    {
        // Close the reader
        base.PostExecute();
        SR.Close();

        // Set the Package Variable
        Variables.NoOfCustomers = NoOfCustomers;
    }

    public override void CreateNewOutputRows()
    {
        // Declare variables
        string nextLine;
        string[] columns;
        char[] delimiters;

        // Set the delimiter
        delimiters = "|".ToCharArray();

        // Read the first line (header)
        nextLine = SR.ReadLine();

        // Split the line into columns
        columns = nextLine.Split(delimiters);

        // Read the second line and loop until the end of the file
        nextLine = SR.ReadLine();
        while (nextLine != null)
        {
            // Split the line into columns
            columns = nextLine.Split(delimiters);

            // Add a row on the output
            ParamOutputBuffer.AddRow();

            // Use MessageBox for Debugging
            // (you can debug a Script Task but not Script Component)
            MessageBox.Show(columns[0].ToString());
            MessageBox.Show(columns[1].ToString());

            // Set the values of the Script Component output according to the file content
            ParamOutputBuffer.ParamName = columns[0];
            ParamOutputBuffer.ParamValue = columns[1];

            // When we find the row, set the variable
            if (columns[0].ToString() == "Number Of Customers")
            {
                // Store the value of columns[1] to an internal variable
                NoOfCustomers = Convert.ToInt32(columns[1]);
                MessageBox.Show(NoOfCustomers.ToString());
                break; // Get out of the loop when we found the row
            }

            // Read the next line
            nextLine = SR.ReadLine();
        }
    }
}

Run it. You get message boxes 6 times containing: Analysis Type, Regulatory, Number Of Customers, 322514, Date Exported, 06/06/2011. Unfortunately we can’t debug a Script Component (we can debug a Script Task, but not a Script Component, see here), hence we use message boxes.

The code reads the file line by line, splitting each line on the delimiter. The split result is put in an array called columns. This array is mapped to the Output Buffer of the script component. When we found the “Number of Customer” row, set the internal variable then “break” out of the loop. At post execute, we set the package variable.

After the code is working properly don’t forget to comment out the message boxes.

Hope it’s helpful.

Vincent Rainardi, 7/6/2011

6 June 2011

SSIS: Updating a Variable based on Database

Filed under: SSIS — Vincent Rainardi @ 9:17 pm
Tags:

Sometimes when writing an SSIS package, we need to update the value of a package variable based on a database table. For example, we may have a variable called ImportFolder. This package variable contains the path to the folder which contains the files that we need to import. The value of ImportFolder variable is stored in the database. We have to query that database (select ImportFolder from table1 where col1 = …) and then set the value of the ImportFolder package variable.

In some cases the value that we need to update the package with is not stored in the database, but it is stored in a file. We need to open this file, read it and set the value of the package variable. In this article I will give an example on how to do the database one, and on the next article I’ll do the file one.

Execute SQL Task

We use an Execute SQL Task, then set the Result Set to a variable. The database doesn’t have to be SQL Server of course, it can be anything. Here’s how:

Pull Execute SQL task from the Toolbox to the Control Flow canvas:

Double click it to edit and type the query on the SQL Statement and set Result Set to Single Row.

Here’s an example of a SQL query to get the Import Folder from a database table:

After you set the SQL Statement and Result Set, click on the Result Set pane on the left handside and set the Result Name and Variable Name. Result Name is the column name on the query output and variable name is the name of the package variable (yes you need to create it first).

So click OK to come out of Execute SQL Task editor, create the variable as below.

Create the variable by clicking the first icon. You need to set the Data Type as String. To set the scope to Package, click on any point on the Control Flow canvas before creating the variable.

If you wonder “Where is this Variable window?” scroll down to the bottom of the screen, click Variables tab on the left, next to the Toolbar tab:

Now back to Result Set tab again (double click Execute SQL Task, click on Result Set tab on the left) and set the Result Name and Variable Name:

Run it. Before running it you may want to set a Breakpoint so you can see the value of the variable. Right click on the Execute SQL Task and choose Edit Breakpoints. Select the OnPostExecute event (this is after the task is run).

Then run the package. It will stop after it runs the Execute SQL task. To see the value of the ImportPath variables, type “ImportPath” in the Watch window:

Or expand the Locals window and find User::ImportPath:

If you wonder “Where are these Watch and Locals windows”, they are at the bottom of screen. If you can’t see them try: Windows menu, Reset Windows Layout.

Vincent Rainardi, 6/6/2011

PS. On the next article I’ll write about Updating a Variable from a File.

SSIS 2008 Data Profiler

Filed under: SSIS — Vincent Rainardi @ 5:30 am
Tags:

SQL Server 2008 has a new feature in SSIS: Data Profiler. For a given table, Data Profiler enables us to understand the number of distinct values in each column, and the distribution of data values. This is very useful in data warehousing, both in ETL and when building cubes in SSAS.

When building an ETL package, sometimes we need to know the data values in a column. For example, to understand whether a column is mostly empty (blank or null) or contains good data. Or to help us decide whether to exclude or include certain values or certain patterns, probably related to a data quality requirement. Or to help us identify potential hierarchies in a dimension table.

Data Profiler is also useful when doing performance tuning, for example when partitioning the a fact table or when partitioning a cube. Understanding the data distribution could assist us in creating similar size groups based on a particular column.

On the current SQL Server 2005 data warehousing project, I use Informatica Data Explorer (IDE) to understand the distribution of data values in the data warehouse fact and dimension tables, for the purpose of building SSAS cubes. I often asked myself: When will we be able to right click on any table in Management Studio, select ‘Analyze’ and get the data profile for that table?

Now in SSIS 2008 we have data profiler. Well, it’s not as friendly as doing a right click on any table and select ‘Analyze’. But it’s a start. I hope in the later releases of SQL Server we can right click any table in Management Studio and choose ‘Analyze’ to get the data profile.

So let’s find out what SSIS 2008 Data Profiler is all about. First, let’s have a look at its output:

Figure 1. Output of SSIS 2008 Data Profiler

The left pane shows the tables that we analyzed, along with the profiles that we executed for each table. The right hand pane shows the datails of the profile that we select on the left hand pane. For example, for ‘Column Value Distribution’ profile, the top right panel shows a list of columns and the number of distinct values in each column. If we click on any column on this panel, the bottom right panel displays the distribution of the data value, i.e. the value, number of rows containing that value, and the percentage of total rows. Figure 1 shows the distribution of the data values on the product_key column in the fact_sales table.

Other than analyzing the distribution of data value, SSIS 2008 Data Profiler can also analyze:

  1. The number of nulls in a column.
  2. The pattern of the data values, for example if a column contains values of 2008-01, 2008-02, …, 2008-12, then the pattern is “200d-dd” (d means digit, i.e. a number from 0 to 9).
  3. Minimum and maximum values of the data in each column.
  4. Mean (average) and standard deviation of the data values in each column. Standard deviation indicates how far the value is from the mean.
  5. Determine which column(s) is a good candidate for a primary key. This is useful for checking the 1st normal form rule, which states that each table must have a primary key that uniquely identifies each row.
  6. Checking whether a column is fully or partially dependant of other column(s). This is useful for checking the 2nd normal form and for identifying potential hierarchies within a dimension. In 2nd normal form, non primary key columns must be dependent on the primary key.
  7. Checking whether all values in a column exist in another table. In data warehousing, this is useful to identify if there are dimensional keys on the fact table which do not exist on the dimension table. For ODS design (or other normalized databases), this is useful to identify the 3rd normal form, which states that no column is transitively dependent on the primary key. In other words, it is useful for identifying potential foreign key relationships.

Now let’s understand the ‘how’.

Open BIDS, create an SSIS project, drag Data Profiling Task onto the design surface. Double click to edit it. Type the Destination File name.

Click on Profile Requests on the left pane. Select Column Value Distribution on the Profile Type column. Supply the connection details, select the Table and Column(s) at the bottom panel as shown on Figure 2 below and click OK.

Figure 2. Setting up a Data Profiling Task in SSIS

Run the SSIS package. Verify that the output file is created.

To view the output file, we need to use Data Profile Viewer utility. Go to C:\Program Files\Microsoft SQL Server\100\DTS\Binn (adjust to your SQL Server folder) and execute DataProfileViewer.exe. Click on Open, and open the output file that was produced by the SSIS Data Profiling Task earlier. It will look like figure 1 above.

In addition to Column Value Distribution Profiles that we chose above, we can also choose other profiles, such as Column Pattern Profiles, Column Statistics Profiles, Functional Dependency Profiles, Candidate Key Profiles, Column Null Ratio Profiles and Inclusion Profiles. These profiles provide the 7 analysis items I listed earlier.

The Data Profiling facility in SQL Server 2008 is quite basic. As is the the user friendliness. The viewer is not integrated in BI Development Studio, even in Release Candidate 0. But it is a good start. Whether it’s at an additional cost or not, it is a necessity these days that an ETL tool should have a data profiling facility. I’m sure in the future SSIS data profiling feature will improve a lot. I hope that in the future releases of SQL Server, the Data Profile Viewer utility is integrated into BI Development Studio and that we can right click any table in Management Studio and choose ‘Analyze’ to get the data profile.

Vincent Rainardi, 7th July 2008
Author of Building A Data Warehouse: With Examples in SQL Server

Repost from SQLServerCentral

5 June 2011

SSIS: Export a Query Result to a File

Filed under: SSIS — Vincent Rainardi @ 11:01 pm
Tags:

This seems to be a simple task but someone who just begins using SSIS will surely come across this task, and at that point he will need an example to copy from. Importing a file into a table is a more common task in Data Warehousing than exporting to a file, so it is possible that an SSIS developer can do:

  • Import a file into table
  • Run SQL to copy table1 to table 2
  • Export a table into a file

but never to this day he exported the result of a query to a file.

There are 2 ways to do it:

  1. Using Data Flow task
  2. Using Export Import Wizard

I recommend no 1 over 2 as we have more control but I’ll go through both. No we can’t use Execute SQL Task for this. Nor OLE DB Command Transformation.

Using Data Flow Task

Pull the Data Flow task into the Control Flow tab and double click it to edit:

On the Data Flow tab, pull the OLE DB Source and a Flat File Destination then connect them:

Double click the OLE DB Source to edit it:

Set the OLE DB connection manager to an existing one or click New to create a new one.

This is the most important one: Set the Data access mode to SQL Command. Type the SQL query on SSMS, test it then copy and paste it here. You can also use “Build Query”:

Click Preview to see the query result. Click Columns on the left hand side to check the output columns (this is optional) and then click OK.

Double click the Flat File Destination to edit it.

When asked, select the file format as Delimited. It is the most common. You can experiment with the other types later.

Click on New to define a new Connection Manager and set the file name by clicking Browse. Select “Column names in the first data row” as needed and change the connection manager name to one of your choice.

On the left hand side click on the Columns and set the Column delimiter to Vertical Bar {|}, see below. Again this is the most common one; you can experiment with the other delimiter later on. Then click OK.

Run it (Start Debugging) and check the file output. I would suggest that you only put 3 rows of data on your table so you can check if you have 3 rows on the output file. If you have 10,000 how do you check it? :)

Using Export Import Wizard

On the Solution Explorer, right click on the SSIS Packages and select SSIS Import and Export Wizard. Navigate through the pages.

1. Source: for SQL Server 2008 choose SQL Server Native Client 10.0, and select the DB as below; otherwise choose as appropriate.

2. Destination: choose Flat File and specify the File name. Choose “Column names in the first data row” if needed, as below:

3. Specify Table Copy or Query: choose Write a query to specify the data to transfer, as below:

4. Provide Source Query: type the query in SSMS and copy paste the query here, as below:

5. Set the column delimiter to Vertical Bar. This is the most common one. You can experiment with other delimiter later on.

6. Click Next, Finish, Close.

On the Solution Explorer you should have a new package, e.g. Package1.dtsx. It contains 1 Data Flow. The Data Flow contains an OLE DB Source and a Flat File Destination. It is just like the package we created a little earlier.

Hope this helps. FYI this article is one of the articles I mentioned here, the Coming SSIS Articles. As I mentioned there I will be writing other basic SSIS articles:

  • Loading a file into an existing table (upsert)
  • Setting the value of a variable in a script
  • Finding the latest file
  • Dealing file footer/header
And the more architectural ones:
  • Configuration parameter: parent variable vs xml approach – advantages/disadvantages
  • Using component or child package to avoid rewriting the same thing again and again
  • Orchestrating an ETL batch using metadata approach
  • Using a centralised message hub for DI/ETL rather than point to point approach

If there is a topic you’d prefer me to write, please let me know. Vincent Rainardi, 5/6/2011, vrainardi@gmail.com.

4 June 2011

Coming SSIS Articles

Filed under: SSIS — Vincent Rainardi @ 2:57 pm
Tags:

Looking at the SSIS section on my blog I felt that SSIS has not been represented fairly. It has far less articles than SSAS! Over the years I have encountered several interesting stuff in SSIS, hence there are a few SSIS articles I’d like to write. Some are of basic type; some are more of architectural type.

The basic ones are for example: (I will put the link here one by one as I completed them)

  • export a query result into a file Link
  • load a file into an existing table (upsert)
  • set the value of a variable from database Link
  • set the value of a variable from a file Link1, Link2
  • find the latest file
  • dealing with the file footer/header
  • import an Excel file – problem with alphanumeric column
  • removing a column that was deleted from the table

I found that some people are having difficulties finding an example that they can copy. I believe that we human “learn by example”. When people google something, they want to find an article with an example that they can copy/do. Not lengthy explanation about how it works. But it is so difficult to find these examples, even for simple things, e.g. writing into a variable. So in all my coming articles I will write it to show it by example, rather than lengthy theoretical explanation.

The more of architectural ones are for example:

  • Configuration parameter: parent variable vs xml approach – advantages/disadvantages
  • Using custom task/component or child package to increase code reusability and maintainability
  • Orchestrating an ETL batch using metadata approach
  • Using a centralised message hub for DI/ETL rather than point to point approach. Particularly applicable / beneficial for large corporation.
  • What’s new in SSIS 2011 – this one is not architectural :)

So in the next few weeks I will be writing them one by one. Hopefully they are useful for people who need them. As I finish each article I’ll put the link on the list above, as well as on the home page of course. If there are other topics in SSIS that you would like me to write, please let me know (vrainardi@gmail.com). There’s 1 Informatica article that I’d like to write: Importing a File with Dynamic Columns, i.e. like this but in Informatica.

Vincent Rainardi, 4/6/11

PS. Well, talking about unfair representation, SSRS, Informatica, BO, Cognos, and Teradata are also not represented fairly :) – I have no articles on them. None! Well may be next time, after SSIS. Now I’m going to concentrate solely on SSIS (well, and data modelling).

5 March 2011

SSIS: Importing a File with Dynamic Columns

Filed under: SQL Server,SSIS — Vincent Rainardi @ 9:25 am
Tags: ,

One of the most difficult things to deal with in the ETL is importing a file with dynamic columns. The file does not have a set number of columns. The number of columns changes from time to time. For example, the file can have 5 columns, or 20 columns, or anything in between.

Below are some examples of such a file:

File1:
Col1,Col2,Col3t1,Col3t2,Col3t3
a,a,1,1,1
b,b,2,2,2
c,c,3,3,3

In the above example the number of columns for Column3 is 3. But it could be 4, 5, 6 or more, we don’t know.

Before we look at the ETL side, we need to look at the data modelling side first. What kind of target table should we create to host these files.

Data Modelling

Essentially there are 2 choices.

1. We could make it normalised like this:

Table1:
Col1,Col2,Col3Type,Col3Value
a,a,t1,1
b,b,t2,1
c,c,t3,1
a,a,t1,2
b,b,t2,2
c,c,t3,2
a,a,t1,3
b,b,t2,3
c,c,t3,3

In principle we design the structure so that we have a fixed number of columns. It doesn’t matter how many columns in the source file, we will always have 4 columns in the table.

The drawback of this design is when we query it. If we need to produce File1 from Table1, we would need to pivot the table like this:

select Col1, Col2, t1 as Col3t1, t2 as Col3t2, t3 as Col3t3
from
( select Col1, Col2, Col3Type, Col3Value
from Table3
) P1
pivot
( sum(Col3Value) for Col3Type in (t1, t2, t3)
) as P2

Output:
Col1,Col2,Col3t1,Col3t2,Col3t3
b,a,1,1,1
a,b,2,2,2
c,c,3,3,3

If we have 2 tables like that we need to join them together on the common column(s) and it becomes even trickier. If we are dealing with 1 million rows column, we risk a serious performance degradation. This might not be a problem for a data mart, but if it is for a transaction system where millisecond response time is expected, this design does not serve the purpose.

Alternative Design

The alternative is to find out how many columns Col3 is. We know it could be 3 to 5, but could it be 100 columns. Spending 5 minutes with the subject matter expert to understand the characteristic of Col3, is well worth it. I have seen this dynamic columns case 3 times (the first time was about 12 years ago) and on every case they followed the book/theory (to normalise) and everytime they regretted it. On the second case (about 8 years ago) when I was involved in troubleshooting the performance issue, in the end we changed the table structure to denormalised the table structure.

So, I would recommend to find out how many columns Col3 is. 90% of the chance is: it is not that many. The response I usually get from the SME is something like this: “Most likely Col3 varies between 3 and 5 columns, it certainly won’t exceed 8″. There is a natural meaning of this data, that’s why the range is limited. For example: Col3 could be the number of teams, or the number of projects, or the number of web sites, or the number of product types. In the business world most of the data is within a certain range. You can’t have 100 teams for example. It’s physically impossible for a company as it is limited by budget constraints.

So once we have that information (the maximum number of columns), then we create the table according to this limitation. Let’s assume that the SME said that the max is 8 columns. So we design Table1 with 10 columns (just in case) as follows:

create table table1
( Col1 varchar(10),
  Col2 varchar(10),
  Col3t1 int,
  Col3t2 int,
  Col3t3 int,
  Col3t4 int,
  Col3t5 int,
  Col3t6 int,
  Col3t7 int,
  Col3t8 int,
  Col3t9 int,
  Col3t10 int
)

Unless it is a staging table, don’t forget to add a PK column. It is a good design to be able to uniquely identify a row at all times. In this example I don’t add a PK column because the target is a staging table.

A table structure with a fixed number of columns like this makes our lives a lot easier. It directly mimic the structure of the source file, and able to accommodate the dynamic nature of Col3.

The Data Load

I’m going to explain the ETL for the denormalised table, i.e. the one with a fixed 10 columns for Col3. The ETL for the normalised table, i.e. the table with 4 columns will have to wait until next time I’m afraid, as I have a few other posts to write next week, plus a presentation for SQLBits 8, Advanced Dimensional Modelling. If you need the ETL for the normalised table with 4 columns please contact me.

Because of the dynamic nature of Col3 in the source file, the best way is to use the Script Component as a Source. There are a lot of sources on the internet Script Component. This post shows you the basics. This post (by Lee Everest) shows you how to use a Script Component to split the rows in the source file. Similarly, this MSDN post shows how to import a delimited source file using a Script Component.

The main concept is as follows: First create a Connection Manager for a File Connection (no need to use Flat File connection). Name this connection File1, like this:

Then create a workflow, and drag the Script Component into the workflow. When asked, specify as source.

Then:

  • On the Connection Managers tab, set the connection to File1 (that’s the middle column). On the first column, name it as “Conn1″.

  • On the Inputs and Outputs tab, rename the “Output 0″ as File1Output. Then create the Output Columns one by one, as per the structure of Table2:

    Col1, Col2, Col3t1, Col3t2, … Col3t10.


  • In this example we will convert the data type for col3 from string to int not in the Script Component, but later on using Derived Column / Data Conversion transformation. So for all columns, set the data type to String, length 10.

  • On the Script tab, click on the Edit Script and edit it as follows. Don’t forget to add “Using System.IO;” and if you want to use MessageBox for debugging you need to add “using System.Windows.Forms;”. Remember that you can’t debug an SSIS Script Component (see my post here), but you can debug SSIS Script Task.
private StreamReader SR;
private string File1;

public override void AcquireConnections(object Transaction)
{
  // Get the connection for File1
  IDTSConnectionManager100 CM = this.Connections.File1Conn;
  File1 = (string)CM.AcquireConnection(null);
}

public override void PreExecute()
{
  // Create a reader for File1
  base.PreExecute();
  SR = new StreamReader(File1);
}

public override void CreateNewOutputRows()
{
  // Declare variables
  string nextLine;
  string[] columns;
  char[] delimiters;
  int Col3Count;
  String[] Col3Value = new string[10];

  // Set the delimiter
  delimiters = ",".ToCharArray();

  // Read the first line (header)
  nextLine = SR.ReadLine();

  // Split the line into columns
  columns = nextLine.Split(delimiters);

  // Find out how many Col3 there are in the file
  Col3Count = columns.Length - 2;
  // MessageBox.Show(Col3Count.ToString());

  // Read the second line and loop until the end of the file
  nextLine = SR.ReadLine();
  while (nextLine != null)
  {
    // Split the line into columns
    columns = nextLine.Split(delimiters);
    {
      // Add a row
      File1OutputBuffer.AddRow();

      // Set the Col3Value variable to the values in the file
      for (int i = 0; i <= Col3Count - 1; i++)
      {
        Col3Value[i] = columns[i + 2];
      }

      // Set the values of the Script Component output according to the file content
      File1OutputBuffer.Col1 = columns[0];
      File1OutputBuffer.Col2 = columns[1];
      File1OutputBuffer.Col3t1 = Col3Value[0];
      File1OutputBuffer.Col3t2 = Col3Value[1];
      File1OutputBuffer.Col3t3 = Col3Value[2];
      File1OutputBuffer.Col3t4 = Col3Value[3];
      File1OutputBuffer.Col3t5 = Col3Value[4];
      File1OutputBuffer.Col3t6 = Col3Value[5];
      File1OutputBuffer.Col3t7 = Col3Value[6];
      File1OutputBuffer.Col3t8 = Col3Value[7];
      File1OutputBuffer.Col3t9 = Col3Value[8];
      File1OutputBuffer.Col3t10 = Col3Value[9];

    }

    // Read the next line
    nextLine = SR.ReadLine();
  }
}

public override void PostExecute()
{
  // Close the reader
  base.PostExecute();
  SR.Close();
}
  • Save the script, close SSIS script editor, click OK to get out of the Script Transformation Editor, back to the Data Flow.
  • I would suggest not to put the output into a table in the database directly, but put the output to a file first to check the result. This way you don’t need to worry about data conversion between string and int etc. Create a flat file connection with 12 columns. All the columns are string, with length = 10, as follows:

  • Run it. If it fails, check the source file and make it like this:

    Also check the script, make sure it is like above.

  • Create an OLE DB Destination or SQL Server Destination (see my post here re why practically speaking we can’t use SQL Server Destination). But create the table first by executing the DDL I wrote above. Note that if you are using SQL Server Destination (because you are doing it to your local instance) then notice that on Windows 7 or Vista you need to run BIDS as Administrator, otherwise you’ll get this error: Unable to prepare SSIS bulk insert, so something like that.
  • Create a data conversion to convert Col3 columns from string to int, like this:

  • Connect the output of the data conversion to the DB and map the DC output columns to the DB columns like this:

  • Now execute it again, and check the target table:

  • Overall the Data Flow looks like this:

As usual I welcome any question and discussion at vrainardi@gmail.com. Vincent, 5/3/11.


28 February 2011

SSIS: SQL Server Destination or OLEDB

Filed under: SSIS — Vincent Rainardi @ 12:28 am
Tags:

We need to insert or update a SQL Server table using a Work Flow. Should we use SQL Server destination or OLEDB destination? This is one of the most frequently asked question among SSIS developers. OLEDB is used more than SQL Server, which doesn’t make sense. If the target is an Oracle table we can understand that. But this is a SQL Server table, of course it is better to use SQL Server Destination right?

A SQL Server destination has some things that a OLE DB destination doesn’t. The left one is SQL Server destination, the right one is OLE DB destination:

One of the most important things to notice here is that in terms of features, there’s not much difference. Both of them are suitable for loading data into a SQL Server table. If your table has triggers, then yes it matters. But who in data warehousing would create a dimension or fact table with a trigger anyway? Using the SQL Server Destination we can specify the last row, which is useful when you need to import only the first N rows (see my post here). But as the source is a SQL table anyway we can of course the Top N on our select statement. So again, there’s not much difference.

So the most crucial question is: which one is faster? To answer that I imported a file with 1 million rows, 2 columns int into a SQL Server table. First using a OLE DB Destination, then using a SQL Server Destination. Properties settings are all like above.

OLE DB Destination: 1st run: 3.229, 2nd run: 3.464, 3rd run: 3.978 seconds (without tablock it is 8.284, 8.330, 8.331)

SQL Server Destination: 1st run: 2.637, 2nd run: 2.574, 3rd run: 2.870 seconds.

So SQL Server Destination is faster than OLE DB Destination. SQL Server average is 2.694s, 24% faster than OLE DB which is 3.557s.

This runs on my Dell Inspiron 1545 laptop, Pentium Dual Core CPU T4500 @2.30 GHz each, 4 GB RAM, Windows 7 Home Premium 64-bit. Hard drive: WD5000BEVT, 500GB, 2.5″ x 1/8H SATA-300, 5400 RPM, external Data Transfer Rate 300MBps, average latency 5.5ms, seek time 2ms.

The main issue why SQL Server Destination is not used is not performance. The main issue is that we can’t edit the SSIS package on our PC. We have to edit the SSIS package on the target SQL Server, which is rediculous. If we edit the package on our PC we get an error message: “The selected data source is on a remote computer. The bulk insert operation can be executed only on the local computer.”

This Connect post explains it. On that post Bob Bojanic mentioned that Microsoft would be fixing it in SQL 2011, enabling us to use SQL Server Destination when editing the SSIS package on our PC. But a later post dated 13/6/08 from Microsoft said that they were not sure if the fix will be done in SQL 2011 (“open for triage”).

I still don’t understand why this restriction was there in the first place. And why it wasn’t issued as a 2008 patch, as to me the fix seems to be trivial: as Jonathan Bays said, “I would recommend removing this validation during design time and only put this restriction in at run time. If the package tries to load to a remote server, then it can throw an exception.”

27 February 2011

SSIS: Work Flow vs Stored Procedures

Filed under: SSIS — Vincent Rainardi @ 9:44 pm
Tags:

When importing a file into a SQL table, we create a Work Flow. But for transferring data from one SQL Server table to another SQL Server table, is it better to use Execute SQL Task (Stored Procedures) or Work Flow?

This is a classic debate in SSIS. A lot of times in data warehousing we need to transfer data from the staging tables to the fact and dimension tables. Should we use SQL Task or Work Flow?

There are 4 main considerations here:

  1. Data quality checking
  2. ETL framework
  3. Performance
  4. Development time

Data Quality Checking

There are a number of data quality checks that we need to perform on the incoming data and log them accordingly, potentially rejecting the incoming data. For example data type validations, number of columns, whether the data is within a certain allowable range or conforming to a certain list, etc. These DQ checks should be built only once and used many times, avoiding redundant work. For that purpose, it is easier to build the DQ checks in the form of stored procedures, running dynamic SQLs on many staging tables tables one by one. One of the main principle in DQ is that any silly data in the incoming data should not fail the data load. It should be gracefully recorded and the whole ETL package carries on. It is of an order of magnitude more difficult to build the DQ routines as script tasks, which are executed before the data flows into the warehouse. On the other hand, the data profiles are easier to be built using Data Profiling task. What I’m saying is that the decision whether to use a data flow or stored procedure/execute SQL task is affected by how the DQ routines were built.

ETL Framework

In every data warehousing or data integration project that uses SSIS as the ETL tool, the first step is to build an ETL framework. This framework handles error checking, alert notification, task failures, logging, execution history, file archiving and batch control. It is built as “parent child” package system, supported by a series of ETL metadata tables, as per chapter 10 of my book, e.g. data flow table, package table and status table. What I’m saying here is that the decision of whether to use a data flow or stored procedures/execute SQL task is affected by your ETL framework. I know that it should be the other way around: the ETL framework should be built to incorporate both the workflow and the stored procedures. Well if that’s the case in your project that is excellent, there’s no problem here. But practically speak I’ve seen several cases where we could not implement a data transfer routine as a workflow because the the ETL framework dictates that they need to be implemented as a stored procedures.

The next 2 points are the guts of the reasons. They are the real reasons for choosing between work flow approach and stored procedures, if it is a green field. Meaning that you have a complete freedom to choose, without any of the existing corporate rules/architecture affecting your decision.

Performance

Performance is about how fast the data load is. Given the same amount data to load from the staging table into the main table, which one is the fastest method, using select insert, or using a data flow? Generally speaking, if the data is less than 10k rows, there’s no real difference in performance. It is how complicated your DQ stuff that slows it down, not whether it’s a workflow or a stored procedure. If you are lucky enough to be involved in a project that loads billions of rows every day, you should be using work flow. Generally it is faster than stored procedure. The main issue with a stored procedure to do 1 billion upsert in SQL Server database is the bottleneck on the tempDB and log files. Your DBA wouldn’t be happy if you blew up the tempDB from a nice 2 GB to 200 GB. Ditto with log files.

Using workflow you can split a derived column transformation into several transformations, effectively boosting the throughput up to twice faster. See here for details from SQLCat team. And this principle is applicable for any synchronous task, including data conversion transform, lookup, row count, copy column and multicast. See here for an explanation about sync vs async tasks. One thing that gives us the most performance gain is to use multiple workflow to read different partitions of the source table simultaneously. This is for sure will create a bottleneck on the target, so it too needs to be partitioned, pretty much the same way as the source table. The other thing that increases the performance is the use of cache on lookup transformation. Using Full Cache, the entire lookup table is pulled into memory before the data flow is executed, so that the lookup operation is lightning fast. Using Partial Cache, the cache is built as the rows pass through. When a new row comes in, SSIS searches the cache (memory) for a match. Only if it doesn’t find then it fetches the data from disk. See here for details. You don’t get all these when you use stored procedures to transfer the data.

Development Time

You may say that development time is inferior compared to performance, when it comes to how big it influences the decision between work flow and SP. But in reality this factor is significant. I have seen several cases where the ETL developer is more convenient coding in Transact SQL than using SSIS transformations. They are probably twice as fast building it in stored procedures than doing it in SSIS transformations, due to their past experience. Understandably, this is because the majority of the so called “SSIS developer” was a “SQL developer”. They may have been doing SSIS for 2 years, but they have been doing SQL stored procedures for 10 years. For example, many developers are more conversant doing date conversion in Transact SQL than in Derived Column.

Conclusion

If you are lucky enough to be able to choose freely, work flow gives more performance and flexibility. But as with everything else in the real world, there are other factors which tie your hands, e.g. the data quality checking, the ETL framework and the development time.

As always I’d be glad to receive your comments and discussion at vrainardi@gmail.com. Vincent 27/2/11.

SSIS: Debugging a Script Component

Filed under: SSIS — Vincent Rainardi @ 10:17 am
Tags:

I was mistakenly assumed that we can debug a script component. As in, putting breakpoints and inspect the value of the variables at run time. But no, in SSIS we can not debug a script component. We can debug a script task, but not a script component. A script task is a control flow item, whereas a script component is a workflow item.

Because in the past I have debugged a script task, I thought we could debug a script component. So I put break points on the code in my script component and then execute the workflow task. Alas, the workflow did not stop at the breakpoint. It carries on. I thought my Visual Studio 2008 installation is not right, i.e. it did not integrate with SQL Server BIDS. So I tried on my laptop which only had SQL Server 2008 installed. I created a workflow with a script component and ran it and I got the same thing: SSIS did not stop at the break points inside the script component.

Then I read in MSDN that we can’t debug a script component: link. Because we can’t set breakpoints and inspect the variables, it suggests 4 methods:

  1. Message box (MessageBox.Show)
  2. Raise warning and error events
  3. Log messages
  4. Use a data viewer to see the output

In the end I did the debugging using the combination of:

  1. Message box
  2. Write to a text file (using SteamWriter)
  3. Set a package variable

SSIS: Importing Files – Read The First N Rows

Filed under: SSIS — Vincent Rainardi @ 8:17 am
Tags:

Sometimes we need to read the first few rows of the file, not the whole file. Usually this is because the file contains a header that gives us information about the structure of the file, like this:

col1|col2|col3
a,a,a
col1|col2|col3|col4|col5
1,1,1,1,1
2,2,2,2,2
3,3,3,3,3

In the above example we want to read the first 2 rows, which is the header. But in SSIS there is no facility to read the first N rows only. We have to read the whole file, which is rediculous if we have a file with 1 million rows. So how do we read the first N rows only?

To read the first N rows there are 3 alternatives:

  1. Use Konesans Row Number Transformation.
  2. Put all rows into a table then select top N from it.
  3. Use a script component as a source or transform.

Using Konesans Row Number Transformation

Thanks to Allan and Darren, Row Number Transformation gives us a row number for every row in the file. This row number is added as a column in the output of the data flow. For further details please see their web site, SQLIS.com. We can then use the Conditional Split to split the data flow, like this:

1 output of the Conditional Split, called header in this case, contains the first 2 rows only. Which can then be loaded into the target table.

Using this method we have to read the entire file, as summarised nicely by CTKeane here. So practically speaking this approach is ideal if the file we are importing is small e.g. 1000 rows. However, it is also suitable for importing big files, if we want to read or process the remaining of the file as well. Say the file consists of 1 million rows. The first 2 rows are header, and the remaining 999,998 rows are the details. So we can process the default output of the Conditional Split, which contains the detail rows.

The second thing we need to be aware is that it is a third party component where our infrastructure managers might not be willing to support. So before going down this route it is wise to check first with whoever is in charge for the production/UAT/dev servers in your company/project.

Putting all rows in a table then select the top N from it

In this approach we import the file into a staging table. Then we select the top N rows from this staging table. Again it is OK if we want to process the remaining of the file as well as the header. But if we don’t want to read the remaining of the file, we can’t use this solution.

If the file that we are importing consist of 2 sections like the above example, then when defining the flat file connection manager we need to define the file as 1 column. Only 1 column. When we query that 1 column table we then split the long string into several column based on the delimiters.

Transact SQL doesn’t have a string split function, not like in C#. We could use charindex, which gives us the location of the delimiters. So the idea is that we crawl the long string character by character, and when we found the delimiter we then use substring to capture that column data. For an example see this post on the SQLTeam web site.

Use a Script Component as a Source/Transform

If your flat file is of a static structure (fixed number of columns) then you can use the script component as a transform, and use flat file source as normal, as Jamie Thompson described here. But if your flat file contains multiple structures, like in the above example, it is probably best to use the script component as a source.

To use a script component as a transform, first create a flat file source pointing to the file. Then add a script component and choose Transform when SSIS asks us to select Source, Destination or Transform.

  • On the Connection Managers tab we don’t set anything.
  • On the Input Columns tab we select the columns that we want. These input columns will be automatically added to the Input 0 section of the Inputs and Outputs tab.
  • On the Inputs and Outputs tab, rename the Output 0 to something sensible, e.g. CustomerRecordFileOutput. Then create the Output Columns. Make these columns of type signed int 4 bytes. Also, on the Output 0, set the SynchronousInputID property to None, which makes it async (the default is the input column, i.e. sync).
  • Finally on the Script tab, set the ReadOnlyVariables to User::N, i.e. the package variable containing how many rows we want to read. This is important otherwise that package variable will not be recognised within the script. Then edit the script as follows (C#):
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        int i = 1;

        if (i<=Variables.N)
        {   Output0Buffer.AddRow();
            Output0Buffer.Col1 = Row.col1;
            i++;
        }

    }

Basically, in the above code, we create a row counter, which we call i. We increment this i for every Input buffer row. As long as i is less than or equal to N, we produce output. Otherwise we don’t produce any output. Leave the Pre and Post Execute method, we just need to amend the Input0_ProcessInputRow method as above.

The data flow looks like this:

If you use the script component as a source, this is how:

  1. Don’t create a flat file source, just create a flat file connection manager.

    Suppose our source file is like this:

    col1|col2
    1|1
    2|2
    3|3
    4|4
    5|5
    

    So create a flat file connection manager with 2 columns, tick the “Column names in the first data row” checkbox, format: Delimited, Header row delimiter: {CR}{LF}, column delimiter: vertical bar.

  2. Add the script component. Choose “As a Source” when asked whether it’s a source, destination or transformation.
  3. Edit the script component and on the Connection Managers add the flat file connection that we created on step 1 above. For example, let’s call it ConnFile2. Like this:

  4. Then on the Inputs and Outputs, rename Output0 to File2Output. Then click on the “Output Columns” and click Add Column. Call it Col1 and set the data type according to the file you are importing:

  5. On the Script tab, add the package variable N to the ReadOnlyVariables:

  6. Then edit the script as follows: (based on MSDN article, Creating a Source with the Script Component, link)
public class ScriptMain : UserComponent
{
    private StreamReader SR;
    private string File2;

    public override void AcquireConnections(object Transaction)
    {
        IDTSConnectionManager100 ConnMgr = this.Connections.ConnFile3;
        File2 = (string)ConnMgr.AcquireConnection(null);
    }

    public override void PreExecute()
    {
        base.PreExecute();
        SR = new StreamReader(File2);
    }

    public override void PostExecute()
    {
        base.PostExecute();
        SR.Close();
    }

    public override void CreateNewOutputRows()
    {
        string Row;
        string[] Cols;
        char[] Delimiter;
        Delimiter = "|".ToCharArray();

        //Skip the header first
        Row = SR.ReadLine();

        for(int i = 1; i <= Variables.N; i++)
        {
            Row = SR.ReadLine();
            Cols = Row.Split(Delimiter);
            File2OutputBuffer.AddRow();
            File2OutputBuffer.Col1 = Cols[0];
            File2OutputBuffer.Col2 = Cols[1];

        }

    }

}

The Data Flow looks like this:

« Previous PageNext Page »

The Rubric Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 208 other followers