Data Warehousing and Business Intelligence

4 June 2011

Coming SSIS Articles

Filed under: SSIS — Vincent Rainardi @ 2:57 pm
Tags:

Looking at the SSIS section on my blog I felt that SSIS has not been represented fairly. It has far less articles than SSAS! Over the years I have encountered several interesting stuff in SSIS, hence there are a few SSIS articles I’d like to write. Some are of basic type; some are more of architectural type.

The basic ones are for example: (I will put the link here one by one as I completed them)

  • export a query result into a file Link
  • load a file into an existing table (upsert)
  • set the value of a variable from database Link
  • set the value of a variable from a file Link1, Link2
  • find the latest file
  • dealing with the file footer/header
  • import an Excel file – problem with alphanumeric column
  • removing a column that was deleted from the table

I found that some people are having difficulties finding an example that they can copy. I believe that we human “learn by example”. When people google something, they want to find an article with an example that they can copy/do. Not lengthy explanation about how it works. But it is so difficult to find these examples, even for simple things, e.g. writing into a variable. So in all my coming articles I will write it to show it by example, rather than lengthy theoretical explanation.

The more of architectural ones are for example:

  • Configuration parameter: parent variable vs xml approach – advantages/disadvantages
  • Using custom task/component or child package to increase code reusability and maintainability
  • Orchestrating an ETL batch using metadata approach
  • Using a centralised message hub for DI/ETL rather than point to point approach. Particularly applicable / beneficial for large corporation.
  • What’s new in SSIS 2011 – this one is not architectural :)

So in the next few weeks I will be writing them one by one. Hopefully they are useful for people who need them. As I finish each article I’ll put the link on the list above, as well as on the home page of course. If there are other topics in SSIS that you would like me to write, please let me know (vrainardi@gmail.com). There’s 1 Informatica article that I’d like to write: Importing a File with Dynamic Columns, i.e. like this but in Informatica.

Vincent Rainardi, 4/6/11

PS. Well, talking about unfair representation, SSRS, Informatica, BO, Cognos, and Teradata are also not represented fairly :) – I have no articles on them. None! Well may be next time, after SSIS. Now I’m going to concentrate solely on SSIS (well, and data modelling).

5 March 2011

SSIS: Importing a File with Dynamic Columns

Filed under: SQL Server,SSIS — Vincent Rainardi @ 9:25 am
Tags: ,

One of the most difficult things to deal with in the ETL is importing a file with dynamic columns. The file does not have a set number of columns. The number of columns changes from time to time. For example, the file can have 5 columns, or 20 columns, or anything in between.

Below are some examples of such a file:

File1:
Col1,Col2,Col3t1,Col3t2,Col3t3
a,a,1,1,1
b,b,2,2,2
c,c,3,3,3

In the above example the number of columns for Column3 is 3. But it could be 4, 5, 6 or more, we don’t know.

Before we look at the ETL side, we need to look at the data modelling side first. What kind of target table should we create to host these files.

Data Modelling

Essentially there are 2 choices.

1. We could make it normalised like this:

Table1:
Col1,Col2,Col3Type,Col3Value
a,a,t1,1
b,b,t2,1
c,c,t3,1
a,a,t1,2
b,b,t2,2
c,c,t3,2
a,a,t1,3
b,b,t2,3
c,c,t3,3

In principle we design the structure so that we have a fixed number of columns. It doesn’t matter how many columns in the source file, we will always have 4 columns in the table.

The drawback of this design is when we query it. If we need to produce File1 from Table1, we would need to pivot the table like this:

select Col1, Col2, t1 as Col3t1, t2 as Col3t2, t3 as Col3t3
from
( select Col1, Col2, Col3Type, Col3Value
from Table3
) P1
pivot
( sum(Col3Value) for Col3Type in (t1, t2, t3)
) as P2

Output:
Col1,Col2,Col3t1,Col3t2,Col3t3
b,a,1,1,1
a,b,2,2,2
c,c,3,3,3

If we have 2 tables like that we need to join them together on the common column(s) and it becomes even trickier. If we are dealing with 1 million rows column, we risk a serious performance degradation. This might not be a problem for a data mart, but if it is for a transaction system where millisecond response time is expected, this design does not serve the purpose.

Alternative Design

The alternative is to find out how many columns Col3 is. We know it could be 3 to 5, but could it be 100 columns. Spending 5 minutes with the subject matter expert to understand the characteristic of Col3, is well worth it. I have seen this dynamic columns case 3 times (the first time was about 12 years ago) and on every case they followed the book/theory (to normalise) and everytime they regretted it. On the second case (about 8 years ago) when I was involved in troubleshooting the performance issue, in the end we changed the table structure to denormalised the table structure.

So, I would recommend to find out how many columns Col3 is. 90% of the chance is: it is not that many. The response I usually get from the SME is something like this: “Most likely Col3 varies between 3 and 5 columns, it certainly won’t exceed 8″. There is a natural meaning of this data, that’s why the range is limited. For example: Col3 could be the number of teams, or the number of projects, or the number of web sites, or the number of product types. In the business world most of the data is within a certain range. You can’t have 100 teams for example. It’s physically impossible for a company as it is limited by budget constraints.

So once we have that information (the maximum number of columns), then we create the table according to this limitation. Let’s assume that the SME said that the max is 8 columns. So we design Table1 with 10 columns (just in case) as follows:

create table table1
( Col1 varchar(10),
  Col2 varchar(10),
  Col3t1 int,
  Col3t2 int,
  Col3t3 int,
  Col3t4 int,
  Col3t5 int,
  Col3t6 int,
  Col3t7 int,
  Col3t8 int,
  Col3t9 int,
  Col3t10 int
)

Unless it is a staging table, don’t forget to add a PK column. It is a good design to be able to uniquely identify a row at all times. In this example I don’t add a PK column because the target is a staging table.

A table structure with a fixed number of columns like this makes our lives a lot easier. It directly mimic the structure of the source file, and able to accommodate the dynamic nature of Col3.

The Data Load

I’m going to explain the ETL for the denormalised table, i.e. the one with a fixed 10 columns for Col3. The ETL for the normalised table, i.e. the table with 4 columns will have to wait until next time I’m afraid, as I have a few other posts to write next week, plus a presentation for SQLBits 8, Advanced Dimensional Modelling. If you need the ETL for the normalised table with 4 columns please contact me.

Because of the dynamic nature of Col3 in the source file, the best way is to use the Script Component as a Source. There are a lot of sources on the internet Script Component. This post shows you the basics. This post (by Lee Everest) shows you how to use a Script Component to split the rows in the source file. Similarly, this MSDN post shows how to import a delimited source file using a Script Component.

The main concept is as follows: First create a Connection Manager for a File Connection (no need to use Flat File connection). Name this connection File1, like this:

Then create a workflow, and drag the Script Component into the workflow. When asked, specify as source.

Then:

  • On the Connection Managers tab, set the connection to File1 (that’s the middle column). On the first column, name it as “Conn1″.

  • On the Inputs and Outputs tab, rename the “Output 0″ as File1Output. Then create the Output Columns one by one, as per the structure of Table2:

    Col1, Col2, Col3t1, Col3t2, … Col3t10.


  • In this example we will convert the data type for col3 from string to int not in the Script Component, but later on using Derived Column / Data Conversion transformation. So for all columns, set the data type to String, length 10.

  • On the Script tab, click on the Edit Script and edit it as follows. Don’t forget to add “Using System.IO;” and if you want to use MessageBox for debugging you need to add “using System.Windows.Forms;”. Remember that you can’t debug an SSIS Script Component (see my post here), but you can debug SSIS Script Task.
private StreamReader SR;
private string File1;

public override void AcquireConnections(object Transaction)
{
  // Get the connection for File1
  IDTSConnectionManager100 CM = this.Connections.File1Conn;
  File1 = (string)CM.AcquireConnection(null);
}

public override void PreExecute()
{
  // Create a reader for File1
  base.PreExecute();
  SR = new StreamReader(File1);
}

public override void CreateNewOutputRows()
{
  // Declare variables
  string nextLine;
  string[] columns;
  char[] delimiters;
  int Col3Count;
  String[] Col3Value = new string[10];

  // Set the delimiter
  delimiters = ",".ToCharArray();

  // Read the first line (header)
  nextLine = SR.ReadLine();

  // Split the line into columns
  columns = nextLine.Split(delimiters);

  // Find out how many Col3 there are in the file
  Col3Count = columns.Length - 2;
  // MessageBox.Show(Col3Count.ToString());

  // Read the second line and loop until the end of the file
  nextLine = SR.ReadLine();
  while (nextLine != null)
  {
    // Split the line into columns
    columns = nextLine.Split(delimiters);
    {
      // Add a row
      File1OutputBuffer.AddRow();

      // Set the Col3Value variable to the values in the file
      for (int i = 0; i <= Col3Count - 1; i++)
      {
        Col3Value[i] = columns[i + 2];
      }

      // Set the values of the Script Component output according to the file content
      File1OutputBuffer.Col1 = columns[0];
      File1OutputBuffer.Col2 = columns[1];
      File1OutputBuffer.Col3t1 = Col3Value[0];
      File1OutputBuffer.Col3t2 = Col3Value[1];
      File1OutputBuffer.Col3t3 = Col3Value[2];
      File1OutputBuffer.Col3t4 = Col3Value[3];
      File1OutputBuffer.Col3t5 = Col3Value[4];
      File1OutputBuffer.Col3t6 = Col3Value[5];
      File1OutputBuffer.Col3t7 = Col3Value[6];
      File1OutputBuffer.Col3t8 = Col3Value[7];
      File1OutputBuffer.Col3t9 = Col3Value[8];
      File1OutputBuffer.Col3t10 = Col3Value[9];

    }

    // Read the next line
    nextLine = SR.ReadLine();
  }
}

public override void PostExecute()
{
  // Close the reader
  base.PostExecute();
  SR.Close();
}
  • Save the script, close SSIS script editor, click OK to get out of the Script Transformation Editor, back to the Data Flow.
  • I would suggest not to put the output into a table in the database directly, but put the output to a file first to check the result. This way you don’t need to worry about data conversion between string and int etc. Create a flat file connection with 12 columns. All the columns are string, with length = 10, as follows:

  • Run it. If it fails, check the source file and make it like this:

    Also check the script, make sure it is like above.

  • Create an OLE DB Destination or SQL Server Destination (see my post here re why practically speaking we can’t use SQL Server Destination). But create the table first by executing the DDL I wrote above. Note that if you are using SQL Server Destination (because you are doing it to your local instance) then notice that on Windows 7 or Vista you need to run BIDS as Administrator, otherwise you’ll get this error: Unable to prepare SSIS bulk insert, so something like that.
  • Create a data conversion to convert Col3 columns from string to int, like this:

  • Connect the output of the data conversion to the DB and map the DC output columns to the DB columns like this:

  • Now execute it again, and check the target table:

  • Overall the Data Flow looks like this:

As usual I welcome any question and discussion at vrainardi@gmail.com. Vincent, 5/3/11.


28 February 2011

SSIS: SQL Server Destination or OLEDB

Filed under: SSIS — Vincent Rainardi @ 12:28 am
Tags:

We need to insert or update a SQL Server table using a Work Flow. Should we use SQL Server destination or OLEDB destination? This is one of the most frequently asked question among SSIS developers. OLEDB is used more than SQL Server, which doesn’t make sense. If the target is an Oracle table we can understand that. But this is a SQL Server table, of course it is better to use SQL Server Destination right?

A SQL Server destination has some things that a OLE DB destination doesn’t. The left one is SQL Server destination, the right one is OLE DB destination:

One of the most important things to notice here is that in terms of features, there’s not much difference. Both of them are suitable for loading data into a SQL Server table. If your table has triggers, then yes it matters. But who in data warehousing would create a dimension or fact table with a trigger anyway? Using the SQL Server Destination we can specify the last row, which is useful when you need to import only the first N rows (see my post here). But as the source is a SQL table anyway we can of course the Top N on our select statement. So again, there’s not much difference.

So the most crucial question is: which one is faster? To answer that I imported a file with 1 million rows, 2 columns int into a SQL Server table. First using a OLE DB Destination, then using a SQL Server Destination. Properties settings are all like above.

OLE DB Destination: 1st run: 3.229, 2nd run: 3.464, 3rd run: 3.978 seconds (without tablock it is 8.284, 8.330, 8.331)

SQL Server Destination: 1st run: 2.637, 2nd run: 2.574, 3rd run: 2.870 seconds.

So SQL Server Destination is faster than OLE DB Destination. SQL Server average is 2.694s, 24% faster than OLE DB which is 3.557s.

This runs on my Dell Inspiron 1545 laptop, Pentium Dual Core CPU T4500 @2.30 GHz each, 4 GB RAM, Windows 7 Home Premium 64-bit. Hard drive: WD5000BEVT, 500GB, 2.5″ x 1/8H SATA-300, 5400 RPM, external Data Transfer Rate 300MBps, average latency 5.5ms, seek time 2ms.

The main issue why SQL Server Destination is not used is not performance. The main issue is that we can’t edit the SSIS package on our PC. We have to edit the SSIS package on the target SQL Server, which is rediculous. If we edit the package on our PC we get an error message: “The selected data source is on a remote computer. The bulk insert operation can be executed only on the local computer.”

This Connect post explains it. On that post Bob Bojanic mentioned that Microsoft would be fixing it in SQL 2011, enabling us to use SQL Server Destination when editing the SSIS package on our PC. But a later post dated 13/6/08 from Microsoft said that they were not sure if the fix will be done in SQL 2011 (“open for triage”).

I still don’t understand why this restriction was there in the first place. And why it wasn’t issued as a 2008 patch, as to me the fix seems to be trivial: as Jonathan Bays said, “I would recommend removing this validation during design time and only put this restriction in at run time. If the package tries to load to a remote server, then it can throw an exception.”

27 February 2011

SSIS: Work Flow vs Stored Procedures

Filed under: SSIS — Vincent Rainardi @ 9:44 pm
Tags:

When importing a file into a SQL table, we create a Work Flow. But for transferring data from one SQL Server table to another SQL Server table, is it better to use Execute SQL Task (Stored Procedures) or Work Flow?

This is a classic debate in SSIS. A lot of times in data warehousing we need to transfer data from the staging tables to the fact and dimension tables. Should we use SQL Task or Work Flow?

There are 4 main considerations here:

  1. Data quality checking
  2. ETL framework
  3. Performance
  4. Development time

Data Quality Checking

There are a number of data quality checks that we need to perform on the incoming data and log them accordingly, potentially rejecting the incoming data. For example data type validations, number of columns, whether the data is within a certain allowable range or conforming to a certain list, etc. These DQ checks should be built only once and used many times, avoiding redundant work. For that purpose, it is easier to build the DQ checks in the form of stored procedures, running dynamic SQLs on many staging tables tables one by one. One of the main principle in DQ is that any silly data in the incoming data should not fail the data load. It should be gracefully recorded and the whole ETL package carries on. It is of an order of magnitude more difficult to build the DQ routines as script tasks, which are executed before the data flows into the warehouse. On the other hand, the data profiles are easier to be built using Data Profiling task. What I’m saying is that the decision whether to use a data flow or stored procedure/execute SQL task is affected by how the DQ routines were built.

ETL Framework

In every data warehousing or data integration project that uses SSIS as the ETL tool, the first step is to build an ETL framework. This framework handles error checking, alert notification, task failures, logging, execution history, file archiving and batch control. It is built as “parent child” package system, supported by a series of ETL metadata tables, as per chapter 10 of my book, e.g. data flow table, package table and status table. What I’m saying here is that the decision of whether to use a data flow or stored procedures/execute SQL task is affected by your ETL framework. I know that it should be the other way around: the ETL framework should be built to incorporate both the workflow and the stored procedures. Well if that’s the case in your project that is excellent, there’s no problem here. But practically speak I’ve seen several cases where we could not implement a data transfer routine as a workflow because the the ETL framework dictates that they need to be implemented as a stored procedures.

The next 2 points are the guts of the reasons. They are the real reasons for choosing between work flow approach and stored procedures, if it is a green field. Meaning that you have a complete freedom to choose, without any of the existing corporate rules/architecture affecting your decision.

Performance

Performance is about how fast the data load is. Given the same amount data to load from the staging table into the main table, which one is the fastest method, using select insert, or using a data flow? Generally speaking, if the data is less than 10k rows, there’s no real difference in performance. It is how complicated your DQ stuff that slows it down, not whether it’s a workflow or a stored procedure. If you are lucky enough to be involved in a project that loads billions of rows every day, you should be using work flow. Generally it is faster than stored procedure. The main issue with a stored procedure to do 1 billion upsert in SQL Server database is the bottleneck on the tempDB and log files. Your DBA wouldn’t be happy if you blew up the tempDB from a nice 2 GB to 200 GB. Ditto with log files.

Using workflow you can split a derived column transformation into several transformations, effectively boosting the throughput up to twice faster. See here for details from SQLCat team. And this principle is applicable for any synchronous task, including data conversion transform, lookup, row count, copy column and multicast. See here for an explanation about sync vs async tasks. One thing that gives us the most performance gain is to use multiple workflow to read different partitions of the source table simultaneously. This is for sure will create a bottleneck on the target, so it too needs to be partitioned, pretty much the same way as the source table. The other thing that increases the performance is the use of cache on lookup transformation. Using Full Cache, the entire lookup table is pulled into memory before the data flow is executed, so that the lookup operation is lightning fast. Using Partial Cache, the cache is built as the rows pass through. When a new row comes in, SSIS searches the cache (memory) for a match. Only if it doesn’t find then it fetches the data from disk. See here for details. You don’t get all these when you use stored procedures to transfer the data.

Development Time

You may say that development time is inferior compared to performance, when it comes to how big it influences the decision between work flow and SP. But in reality this factor is significant. I have seen several cases where the ETL developer is more convenient coding in Transact SQL than using SSIS transformations. They are probably twice as fast building it in stored procedures than doing it in SSIS transformations, due to their past experience. Understandably, this is because the majority of the so called “SSIS developer” was a “SQL developer”. They may have been doing SSIS for 2 years, but they have been doing SQL stored procedures for 10 years. For example, many developers are more conversant doing date conversion in Transact SQL than in Derived Column.

Conclusion

If you are lucky enough to be able to choose freely, work flow gives more performance and flexibility. But as with everything else in the real world, there are other factors which tie your hands, e.g. the data quality checking, the ETL framework and the development time.

As always I’d be glad to receive your comments and discussion at vrainardi@gmail.com. Vincent 27/2/11.

SSIS: Debugging a Script Component

Filed under: SSIS — Vincent Rainardi @ 10:17 am
Tags:

I was mistakenly assumed that we can debug a script component. As in, putting breakpoints and inspect the value of the variables at run time. But no, in SSIS we can not debug a script component. We can debug a script task, but not a script component. A script task is a control flow item, whereas a script component is a workflow item.

Because in the past I have debugged a script task, I thought we could debug a script component. So I put break points on the code in my script component and then execute the workflow task. Alas, the workflow did not stop at the breakpoint. It carries on. I thought my Visual Studio 2008 installation is not right, i.e. it did not integrate with SQL Server BIDS. So I tried on my laptop which only had SQL Server 2008 installed. I created a workflow with a script component and ran it and I got the same thing: SSIS did not stop at the break points inside the script component.

Then I read in MSDN that we can’t debug a script component: link. Because we can’t set breakpoints and inspect the variables, it suggests 4 methods:

  1. Message box (MessageBox.Show)
  2. Raise warning and error events
  3. Log messages
  4. Use a data viewer to see the output

In the end I did the debugging using the combination of:

  1. Message box
  2. Write to a text file (using SteamWriter)
  3. Set a package variable

SSIS: Importing Files – Read The First N Rows

Filed under: SSIS — Vincent Rainardi @ 8:17 am
Tags:

Sometimes we need to read the first few rows of the file, not the whole file. Usually this is because the file contains a header that gives us information about the structure of the file, like this:

col1|col2|col3
a,a,a
col1|col2|col3|col4|col5
1,1,1,1,1
2,2,2,2,2
3,3,3,3,3

In the above example we want to read the first 2 rows, which is the header. But in SSIS there is no facility to read the first N rows only. We have to read the whole file, which is rediculous if we have a file with 1 million rows. So how do we read the first N rows only?

To read the first N rows there are 3 alternatives:

  1. Use Konesans Row Number Transformation.
  2. Put all rows into a table then select top N from it.
  3. Use a script component as a source or transform.

Using Konesans Row Number Transformation

Thanks to Allan and Darren, Row Number Transformation gives us a row number for every row in the file. This row number is added as a column in the output of the data flow. For further details please see their web site, SQLIS.com. We can then use the Conditional Split to split the data flow, like this:

1 output of the Conditional Split, called header in this case, contains the first 2 rows only. Which can then be loaded into the target table.

Using this method we have to read the entire file, as summarised nicely by CTKeane here. So practically speaking this approach is ideal if the file we are importing is small e.g. 1000 rows. However, it is also suitable for importing big files, if we want to read or process the remaining of the file as well. Say the file consists of 1 million rows. The first 2 rows are header, and the remaining 999,998 rows are the details. So we can process the default output of the Conditional Split, which contains the detail rows.

The second thing we need to be aware is that it is a third party component where our infrastructure managers might not be willing to support. So before going down this route it is wise to check first with whoever is in charge for the production/UAT/dev servers in your company/project.

Putting all rows in a table then select the top N from it

In this approach we import the file into a staging table. Then we select the top N rows from this staging table. Again it is OK if we want to process the remaining of the file as well as the header. But if we don’t want to read the remaining of the file, we can’t use this solution.

If the file that we are importing consist of 2 sections like the above example, then when defining the flat file connection manager we need to define the file as 1 column. Only 1 column. When we query that 1 column table we then split the long string into several column based on the delimiters.

Transact SQL doesn’t have a string split function, not like in C#. We could use charindex, which gives us the location of the delimiters. So the idea is that we crawl the long string character by character, and when we found the delimiter we then use substring to capture that column data. For an example see this post on the SQLTeam web site.

Use a Script Component as a Source/Transform

If your flat file is of a static structure (fixed number of columns) then you can use the script component as a transform, and use flat file source as normal, as Jamie Thompson described here. But if your flat file contains multiple structures, like in the above example, it is probably best to use the script component as a source.

To use a script component as a transform, first create a flat file source pointing to the file. Then add a script component and choose Transform when SSIS asks us to select Source, Destination or Transform.

  • On the Connection Managers tab we don’t set anything.
  • On the Input Columns tab we select the columns that we want. These input columns will be automatically added to the Input 0 section of the Inputs and Outputs tab.
  • On the Inputs and Outputs tab, rename the Output 0 to something sensible, e.g. CustomerRecordFileOutput. Then create the Output Columns. Make these columns of type signed int 4 bytes. Also, on the Output 0, set the SynchronousInputID property to None, which makes it async (the default is the input column, i.e. sync).
  • Finally on the Script tab, set the ReadOnlyVariables to User::N, i.e. the package variable containing how many rows we want to read. This is important otherwise that package variable will not be recognised within the script. Then edit the script as follows (C#):
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        int i = 1;

        if (i<=Variables.N)
        {   Output0Buffer.AddRow();
            Output0Buffer.Col1 = Row.col1;
            i++;
        }

    }

Basically, in the above code, we create a row counter, which we call i. We increment this i for every Input buffer row. As long as i is less than or equal to N, we produce output. Otherwise we don’t produce any output. Leave the Pre and Post Execute method, we just need to amend the Input0_ProcessInputRow method as above.

The data flow looks like this:

If you use the script component as a source, this is how:

  1. Don’t create a flat file source, just create a flat file connection manager.

    Suppose our source file is like this:

    col1|col2
    1|1
    2|2
    3|3
    4|4
    5|5
    

    So create a flat file connection manager with 2 columns, tick the “Column names in the first data row” checkbox, format: Delimited, Header row delimiter: {CR}{LF}, column delimiter: vertical bar.

  2. Add the script component. Choose “As a Source” when asked whether it’s a source, destination or transformation.
  3. Edit the script component and on the Connection Managers add the flat file connection that we created on step 1 above. For example, let’s call it ConnFile2. Like this:

  4. Then on the Inputs and Outputs, rename Output0 to File2Output. Then click on the “Output Columns” and click Add Column. Call it Col1 and set the data type according to the file you are importing:

  5. On the Script tab, add the package variable N to the ReadOnlyVariables:

  6. Then edit the script as follows: (based on MSDN article, Creating a Source with the Script Component, link)
public class ScriptMain : UserComponent
{
    private StreamReader SR;
    private string File2;

    public override void AcquireConnections(object Transaction)
    {
        IDTSConnectionManager100 ConnMgr = this.Connections.ConnFile3;
        File2 = (string)ConnMgr.AcquireConnection(null);
    }

    public override void PreExecute()
    {
        base.PreExecute();
        SR = new StreamReader(File2);
    }

    public override void PostExecute()
    {
        base.PostExecute();
        SR.Close();
    }

    public override void CreateNewOutputRows()
    {
        string Row;
        string[] Cols;
        char[] Delimiter;
        Delimiter = "|".ToCharArray();

        //Skip the header first
        Row = SR.ReadLine();

        for(int i = 1; i <= Variables.N; i++)
        {
            Row = SR.ReadLine();
            Cols = Row.Split(Delimiter);
            File2OutputBuffer.AddRow();
            File2OutputBuffer.Col1 = Cols[0];
            File2OutputBuffer.Col2 = Cols[1];

        }

    }

}

The Data Flow looks like this:

29 July 2010

SSIS: Looping with ADO Enumerator

Filed under: SSIS — Vincent Rainardi @ 11:43 am
Tags:

In SSIS, using the Foreach Loop Container we can loop around 7 different things: file, item, ADO, ADO.NET Schema Rowset, Variable, Nodelist and SMO.

  1. File: we loop around a collection of files in a folder
  2. Item: we loop around rows of data that we define ourselves (can be more than 1 column)
  3. ADO: we loop around rows of a existing table
  4. ADO.NET schema rowset: we loop around a collection of any of these: Catalogs, Character Sets, Check Constrains, Collations, Column Privileges, Columns, Columns Domain Usage, Constraint Column Usage, Constraint Table Usage, Foreign Keys, Indexes, Key Column Usage, Indexes, Primary Keys, Procedure Columns, Procedure Parameters, Procedures, Provider Types, Referential Constraints, Schemata, Statistics, Table Constraints, Table Privileges, Tables, Translations, Usage Privileges, View Column Usage, Views, View Table Usage, SQL Languages, DB Info Keywords, DB Info Literals, Check Constraints by Table, Table Statistics, Tables Info, and Trustees.
  5. Variable: we loop around the values of a variable, for example a DataTable where we loop around the rows.
  6. Nodelist: we loop around a set of XML nodes, which are the results of applying XPath expression to an XML file.
  7. SMO: we loop around SQL Server Management Objects, such as tables, users, statistics, jobs, traces, brokers.

In this article we will use ADO enumerator. We’ll create a package that:

  1. Create a DataSet
  2. Create a DataTable in the DataSet
  3. Populate the DataTable with 2 rows, row 1: file1.txt, row 2: file2.txt
  4. Assign that DataSet to variable Var1
  5. Have a FELC ADO Enumerator with source = Var1 to loop around the rows of the DataTable
  6. Set the Variable Mapping of this FELC to Var2
  7. Inside this FELC loop create a Script Task that checks if a file with file name = the value of Var2 exists or not.

The overall package looks like this:

True, we don’t need to use FELC to loop around the DataTable rows. We could do it in the first C# Script task, right after step 3 (populate the DataTable). The reasons of we might want to use FELC are a) to use SSIS functionality such as transaction and checkpoint, b) to reuse the variable that was transferred to the FELC in downstream task(s).

1-4 Create & Populate DataTable, assign to Var1

OK let’s start. First create a variable called Var1, of type object, scope = package:

Create the Script Task, and set Var1 as ReadWrite variable:

Edit Script, and create the DataSet and DataTable:

// Create the table
DataSet DS1 = new DataSet();
DataTable T1 = new DataTable();
DataColumn C1 = new DataColumn();
C1.DataType = Type.GetType(“System.String”);
C1.ColumnName = “Col1″;
T1.Columns.Add(C1);
DS1.Tables.Add(T1);

Usually we populate the datatable from a database table, but in this instance I’m going to populate it from a file:

// Populate the table from the file
StreamReader sr1;
sr1 = new StreamReader(“C:\\Data\\FELC per row\\list.txt”);
String s1;
DataRow DR1;
while ((s1 = sr1.ReadLine()) != null)
{
DR1 = DS1.Tables[0].NewRow();
DR1["Col1"] = s1;
DS1.Tables[0].Rows.Add(DR1);
}
sr1.Close();

The List.txt file only contains 2 lines:

After populating the DataSet, we assign it to Var1:

// Set Variable
Dts.Variables["Var1"].Value = DS1;
Dts.TaskResult = (int)ScriptResults.Success;

5. Create the Foreach Loop Container

Have a FELC ADO Enumerator with source = Var1 to loop around the rows of the DataTable

Set the collection of the FELC to ADO enumerator, rows in the first table:

6. Set the Variable Mapping

Set the Variable Mapping of this FELC to Var2 (create Var2 first, scope = package, type = string):

This means that every row on the DataTable will be assigned to Var2. It’s the first column only that will be assigned (Index 0 = first column).

7. Add a Script Task inside the loop

Add script task inside the loop to process each row. Set Var 2 as ReadOnlyVariables because we want to use within the script.

Display the value of Var2 and check the existance of the files mentioned in Var2:

string Var2 = Dts.Variables["Var2"].Value.ToString();
System.Windows.Forms.MessageBox.Show(Var2);
if (File.Exists(“C:\\Data\\FELC per row\\”+Var2))
{
System.Windows.Forms.MessageBox.Show(Var2 + ” exists”);
}
else
{
System.Windows.Forms.MessageBox.Show(Var2 + ” does not exist”);
}
Dts.TaskResult = (int)ScriptResults.Success;

Apologies about the indentation. Within {} the code should be indented, but WordPress wouldn’t let it. Unless I format it as “preformatted”. But using preformatted the font would appear very small.

Now if we run it, it will display the value of Var2 and whether that file exist or not:

Rainer Stropek, founder of software architects“, wrote a good article about Foreach ADO Enumerator in The Code Project.

« Previous Page

The Rubric Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 208 other followers