Data Warehousing and Business Intelligence

27 February 2011

SSIS: Importing Files – Read The First N Rows

Filed under: SSIS — Vincent Rainardi @ 8:17 am
Tags:

Sometimes we need to read the first few rows of the file, not the whole file. Usually this is because the file contains a header that gives us information about the structure of the file, like this:

col1|col2|col3
a,a,a
col1|col2|col3|col4|col5
1,1,1,1,1
2,2,2,2,2
3,3,3,3,3

In the above example we want to read the first 2 rows, which is the header. But in SSIS there is no facility to read the first N rows only. We have to read the whole file, which is rediculous if we have a file with 1 million rows. So how do we read the first N rows only?

To read the first N rows there are 3 alternatives:

  1. Use Konesans Row Number Transformation.
  2. Put all rows into a table then select top N from it.
  3. Use a script component as a source or transform.

Using Konesans Row Number Transformation

Thanks to Allan and Darren, Row Number Transformation gives us a row number for every row in the file. This row number is added as a column in the output of the data flow. For further details please see their web site, SQLIS.com. We can then use the Conditional Split to split the data flow, like this:

1 output of the Conditional Split, called header in this case, contains the first 2 rows only. Which can then be loaded into the target table.

Using this method we have to read the entire file, as summarised nicely by CTKeane here. So practically speaking this approach is ideal if the file we are importing is small e.g. 1000 rows. However, it is also suitable for importing big files, if we want to read or process the remaining of the file as well. Say the file consists of 1 million rows. The first 2 rows are header, and the remaining 999,998 rows are the details. So we can process the default output of the Conditional Split, which contains the detail rows.

The second thing we need to be aware is that it is a third party component where our infrastructure managers might not be willing to support. So before going down this route it is wise to check first with whoever is in charge for the production/UAT/dev servers in your company/project.

Putting all rows in a table then select the top N from it

In this approach we import the file into a staging table. Then we select the top N rows from this staging table. Again it is OK if we want to process the remaining of the file as well as the header. But if we don’t want to read the remaining of the file, we can’t use this solution.

If the file that we are importing consist of 2 sections like the above example, then when defining the flat file connection manager we need to define the file as 1 column. Only 1 column. When we query that 1 column table we then split the long string into several column based on the delimiters.

Transact SQL doesn’t have a string split function, not like in C#. We could use charindex, which gives us the location of the delimiters. So the idea is that we crawl the long string character by character, and when we found the delimiter we then use substring to capture that column data. For an example see this post on the SQLTeam web site.

Use a Script Component as a Source/Transform

If your flat file is of a static structure (fixed number of columns) then you can use the script component as a transform, and use flat file source as normal, as Jamie Thompson described here. But if your flat file contains multiple structures, like in the above example, it is probably best to use the script component as a source.

To use a script component as a transform, first create a flat file source pointing to the file. Then add a script component and choose Transform when SSIS asks us to select Source, Destination or Transform.

  • On the Connection Managers tab we don’t set anything.
  • On the Input Columns tab we select the columns that we want. These input columns will be automatically added to the Input 0 section of the Inputs and Outputs tab.
  • On the Inputs and Outputs tab, rename the Output 0 to something sensible, e.g. CustomerRecordFileOutput. Then create the Output Columns. Make these columns of type signed int 4 bytes. Also, on the Output 0, set the SynchronousInputID property to None, which makes it async (the default is the input column, i.e. sync).
  • Finally on the Script tab, set the ReadOnlyVariables to User::N, i.e. the package variable containing how many rows we want to read. This is important otherwise that package variable will not be recognised within the script. Then edit the script as follows (C#):
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        int i = 1;

        if (i<=Variables.N)
        {   Output0Buffer.AddRow();
            Output0Buffer.Col1 = Row.col1;
            i++;
        }

    }

Basically, in the above code, we create a row counter, which we call i. We increment this i for every Input buffer row. As long as i is less than or equal to N, we produce output. Otherwise we don’t produce any output. Leave the Pre and Post Execute method, we just need to amend the Input0_ProcessInputRow method as above.

The data flow looks like this:

If you use the script component as a source, this is how:

  1. Don’t create a flat file source, just create a flat file connection manager.

    Suppose our source file is like this:

    col1|col2
    1|1
    2|2
    3|3
    4|4
    5|5
    

    So create a flat file connection manager with 2 columns, tick the “Column names in the first data row” checkbox, format: Delimited, Header row delimiter: {CR}{LF}, column delimiter: vertical bar.

  2. Add the script component. Choose “As a Source” when asked whether it’s a source, destination or transformation.
  3. Edit the script component and on the Connection Managers add the flat file connection that we created on step 1 above. For example, let’s call it ConnFile2. Like this:

  4. Then on the Inputs and Outputs, rename Output0 to File2Output. Then click on the “Output Columns” and click Add Column. Call it Col1 and set the data type according to the file you are importing:

  5. On the Script tab, add the package variable N to the ReadOnlyVariables:

  6. Then edit the script as follows: (based on MSDN article, Creating a Source with the Script Component, link)
public class ScriptMain : UserComponent
{
    private StreamReader SR;
    private string File2;

    public override void AcquireConnections(object Transaction)
    {
        IDTSConnectionManager100 ConnMgr = this.Connections.ConnFile3;
        File2 = (string)ConnMgr.AcquireConnection(null);
    }

    public override void PreExecute()
    {
        base.PreExecute();
        SR = new StreamReader(File2);
    }

    public override void PostExecute()
    {
        base.PostExecute();
        SR.Close();
    }

    public override void CreateNewOutputRows()
    {
        string Row;
        string[] Cols;
        char[] Delimiter;
        Delimiter = "|".ToCharArray();

        //Skip the header first
        Row = SR.ReadLine();

        for(int i = 1; i <= Variables.N; i++)
        {
            Row = SR.ReadLine();
            Cols = Row.Split(Delimiter);
            File2OutputBuffer.AddRow();
            File2OutputBuffer.Col1 = Cols[0];
            File2OutputBuffer.Col2 = Cols[1];

        }

    }

}

The Data Flow looks like this:

About these ads

5 Comments »

  1. […] specify the last row, which is useful when you need to import only the first N rows (see my post here). But as the source is a SQL table anyway we can of course the Top N on our select statement. So […]

    Pingback by SQL Server Destination or OLEDB « Data Warehousing and Business Intelligence — 28 February 2011 @ 12:30 am | Reply

  2. Thank you very much for this article. It help me a lot!!!

    Comment by Leo — 1 November 2011 @ 9:13 pm | Reply

  3. There is alternative way ot the Row Number Transformation, see http://support.microsoft.com/kb/186133

    Comment by todorp — 19 December 2011 @ 4:25 am | Reply

  4. You can use T-SQL to get first N rows from file:
    BULK INSERT [tablename] FROM ‘filename’ WITH (FIRSTROW=1, LASTROW=N)

    Comment by Koit — 18 October 2012 @ 3:42 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 213 other followers

%d bloggers like this: