Data Warehousing and Business Intelligence

5 March 2011

SSIS: Importing a File with Dynamic Columns

Filed under: SQL Server,SSIS — Vincent Rainardi @ 9:25 am
Tags: ,

One of the most difficult things to deal with in the ETL is importing a file with dynamic columns. The file does not have a set number of columns. The number of columns changes from time to time. For example, the file can have 5 columns, or 20 columns, or anything in between.

Below are some examples of such a file:

File1:
Col1,Col2,Col3t1,Col3t2,Col3t3
a,a,1,1,1
b,b,2,2,2
c,c,3,3,3

In the above example the number of columns for Column3 is 3. But it could be 4, 5, 6 or more, we don’t know.

Before we look at the ETL side, we need to look at the data modelling side first. What kind of target table should we create to host these files.

Data Modelling

Essentially there are 2 choices.

1. We could make it normalised like this:

Table1:
Col1,Col2,Col3Type,Col3Value
a,a,t1,1
b,b,t2,1
c,c,t3,1
a,a,t1,2
b,b,t2,2
c,c,t3,2
a,a,t1,3
b,b,t2,3
c,c,t3,3

In principle we design the structure so that we have a fixed number of columns. It doesn’t matter how many columns in the source file, we will always have 4 columns in the table.

The drawback of this design is when we query it. If we need to produce File1 from Table1, we would need to pivot the table like this:

select Col1, Col2, t1 as Col3t1, t2 as Col3t2, t3 as Col3t3
from
( select Col1, Col2, Col3Type, Col3Value
from Table3
) P1
pivot
( sum(Col3Value) for Col3Type in (t1, t2, t3)
) as P2

Output:
Col1,Col2,Col3t1,Col3t2,Col3t3
b,a,1,1,1
a,b,2,2,2
c,c,3,3,3

If we have 2 tables like that we need to join them together on the common column(s) and it becomes even trickier. If we are dealing with 1 million rows column, we risk a serious performance degradation. This might not be a problem for a data mart, but if it is for a transaction system where millisecond response time is expected, this design does not serve the purpose.

Alternative Design

The alternative is to find out how many columns Col3 is. We know it could be 3 to 5, but could it be 100 columns. Spending 5 minutes with the subject matter expert to understand the characteristic of Col3, is well worth it. I have seen this dynamic columns case 3 times (the first time was about 12 years ago) and on every case they followed the book/theory (to normalise) and everytime they regretted it. On the second case (about 8 years ago) when I was involved in troubleshooting the performance issue, in the end we changed the table structure to denormalised the table structure.

So, I would recommend to find out how many columns Col3 is. 90% of the chance is: it is not that many. The response I usually get from the SME is something like this: “Most likely Col3 varies between 3 and 5 columns, it certainly won’t exceed 8″. There is a natural meaning of this data, that’s why the range is limited. For example: Col3 could be the number of teams, or the number of projects, or the number of web sites, or the number of product types. In the business world most of the data is within a certain range. You can’t have 100 teams for example. It’s physically impossible for a company as it is limited by budget constraints.

So once we have that information (the maximum number of columns), then we create the table according to this limitation. Let’s assume that the SME said that the max is 8 columns. So we design Table1 with 10 columns (just in case) as follows:

create table table1
( Col1 varchar(10),
  Col2 varchar(10),
  Col3t1 int,
  Col3t2 int,
  Col3t3 int,
  Col3t4 int,
  Col3t5 int,
  Col3t6 int,
  Col3t7 int,
  Col3t8 int,
  Col3t9 int,
  Col3t10 int
)

Unless it is a staging table, don’t forget to add a PK column. It is a good design to be able to uniquely identify a row at all times. In this example I don’t add a PK column because the target is a staging table.

A table structure with a fixed number of columns like this makes our lives a lot easier. It directly mimic the structure of the source file, and able to accommodate the dynamic nature of Col3.

The Data Load

I’m going to explain the ETL for the denormalised table, i.e. the one with a fixed 10 columns for Col3. The ETL for the normalised table, i.e. the table with 4 columns will have to wait until next time I’m afraid, as I have a few other posts to write next week, plus a presentation for SQLBits 8, Advanced Dimensional Modelling. If you need the ETL for the normalised table with 4 columns please contact me.

Because of the dynamic nature of Col3 in the source file, the best way is to use the Script Component as a Source. There are a lot of sources on the internet Script Component. This post shows you the basics. This post (by Lee Everest) shows you how to use a Script Component to split the rows in the source file. Similarly, this MSDN post shows how to import a delimited source file using a Script Component.

The main concept is as follows: First create a Connection Manager for a File Connection (no need to use Flat File connection). Name this connection File1, like this:

Then create a workflow, and drag the Script Component into the workflow. When asked, specify as source.

Then:

  • On the Connection Managers tab, set the connection to File1 (that’s the middle column). On the first column, name it as “Conn1″.

  • On the Inputs and Outputs tab, rename the “Output 0″ as File1Output. Then create the Output Columns one by one, as per the structure of Table2:

    Col1, Col2, Col3t1, Col3t2, … Col3t10.


  • In this example we will convert the data type for col3 from string to int not in the Script Component, but later on using Derived Column / Data Conversion transformation. So for all columns, set the data type to String, length 10.

  • On the Script tab, click on the Edit Script and edit it as follows. Don’t forget to add “Using System.IO;” and if you want to use MessageBox for debugging you need to add “using System.Windows.Forms;”. Remember that you can’t debug an SSIS Script Component (see my post here), but you can debug SSIS Script Task.
private StreamReader SR;
private string File1;

public override void AcquireConnections(object Transaction)
{
  // Get the connection for File1
  IDTSConnectionManager100 CM = this.Connections.File1Conn;
  File1 = (string)CM.AcquireConnection(null);
}

public override void PreExecute()
{
  // Create a reader for File1
  base.PreExecute();
  SR = new StreamReader(File1);
}

public override void CreateNewOutputRows()
{
  // Declare variables
  string nextLine;
  string[] columns;
  char[] delimiters;
  int Col3Count;
  String[] Col3Value = new string[10];

  // Set the delimiter
  delimiters = ",".ToCharArray();

  // Read the first line (header)
  nextLine = SR.ReadLine();

  // Split the line into columns
  columns = nextLine.Split(delimiters);

  // Find out how many Col3 there are in the file
  Col3Count = columns.Length - 2;
  // MessageBox.Show(Col3Count.ToString());

  // Read the second line and loop until the end of the file
  nextLine = SR.ReadLine();
  while (nextLine != null)
  {
    // Split the line into columns
    columns = nextLine.Split(delimiters);
    {
      // Add a row
      File1OutputBuffer.AddRow();

      // Set the Col3Value variable to the values in the file
      for (int i = 0; i <= Col3Count - 1; i++)
      {
        Col3Value[i] = columns[i + 2];
      }

      // Set the values of the Script Component output according to the file content
      File1OutputBuffer.Col1 = columns[0];
      File1OutputBuffer.Col2 = columns[1];
      File1OutputBuffer.Col3t1 = Col3Value[0];
      File1OutputBuffer.Col3t2 = Col3Value[1];
      File1OutputBuffer.Col3t3 = Col3Value[2];
      File1OutputBuffer.Col3t4 = Col3Value[3];
      File1OutputBuffer.Col3t5 = Col3Value[4];
      File1OutputBuffer.Col3t6 = Col3Value[5];
      File1OutputBuffer.Col3t7 = Col3Value[6];
      File1OutputBuffer.Col3t8 = Col3Value[7];
      File1OutputBuffer.Col3t9 = Col3Value[8];
      File1OutputBuffer.Col3t10 = Col3Value[9];

    }

    // Read the next line
    nextLine = SR.ReadLine();
  }
}

public override void PostExecute()
{
  // Close the reader
  base.PostExecute();
  SR.Close();
}
  • Save the script, close SSIS script editor, click OK to get out of the Script Transformation Editor, back to the Data Flow.
  • I would suggest not to put the output into a table in the database directly, but put the output to a file first to check the result. This way you don’t need to worry about data conversion between string and int etc. Create a flat file connection with 12 columns. All the columns are string, with length = 10, as follows:

  • Run it. If it fails, check the source file and make it like this:

    Also check the script, make sure it is like above.

  • Create an OLE DB Destination or SQL Server Destination (see my post here re why practically speaking we can’t use SQL Server Destination). But create the table first by executing the DDL I wrote above. Note that if you are using SQL Server Destination (because you are doing it to your local instance) then notice that on Windows 7 or Vista you need to run BIDS as Administrator, otherwise you’ll get this error: Unable to prepare SSIS bulk insert, so something like that.
  • Create a data conversion to convert Col3 columns from string to int, like this:

  • Connect the output of the data conversion to the DB and map the DC output columns to the DB columns like this:

  • Now execute it again, and check the target table:

  • Overall the Data Flow looks like this:

As usual I welcome any question and discussion at vrainardi@gmail.com. Vincent, 5/3/11.


About these ads

10 Comments »

  1. [...] Informatica article that I’d like to write: Importing a File with Dynamic Columns, i.e. like this but in [...]

    Pingback by Coming SSIS Articles « Data Warehousing and Business Intelligence — 7 June 2011 @ 6:32 pm | Reply

  2. Hi,
    How about if we have source as sql and target as oracle?

    Comment by Preeti — 1 October 2012 @ 8:59 am | Reply

    • What about it Preeti? Yes of course you can have SQL Server as your source and Oracle as your target. SSIS can interface with most RDBMS.

      Comment by Vincent Rainardi — 2 October 2012 @ 6:11 pm | Reply

      • Will elaborate my question ?I have columns dynamically changing on source side like adding,deleting which is SQL ,have my target as Oracle ,how to handle dynamically the source changes to be in sync with target (Oracle) structure before loading.I knw in any ETL process this has to be manually done like refreshing the source then updating the target table structure.Cannot this be automated?We have java transformation in ETL so cant we make use of it to automate the process ?If so then how?If not,are there any workarounds to do this?

        Thanks in advance.
        preeti

        Comment by Preeti — 3 October 2012 @ 5:03 am

      • Hi Preeti, please see this article (link)

        Comment by Vincent Rainardi — 3 October 2012 @ 6:01 pm

  3. [...] See Preeti’s comment 3/10/12 on “SSIS: Importing a File with Dynamic Columns” article (link): [...]

    Pingback by SSIS: Automating DDL Changes « Data Warehousing and Business Intelligence — 3 October 2012 @ 5:58 pm | Reply

  4. It would work fine for a stage table, and I even might add columns to my tables (add column3t3 int etc. ) but how I’m gonna query those new columns for my reports? Every query I have would have to be dynamic?

    That’s why I’d rather go with the EAV approach…

    Comment by Joe — 21 March 2013 @ 8:14 pm | Reply

    • Thanks Joe for your comment. Yes you are right Entity Attribute Value data model is more flexible, particularly for reporting. I wrote this article because at that time our data source files have dynamic number of columns. I thought it could be useful if other people were in the same situation.
      Kind regards,
      Vincent

      Comment by Vincent Rainardi — 21 March 2013 @ 9:06 pm | Reply

      • Thanks for your reply .But I have a scenerio where the columns in source table dynamically changes and I cannot go and recreate target table everytime to be in sync with source structure and refresh my etl job .. Another challenge here is source and target db are different ,source is oracle target is sqlserver or vice versa . in that case the data types also differ. Is there any way to automate this .Am using informatica right now .If you take SSIS also ,we might be having java and xml transformations is there any way where we can do mapping of data types and address this dynamic changes instead of manual intervention every time which is tedious.

        Thanks in advance. Preeti

        On 3/22/13, Data Warehousing and Business Intelligence

        Comment by pretty — 22 March 2013 @ 6:12 am

      • Hi Pretti, in Informatica you can define the source as a file with just 1 column then use an Expression Transformation using INSTR and SUBSTR to split the column into several strings based on the delimiter. Then use Normalizer to create multiple rows based on a single row (parameter OCCUR = the number of columns, which we have determined in advance).

        Comment by Vincent Rainardi — 22 March 2013 @ 10:10 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 204 other followers

%d bloggers like this: