Data Warehousing and Data Science

28 June 2011

Processing ROLAP Cube and ROLAP Dimension

Filed under: Analysis Services — Vincent Rainardi @ 9:41 pm
Tags:

Someone asked me today: he created a ROLAP cube with MOLAP dimension. He made some changes to the dim and did a process full on the cube. He expected to see the changes but he couldn’t see the changes when browsing the cube. He was wondering why. If he processed full the database he could see the changes when browsing the cube.

After trying out a few things myself, I responded to him. My response could be useful for somebody else so I share it here. I understand there are many AS experts out there and if any of below is wrong, I will be glad to learn from them.

Process full does not affect all objects. Process full affects only the object that we process.
If we do process full on a cube, then only the cube will be processed. The dimensions will not be processed.
If we do process full on a dimension, then only that dimension will be processed. The cube will not be processed.
To process everything (both cube and dimensions), we need to process the database.

Using ROLAP cube, if you add a row in your fact table, you do not need to process anything. You will automatically see the new row when you browse the cube.

But, in a cube there are measure groups (MG). And in a measure group there are partitions.
If you set the cube to ROLAP, but the MGs and partitions are still MOLAP, then you won’t see the changes in the fact table.
If you set the cube and the MGs to ROLAP but the partitions are still MOLAP, then you won’t see the changes in the fact table.

Because the partitions are MOLAP you need to process the cube to see the changes in the fact table.
It is the partition that determines the overall settings.
If you set the cube and the MGs to MOLAP but the partitions are in ROLAP, then you will see the changes in the fact table without processing anything.

Impact analysis:
If you do a process full on a dimension and do impact analysis, you get the partition, the MG and the cube.
If you do a process update on a dimension and do impact analysis, you don’t get anything. Which means that you don’t need to process the cube (in theory)
If you do a process full on a cube and do impact analysis, then you don’t get the dimensions.

As I said above, if your cube is ROLAP, and you change your fact table, you do not need to process anything. You will automatically see the changes when you browse the cube.
But, if your dimension is ROLAP, and you change the value on the dimension table, then you need to process the dimension to see the change.

You can’t see the changes you made on the dimension table if you don’t process the dimension.
You don’t need to process full the dimension. Process update or process data is enough to see the changes.

Going back to “Process full does not affect all objects”, if you want to process affected objects, then select “Change Settings” and choose “Process affected objects”. For example, when process full a dimension, you if you select “Process affected objects” you will also process the partition, MG and cube.

Going back to “Process full affects only the object that we process”, process full processes both data and index.
Process data processes data only. Process index processes index only.
But process full only processes the object we process, not all objects in the database.

Hope this helps. Vincent Rainardi, 28/6/2011.

27 June 2011

The Principle of “Do It Upstream” in Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 10:15 am
Tags:

A data warehouse is fed from source systems. A calculation is better performed upstream in the source system than in the data warehouse. This is what I call the principle of “Do It Upstream” in data warehousing.

Why?

Why do we need to do it upstream? Why don’t we calculate it in the data warehouse? Because:

a) It is easier and more economical
b) The result can be used by other downstream systems, not only the DW
c) It forms a logical grouping architecturally

For example, a credit risk system A is responsible for calculating various measures, ratios and indicators. We then need 1 more ratio, which was not already calculated in system A, it is easier to calculate it in system A because they already calculating various ratios. It will be in the same place/program as the other calculations, and it will already have access to the database columns it requires.

What I meant with “logical grouping” in point c) above is that when the data is published/sent to downstream system, the new ratio can be included in the existing group of ratios and published together with minimal extra effort.

Exceptions

As always, there are exceptions to everything. There are 2 exceptions to this principle:

a) If the calculation involved data from several source/upstream systems, then we have to perform it in the DW, no argument about it.

b) If the calculation is complex and DW is more equipped to do it, it could be better to perform it in DW. An example of this is customer classification, i.e. a mobile company who wants to classify the customers based on spending patterns.

The Politics

It is usually a hard fight to ask an upstream system to do the work for a downstream system. “Why should I?” and “What do I get from this?”, will have to be answered in their favour, before the management of the upstream system would agree to do it.

And why should they indeed. Think about it from their point of view, you are asking them to do some work, of their own budget, of which produces no benefit to them. No one in their right minds would agree to that. This is why, in reality, the work of adding extra calculation falls under the DW project umbrella. “You need it, you do it”.

There are ways to do deal with this politics. These 2 are usual ways:

a) Ask somebody in higher position who overlooks both of you. Present the case of “This company will benefit $x if we do this upstream”. As always, when dealing with a C level, just “show them the money”. Anybody in their right mind will follow the money.

b) Pay him. Cross charge the activity. Totally funded by you, and therefore (important!) controlled by you.

Fighting with Yourself

This is a much harder fight. You yourself want the work to fall in your hands. As a DW Manager, you want this work to be done in DW. You will get more budget (obviously, it would be fool not to ask for an increase of money) and you will get more resource (time and people), meaning more power. In an under-constraint project this could be a good relief. Everybody under you (Mr. Architect, Mr. Analyst, Mr. Developer, Mr. Tester) would prefer the work to be done in DW, for an obvious reason that benefit themselves.

There are 2 usual ways to look at this:

a) By giving the work to the upstream system you have a control over them, possibility of promotion to a position overseeing both departments

b) Closer relationship to the CEO (as you have to present the case to him, see politics above)

Closure

Not sure why I wrote about politics and management today. Probably because it just flows naturally from the “Upstream” topic. Probably because of hot summer weather (32 C today in London!). I think I better stay on the technical side. Dealing with the “how to make it happen”. It’s simpler and less head ache 🙂

Vincent Rainardi, 27/6/2011

26 June 2011

Import/Export Files with Variable File Name Set at Run Time

Filed under: SSIS — Vincent Rainardi @ 6:37 am
Tags:

In ETL we often need to import files. But these files usually don’t have a fixed file name. They are variable. The file names change every day, because they have date on their file names. For example, yesterday it was called trans_20110627. Today it is called trans_20110628. Tomorrow it will be called trans_20110629.

How do we import them? We set the file name at run time.

How do we set the file name at run time? Using a script task, at run time we set the value of a variable. This variable is used as the file name of the file.

This might be one of the “oldest tricks in the book”, but still, I found that many people don’t know it. So it’s probably worth writing a post about it.

So here’s how to do it, step by step:

First, create a variable called TransFileName, of type string, scope = package:

Then create a script task on the control flow, call it Set File Name:

Edit the script task, put the TransFileName variable on the ReadWriteVariable:

Then click on Edit Script:

For simplicity in the above script I just put a fixed file name. But in the real project the logic is for example: find today’s file, find the latest file, etc, which is for another blog post. I’ll write that “find the latest file logic in using c# script” post in the next few days and put the link here.

Save the ScriptMain.cs window and close it. Click OK to come out of the Script Task Editor.

Create a Data Flow task:

Double click the Data Flow Task to edit it. Create Flat File Source and OLE DB Destination.

Double click on the Trans File to edit it. Click New on the Flat File Source Editor. Click Browse and select the file we want to import. For this exercise I’ve created a small text file like this:

It only contains 1 column and 2 rows. I also created a small table to illustrate the process:

create table VarFileName.Trans ( Col1 int )

So, click on Browse and select the file we want to import:

Click “Column names in the first data row” if your file has column names in the first row. Click on Columns tab on the left pane:

Specify the column delimiter e.g. pipe (|). Click on the Advanced tab on the left pane:

Set the data type for each column. Click OK to close the Flat File Connection Manager Editor and go back to the Flat File Source Editor:

Click on the Columns on the left pane:

Click OK to close the Flat File Source Editor.

Edit the OLE DB, point it to the target table, e.g. VarFileName.Trans:

Click on Mappings on the left pane and check that all columns on the file have been mapped properly to the table. In the example there is only 1 column:

Click OK to close the OLE DB Destination Editor and go back to the Data Flow window.

Now we are going to use the TransFileName variable as the file name of the file. We do a right click on the TransFile on the Connection Manager and select Properties.

Click on the ellipsis on the Expressions to open the Property Expression Editor. Set the Property column to Connection String. Under the Expression column, click on the ellipsis to open the Expression Builder. Choose the TransFileName variable, drag it to the Expression box:

 

Click OK to close the Expression Builder. Click OK again to close the Property Expressions Editor.

That’s it, we are done. At run time, the script task will set the value of the TransFileName variable. This variable is used as the file name of the file we want to import. Here’s the Control Flow (left) and Data Flow (right) at when we run it:

This can also be used to when we need to export to a file which we only know the name at run time. It works the same way.

Hope this helps,

Vincent Rainardi, 26/6/2011

23 June 2011

Data Sourcing

Filed under: Data Warehousing,Project Management — Vincent Rainardi @ 7:28 am
Tags: ,

One of the trickiest parts of a data warehousing project is to determine where we should get the data from. On the one hand, the business user knows that they would like to have attribute X, but they won’t know where in the enterprise we could get it from. On the other hand, if you ask someone from the source system team, “Do you have field X?”, your terminology might be different from what they are using.

For example: a bank. The business users are from the Compliance department and they want to have “Third Party Name” and “Account Status” on their report. You setup a meeting with the business users, trying to clarify what “Third Party” means and what “Account Status” means. There are 150 systems in the bank and you identified 5 possible systems that may have the data. You spent days trying to get the source system teams into a meeting. After a few weeks, you concluded that only 2 possible systems you can source the data from. Then you ask for sample data. And what a disappointment, both systems do not contain the data you are looking for.

Say you do get hold of a system which had the data you are looking for. There are usual data sourcing routine that you need to perform:

  1. Ask for sample data, import it into your system and examine the data quality. It’s no good if most of the rows are null for that column! Are there rows with null business keys?
  2. Check if the grain of the data matches what we are looking for. If you need it at trading level but the data is at account level, it’s no good! If you expect the data to be 1 row for each account per day, and in the data file you find that it’s 1 row for each customer (not per day) , then it’s no good!
  3. Understand how the data needs to be joined in our data warehouse. Test and retest again. Yes it’s left (outer) join of course, but are you sure that the data on the joined fields matches our warehouse?
  4. In a large enterprise you are not allowed to import directly from their database (point to point). You have to subscribe to their “publication”. In a large corporation, you are not the only consumer of this data; there might be 5 other systems consuming the same data.
  5. Understand the data delivery: Is the feed incremental or full load every time. What is delivery mechanism, FTP or messaging? Is it daily feed or continuous throughout the day? How often does the data get updated? What’s their scheduled down time?
  6. Understand the age of the data: If today you receive a file from that source system, does the content of this file reflect the condition in their system last night, or 2 days ago? Working days only, or week end included? In which situation this is not the case, at year end?

Welcome to the world of data sourcing. This is one of the activities in data warehousing that takes a lot of time. Having been into many warehousing projects, I feel now that the most important thing in a data warehousing project is probably the data sourcing. We need to have somebody in our DW team who understands the source systems. The dimensional modelling is equally important, but if you don’t understand the source data/systems, then the model would not be correct, and the ETL spec would not be correct either. And data sourcing needs to be done early: if we don’t have the data we can’t do anything!

This is how “analysis” is different from “design”. This is where Data Quality work happens. You perform DQ check and recheck on the incoming data. Are there duplicate rows in the data? Do the values match what you expect?

Data sourcing is often overlooked. It is often not considered seriously. Many PMs (especially those who are not from DW background) think like “Ah it’s about getting data from finance department, easy…” We should not underestimate data sourcing in a DW project. If we can’t get good, valid data, we won’t have a data warehouse, let alone producing a BI front end.

That is why, in any DW project, it is crucial that we develop and test with Production data. Lots of PMs don’t get this. They are so used to Data Entry Application project and they ask “Why do you need Production data for development?” Two reasons: the first reason is data quality / data sourcing / functionality. The second reason is performance / technical / size.

The first reason is about making sure that you get the right data, and therefore can build the correct functionality in the DW. The second reason is about making sure that we can deal with vast amount of data. It’s all very well that the ETL package only took 2 hours in dev, but when we go live it could take 7 hours! Why? “Oh we develop using 50,000 rows sample data”. You must be joking! This is a data warehousing project, not a data entry system application development project. And you are back to square one: redo the development, redo the ETL package, redo the testing, etc, which would be a huge waste of valuable project time.

Vincent Rainardi, 23/6/2011

21 June 2011

Updating a Variable based on a File (Script Task)

Filed under: SSIS — Vincent Rainardi @ 10:24 pm
Tags:

In the last article (link) I showed how to update a variable based on a file for script component. That’s useful when we are in a Work Flow. But sometimes the situation happens on the Control Flow. In this case we need to do a Script Task.

Using a Script Task it is a lot simpler than using a Script Component. We essentially need to do a bit of C# coding to read the file and set the variable.

This is the file that we want to read:

Pull the Script Task from the Toolbar to the Control Flow tab:

Define a variable to hold the Number of Customer:

Double click the Script Task to edit it. Set the read write variable to the variable we just created (NoOfCustomers):

Double click the Edit Script and edit it as follows:

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
using System.IO;

{
  public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
  {
    public void Main()
    {
      const string FILE_NAME = "C:\\Util\\File.txt";
      int ReturnValue;
      String line;
      char[] delimiters = "|".ToCharArray();
      string[] columns;
      try
      {
        if (!File.Exists(FILE_NAME))
        {
          MessageBox.Show("File " + FILE_NAME + " does not exist.");
          ReturnValue = (int)ScriptResults.Failure;
          return;
        }
        using (StreamReader sr = File.OpenText(FILE_NAME))
        {
          while ((line = sr.ReadLine()) != null)
          {
            //MessageBox.Show(line);
            columns = line.Split(delimiters);
            //MessageBox.Show(columns[0].ToString());
            //MessageBox.Show(columns[1].ToString());

            if (columns[0].ToString() == "Number Of Customers")
            {
              //MessageBox.Show(columns[0].ToString());
              //MessageBox.Show(columns[1].ToString());
              Dts.Variables["NoOfCustomers"].Value = Convert.ToInt32(columns[1].ToString());
            }
          }
          ReturnValue = (int)ScriptResults.Success;
        }
      }

      catch (Exception e)
      {
        MessageBox.Show(e.Message);
        ReturnValue = (int)ScriptResults.Failure;
      }

      Dts.TaskResult = ReturnValue;
    }
  }
}

// ref: http://msdn.microsoft.com/en-us/library/db5x7c0d.aspx

14 June 2011

Role of a DW Designer/Dimensional Modeller in the ETL Development Phase of a DW/BI Project

Filed under: Data Warehousing — Vincent Rainardi @ 9:53 pm
Tags:

In a Data Warehousing project, when a Data Warehouse Designer or a Dimensional Modeller completed his design and hand it over to the ETL developers, some people think his utilisation will drop from 100% to 10%. This is not true. During ETL development phase, the DW Designer will still be fully occupied. The ETL developers will ask him a lot of questions for many weeks. And the testers will also ask him questions too.

This “answering questions” task seems to be a light job but in reality it takes quite a lot of time. Particularly if the developers and testers are overseas, e.g. UK/US to India. Non face-to-face communications (e.g. telephone conference supported by desktop sharing software) take more effort than face-to-face.

What kind of questions do the ETL developers ask? Mainly ETL questions but some are modelling questions. So it would help if you have an ETL architect in your DW/BI project. Because if you do, this ETL architect would be able to take away the burden of the DW designer having to interface with many developers at once. Instead the DW designer will only need to interface with 1 person.

These are examples of the questions from the ETL developers:

  1. Column length, e.g. the natural key (aka business key) is varchar(20) in source table 1, but varchar(30) in source table 2.
  2. Records with NULL measures or NULL natural keys in the source system: what to do with them?
  3. Duplicate natural keys in the source system: how to de-dup them?
  4. Inconsistency between the dimension model and the data, e.g. the dim model says that the source column is X, but in the data there’s no column called X.
  5. Inconsistencies of data types, either between source system and staging or between staging and the dimensional data store.
  6. Decimal points: should we round to 2 decimal point? But the source data is 6 decimal point hence truncation.
  7. The rule for the ETL is: if it’s null set it to “unknown”. But it is a numeric column, not a string column. Should we set as null or 0?

In addition to answering the above questions asked by ETL developers, the DW designer still has to finish the dim model design. When a dim model is given to the developers to be built, in most cases it is not 100% complete, probably 90% or 95% complete, but not 100%. This is because:

a)      There are source systems that we can’t get yet.

b)      There are changes in the source systems.

c)       Misunderstanding of the source systems, i.e. the dim model was designed without having the source data

So a DW designer or a dim modeller (what’s the difference between a DW designer and a dim modeller, you might ask? I’ll explain after this) still needs to do these tasks when the development is being done. In some projects, these “grey” parts of the dim model are labelled as “phase 2”. It is common in DW projects that the ETL development (and construction of test packages) is started when the dim model achieves a somewhat mature state. Usually, most of the source systems are known to the designer, but there is always 1 or 2 which he hasn’t got them yet. Of course, logically, he start the design work without waiting for this 1 or 2 little gaps in the data sources. And guess what? 2 months later, after the design is “finished”, those 1 or 2 items are still not there. And they got pushed to phase 2. This is common.

Differences Between a DW Designer and a Dim Modeller

What’s the difference between a DW designer and a dim modeller, you might ask? A dimensional model is the fact and dimension tables that form the data warehouse. It is the data model of these tables, i.e. the Entity Relationship Diagram (ERD). It specifies the table names, and how they relate to each other (through the surrogate keys). It specifies the columns in those tables, and the data types and length of each column. It specifies the default values if any, and specific column properties such as “identity”. It also specifies related table properties such as indexes and material views.

A dimensional modeller is the person who designs/creates the dimensional model. His job title might be “data architect”, because he doesn’t merely design the data warehouse/mart, but he also designs OLTP databases. It is uncommon for someone to have the title of “dimensional modeller”. “Data Architect” is a much more common title.

Notice that a dimensional model does not explain how we are going to populate the tables. It explains the tables, but not how to populate it and where we are going to populate it from. This usually falls in the hands of an ETL architect, who, alongside the dim modeller, specifies how the tables will be populated. It is common in a DWBI project that once the dim modeller completed the design of a “subject area” of the dim model, that area is passed to the ETL architect so he can specify the data population for that area.  So they work simultaneously, shortening the duration of the project. An ETL architect is also known as an ETL designer.

But in small DW projects (6 months end-to-end or less), 1 person does both the dim modelling and the ETL architect. In this case the person is labelled as a “DW designer”, or as it is more commonly known, a “DW Architect”.

Vincent Rainardi, 14/6/2011

7 June 2011

SSIS: Update a Variable based on a File (Script Component)

Filed under: SSIS — Vincent Rainardi @ 7:15 pm
Tags:

Yesterday I wrote about setting a variable based on the value in a database table (link). Sometimes the source value is not stored in a database table but in a file. We need to read that file to get the value then update a package variable. Of course you can import that file into a table first, then query the table to update the variable via Result Set mapping like above. But sometimes you don’t want to do this. You want to update the variable straight from the file.

There are 2 ways to do this:

  1. Script component as a source
  2. Script Task

We use the Source Script Component when we want to set the variable as part of a Work Flow. We use the Script Task when we want to set the variable as part of a Control Flow? What’s the different? Work Flow is task where data flows from source to destination. Control Flow is a sequence of task, one after the other, not necessarily about data movement.

Out of those 2 above, in this article I will explain the first one (Update Variable from a File – Script Component as a Source). I will explain the second one (Script Task) in the next article.

OK Let’s start. Let’s say that the file we want to import is like below. Say what we want to get is the Number of Customers.

Let’s start. First we create a Data Flow by pulling the Data Flow task from the Toolbox on the left onto the Control Flow canvas:

Double click it to edit. Pull Script Component from the Toolbox onto the Work Flow. SSIS will ask whether we will use as a Source, Destination or Transformation. Choose Source and we get this:

Now create a variable to store the Number of Customer. First, click on the canvas of the Control Flow so that the variable scope is Package. Then create the variable, setting the type to Int32.

We go back to the data flow task, double click the script component:

On the ReadWriteVariables click on the Ellipses and select User::NoOfCustomers :

On the Input and Output tab, give the output a name (e.g. change it from “Output 0” to “ParamOutput”). Then create 2 output columns: ParamName and ParamValue. Set the Data Type to String for both of them:

On the Connection Managers, define a new connection:

Choose Flat File:

Give it a name, choose the file, and select “Column names in the first data row”:

Click on Columns on the left, check the preview (see below). Here we usually change the delimiter, but in this case it’s not required.

Click OK to close the window and back to the Script Transformation Editor Connection Managers tab. Choose the Connection Manager you just created (ParameterFile) and give it a name (e.g. ParamConn):

Now we are ready to edit the script (to be ready we need to have: connection, output and variable). Click on Script on the left and click on Edit Script:

Edit it like this:

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.IO; // This namespace is required for File functionalities
using System.Windows.Forms; // This namespace is required for the Message Boxes

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    // Declare the variables
    private StreamReader SR;
    private string File1;
    private int NoOfCustomers;

    public override void AcquireConnections(object Transaction)
    {
        // Get the ParamConn connection, store it as 'File1'
        IDTSConnectionManager100 CM = this.Connections.ParamConn;
        File1 = (string)CM.AcquireConnection(null);
    }

    public override void PreExecute()
    {
        // Create a reader for File1
        base.PreExecute();
        SR = new StreamReader(File1);
    }

    public override void PostExecute()
    {
        // Close the reader
        base.PostExecute();
        SR.Close();

        // Set the Package Variable
        Variables.NoOfCustomers = NoOfCustomers;
    }

    public override void CreateNewOutputRows()
    {
        // Declare variables
        string nextLine;
        string[] columns;
        char[] delimiters;

        // Set the delimiter
        delimiters = "|".ToCharArray();

        // Read the first line (header)
        nextLine = SR.ReadLine();

        // Split the line into columns
        columns = nextLine.Split(delimiters);

        // Read the second line and loop until the end of the file
        nextLine = SR.ReadLine();
        while (nextLine != null)
        {
            // Split the line into columns
            columns = nextLine.Split(delimiters);

            // Add a row on the output
            ParamOutputBuffer.AddRow();

            // Use MessageBox for Debugging
            // (you can debug a Script Task but not Script Component)
            MessageBox.Show(columns[0].ToString());
            MessageBox.Show(columns[1].ToString());

            // Set the values of the Script Component output according to the file content
            ParamOutputBuffer.ParamName = columns[0];
            ParamOutputBuffer.ParamValue = columns[1];

            // When we find the row, set the variable
            if (columns[0].ToString() == "Number Of Customers")
            {
                // Store the value of columns[1] to an internal variable
                NoOfCustomers = Convert.ToInt32(columns[1]);
                MessageBox.Show(NoOfCustomers.ToString());
                break; // Get out of the loop when we found the row
            }

            // Read the next line
            nextLine = SR.ReadLine();
        }
    }
}

Run it. You get message boxes 6 times containing: Analysis Type, Regulatory, Number Of Customers, 322514, Date Exported, 06/06/2011. Unfortunately we can’t debug a Script Component (we can debug a Script Task, but not a Script Component, see here), hence we use message boxes.

The code reads the file line by line, splitting each line on the delimiter. The split result is put in an array called columns. This array is mapped to the Output Buffer of the script component. When we found the “Number of Customer” row, set the internal variable then “break” out of the loop. At post execute, we set the package variable.

After the code is working properly don’t forget to comment out the message boxes.

Hope it’s helpful.

Vincent Rainardi, 7/6/2011

6 June 2011

SSIS: Updating a Variable based on Database

Filed under: SSIS — Vincent Rainardi @ 9:17 pm
Tags:

Sometimes when writing an SSIS package, we need to update the value of a package variable based on a database table. For example, we may have a variable called ImportFolder. This package variable contains the path to the folder which contains the files that we need to import. The value of ImportFolder variable is stored in the database. We have to query that database (select ImportFolder from table1 where col1 = …) and then set the value of the ImportFolder package variable.

In some cases the value that we need to update the package with is not stored in the database, but it is stored in a file. We need to open this file, read it and set the value of the package variable. In this article I will give an example on how to do the database one, and on the next article I’ll do the file one.

Execute SQL Task

We use an Execute SQL Task, then set the Result Set to a variable. The database doesn’t have to be SQL Server of course, it can be anything. Here’s how:

Pull Execute SQL task from the Toolbox to the Control Flow canvas:

Double click it to edit and type the query on the SQL Statement and set Result Set to Single Row.

Here’s an example of a SQL query to get the Import Folder from a database table:

After you set the SQL Statement and Result Set, click on the Result Set pane on the left handside and set the Result Name and Variable Name. Result Name is the column name on the query output and variable name is the name of the package variable (yes you need to create it first).

So click OK to come out of Execute SQL Task editor, create the variable as below.

Create the variable by clicking the first icon. You need to set the Data Type as String. To set the scope to Package, click on any point on the Control Flow canvas before creating the variable.

If you wonder “Where is this Variable window?” scroll down to the bottom of the screen, click Variables tab on the left, next to the Toolbar tab:

Now back to Result Set tab again (double click Execute SQL Task, click on Result Set tab on the left) and set the Result Name and Variable Name:

Run it. Before running it you may want to set a Breakpoint so you can see the value of the variable. Right click on the Execute SQL Task and choose Edit Breakpoints. Select the OnPostExecute event (this is after the task is run).

Then run the package. It will stop after it runs the Execute SQL task. To see the value of the ImportPath variables, type “ImportPath” in the Watch window:

Or expand the Locals window and find User::ImportPath:

If you wonder “Where are these Watch and Locals windows”, they are at the bottom of screen. If you can’t see them try: Windows menu, Reset Windows Layout.

Vincent Rainardi, 6/6/2011

PS. On the next article I’ll write about Updating a Variable from a File.

SSIS 2008 Data Profiler

Filed under: SSIS — Vincent Rainardi @ 5:30 am
Tags:

SQL Server 2008 has a new feature in SSIS: Data Profiler. For a given table, Data Profiler enables us to understand the number of distinct values in each column, and the distribution of data values. This is very useful in data warehousing, both in ETL and when building cubes in SSAS.

When building an ETL package, sometimes we need to know the data values in a column. For example, to understand whether a column is mostly empty (blank or null) or contains good data. Or to help us decide whether to exclude or include certain values or certain patterns, probably related to a data quality requirement. Or to help us identify potential hierarchies in a dimension table.

Data Profiler is also useful when doing performance tuning, for example when partitioning the a fact table or when partitioning a cube. Understanding the data distribution could assist us in creating similar size groups based on a particular column.

On the current SQL Server 2005 data warehousing project, I use Informatica Data Explorer (IDE) to understand the distribution of data values in the data warehouse fact and dimension tables, for the purpose of building SSAS cubes. I often asked myself: When will we be able to right click on any table in Management Studio, select ‘Analyze’ and get the data profile for that table?

Now in SSIS 2008 we have data profiler. Well, it’s not as friendly as doing a right click on any table and select ‘Analyze’. But it’s a start. I hope in the later releases of SQL Server we can right click any table in Management Studio and choose ‘Analyze’ to get the data profile.

So let’s find out what SSIS 2008 Data Profiler is all about. First, let’s have a look at its output:

Figure 1. Output of SSIS 2008 Data Profiler

The left pane shows the tables that we analyzed, along with the profiles that we executed for each table. The right hand pane shows the datails of the profile that we select on the left hand pane. For example, for ‘Column Value Distribution’ profile, the top right panel shows a list of columns and the number of distinct values in each column. If we click on any column on this panel, the bottom right panel displays the distribution of the data value, i.e. the value, number of rows containing that value, and the percentage of total rows. Figure 1 shows the distribution of the data values on the product_key column in the fact_sales table.

Other than analyzing the distribution of data value, SSIS 2008 Data Profiler can also analyze:

  1. The number of nulls in a column.
  2. The pattern of the data values, for example if a column contains values of 2008-01, 2008-02, …, 2008-12, then the pattern is “200d-dd” (d means digit, i.e. a number from 0 to 9).
  3. Minimum and maximum values of the data in each column.
  4. Mean (average) and standard deviation of the data values in each column. Standard deviation indicates how far the value is from the mean.
  5. Determine which column(s) is a good candidate for a primary key. This is useful for checking the 1st normal form rule, which states that each table must have a primary key that uniquely identifies each row.
  6. Checking whether a column is fully or partially dependant of other column(s). This is useful for checking the 2nd normal form and for identifying potential hierarchies within a dimension. In 2nd normal form, non primary key columns must be dependent on the primary key.
  7. Checking whether all values in a column exist in another table. In data warehousing, this is useful to identify if there are dimensional keys on the fact table which do not exist on the dimension table. For ODS design (or other normalized databases), this is useful to identify the 3rd normal form, which states that no column is transitively dependent on the primary key. In other words, it is useful for identifying potential foreign key relationships.

Now let’s understand the ‘how’.

Open BIDS, create an SSIS project, drag Data Profiling Task onto the design surface. Double click to edit it. Type the Destination File name.

Click on Profile Requests on the left pane. Select Column Value Distribution on the Profile Type column. Supply the connection details, select the Table and Column(s) at the bottom panel as shown on Figure 2 below and click OK.

Figure 2. Setting up a Data Profiling Task in SSIS

Run the SSIS package. Verify that the output file is created.

To view the output file, we need to use Data Profile Viewer utility. Go to C:\Program Files\Microsoft SQL Server\100\DTS\Binn (adjust to your SQL Server folder) and execute DataProfileViewer.exe. Click on Open, and open the output file that was produced by the SSIS Data Profiling Task earlier. It will look like figure 1 above.

In addition to Column Value Distribution Profiles that we chose above, we can also choose other profiles, such as Column Pattern Profiles, Column Statistics Profiles, Functional Dependency Profiles, Candidate Key Profiles, Column Null Ratio Profiles and Inclusion Profiles. These profiles provide the 7 analysis items I listed earlier.

The Data Profiling facility in SQL Server 2008 is quite basic. As is the the user friendliness. The viewer is not integrated in BI Development Studio, even in Release Candidate 0. But it is a good start. Whether it’s at an additional cost or not, it is a necessity these days that an ETL tool should have a data profiling facility. I’m sure in the future SSIS data profiling feature will improve a lot. I hope that in the future releases of SQL Server, the Data Profile Viewer utility is integrated into BI Development Studio and that we can right click any table in Management Studio and choose ‘Analyze’ to get the data profile.

Vincent Rainardi, 7th July 2008
Author of Building A Data Warehouse: With Examples in SQL Server

Repost from SQLServerCentral

5 June 2011

SSIS: Export a Query Result to a File

Filed under: SSIS — Vincent Rainardi @ 11:01 pm
Tags:

This seems to be a simple task but someone who just begins using SSIS will surely come across this task, and at that point he will need an example to copy from. Importing a file into a table is a more common task in Data Warehousing than exporting to a file, so it is possible that an SSIS developer can do:

  • Import a file into table
  • Run SQL to copy table1 to table 2
  • Export a table into a file

but never to this day he exported the result of a query to a file.

There are 2 ways to do it:

  1. Using Data Flow task
  2. Using Export Import Wizard

I recommend no 1 over 2 as we have more control but I’ll go through both. No we can’t use Execute SQL Task for this. Nor OLE DB Command Transformation.

Using Data Flow Task

Pull the Data Flow task into the Control Flow tab and double click it to edit:

On the Data Flow tab, pull the OLE DB Source and a Flat File Destination then connect them:

Double click the OLE DB Source to edit it:

Set the OLE DB connection manager to an existing one or click New to create a new one.

This is the most important one: Set the Data access mode to SQL Command. Type the SQL query on SSMS, test it then copy and paste it here. You can also use “Build Query”:

Click Preview to see the query result. Click Columns on the left hand side to check the output columns (this is optional) and then click OK.

Double click the Flat File Destination to edit it.

When asked, select the file format as Delimited. It is the most common. You can experiment with the other types later.

Click on New to define a new Connection Manager and set the file name by clicking Browse. Select “Column names in the first data row” as needed and change the connection manager name to one of your choice.

On the left hand side click on the Columns and set the Column delimiter to Vertical Bar {|}, see below. Again this is the most common one; you can experiment with the other delimiter later on. Then click OK.

Run it (Start Debugging) and check the file output. I would suggest that you only put 3 rows of data on your table so you can check if you have 3 rows on the output file. If you have 10,000 how do you check it? 🙂

Using Export Import Wizard

On the Solution Explorer, right click on the SSIS Packages and select SSIS Import and Export Wizard. Navigate through the pages.

1. Source: for SQL Server 2008 choose SQL Server Native Client 10.0, and select the DB as below; otherwise choose as appropriate.

2. Destination: choose Flat File and specify the File name. Choose “Column names in the first data row” if needed, as below:

3. Specify Table Copy or Query: choose Write a query to specify the data to transfer, as below:

4. Provide Source Query: type the query in SSMS and copy paste the query here, as below:

5. Set the column delimiter to Vertical Bar. This is the most common one. You can experiment with other delimiter later on.

6. Click Next, Finish, Close.

On the Solution Explorer you should have a new package, e.g. Package1.dtsx. It contains 1 Data Flow. The Data Flow contains an OLE DB Source and a Flat File Destination. It is just like the package we created a little earlier.

Hope this helps.

Picture for “skipping to the next task”: (see dsentelle’s comment below):
Skip to next task

 

Next Page »

Blog at WordPress.com.