Data Warehousing and Data Science

28 February 2011

SSIS: SQL Server Destination or OLEDB

Filed under: SSIS — Vincent Rainardi @ 12:28 am
Tags:

We need to insert or update a SQL Server table using a Work Flow. Should we use SQL Server destination or OLEDB destination? This is one of the most frequently asked question among SSIS developers. OLEDB is used more than SQL Server, which doesn’t make sense. If the target is an Oracle table we can understand that. But this is a SQL Server table, of course it is better to use SQL Server Destination right?

A SQL Server destination has some things that a OLE DB destination doesn’t. The left one is SQL Server destination, the right one is OLE DB destination:

One of the most important things to notice here is that in terms of features, there’s not much difference. Both of them are suitable for loading data into a SQL Server table. If your table has triggers, then yes it matters. But who in data warehousing would create a dimension or fact table with a trigger anyway? Using the SQL Server Destination we can specify the last row, which is useful when you need to import only the first N rows (see my post here). But as the source is a SQL table anyway we can of course the Top N on our select statement. So again, there’s not much difference.

So the most crucial question is: which one is faster? To answer that I imported a file with 1 million rows, 2 columns int into a SQL Server table. First using a OLE DB Destination, then using a SQL Server Destination. Properties settings are all like above.

OLE DB Destination: 1st run: 3.229, 2nd run: 3.464, 3rd run: 3.978 seconds (without tablock it is 8.284, 8.330, 8.331)

SQL Server Destination: 1st run: 2.637, 2nd run: 2.574, 3rd run: 2.870 seconds.

So SQL Server Destination is faster than OLE DB Destination. SQL Server average is 2.694s, 24% faster than OLE DB which is 3.557s.

This runs on my Dell Inspiron 1545 laptop, Pentium Dual Core CPU T4500 @2.30 GHz each, 4 GB RAM, Windows 7 Home Premium 64-bit. Hard drive: WD5000BEVT, 500GB, 2.5″ x 1/8H SATA-300, 5400 RPM, external Data Transfer Rate 300MBps, average latency 5.5ms, seek time 2ms.

The main issue why SQL Server Destination is not used is not performance. The main issue is that we can’t edit the SSIS package on our PC. We have to edit the SSIS package on the target SQL Server, which is rediculous. If we edit the package on our PC we get an error message: “The selected data source is on a remote computer. The bulk insert operation can be executed only on the local computer.”

This Connect post explains it. On that post Bob Bojanic mentioned that Microsoft would be fixing it in SQL 2011, enabling us to use SQL Server Destination when editing the SSIS package on our PC. But a later post dated 13/6/08 from Microsoft said that they were not sure if the fix will be done in SQL 2011 (“open for triage”).

I still don’t understand why this restriction was there in the first place. And why it wasn’t issued as a 2008 patch, as to me the fix seems to be trivial: as Jonathan Bays said, “I would recommend removing this validation during design time and only put this restriction in at run time. If the package tries to load to a remote server, then it can throw an exception.”

27 February 2011

Creating Test Data on SQL Server 2008

Filed under: SQL Server — Vincent Rainardi @ 11:31 pm
Tags:

The key thing here is that you need Visual Studio 2010 to do this. You can not generate test data using SQL Server 2008. You need Visual Studio 2010 Premium or Ultimate. The normal Visual Studio (Professional edition) can not do it. Alternatively you can do it using Visual Studio 2008 Database edition. In VS 2010 there is nothing called Database edition, it only exists in VS 2008.

Unfortunately VS 2010 Premium is £3,695, very different to the normal VS 2010 (Professional edition) which is £694).

As per the how to do it, Arshad Ali wrote a very good post here. And the formal MSDN guide is here. One of the most important things to notice is that the test data is rubbish, e.g. contains a series of random, unreadable characters. For numeric columns it’s OK, they contains random numbers, which is what we want. But again the numbers are rubbish. Most users are used to meaningful data, so for them to view reports containing the test data, their usual reactions is that they can’t stand looking at the reports. In their view the reports are rubbish because they contain rubbish data. You won’t understand it unless you see it for youself, so I’d recommend doing it (showing the user a report with dummy data). I’m of the opinion that reports with dummy data are more appropriate for developers and testers within IT, not the end users.

If rubbish data is not what you have in mind, do not set 3 days in your project plan to generate test data. You will need to do it the old way, using stored procedures. Mitchel Sellers wrote a very good article on this topic: link.

Alternatively, you can use Red Gate data generator tool (link), which is £175/user. Cheaper than buying the VS 2010 Premium which is £3,695. But, again, as per any test data generator tools, the data is rubbish. It’s just a series of random, unreadable characters.

Editions and Prices of Visual Studio 2008

Filed under: Analysis Services — Vincent Rainardi @ 11:03 pm

This page on Microsoft web site contains the differences between 5 Microsoft Visual Studio editions, as below:

The prices for these 5 Microsoft Visual Studio editions are:
(all prices including VAT, from eCost Software 27/2/11, rounded to nearest GBP)

  1. Professional with MSDN Essentials £694
  2. Professional with MSDN Subscription £954
  3. Premium £3,695
  4. Ultimate £10,230
  5. Test Professional £1,954

A few important things to notice (beside the prices) are:

  1. All 5 editions includes SQL Server 2008. It’s for development only, not for production, but we buy VS for development purpose anyway.
  2. The Professional and Test Professional editions don’t not include DB dev/testing stuff. In particular you can’t generate test data.
  3. Only Premium and Ultimate include Microsoft Office.
  4. All editions except #1 includes Team Foundation stuff e.g. VSS, build automation, team portal & reporting, BUT you need to buy TFS server licence (£378) and TFS CAL (£339/user)

SSIS: Work Flow vs Stored Procedures

Filed under: SSIS — Vincent Rainardi @ 9:44 pm
Tags:

When importing a file into a SQL table, we create a Work Flow. But for transferring data from one SQL Server table to another SQL Server table, is it better to use Execute SQL Task (Stored Procedures) or Work Flow?

This is a classic debate in SSIS. A lot of times in data warehousing we need to transfer data from the staging tables to the fact and dimension tables. Should we use SQL Task or Work Flow?

There are 4 main considerations here:

  1. Data quality checking
  2. ETL framework
  3. Performance
  4. Development time

Data Quality Checking

There are a number of data quality checks that we need to perform on the incoming data and log them accordingly, potentially rejecting the incoming data. For example data type validations, number of columns, whether the data is within a certain allowable range or conforming to a certain list, etc. These DQ checks should be built only once and used many times, avoiding redundant work. For that purpose, it is easier to build the DQ checks in the form of stored procedures, running dynamic SQLs on many staging tables tables one by one. One of the main principle in DQ is that any silly data in the incoming data should not fail the data load. It should be gracefully recorded and the whole ETL package carries on. It is of an order of magnitude more difficult to build the DQ routines as script tasks, which are executed before the data flows into the warehouse. On the other hand, the data profiles are easier to be built using Data Profiling task. What I’m saying is that the decision whether to use a data flow or stored procedure/execute SQL task is affected by how the DQ routines were built.

ETL Framework

In every data warehousing or data integration project that uses SSIS as the ETL tool, the first step is to build an ETL framework. This framework handles error checking, alert notification, task failures, logging, execution history, file archiving and batch control. It is built as “parent child” package system, supported by a series of ETL metadata tables, as per chapter 10 of my book, e.g. data flow table, package table and status table. What I’m saying here is that the decision of whether to use a data flow or stored procedures/execute SQL task is affected by your ETL framework. I know that it should be the other way around: the ETL framework should be built to incorporate both the workflow and the stored procedures. Well if that’s the case in your project that is excellent, there’s no problem here. But practically speak I’ve seen several cases where we could not implement a data transfer routine as a workflow because the the ETL framework dictates that they need to be implemented as a stored procedures.

The next 2 points are the guts of the reasons. They are the real reasons for choosing between work flow approach and stored procedures, if it is a green field. Meaning that you have a complete freedom to choose, without any of the existing corporate rules/architecture affecting your decision.

Performance

Performance is about how fast the data load is. Given the same amount data to load from the staging table into the main table, which one is the fastest method, using select insert, or using a data flow? Generally speaking, if the data is less than 10k rows, there’s no real difference in performance. It is how complicated your DQ stuff that slows it down, not whether it’s a workflow or a stored procedure. If you are lucky enough to be involved in a project that loads billions of rows every day, you should be using work flow. Generally it is faster than stored procedure. The main issue with a stored procedure to do 1 billion upsert in SQL Server database is the bottleneck on the tempDB and log files. Your DBA wouldn’t be happy if you blew up the tempDB from a nice 2 GB to 200 GB. Ditto with log files.

Using workflow you can split a derived column transformation into several transformations, effectively boosting the throughput up to twice faster. See here for details from SQLCat team. And this principle is applicable for any synchronous task, including data conversion transform, lookup, row count, copy column and multicast. See here for an explanation about sync vs async tasks. One thing that gives us the most performance gain is to use multiple workflow to read different partitions of the source table simultaneously. This is for sure will create a bottleneck on the target, so it too needs to be partitioned, pretty much the same way as the source table. The other thing that increases the performance is the use of cache on lookup transformation. Using Full Cache, the entire lookup table is pulled into memory before the data flow is executed, so that the lookup operation is lightning fast. Using Partial Cache, the cache is built as the rows pass through. When a new row comes in, SSIS searches the cache (memory) for a match. Only if it doesn’t find then it fetches the data from disk. See here for details. You don’t get all these when you use stored procedures to transfer the data.

Development Time

You may say that development time is inferior compared to performance, when it comes to how big it influences the decision between work flow and SP. But in reality this factor is significant. I have seen several cases where the ETL developer is more convenient coding in Transact SQL than using SSIS transformations. They are probably twice as fast building it in stored procedures than doing it in SSIS transformations, due to their past experience. Understandably, this is because the majority of the so called “SSIS developer” was a “SQL developer”. They may have been doing SSIS for 2 years, but they have been doing SQL stored procedures for 10 years. For example, many developers are more conversant doing date conversion in Transact SQL than in Derived Column.

Conclusion

If you are lucky enough to be able to choose freely, work flow gives more performance and flexibility. But as with everything else in the real world, there are other factors which tie your hands, e.g. the data quality checking, the ETL framework and the development time.

As always I’d be glad to receive your comments and discussion at vrainardi@gmail.com. Vincent 27/2/11.

SSIS: Debugging a Script Component

Filed under: SSIS — Vincent Rainardi @ 10:17 am
Tags:

I was mistakenly assumed that we can debug a script component. As in, putting breakpoints and inspect the value of the variables at run time. But no, in SSIS we can not debug a script component. We can debug a script task, but not a script component. A script task is a control flow item, whereas a script component is a workflow item.

Because in the past I have debugged a script task, I thought we could debug a script component. So I put break points on the code in my script component and then execute the workflow task. Alas, the workflow did not stop at the breakpoint. It carries on. I thought my Visual Studio 2008 installation is not right, i.e. it did not integrate with SQL Server BIDS. So I tried on my laptop which only had SQL Server 2008 installed. I created a workflow with a script component and ran it and I got the same thing: SSIS did not stop at the break points inside the script component.

Then I read in MSDN that we can’t debug a script component: link. Because we can’t set breakpoints and inspect the variables, it suggests 4 methods:

  1. Message box (MessageBox.Show)
  2. Raise warning and error events
  3. Log messages
  4. Use a data viewer to see the output

In the end I did the debugging using the combination of:

  1. Message box
  2. Write to a text file (using SteamWriter)
  3. Set a package variable

SSIS: Importing Files – Read The First N Rows

Filed under: SSIS — Vincent Rainardi @ 8:17 am
Tags:

Sometimes we need to read the first few rows of the file, not the whole file. Usually this is because the file contains a header that gives us information about the structure of the file, like this:

col1|col2|col3
a,a,a
col1|col2|col3|col4|col5
1,1,1,1,1
2,2,2,2,2
3,3,3,3,3

In the above example we want to read the first 2 rows, which is the header. But in SSIS there is no facility to read the first N rows only. We have to read the whole file, which is rediculous if we have a file with 1 million rows. So how do we read the first N rows only?

To read the first N rows there are 3 alternatives:

  1. Use Konesans Row Number Transformation.
  2. Put all rows into a table then select top N from it.
  3. Use a script component as a source or transform.

Using Konesans Row Number Transformation

Thanks to Allan and Darren, Row Number Transformation gives us a row number for every row in the file. This row number is added as a column in the output of the data flow. For further details please see their web site, SQLIS.com. We can then use the Conditional Split to split the data flow, like this:

1 output of the Conditional Split, called header in this case, contains the first 2 rows only. Which can then be loaded into the target table.

Using this method we have to read the entire file, as summarised nicely by CTKeane here. So practically speaking this approach is ideal if the file we are importing is small e.g. 1000 rows. However, it is also suitable for importing big files, if we want to read or process the remaining of the file as well. Say the file consists of 1 million rows. The first 2 rows are header, and the remaining 999,998 rows are the details. So we can process the default output of the Conditional Split, which contains the detail rows.

The second thing we need to be aware is that it is a third party component where our infrastructure managers might not be willing to support. So before going down this route it is wise to check first with whoever is in charge for the production/UAT/dev servers in your company/project.

Putting all rows in a table then select the top N from it

In this approach we import the file into a staging table. Then we select the top N rows from this staging table. Again it is OK if we want to process the remaining of the file as well as the header. But if we don’t want to read the remaining of the file, we can’t use this solution.

If the file that we are importing consist of 2 sections like the above example, then when defining the flat file connection manager we need to define the file as 1 column. Only 1 column. When we query that 1 column table we then split the long string into several column based on the delimiters.

Transact SQL doesn’t have a string split function, not like in C#. We could use charindex, which gives us the location of the delimiters. So the idea is that we crawl the long string character by character, and when we found the delimiter we then use substring to capture that column data. For an example see this post on the SQLTeam web site.

Use a Script Component as a Source/Transform

If your flat file is of a static structure (fixed number of columns) then you can use the script component as a transform, and use flat file source as normal, as Jamie Thompson described here. But if your flat file contains multiple structures, like in the above example, it is probably best to use the script component as a source.

To use a script component as a transform, first create a flat file source pointing to the file. Then add a script component and choose Transform when SSIS asks us to select Source, Destination or Transform.

  • On the Connection Managers tab we don’t set anything.
  • On the Input Columns tab we select the columns that we want. These input columns will be automatically added to the Input 0 section of the Inputs and Outputs tab.
  • On the Inputs and Outputs tab, rename the Output 0 to something sensible, e.g. CustomerRecordFileOutput. Then create the Output Columns. Make these columns of type signed int 4 bytes. Also, on the Output 0, set the SynchronousInputID property to None, which makes it async (the default is the input column, i.e. sync).
  • Finally on the Script tab, set the ReadOnlyVariables to User::N, i.e. the package variable containing how many rows we want to read. This is important otherwise that package variable will not be recognised within the script. Then edit the script as follows (C#):
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        int i = 1;

        if (i<=Variables.N)
        {   Output0Buffer.AddRow();
            Output0Buffer.Col1 = Row.col1;
            i++;
        }

    }

Basically, in the above code, we create a row counter, which we call i. We increment this i for every Input buffer row. As long as i is less than or equal to N, we produce output. Otherwise we don’t produce any output. Leave the Pre and Post Execute method, we just need to amend the Input0_ProcessInputRow method as above.

The data flow looks like this:

If you use the script component as a source, this is how:

  1. Don’t create a flat file source, just create a flat file connection manager.

    Suppose our source file is like this:

    col1|col2
    1|1
    2|2
    3|3
    4|4
    5|5
    

    So create a flat file connection manager with 2 columns, tick the “Column names in the first data row” checkbox, format: Delimited, Header row delimiter: {CR}{LF}, column delimiter: vertical bar.

  2. Add the script component. Choose “As a Source” when asked whether it’s a source, destination or transformation.
  3. Edit the script component and on the Connection Managers add the flat file connection that we created on step 1 above. For example, let’s call it ConnFile2. Like this:

  4. Then on the Inputs and Outputs, rename Output0 to File2Output. Then click on the “Output Columns” and click Add Column. Call it Col1 and set the data type according to the file you are importing:

  5. On the Script tab, add the package variable N to the ReadOnlyVariables:

  6. Then edit the script as follows: (based on MSDN article, Creating a Source with the Script Component, link)
public class ScriptMain : UserComponent
{
    private StreamReader SR;
    private string File2;

    public override void AcquireConnections(object Transaction)
    {
        IDTSConnectionManager100 ConnMgr = this.Connections.ConnFile3;
        File2 = (string)ConnMgr.AcquireConnection(null);
    }

    public override void PreExecute()
    {
        base.PreExecute();
        SR = new StreamReader(File2);
    }

    public override void PostExecute()
    {
        base.PostExecute();
        SR.Close();
    }

    public override void CreateNewOutputRows()
    {
        string Row;
        string[] Cols;
        char[] Delimiter;
        Delimiter = "|".ToCharArray();

        //Skip the header first
        Row = SR.ReadLine();

        for(int i = 1; i <= Variables.N; i++)
        {
            Row = SR.ReadLine();
            Cols = Row.Split(Delimiter);
            File2OutputBuffer.AddRow();
            File2OutputBuffer.Col1 = Cols[0];
            File2OutputBuffer.Col2 = Cols[1];

        }

    }

}

The Data Flow looks like this:

Parent Child Dimension: extra child node

Filed under: Analysis Services — Vincent Rainardi @ 8:03 am
Tags:

In a Parent Child Dimension (PDC), using the MembersWithData property we can see whether we want to display child nodes with the same parent name or not. For example, if we have a table like this:

We expect to see this: (1)

But by default Analysis Services produce this: (2)

The above screenshot shows that, by default, Analysis Services gives us 1 more child node, which is the parent.

To get (1), change the MembersWithData property from NonLeafDataVisible to NonLeaveDataHidden, like this:

We need to set this MembersWithData property on the parent key column, not on the dimension key column.

Thank you to Mourad Faiz for bringing this to my attention. As always I welcome corrections and discussion at vrainardi@gmail.com. Vincent 27/2/11.

Mixed Grain Fact Table

Filed under: Data Warehousing — Vincent Rainardi @ 7:19 am
Tags:

A mixed grain fact table is a fact table where we have measures with different grain. For example, one measure is weekly whereas the other measure is monthly. In this post I’d like to go through the advantages and disadvantages of that approach. The Kimball group clearly stated that measures in a fact table must have the same grain, see Kimball’s Toolkit book chapter 2.

As usual it is easier to explain “by example”:

It is a data mart. They have measures per week and per month, but they don’t have the measure on daily basis. Should we have 2 fact tables, one for weekly and one for monthly, like this: (1)

Or should we create a mixed grain fact table like this: (2)

In the above fact table, the black lines are weekly figures, whereas the red lines are monthly figures. They are both put in the same fact table, but on different measure columns. On the rows where the weekly measure exist, the monthly measure is zero. And vice versa. Therefore weekly and monthly total are correct:

select D.Week, sum(F.WeeklyMeasure) from FactMixedGrain F
join DimDate D on F.DimDate = D.DimDate group by D.Week

Result:

select D.Month, sum(F.MonthlyMeasure) from FactMixedGrain F
join DimDate D on F.DimDate = D.DimDate group by D.Month

Result:

Usually the main reason of doing option (2) is to have both the weekly and monthly measures in 1 fact table. This is done to save development effort, particularly ETL effort. It is easier to populate just 1 table than two tables.

That’s the advantage. Now the disadvantages. The issue with option (2) is that the grain of the fact table is mixed. There are 2 grains. In practice we have other dimension key columns in the fact table. And we also have other measure columns in the fact table. Some of these measures are weekly and some of them are monthly. Perhaps some of these measures are daily.

The issues with a fact table with a mixed grain like that are:

  1. When the reporting from the table, we are risking mixing measures with different grains in the same formula, resulting in meaningless result.
  2. The DateKey is used to indicate both the end of week and the end of month. This requires certain logic in the ETL, which prolong its development.
  3. In the fact table we have black and red lines, indicating weekly and monthly grain. Black for weekly and red for monthly. We may end up having yellow lines too, for daily. And blue lines too, for quarterly. Oh, and green lines too, for yearly. After that, it will be very confusing. When we query the fact table, we wouldn’t be able to tell which line is which.
  4. When we join this fact table with another fact table (drilling across), we are risking joining the wrong rows.

In option (2) above, the date dimension is used for 2 different grains in the fact table. Other examples like this are:

  1. Product dimension used at the product group level and at the product category level.
  2. Customer dimension used at the company level and at the group level.

When we have 2 or 3 dimensions, with each has dual grain like that, it becomes really confusing and risky.

In the above example, the same dimension key column has dual grain. In other cases, the mixed grain in the fact table is caused by different dimension key columns. One measure correspond to dim1, where as another measure correspond to dim2. Where measure1 has value, measure 2 is 0. And vice versa, like this:

And so it becomes really confusing. So I would recommend avoiding this design, mixing measures with different grains into the same fact table. Instead, it is better to put them into separate fact tables.

As usual I welcome comments and discussion at vrainardi@gmail.com. Vincent 27/2/11.

26 February 2011

Denormalising a Fact Table

Filed under: Data Warehousing — Vincent Rainardi @ 9:00 am
Tags:

To denormalise a fact table means we expand a dimensional key column, to include the attributes in the dimension. This practice usually happens in a small, simple data mart project, where one person is doing everything.

As usual, it is easier to “learn by example”:

The usual practice in dimensional modelling is to design a fact table like this:
Dim1Key, Dim2Key, Measure1, Measure2, Measure3 (1)
e.g. DateKey, EventKey, Revenue, Cost, Visitors (an exhibition organiser company)

Rather than doing that, should the fact table be like this?
Dim1Key, Dim2Attributes, Measure1, Measure2, Measure3 (2)
e.g. Date, EventName, EventType, EventLocation, Revenue, Cost, Visitors

The main consideration for (2) is to save the ETL effort. By not creating the dimension tables and have everything in the fact table, it is simpler and quicker to build.

There are several disadvantages to approach (2):

  1. There might be other event attributes introduced in the future. And those new attributes will need to be put into the fact table as well.
  2. We lost the date attributes such as year and month. They will need to be put into the fact table as well then, making the fact table very wide.
  3. It is difficult to “browse the events”. We would need to do a select distinct on the fact table to provide a list of events. The same with other dimensions.
  4. Some dimensions such as customer and product can have many attributes, making the fact table even wider.
  5. If the fact table is big, say 1 billion rows, aggregating a measure by an attribute is slower than if we put the attributes in a small dimension table.
  6. We would need to index the fact table on these attributes. The indices will be a lot bigger than if the attributes are in a dimension table. They will take more effort to maintain.

Because of these reasons better not to denormalise a fact table.

In practice, at high level, everybody agree with this principle. But not everybody agree with the details of this principle. I call this issue “one or two dimensions”.

Again it is easier to learn by example:
In the above example, should event location be a separate dimension?

Many people I talk to agree that location should be a separate dimension. It should have its own dimension key, and it would have its own attributes such as city, post code and nearest tube station.

Generally, if the attribute has many other attributes, it should be put into a separate dimension. But, there is a disadvantage: we lost the direct link between location and event. We can only link them via the fact table. Which is why some designer “link the 2 dimension via a back door”. Read here for details, case b.

Now a little bit more vague: should the event type be in a separate dimension?

This time the response I get is about half-half. Some people think that event type should be a separate dimension. But some people think it should stay in the event dimension. Opinions will always vary (naturally), but the most important thing for us is to understand the reasons why they think so.

The main reason for keeping event type in the event dimension is because it has no attributes. It only consist of 1 attribute, i.e. itself. It may have code and description, but from business point of view it is only 1 attribute.

Whereas the main reason for separating event type into its own dimension is because some people (especially in the exhibition industry) think that it has its own attributes, i.e. it consist of 3 levels. Level 1: show or exhibition. Level 2: if it’s a show: music or art, and if it’s an exhibition: commercial or art. Level 3: if it’s a commercial exhibition: wedding, home, or books. If it’s an art exhibition: painting, contemporary, or installation.

So the decision whether an attribute should be put into its own dimension or not depends whether it has its own attribute or not. Again, I must emphasise the main disadvantage of separating into its own dimension: we loose the direct link. We can only link via a fact table.

But that could be an advantage. It is possible, that we could link via 2 different fact table. And we end up with different results. If that is the nature of the data (there is no fixed relationship), then the attribute must be put into its own dimension. This is the second reason why we separate an attribute (such as event type) into its own dimension.

As usual I welcome any comments and discussion at vrainardi@gmail.com. Vincent 26/2/11.

Normalising a Fact Table

Filed under: Data Warehousing — Vincent Rainardi @ 7:20 am
Tags:

To normalise a fact table means we make it slim and tall. We do that by introducing a new dimension called “measure type” into the fact table, and move all measures into just one measure column: amount. The meaning of this amount column depends on the value of the measure type key.

As always, it is easier to ”learn by example”.

Rather than creating a fact table like this: (1)
Dim1_Key, Dim2_Key, Dim3_Key, Measure1, Measure2, Measure3
e.g. Date_Key, Customer_Key, Product_Key, Revenue1, Revenue2, Revenue3

Why don’t we create it like this: (2)
Dim1_Key, Dim2_Key, Dim3_Key, Measure_Type_Key, Measure_Amount
e.g. Date_Key, Customer_Key, Product_Key, Revenue_Type_Key, Revenue_Amount

The usual argument is that “if we have a new measure (e.g. Revenue4), we don’t have to alter the table structure”.

And the usual situation is something like this: “we have 30 different measures/revenues in this company and it is very likely that in the next 3 months we will have another one”.

So, there we go. That’s the problem presented. The question now is, would you go for (1) or (2)? What do you think?

…… (your opinion here. I really like to hear it. Please let me know at vrainardi@gmail.com)

I would go for (1), even if I have to create a new measure every 3 months. Unless it is a GL mart.

Why?

  1. Additive. One of the sacred fundamentals in data warehousing (is there such a thing?) is that a measure is additive.  Because it is additive, we can sum it up. This sounds very simple, but it is very powerful. It is fundamental to the analytics ability of a data warehouse. Most of the BI tools are utilising this “additive characteristic”.
    If we go with (2) we will lose this analytic ability. If one measure is not compatible with the other measures, we can’t add them up. In the case of all of them being revenues, and all of them being daily, and all of them being applicable per product and per customer, and all of them being in the same currency, and all of them being in the same units (say in thousands) then yes the measures are additive to each other. But once we introduce other measure such as costs or tax or non monetary measure into the fact table, we may not be able to add them up together.
  2. One Pass Query.
    “Not really”, you might say. All you have to do is to add “where revenue_type = 1” and you can now sum up the Revenue1. If you want Revenue2, just put “where revenue_type = 2” in your query and voila! you got Revenue2. Yes, true that, but … if you want to get “revenue1 minus revenue2”, we will have to do “2 passes” on the table. Essentially you need to run 2 SQLs. The first one is to sum up Revenue1 and t he second is to sum up Revenue2. Then you substract one from the other. That’s what “2 passes” means. Why is it a problem if it is 2 passes, or 3 passes or 4 passes? Slow, that is the problem. The more number of passes our SQL need to run to produce a report, the slower the report.
  3. On the BI tool, if you don’t put the measure type filter, you are risking the user getting “rubbish information”. From usability point of view, this design is a rubbish. A data warehouse or a data mart is designed primarily to analyse information. We do not store data for the sake of storing it. When we store data we arrange the data in the format that is easiest to be analysed. The user will need to set the measure type dimension every single time. On every single BI query. When you see this in practice you will understand why the users would go mad. Particularly if they are analysing the data using OLAP cubes.
  4. Flexibility. One design principle of data warehousing is flexibility. We want to be able to add things later on. With regards to the fact table, Kimball group clearly say that we need to group the measures based on their grain. When we have a new measure, as long as the measure has the same grain, it can go to the same fact table. If we normalise the table, we may not be able to add another measure (e.g. count), even if it has the same grain. Because if we add a count measure, we lost the additivity. This way, we will end up with many more fact tables. Because the measures are not only grouped by their grain, but also by their data characteristic, e.g. all monetary values goes to fact table 1, all non monetary amount (such as quantity) goes to fact table 2, and all count measures goes to fact table 3. Not a good design. It is not flexible, and we end up with more number of fact tables.

Unless It Is a GL Mart

A General Ledger is a collection of accounts. In a large company there are thousands of accounts. Usually there are 5 main types of accounts: asset accounts, liabilities accounts, capital accounts, revenue accounts and expense accounts. Each account has a balance, at all times. The balances change every day. Transactions (or journal items) change these balances. Imagine that you are viewing this a list of accounts, with their balances. That’s more or less what is known as a “GL mart”. A large company has more than 1 ledgers (more than 1 lists), for instance, 1 for each subsidiary company.

The fact table is something like this:

Period_Key, Account_Type_Key, Account_Key, Ledger_Key, Balance

The period is usually a calendar month. But in some companies 1 period is 4 calendar weeks. The account type is usually the 5 main account types I mentioned earlier, broken down to a very fine grouping of accounts. Usually 3 to 4 levels. Account dimension is usually a parent child dimension. Meaning that for each account, there is a parent account.

Usually there are other fact tables in a GL mart. For example, transaction fact table. The transaction fact table contains the detail of every transaction. The idea is, when you are analysing data using a BI tool and select an account, you can drill down to the transaction level.

OK, enough with the intro. That’s GL mart. Now the key point: in GL mart, there is only 1 measure: Balance. The value of this measure depends on the account key. It is therefore natural, that in the GL fact table we have only 1 measure, and the meaning of that measure depends on the account key column.

That is different from “normalising a fact table” that we discussed in this post. In this post, the fact table has many measure columns. And we tried to make it 1 measure column. Which is unnatural.

As usual I welcome comments and discussion at vrainardi@gmail.com. Vincent 26/2/11.

Next Page »

Blog at WordPress.com.