Data Warehousing, BI and Data Science

8 March 2011

Time Dimension

Filed under: Data Warehousing — Vincent Rainardi @ 10:03 pm
Tags:

This is a classic topic in dimensional modelling. I am not talking about date dimension here, but time dimension, e.g. 2011-03-08 19:31:05.289. Consider this case:

It is an order table in a stock brokerage company (click here for a background). It has these 3 columns: Order date time, cancellation datetime, settlement date. Of course it has other columns, but they are irrelevant to this post. Sample data:

Order Datetime| Cancellation Datetime|Settlement Date
2011-03-08 08:59:31.201|NULL|2011-03-11
2011-02-12 14:07:22.549|2011-02-13 06:38:30.216|2011-02-16

The SettlementDate is irrelevant in this matter because it is a date column. Let’s take Order Datetime column as an example, because it is a datetime column. The question is, how are we going to store it in the warehouse/mart? We have 3 options:

a. Create OrderDateKey and OrderTimeKey in the fact table? They then link to 2 dimensions: DimDate and DimTime.
b. Create OrderDatetimeKey in the fact table? This then links to DimDatetime.
c. Not keying it, but store it as a datetime column in the fact table?

Some of the things that we need to consider in this case are:

  1. Do you need the millisecond element of the time?
  2. Do you need the “second” element of the time or just the minutes?
  3. Do you need to band the time, e.g. into 1 hour slots or half hour slots.
  4. Do you need to analyse the time element separately from the date element?
    Or the reverse: do you need to analyse just the date element without the time?
  5. Is this a fact table or a dimension table?

In many DW projects option b is considered impractical. Option a and c are chosen more often than b. If the millisecond is needed, usually option c is more suitable. If the millisecond is not needed, usually option a is more suitable.

If it is a fact table, then the tendency is to key the datetime column (option a). If it is a dimension table, the tendency is not to key the datetime column (option c).

If the date is analysed separately from the time element, option a is preferred than option b.

Apart from the 3 options above, we have a “combi” option, i.e. the date column is keyed, but the time column is not. This is rarely used. I’m struggling to see the justification or the case where this approach is more suitable than option a, b, and c.

DimDatetime (option b) practically speaking can only be used if the grain is minute. We can’t do b if the grain is second or ms.

Here are the number of rows for each dim (the word in brackets shows the grain of the dim):

DimTime (min):     1,440
DimTime (sec): 86,400
DimTime (ms): 86,400,000

DimDatetime (min): 5,256,000
DimDatetime (sec): 315,360,000
DimDatetime (ms): 315,360,000,000

That is why practically speaking the last 2 lines can’t be used. Because the dimension would have had 315 million and 315 billion rows respectively.

As usual I welcome any feedback and discussion at vrainardi@gmail.com. Vincent 8/3/11.

Update, 21/4/2011: If you need to compose a hierarchy between date attributes and time attribute (say day and hour) then it is best to keep them as 1 dimension (not Dim_Date and Dim_Time separately)

SQLBits 8 Brighton: Advanced Dimensional Modelling

Filed under: Data Warehousing — Vincent Rainardi @ 7:23 pm
Tags:

I will be presenting Advanced Dimensional Modelling session at SQLBits 8 in Brighton on Saturday 9th April at 12:10 to 13:10. Link to the agenda is here.

The session excerpt is below:

This session is for anybody involved in the design of a data warehouse. Many of us know what dimensions and fact tables are. But as I explained in my blog, dimensional modelling is more than just dimensions and fact tables. This session is about advanced dimensional modelling topics such as Fact Table Primary Key, Vertical Fact Tables, Aggregate Fact Tables, SCD Type 6, Snapshotting Transaction Fact Tables, 1 or 2 Dimensions, Dealing with Currency Rates, When to Snowflake, Dimensions with Multi Valued Attributes, Transaction-Level Dimensions, Very Large Dimensions, A Dimension With Only 1 Attribute, Rapidly Changing Dimensions, Banding Dimension Rows, Stamping Dimension Rows and Real Time Fact Table. Prerequisites: You need have a basic knowledge of dimensional modelling and relational database design.

I’d like to say congratulations all the speakers for getting your session chosen, particularly those who I know: Dr. John Tunnicliffe, Alistair Aitchison, Jamie Thomson, Chris Testa O’Neil, Marco Russo and Alberto Ferrari. For those who I don’t know yet, I hope I can make your acquaintance there. So to all the speakers and all the audience, see you all there.

Any specific question or request about advanced dimensional modelling please contact me at vrainardi@gmail.com. I will be writing the material in the next 2 weeks. I know the word “advanced” has different meaning from person to person, but I hope I won’t get too many of “that was way too easy for me” or “that went straight through my head”. Vincent 8/3/11.

My slide deck is here. I will put the slides using SlideShare in the next few days. As usual I welcome any comments and questions on vrainardi@gmail.com.

Update 14/4/11: Click here to view the PowerPoint presentation on SlideShare

5 March 2011

Putting Higher Grain Attributes into a Separate Dimension

Filed under: Data Warehousing — Vincent Rainardi @ 6:23 pm
Tags:

In a fact table there are a few dimension key columns. These columns link to dimension tables. Each dimension table has a grain, the lowest unit that constitute a row (apart from SCD stuff). The question here is whether we should put an attribute from one of those dimensions into its own dimension. As usual, it is easier to explain by example.

Example 1. ProductGroupKey. In the classic case of retail mart, in the sales fact table we have ProductKey, DateKey, StoreKey and CustomerKey. In the Product dimension we have product type. A product group depends only on product key, hence Product Group is an attribute of product dimension. The grain of product dimension is product. Product group has a higher grain than product. Should we separate product group attributes into a separate dimension and have ProductGroupKey in the sales fact table

Example 2. CustomerKey. In the stock broker data mart (see my SQLBits presentation here for a background), in the account balance fact table we have AccountKey, OrderDateKey, OrderTypeKey, ShareKey, SettlementDateKey, OrderStatusKey and CancellationDateKey. We have an Account dimension and we also have a Customer dimension. A customer can have several accounts but an account can only belong to one customer. So a customer depends totally on an account. Hence the CustomerKey should be attribute of the account dimension. The grain of the account dimension is account. Customer has a higher grain than account. There is no doubt we need to have a customer dimension, but should we keep the CustomerKey in the account dimension or should put the CustomerKey in the fact table?

Example 3. StatusKey. In the product or customer dimension, we have customer status and product status. There could be several statuses. For customer: credit status, CRM status, customer life cycle status, whereas for product: stock status, product life cycle status. Some statuses change once a year, some once a week. Should we put these statuses into a separate dimension and have a status key in the fact table?

The things that we need to consider when making these decisions are:

  1. Whether there are many other attributes at the same level or not. In example 2, customer name is an attribute of an account, but there are many other customer attributes which are at customer level, hence they need to be put into a separate dimension.
  2. This is an influential factor: Whether there are measures which are at that level. In example 1, if we have several measures at Product Group level, and we can’t break those measures down to Product level, then we will need to have a dimension at this grain. We need to have a Product Group dimension.
  3. One way for dealing with a rapidly changing attribute in a type 2 dimension is to put it in its own dimension, with StatusKey column in the fact table. If in the product dimension there are 3 statuses and they are changing every day, then they should be put into a separate dimension, with direct link to the fact tables.
  4. This is an influential factor: Whether the relationship between that higher grain attribute and the key of the dimension is many-to-many. If it is many-to-many, then it needs to be moved to a separate dimension. For example, in Example 1 above, if a product can belong to 2 product groups, then product group cannot stay in the product dimension.
  5. Whether the attribute is also an attribute of another dimension. For example, geography attributes such as city and country. In insurance data warehouse, both the broker dimension and the risk dimension have city attribute. We may want to consider putting “city” in a separate dimension called geography or location.

Number 2 and 4 above are influential factors, i.e. the decision is sort of “forced” (borrowing a chess terminology); you don’t have a choice. Whereas number 1, 3 & 5 are “non-forced”, i.e. you don’t have to do it; you have a choice.

As usual I welcome question and discussion at vrainardi@gmail.com. Vincent 5/3/11.

SSIS: Importing a File with Dynamic Columns

Filed under: SQL Server,SSIS — Vincent Rainardi @ 9:25 am
Tags: ,

One of the most difficult things to deal with in the ETL is importing a file with dynamic columns. The file does not have a set number of columns. The number of columns changes from time to time. For example, the file can have 5 columns, or 20 columns, or anything in between.

Below are some examples of such a file:

File1:
Col1,Col2,Col3t1,Col3t2,Col3t3
a,a,1,1,1
b,b,2,2,2
c,c,3,3,3

In the above example the number of columns for Column3 is 3. But it could be 4, 5, 6 or more, we don’t know.

Before we look at the ETL side, we need to look at the data modelling side first. What kind of target table should we create to host these files.

Data Modelling

Essentially there are 2 choices.

1. We could make it normalised like this:

Table1:
Col1,Col2,Col3Type,Col3Value
a,a,t1,1
b,b,t2,1
c,c,t3,1
a,a,t1,2
b,b,t2,2
c,c,t3,2
a,a,t1,3
b,b,t2,3
c,c,t3,3

In principle we design the structure so that we have a fixed number of columns. It doesn’t matter how many columns in the source file, we will always have 4 columns in the table.

The drawback of this design is when we query it. If we need to produce File1 from Table1, we would need to pivot the table like this:

select Col1, Col2, t1 as Col3t1, t2 as Col3t2, t3 as Col3t3
from
( select Col1, Col2, Col3Type, Col3Value
from Table3
) P1
pivot
( sum(Col3Value) for Col3Type in (t1, t2, t3)
) as P2

Output:
Col1,Col2,Col3t1,Col3t2,Col3t3
b,a,1,1,1
a,b,2,2,2
c,c,3,3,3

If we have 2 tables like that we need to join them together on the common column(s) and it becomes even trickier. If we are dealing with 1 million rows column, we risk a serious performance degradation. This might not be a problem for a data mart, but if it is for a transaction system where millisecond response time is expected, this design does not serve the purpose.

Alternative Design

The alternative is to find out how many columns Col3 is. We know it could be 3 to 5, but could it be 100 columns. Spending 5 minutes with the subject matter expert to understand the characteristic of Col3, is well worth it. I have seen this dynamic columns case 3 times (the first time was about 12 years ago) and on every case they followed the book/theory (to normalise) and everytime they regretted it. On the second case (about 8 years ago) when I was involved in troubleshooting the performance issue, in the end we changed the table structure to denormalised the table structure.

So, I would recommend to find out how many columns Col3 is. 90% of the chance is: it is not that many. The response I usually get from the SME is something like this: “Most likely Col3 varies between 3 and 5 columns, it certainly won’t exceed 8”. There is a natural meaning of this data, that’s why the range is limited. For example: Col3 could be the number of teams, or the number of projects, or the number of web sites, or the number of product types. In the business world most of the data is within a certain range. You can’t have 100 teams for example. It’s physically impossible for a company as it is limited by budget constraints.

So once we have that information (the maximum number of columns), then we create the table according to this limitation. Let’s assume that the SME said that the max is 8 columns. So we design Table1 with 10 columns (just in case) as follows:

create table table1
( Col1 varchar(10),
  Col2 varchar(10),
  Col3t1 int,
  Col3t2 int,
  Col3t3 int,
  Col3t4 int,
  Col3t5 int,
  Col3t6 int,
  Col3t7 int,
  Col3t8 int,
  Col3t9 int,
  Col3t10 int
)

Unless it is a staging table, don’t forget to add a PK column. It is a good design to be able to uniquely identify a row at all times. In this example I don’t add a PK column because the target is a staging table.

A table structure with a fixed number of columns like this makes our lives a lot easier. It directly mimic the structure of the source file, and able to accommodate the dynamic nature of Col3.

The Data Load

I’m going to explain the ETL for the denormalised table, i.e. the one with a fixed 10 columns for Col3. The ETL for the normalised table, i.e. the table with 4 columns will have to wait until next time I’m afraid, as I have a few other posts to write next week, plus a presentation for SQLBits 8, Advanced Dimensional Modelling. If you need the ETL for the normalised table with 4 columns please contact me.

Because of the dynamic nature of Col3 in the source file, the best way is to use the Script Component as a Source. There are a lot of sources on the internet Script Component. This post shows you the basics. This post (by Lee Everest) shows you how to use a Script Component to split the rows in the source file. Similarly, this MSDN post shows how to import a delimited source file using a Script Component.

The main concept is as follows: First create a Connection Manager for a File Connection (no need to use Flat File connection). Name this connection File1, like this:

Then create a workflow, and drag the Script Component into the workflow. When asked, specify as source.

Then:

  • On the Connection Managers tab, set the connection to File1 (that’s the middle column). On the first column, name it as “Conn1”.

  • On the Inputs and Outputs tab, rename the “Output 0” as File1Output. Then create the Output Columns one by one, as per the structure of Table2:

    Col1, Col2, Col3t1, Col3t2, … Col3t10.


  • In this example we will convert the data type for col3 from string to int not in the Script Component, but later on using Derived Column / Data Conversion transformation. So for all columns, set the data type to String, length 10.

  • On the Script tab, click on the Edit Script and edit it as follows. Don’t forget to add “Using System.IO;” and if you want to use MessageBox for debugging you need to add “using System.Windows.Forms;”. Remember that you can’t debug an SSIS Script Component (see my post here), but you can debug SSIS Script Task.
private StreamReader SR;
private string File1;

public override void AcquireConnections(object Transaction)
{
  // Get the connection for File1
  IDTSConnectionManager100 CM = this.Connections.File1Conn;
  File1 = (string)CM.AcquireConnection(null);
}

public override void PreExecute()
{
  // Create a reader for File1
  base.PreExecute();
  SR = new StreamReader(File1);
}

public override void CreateNewOutputRows()
{
  // Declare variables
  string nextLine;
  string[] columns;
  char[] delimiters;
  int Col3Count;
  String[] Col3Value = new string[10];

  // Set the delimiter
  delimiters = ",".ToCharArray();

  // Read the first line (header)
  nextLine = SR.ReadLine();

  // Split the line into columns
  columns = nextLine.Split(delimiters);

  // Find out how many Col3 there are in the file
  Col3Count = columns.Length - 2;
  // MessageBox.Show(Col3Count.ToString());

  // Read the second line and loop until the end of the file
  nextLine = SR.ReadLine();
  while (nextLine != null)
  {
    // Split the line into columns
    columns = nextLine.Split(delimiters);
    {
      // Add a row
      File1OutputBuffer.AddRow();

      // Set the Col3Value variable to the values in the file
      for (int i = 0; i <= Col3Count - 1; i++)
      {
        Col3Value[i] = columns[i + 2];
      }

      // Set the values of the Script Component output according to the file content
      File1OutputBuffer.Col1 = columns[0];
      File1OutputBuffer.Col2 = columns[1];
      File1OutputBuffer.Col3t1 = Col3Value[0];
      File1OutputBuffer.Col3t2 = Col3Value[1];
      File1OutputBuffer.Col3t3 = Col3Value[2];
      File1OutputBuffer.Col3t4 = Col3Value[3];
      File1OutputBuffer.Col3t5 = Col3Value[4];
      File1OutputBuffer.Col3t6 = Col3Value[5];
      File1OutputBuffer.Col3t7 = Col3Value[6];
      File1OutputBuffer.Col3t8 = Col3Value[7];
      File1OutputBuffer.Col3t9 = Col3Value[8];
      File1OutputBuffer.Col3t10 = Col3Value[9];

    }

    // Read the next line
    nextLine = SR.ReadLine();
  }
}

public override void PostExecute()
{
  // Close the reader
  base.PostExecute();
  SR.Close();
}
  • Save the script, close SSIS script editor, click OK to get out of the Script Transformation Editor, back to the Data Flow.
  • I would suggest not to put the output into a table in the database directly, but put the output to a file first to check the result. This way you don’t need to worry about data conversion between string and int etc. Create a flat file connection with 12 columns. All the columns are string, with length = 10, as follows:

  • Run it. If it fails, check the source file and make it like this:

    Also check the script, make sure it is like above.

  • Create an OLE DB Destination or SQL Server Destination (see my post here re why practically speaking we can’t use SQL Server Destination). But create the table first by executing the DDL I wrote above. Note that if you are using SQL Server Destination (because you are doing it to your local instance) then notice that on Windows 7 or Vista you need to run BIDS as Administrator, otherwise you’ll get this error: Unable to prepare SSIS bulk insert, so something like that.
  • Create a data conversion to convert Col3 columns from string to int, like this:

  • Connect the output of the data conversion to the DB and map the DC output columns to the DB columns like this:

  • Now execute it again, and check the target table:

  • Overall the Data Flow looks like this:

As usual I welcome any question and discussion at vrainardi@gmail.com. Vincent, 5/3/11.


2 March 2011

A Measure or An Attribute

Filed under: Data Warehousing — Vincent Rainardi @ 7:57 am
Tags:

Sometimes we are not sure whether a number should be a measure or an attribute. As always it is easier to learn by example.

Example 1. It’s a customer loyalty programme in an eShop, an eCommerce company. Every year they give all customers a simple survey containing 3 questions. One of the question is how satisfied are you with eShop.com. It’s an integer number between 1 (lowest) and 10 highest. The question is: is this customer satisfaction number a measure, or an attribute? Would you a) create a fact table with CustomerKey and YearKey and have the CustomerSatisfaction as measure, or b) put it as an attribute in DimCustomer?

Example 2. eEverything.com sells thousands of product items in 5 categories: eBook, eFilm, eMusic, eSoftware, eGames. Every December, they calculate all the things that customer bought from eEverything.com and give them 5 numbers: the probability that the customer will make purchase an eBook/eFilm/eMusic/eSoftware/eGames in the coming year. These 5 probability numbers will drive the CRM email marketing campaign in the next year. Are these probability numbers measures or attributes? Would you a) create a fact table with CustomerKey and YearKey and have them as 5 measure columns, or b) put them as 5 attributes in DimCustomer?

Example 3. MyStore.com is an eCommerce company that let customers build their own stores. Customers can search millions of products and choose what they want to put on their stores. From socks to cups. From notebooks to TV. Rather than “browsing the whole internet” they have everything that they like in 1 place. They can even add the same product from two different suppliers (with a different price of course). Because MyStore.com has millions of customers, they can aggregate the products that are wanted by many customers. Say that this month, product A (which is $10) is wanted by 100,000 people. MyStore then asks the supplier for a 5% discount. Knowing that they will have lots of sales, the supplier usually agrees to give a discount. MyStore then contacts the customers if they would like to buy it for $9.50. Each product has a number, which shows how the probability of the product will be given a discount, because the demand hits a “mass scale”. Every day, MyStore data warehouse system calculates this number, which is called PFD, Probability For a Discount. Would you put PFD as a measure in a fact table (with ProductKey and DateKey) or as an attribute of the product dimension?

In my opinion, the factors that we need to consider when making this decision is:

  1. How frequent does it change. If it changes daily then put it as a measure. If it never changes then put it as an attribute. If it’s changes once a year we can handle it as SCD, if the dim is not too large (say up to 1 million members)
  2. Usually a numeric is put as a measure, but if it is not aggregatable then what’s the point? It make sense to put it in a dimension as an attribute. But see grain below (point 3). But if it is aggregatable then there is an advantage to put it as a measure.
    All 3 examples above are not aggregatable.
  3. Is the grain the same as the dimension? In the 3 example above, the grain is only the customer (plus time of course). But if the grain spans across several dimensions (like this example) then we cannot put it on 1 dimension. It needs to be put on a fact table.

As usual I welcome comments or discussion at vrainardi@gmail.com. Vincent 2/3/11.

1 March 2011

Non Aggregatable Measures

Filed under: Data Warehousing — Vincent Rainardi @ 10:46 pm
Tags:

In insurance, the probability that an event happens is a non aggregatable measure. There are also several measures in investment banking which are not aggregatable, particularly those related to risk. Not aggregatable means you can’t sum it up. For example:

  • The probability of building A in City X getting an earthquake in 2011 is 3×10-6.
  • The probability of building B in City Y getting an earthquake in 2011 is 4×10-6.
  • The probability of both building A and B getting an earthquake in 2011 is 5.6×10-6. You can’t add them up.

Another example:

  • The probability of building A in City X getting an earthquake in 2011 is 3×10-6.
  • The probability of building A in City X getting an windstorm in 2011 is 2×10-6.
  • The probability of building A in City X getting an earthquake or a windstorm in 2011 is 4.7×10-6. You can’t add them up.

We would probably design the fact table as normal:

BuildingKey, CityKey, YearKey, PerilKey, Probability
1, 1, 1, 1, 3×10-6
2, 2, 1, 1, 4×10-6

But we need to remember that in the BI (reports, cubes) we can’t sum the measure up.

One thing that we need to ask ourselves is: On which dimension is it not aggregatable? For example, in the above example, is it aggregatable on year?

If a measure is not aggregatable on one dimension, it does not mean that it’s not aggregatable in all dimensions.

« Previous Page

Blog at WordPress.com.