Data Warehousing and Data Science

20 March 2012

SSIS: Loading a Big File Fast

Filed under: SSIS — Vincent Rainardi @ 7:21 am
Tags:

Loading a big file into a table in SQL Server

The differences between a slow load and a very fast load are:

  1. Flat file source: ValidateExternalMetadata = false
  2. OLE DB Destination: AccessMode: OpenRowSet using FastLoad (default is OpenRowSet)

Flat file: 230,000 rows, 15 columns, pipe separated, qualifier: double quote
SSIS: data conversion to Unicode on 10 columns

Big difference in performance:

  • Before: 229 seconds (3.8 minutes)
  • After: 3 seconds

So next time you load a file into a table using SSIS, please remember to use these settings.

14 March 2012

Name of Surrogate Key Columns

Filed under: Data Warehousing — Vincent Rainardi @ 5:15 am
Tags:

The convention amongst Kimball practitioners is to name the surrogate key columns as the dimension name + “_key”, i.e product_key, claim_key, risk_key. And this is done both in the fact table and in the dimension table. This way, when you create the BO universe, or SSAS cube, or MicroStrategy metadata/repository, it’s easier because we can say ‘link on the same name’

But recently I came across a different naming convention. In the fact table the surrogate key column is called PK_[Dim Name]_ID and in the dimension it is called FK_[Dim_Name]_ID. For example, the surrogate key of the product dimension is PK_Product_ID and in the fact table it is called FK_Product_ID. The benefit of this naming convention is so that in Toad, or in SSMS, when you type

select * from fact_daily_sales f
join dim_product dp on dp.

As soon as you type “.” the intellisense pops out displaying all the columns in dim_product and when you type “PK” there is only 1 column selected: PK_Product_ID. So it becomes:

select * from fact_daily_sales f
join dim_product dp on dp.pk_product_id = f.

And as soon as you type “.” the intellisense pops out displaying all the columns in fact_daily_sales table and when you type “FK” Toad/SSMS offers you all the surrogate keys in that fact table. Beautiful isn’t it?

Yes, but with 1 disadvantage: you can’t join on names between the fact table and the dimension table because the surrogate key columns have different names.

 

Name of Dimension and Fact Tables

Whilst I’m at it (naming), I might as well write about naming the dimension and fact tables. The convention amongst Kimball practitioners is to prefix the dimension table name with “dim_”, i.e. dim_[Dim Name]. And for the fact table the prefix is “fact_”, i.e. dim_[Fact Table Name]. For example: dim_product, dim_customer, fact_customer_profitability. This is because of the examples set by the Kimball Group, and by Microsoft SQL Server product team when they created Adventure Works data warehouse.

To differentiate transaction fact tables from snapshot fact tables, the data warehouse team in some companies added “_trans” or “_snapshot” suffix. And to differentiate accumulating snapshot fact table form periodic snapshot fact table, they add “_acc” suffix. For example: fact_sales_trans and fact_inventory_level_snapshot_acc. If it is a periodic snapshot, they mention the frequency, i.e. daily or monthly: fact_daily_sales_snapshot, fact_monthly_sales_snapshot.

In some companies, they shorten it further by naming the dimension table “D_” instead of “dim_” and the fact tables as “F_” instead of “fact_”. I personally prefer the D and F convention rather than “dim_” and “fact_”. Reason: we know that there are only 3 types of tables in the whole warehouse: fact, dimension and bridge.

Rather than using underscore (“_”) to separate words, I prefer to camel case it, to match Microsoft / .NET naming standard, i.e. fDailySales, dProduct, dCustomer. I prefer to type the full words rather than abbreviated them, i.e. instead of fProdOutput I type fProductionOutput.

I have yet come across data warehouse team who put the word “dim” and “fact” at the end, i.e. SalesFact, sales_fact, ProductDim or product_dim. Everyone seems to be putting the “dim” and “fact” in the front.

A data warehouse user once said to me: “Fact tables are fine, but dim tables? We want bright tables, not dim ones” 🙂

13 March 2012

Star Schema or Snowflake

Filed under: Data Warehousing — Vincent Rainardi @ 8:05 am

Last week somebody asked me that classic question again: which one should I do, star schema or snowflake and why?

Star schema, because:
1. Query performance
Because it is 1 level join, star schema perform better in query times than snowflake. We don’t need to join multiple tables to form a dimension. All we need to join is the dimension table and the fact table. If your BI tool is ‘all in memory’ such as QlikView, PowerPivot, and SSAS (MOLAP), then they are fast. But if your BI tool is ‘touch the disk everytime’ such as BO InfoView and SSRS, then you need

2. Simplicity in design
This is the main reason why I prefer star schema than snowflake. It is simple. We do not normalise the dimension. We put all the product attributes in 1 dimension. It makes it simple. It is easy to relate one attribute and another. We don’t need to join tables.
The most important reason to choose Star Schema is this: simplicity in SCD type 2. With Star Schema, product dimension is one table so it is easier to make it type 2. With Snowflake it is scattered over 5-6 tables and it is a nightmare to implement type 2.

3. Flexibility for hierarchy
The relationship between hierarchy levels can be different for each dimension. The relationship of city-area-region in the customer dimension can be different from the relationship of city-area-region in the branch dimension. And it can be different in supplier dimension. We don’t put them in 1 central location like in Snowflake. This makes it flexible to define the hierarchy.

4. Compatibility with BI Tools and database engine
Star schema has wider recognition in the DWBI industry. Some database engines such as SQL Server 2008 recognise that it is a star schema and optimise the query accordingly. Some BI tools such as BO and MicroStrategy can create the metadata layer automatically based on star schema.

As a principle, we always have to look at the other side. We must not look at one side only. The disadvantages of Star Schema are:

1. Longer load time
Because we need to perform joins to get all the data required, the ETL in Star Schema takes longer time than Snowflake. But please see SCD type 2 in point 2 above. It make it longer to load a Snowflake Schema if the dimension is type 2.

2. Inflexibility
In Star Schema the dimension is already formed in its entirity. So it’s not possible to dismantle it and offer half of it to one fact table, and offer the other half to another fact table. In Snowflake we can. The dimension is stored in several pieces, enabling us to take the pieces we need.

Now, after all above, you need to read this: When to snowflake (link).

Above is my opinion. I know there are many experts who know much more than me, who may have different opinions. So as always I welcome any comments and discussions at vrainardi@gmail.com

Vincent Rainardi 13/3/2012

Blog at WordPress.com.