Data Warehousing and Business Intelligence

15 December 2009

Building Cubes from Transaction Systems – Creating a dimension from a large table

Filed under: Analysis Services — Vincent Rainardi @ 7:02 am
Tags:

At SQLBits last month I presented a session about building cubes from transaction systems. Presentation slides are here. I’m going to write some of the points from that session as blog posts.

One of the primary challenges of building cubes from a transaction system is that we don’t have nice small dimension tables. Sometimes we need to create a dimension from a large table, say 100 million rows. When we do this, if we use the primary key of the large table, the DSV query will have the same grain as the large table and we will end up with 100 million rows in our DSV query:
select pk as dim_key, attribute1, attribute2 from large_table

Dimension processing will be a lot quicker if we select distinct those attribute, and use their concatenation as the dim key:
select distinct attribute1 + attribute2 as dim_key, attribute1, attribute2 from large_table

This way the dimension will only contain the crossjoin of attribute1 & attribute2 (say 10,000 rows) rather than 100 million rows. The query will be even faster if attribute1 & 2 are included in a covering index.

Some notes when doing this:

  1. NULL + ‘a’ = NULL, so convert NULL to blank string like this: isnull(attribute1,’’)
  2. Use |, ~ or ¬ to separate attributes for clarity
  3. Trim the attributes on the dim key. Dim key is also on the fact table and fact table is usually a large table so we want its columns to be slim.

So the query becomes:
select distinct isnull(rtrim(attribute1),’’) + ‘|’ + isnull(rtrim(attribute2),’’) as dim_key,
attribute1, attribute2 from large_table

On the fact table SQL we produce the same dim key, by joining on the PK of the large table:
select …, isnull(rtrim(attribute1),’’) + ‘|’ + isnull(rtrim(attribute2),’’) as dim_key
from trans_table A
left join large_table B on A.fk = B.pk

But that’s joining a large table with a large table and it’s going to be slow! Yes but:
a) We can partition the measure group so the query is not so slow. Whereas on the dimension we can’t partition it.
b) SSAS put select distinct around the dimension query. A select distinct on 100 million takes a while. Whereas the partition query is a straight forward query (no select distinct around it).
c) In the case A & B being master-child trans tables (like order header & order detail), both tables are indexed on the PK-FK columns (order ID) so the join is efficient.

I found that overall it is quicker to put the big query on the measure group compared to putting it on the dimension.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: