At SQLBits last month I presented a session about building cubes from transaction systems. Presentation slides are here. I’m going to write some of the points from that session as blog posts.
One of the primary challenges of building cubes from a transaction system is that we don’t have nice small dimension tables. Sometimes we need to create a dimension from a large table, say 100 million rows. When we do this, if we use the primary key of the large table, the DSV query will have the same grain as the large table and we will end up with 100 million rows in our DSV query:
select pk as dim_key, attribute1, attribute2 from large_table
Dimension processing will be a lot quicker if we select distinct those attribute, and use their concatenation as the dim key:
select distinct attribute1 + attribute2 as dim_key, attribute1, attribute2 from large_table
This way the dimension will only contain the crossjoin of attribute1 & attribute2 (say 10,000 rows) rather than 100 million rows. The query will be even faster if attribute1 & 2 are included in a covering index.
Some notes when doing this:
- NULL + ‘a’ = NULL, so convert NULL to blank string like this: isnull(attribute1,’’)
- Use |, ~ or ¬ to separate attributes for clarity
- Trim the attributes on the dim key. Dim key is also on the fact table and fact table is usually a large table so we want its columns to be slim.
So the query becomes:
select distinct isnull(rtrim(attribute1),’’) + ‘|’ + isnull(rtrim(attribute2),’’) as dim_key,
attribute1, attribute2 from large_table
On the fact table SQL we produce the same dim key, by joining on the PK of the large table:
select …, isnull(rtrim(attribute1),’’) + ‘|’ + isnull(rtrim(attribute2),’’) as dim_key
from trans_table A
left join large_table B on A.fk = B.pk
But that’s joining a large table with a large table and it’s going to be slow! Yes but:
a) We can partition the measure group so the query is not so slow. Whereas on the dimension we can’t partition it.
b) SSAS put select distinct around the dimension query. A select distinct on 100 million takes a while. Whereas the partition query is a straight forward query (no select distinct around it).
c) In the case A & B being master-child trans tables (like order header & order detail), both tables are indexed on the PK-FK columns (order ID) so the join is efficient.
I found that overall it is quicker to put the big query on the measure group compared to putting it on the dimension.