Transaction Dimension

5 March 2010

Transaction Dimension

Filed under: Data Warehousing,SQL Server — Vincent Rainardi @ 7:12 pm
Tags: Data Warehousing, SQL Server

Transaction Dimension

Pure Kimball practitioners usually leave degenerate dimension (DD) attributes in the transaction fact table. But in some data warehouse implementation people place these attributes in a dimension called Transaction Dimension. It usually has the same grain as the transaction fact table (hence it’s discouraged) and contains DD attributes such as the transaction number. In this article I will discuss transaction dimension: what it is, the pros and cons, and how to implement it.

What is a Transaction Dimension?

A Transaction Dimension is a dimension table containing the textual details of a business transaction that is left over in the source transaction table, after all the other attributes have been put to other dimensions. Consider a classic retail scenario of a supermarket. The source transaction tables are order table and order line table. The transaction fact table grain is order line, in other words: 1 row in the fact table for each row in the order line. Columns on these 2 tables are put into 4 main dimensions: customer, product, store and date.

What’s left on the order table is order ID, order line ID and scan timestamp columns. We could either: a) leave these columns in the fact table, b) put them in the transaction dimension, or c) throw them away.

If we put the columns in the fact table, we will find that the order ID is repeated across several fact rows, because the order ID is the same for all the lines in the order. In some cases we may find other attributes are also repeated across fact rows, for example: cashier, price scheme, advertisements, discount scheme, exchange rate profile, notes, special condition and scan timestamp. Generally, if an attribute is repeated across fact rows, they shouldn’t be put in the fact table. They should be put on a dimension table.

Some of these attributes have low cardinality (distinct value is 10,000 or lower) hence, should go to their own dimension (not transaction dimension), for example: discount scheme dimension, cashier dimension and time of day dimension (truncate to minute or second). But some of them have very high cardinality, which means that they are different for almost each transaction, for example: order ID, order line ID, notes, special condition and scan timestamp. These high cardinality attributes can be put on the transaction dimension.

Reasons Creating a Transaction Dimension (the pros)

There are 4 reasons why we create a transaction dimension:

Performance. By creating a transaction dimension we essentially move the DD columns to a dimension table, away from the ‘centrally traffic heavy’ fact table. DW queries don’t touch those DD columns unless they need to. If you have a few DD attributes totalling of say 30 bytes, they will be replaced by a 4 bytes int column. That makes the fact table slimmer, better for queries.
Snapshot. If you make a periodic snapshot fact table, the saving is even greater. Say your monthly snapshot fact stores the last 10 years data = 120 months = 120 copies of the trans fact table. Rather than specifying the DDs repeatedly over and over again 120 times, they are specified once on the transaction dimension, and all that is left on the fact table is a slim 1 int column: the transaction key.
Greater grains. Some fact tables have grains greater than the transaction. For example, a financial transaction (say payment received from a customer) could be posted into 4 different accounts in the general ledger fact table: 1 bank account and 2 income accounts and 1 VAT account. That single financial transaction becomes 4 G/L lines – 4 fact rows. Because it’s a single financial transaction, it only has 1 row in the transaction dimension. All 4 fact rows will have the same value on the transaction key column. Just like a normal dimension, the granularity of the transaction dimension is lower than the granularity of the fact table, e.g. if the fact table contains 10 million rows, the transaction dimension contains 3 million rows. This is because 1 transaction creates 3 to 4 rows in the fact table but only 1 row in the transaction dimension.
Linked/related transactions. We may also find that some transactions are related, for example, in the mobile industry, the 3rd call that a customer make in that day may get cheaper tariff, say half price compared to the first 2 calls. Or, in retail, a purchase of a kitchen might need to be created as 2 related orders (delivered at the same date), because the worktop is made to order. In these situations, rather than creating a ‘related order’ column on the fact tables, it is better to create the ‘related order’ columns on the transaction dimension. Because, a) an order can consist of many fact rows (1 row per order line) so there the “related order number” will be duplicated across these fact rows, and b) by putting the “related order” column in the transaction dimension, the fact table becomes slimmer and faster.

The Cons

OK that’s the benefits. What are the disadvantages? In what situation it’s not suitable?

If your fact table is a transaction fact table and the grain of the transaction dimension is the same as the grain of the fact table, and you only have 1 DD column, perhaps it’s better to leave the DD in the fact table. You don’t get a lot of space or speed by putting it on a transaction dimension.
If the mart is only used to supply SSAS, then there is little point of having transaction dimension table physically. Because in SSAS you can create the transaction dimension as “fact dimension”. SSAS can create the transaction dimension “on the fly” from the fact table.
If you use the transaction dimension as a place to put many attributes (as opposed to storing them in the main dimension), with the argument “well that’s the value of that attribute when the transaction happened, as opposed to the value of the attribute now” – this is not right, use type 2 SCD for this.
Any dimension with the grain the same as the fact table (like transaction dimension) is questionable: do we really need this dim? Is it really going to be used? Wide textual columns such as notes – are they going to use it? If the purpose of creating the mart/warehouse is analytics, will that free-text notes column ever be used in the OLAP?
Any dimension with the grain the same as the fact table can potentially cause performance issue (unless your fact table is small). Say your fact table is 10m rows. The transaction dimension is also 10m rows. Joining 10m to 10m could potentially slow, especially if the physical ordering of the trans dim is not the joining column.

How do we use it?

A very large percentage of data warehouse queries is on the main dimension (account, retail: product, utilities: customer, insurance: policy, healthcare: patient, airline: ticket, telecom: call, etc) and date dimension. Very little percentage of data warehouse queries is on transaction dimension. Probably less than 1%. This is understandable, as by definition a data warehouse is not a transactional system.

If you find that transactional queries are more than 10%, ask yourselves this question: why are these queries not done against the operational systems or ODS? In some cases the answers are genuine: because we don’t have ODS. Or worse, because we don’t have reporting facility in the operational system. Our operational system is custom built and when we built it we decided that all reporting will be done from the warehouse. It doesn’t mean that they are the right reasons though; quite the other way around.

Anyway, I digress. Back to the topic of this section, how do we use a transaction dimension? A typical way of using a transaction dimension is to constrain the star query to a particular transaction ID, then go to the fact table via the transaction key to get the necessary rows.

Another way of using a transaction dimension is to provide a key to the main dimension, e.g. customer or account, so you can go from customer/account dimension directly to the transaction dimension without going through the fact table.

Can we have 2 transaction dimensions? Rare but possible, i.e. in the case 2 grains (higher level and lower level), or when you have 2 different types of transactions, i.e. premium vs claim in insurance, or mortgage vs checking transactions in banking. In the case of the latter, each type of transaction normally located on different fact table. So fact table 1 is using transaction dimension 1, while fact table 2 is using transaction dimension 2.

Indexing and Loading

Create nonclustered indexes on attributes that have a) low density (many distinct values), b) high selectivity (the where clause / query predicate is specific, i.e. selecting only 1 row out of 1 million). Index on the transaction key (clustered), just like any other dimension. This is very important as the transaction dimension almost always used by joining from the fact table. this index will be used to join to the fact table.

Consider partitioning the transaction dimension if it contains more than 1 million rows. The question is always “on what key?” Range (left or right doesn’t matter) of Transaction ID is a good candidate, as Transaction ID will be frequently used in the query predicate. If you do partition it, you still need to cluster index on transaction key (as the second key as the first key must be the partitioning key). It is possible to partition on the transaction key (dim’s PK) but the maintenance of the partitioning key will require quite a bit of work.

99.9% the transaction dimension is in type 1, not type 2. This is because the attributes do not change, as they are DD attributes. If you do make it type 2, it is beneficial to add “is current” flag; we can’t rely on the validity date columns for query performance.

The daily routine on loading is to insert new rows from the source transaction table into both the fact table and the transaction dimension. The transaction key identity col (clustered) on the PK makes it a quick insert. Update and delete are very unlikely, as transaction dimension only contains degenerate dimension attributes.

Like everything else in data warehousing, for Transaction Dimension there are a lot of “it depends” so I’d welcome any discussion, correction and opinion.

Source: Statistics Used by Query Optimizer 2005 by Eric Hanson and Lubor Kollar

Comments (6)

6 Comments »

[…] Create a transaction dimension and show the chain of events in this dimension. The grain of a transaction dimension is 1 row per […]

Pingback by Primary Key and Clustered Index on the Fact Table « Data Warehousing and Business Intelligence — 6 March 2010 @ 1:05 pm | Reply
Sorry to add a comment to an old post. I think this a very good summary of a common problem.

One thing I might add is that sometimes you have an item and a header with their own dds and want to show the item facts against the header dds.

Let’s say I have a header table with 5m rows and an item table with 10m. The users want to query items by header ID as well as item ID. There are also other dds in the header and item which need to be available and have cardinality equal to or close to the pk. Options:
1) place all header dds and the item dd on the item fact table
2) only place the item dd on the item fact and create an item dim with the header dds
3) place the item key and the header key on the item fact and have an item dim and a header dim with the additional dds

Any of these might be OK at 5m x 10m but 50m x 100m it starts to get ugly. I think you have to remove the idea of a transaction dimension and go with the wide fact table at that stage – as wrong as it feels.

Comment by Robert — 26 November 2010 @ 3:17 pm | Reply
- Thanks Robert. Very useful comment.
  
  Comment by Vincent Rainardi — 30 March 2016 @ 7:15 am | Reply
I have a very large customer dimension which needs to be loaded as Type 2, but the issue is number of rows. It has more than 50 million rows and that’s for one year only, If I load data for historical 4 years the dimension size would go more than 300 million approx and it will ever increasing massively in the future. This dimension has many number of columns (50 or more columns) and they’ll require to be in Type 2. Have you tried this option with such a huge dimension?

Comment by chiragparikh01 — 29 March 2016 @ 4:03 pm | Reply
- Hi Chirag, this article is about transaction dimension. It is a different problem to a large customer dimension. The problem with a large customer dimension (500 million rows, 100 columns, type 2) is that it is slow to update. We need to find the rows to close off, based on the natural key (such as email address or Customer ID). We may need to close 200,000 rows for today’s load, and this process can take half an hour because there are 500m rows in the table. Inserting the new rows is not an issue, but closing the expired row is an issue.
  
  One approach to solve this problem is by storing only the core customer attributes (such as name, date of birth, email address) in the customer dimension, whereas the geographic attributes (such as are address, city, region) are stored in a geography dimension, the demographic attributes (such as occupation, marriage, education, income bracket) are stored in a demography dimension, and classification attributes (such as status, type, group) are stored in a classification dimension. The geography, demography and classification dimensions are linkable to the customer dimension as they have customer ID column (not customer_key column which is always changing). In the fact tables we have customer_key, customer_geography_key, customer_demography_key, and customer_classification_key columns.
  
  Comment by Vincent Rainardi — 30 March 2016 @ 7:09 am | Reply
  - Thanks Vincent for quick and valuable response.
    
    Comment by chiragparikh01 — 30 March 2016 @ 10:57 am

RSS feed for comments on this post. TrackBack URI

Data Platform and Data Science

5 March 2010