Pure Kimball practitioners usually leave degenerate dimension (DD) attributes in the transaction fact table. But in some data warehouse implementation people place these attributes in a dimension called Transaction Dimension. It usually has the same grain as the transaction fact table (hence it’s discouraged) and contains DD attributes such as the transaction number. In this article I will discuss transaction dimension: what it is, the pros and cons, and how to implement it.
What is a Transaction Dimension?
A Transaction Dimension is a dimension table containing the textual details of a business transaction that is left over in the source transaction table, after all the other attributes have been put to other dimensions. Consider a classic retail scenario of a supermarket. The source transaction tables are order table and order line table. The transaction fact table grain is order line, in other words: 1 row in the fact table for each row in the order line. Columns on these 2 tables are put into 4 main dimensions: customer, product, store and date.
What’s left on the order table is order ID, order line ID and scan timestamp columns. We could either: a) leave these columns in the fact table, b) put them in the transaction dimension, or c) throw them away.
If we put the columns in the fact table, we will find that the order ID is repeated across several fact rows, because the order ID is the same for all the lines in the order. In some cases we may find other attributes are also repeated across fact rows, for example: cashier, price scheme, advertisements, discount scheme, exchange rate profile, notes, special condition and scan timestamp. Generally, if an attribute is repeated across fact rows, they shouldn’t be put in the fact table. They should be put on a dimension table.
Some of these attributes have low cardinality (distinct value is 10,000 or lower) hence, should go to their own dimension (not transaction dimension), for example: discount scheme dimension, cashier dimension and time of day dimension (truncate to minute or second). But some of them have very high cardinality, which means that they are different for almost each transaction, for example: order ID, order line ID, notes, special condition and scan timestamp. These high cardinality attributes can be put on the transaction dimension.
Reasons Creating a Transaction Dimension (the pros)
There are 4 reasons why we create a transaction dimension:
- Performance. By creating a transaction dimension we essentially move the DD columns to a dimension table, away from the ‘centrally traffic heavy’ fact table. DW queries don’t touch those DD columns unless they need to. If you have a few DD attributes totalling of say 30 bytes, they will be replaced by a 4 bytes int column. That makes the fact table slimmer, better for queries.
- Snapshot. If you make a periodic snapshot fact table, the saving is even greater. Say your monthly snapshot fact stores the last 10 years data = 120 months = 120 copies of the trans fact table. Rather than specifying the DDs repeatedly over and over again 120 times, they are specified once on the transaction dimension, and all that is left on the fact table is a slim 1 int column: the transaction key.
- Greater grains. Some fact tables have grains greater than the transaction. For example, a financial transaction (say payment received from a customer) could be posted into 4 different accounts in the general ledger fact table: 1 bank account and 2 income accounts and 1 VAT account. That single financial transaction becomes 4 G/L lines – 4 fact rows. Because it’s a single financial transaction, it only has 1 row in the transaction dimension. All 4 fact rows will have the same value on the transaction key column. Just like a normal dimension, the granularity of the transaction dimension is lower than the granularity of the fact table, e.g. if the fact table contains 10 million rows, the transaction dimension contains 3 million rows. This is because 1 transaction creates 3 to 4 rows in the fact table but only 1 row in the transaction dimension.
- Linked/related transactions. We may also find that some transactions are related, for example, in the mobile industry, the 3rd call that a customer make in that day may get cheaper tariff, say half price compared to the first 2 calls. Or, in retail, a purchase of a kitchen might need to be created as 2 related orders (delivered at the same date), because the worktop is made to order. In these situations, rather than creating a ‘related order’ column on the fact tables, it is better to create the ‘related order’ columns on the transaction dimension. Because, a) an order can consist of many fact rows (1 row per order line) so there the “related order number” will be duplicated across these fact rows, and b) by putting the “related order” column in the transaction dimension, the fact table becomes slimmer and faster.
OK that’s the benefits. What are the disadvantages? In what situation it’s not suitable?
- If your fact table is a transaction fact table and the grain of the transaction dimension is the same as the grain of the fact table, and you only have 1 DD column, perhaps it’s better to leave the DD in the fact table. You don’t get a lot of space or speed by putting it on a transaction dimension.
- If the mart is only used to supply SSAS, then there is little point of having transaction dimension table physically. Because in SSAS you can create the transaction dimension as “fact dimension”. SSAS can create the transaction dimension “on the fly” from the fact table.
- If you use the transaction dimension as a place to put many attributes (as opposed to storing them in the main dimension), with the argument “well that’s the value of that attribute when the transaction happened, as opposed to the value of the attribute now” – this is not right, use type 2 SCD for this.
- Any dimension with the grain the same as the fact table (like transaction dimension) is questionable: do we really need this dim? Is it really going to be used? Wide textual columns such as notes – are they going to use it? If the purpose of creating the mart/warehouse is analytics, will that free-text notes column ever be used in the OLAP?
- Any dimension with the grain the same as the fact table can potentially cause performance issue (unless your fact table is small). Say your fact table is 10m rows. The transaction dimension is also 10m rows. Joining 10m to 10m could potentially slow, especially if the physical ordering of the trans dim is not the joining column.
How do we use it?
A very large percentage of data warehouse queries is on the main dimension (account, retail: product, utilities: customer, insurance: policy, healthcare: patient, airline: ticket, telecom: call, etc) and date dimension. Very little percentage of data warehouse queries is on transaction dimension. Probably less than 1%. This is understandable, as by definition a data warehouse is not a transactional system.
If you find that transactional queries are more than 10%, ask yourselves this question: why are these queries not done against the operational systems or ODS? In some cases the answers are genuine: because we don’t have ODS. Or worse, because we don’t have reporting facility in the operational system. Our operational system is custom built and when we built it we decided that all reporting will be done from the warehouse. It doesn’t mean that they are the right reasons though; quite the other way around.
Anyway, I digress. Back to the topic of this section, how do we use a transaction dimension? A typical way of using a transaction dimension is to constrain the star query to a particular transaction ID, then go to the fact table via the transaction key to get the necessary rows.
Another way of using a transaction dimension is to provide a key to the main dimension, e.g. customer or account, so you can go from customer/account dimension directly to the transaction dimension without going through the fact table.
Can we have 2 transaction dimensions? Rare but possible, i.e. in the case 2 grains (higher level and lower level), or when you have 2 different types of transactions, i.e. premium vs claim in insurance, or mortgage vs checking transactions in banking. In the case of the latter, each type of transaction normally located on different fact table. So fact table 1 is using transaction dimension 1, while fact table 2 is using transaction dimension 2.
Indexing and Loading
Create nonclustered indexes on attributes that have a) low density (many distinct values), b) high selectivity (the where clause / query predicate is specific, i.e. selecting only 1 row out of 1 million). Index on the transaction key (clustered), just like any other dimension. This is very important as the transaction dimension almost always used by joining from the fact table. this index will be used to join to the fact table.
Consider partitioning the transaction dimension if it contains more than 1 million rows. The question is always “on what key?” Range (left or right doesn’t matter) of Transaction ID is a good candidate, as Transaction ID will be frequently used in the query predicate. If you do partition it, you still need to cluster index on transaction key (as the second key as the first key must be the partitioning key). It is possible to partition on the transaction key (dim’s PK) but the maintenance of the partitioning key will require quite a bit of work.
99.9% the transaction dimension is in type 1, not type 2. This is because the attributes do not change, as they are DD attributes. If you do make it type 2, it is beneficial to add “is current” flag; we can’t rely on the validity date columns for query performance.
The daily routine on loading is to insert new rows from the source transaction table into both the fact table and the transaction dimension. The transaction key identity col (clustered) on the PK makes it a quick insert. Update and delete are very unlikely, as transaction dimension only contains degenerate dimension attributes.
Like everything else in data warehousing, for Transaction Dimension there are a lot of “it depends” so I’d welcome any discussion, correction and opinion.
Source: Statistics Used by Query Optimizer 2005 by Eric Hanson and Lubor Kollar