Should I put this column in a Fact table or in a Dimension table? Have you ever come across that question?
An old, simple rule in data warehousing is: if it is a number, it goes to the fact table. If it is a text, it goes to the dimension table.
Today somebody asked me: “Is it always like that?”
As with everything else in the world, the answer is “Not always. There are exceptions”.
In general, the principles are:
- If the number is a monetary amount (money), it should be in a fact table
- If the number is aggregatable, it should be in a fact table
- If the number is static (doesn’t change with time), it should be in a dimension
- If the number changes every day, it should be in a fact table
- If the number is an identifier, it should be in a fact table as a degenerate dimension column
As usual it is better to learn by example, rather than by definition. A few cases which are interesting are below:
1. Number of …
[Number of Purchases], [Number of Customers], [Number of Product Lines], … these columns are aggregatable, so they should be in a fact table. And they tend to change over time, so they should be in the fact table. For example: in a customer dimension, we have [Number of Accounts]. [Number of Accounts] is ideally put in a fact table, so that it can be aggregated across customers. And it can change. If a customer opens another account or close his account then the [Number of Accounts] changes. If we put it in a periodic snapshot fact table, we can see the trend across all demographic, across time, across areas, across products, etc.
But what if we don’t need to aggregate them? Then put it in the dimension. But think about the above paragraph again.
Any example of a [Number of …] column that should definitely be in a dimension? Yes. [Number of Wheels] column in a Fleet dimension. The context is a logistics / delivery company that owns 40-50 trucks and vans, large and small. [Number of Wheels] is a permanent, fixed property of a vehicle. If a vehicle has 6 wheels, it will always be 6 weeks. So in this case it is absolutely appropriate that [Number of Wheels] goes to a dimension, not a fact table. It is alright if we can’t aggregate. Who wants to know the number of wheels across the whole fleet?
[Number of Bed Rooms] in a property dimension. The context is property agent marketing data warehouse. If a property has 4 bed rooms, it will always have 4 bed rooms. What if a property is renovated to add 1 extra bed room? Well there are always exceptions to the rule. But it is still appropriate to put [Number of Bed Rooms] as an attribute in a dimension, as opposed to a fact table. But if it is in a dimension we can’t aggregate! Bah, who wants to know the number of bed rooms across the whole city anyway? Certainly not the property agent business!
2. Ratio (Percentage) Column
A ratio column contains a number between 0 and 1. It can be expressed as a decimal (e.g. 0.213), or as a percentage (e.g. 21.3%).
In a business-to-business scenario, the customers (which are companies) have various financial ratios. This Wikipedia page explains about various different types of financial ratios. And this Yahoo Finance page shows an example for BT. Price/Earning, Price/Sales, Debt/Equity, etc.
Do we put those ratios in the customer dimension or in a fact table? Some of them changes every day, so they should be in a fact table. So that we can understand its trend across time, and across other dimension.
Some of them changes only once or twice a year, like Return on Asset and Working Capital Ratio. In this case it is appropriate to put them in the customer dimension. Because we are using these financial ratios as the static properties of a customer. But if we are in the investment business and we need to evaluate how each company grows every year, then we need to put these financial ratios in a fact table. Because we are measuring them against other attributes.
One thing that we have to remember when having a ratio column in a fact table is that: to aggregate it, we need to store it in 2 columns: nomination and denomination.
3. Rank Column
A rank is a sorting order based on certain value. For example, sorting the products based on their sales. Or sorting the customers based on the income we get from them over the years.
A rank can be a simple 1,2,3,… But it can also be a quantile (4 bucket), quintile (5 bucket) or decile (10 buckets). In a decile scenario, we put the rank into 10 buckets. Imagine we have 100 products. We rank them as 1, 2, 3, etc. Then we put number 1 to 10 into bucket 1, number 11 to 20 to bucket 2, etc. If the decile of a product is bucket 1, we know it is a popular product, because it is in the top bucket. Product with decile = 4 is more popular than product with decile = 5.
Should a rank column be in a dimension or in a fact table? Rank is a property of the product / customer / account / etc. If the rank changes every day and we want to track how they change over time, we put it in a periodic snapshot fact table. If the rank changes twice a year, we put it as an attribute in a dimension. If we don’t need to track its changes, we set it as a type 1 attribute. If we need to track its changes, we set it as a type 2 attribute.
4. Time Duration Column
In a car hire company, we have [Days Car Out], which is the time duration from time the customer hire the car to the time the customer returns the car. The value can be an integer e.g. 2 days, or a decimal e.g. 2.21 days.
In banking warehouse, for each credit facility we have Tenor, which is time duration from effective date to maturity date, expressed in years. The value is a decimal e.g. 3.45 years.
Do we put these time duration columns in a fact table or in a dimension? Again the questions are the same: do we need to aggregate it? Do we need to slice it across different dimensions? Do we need to analyse it across time?
In the case of the [Days Car Out] column in a car hire company, [Days Car Out] should be put as in a fact table, because it will aggregated up, and will be analysed across different dimensions and across time. In the case of the [Tenor] column in a bank, it should be put as an attribute of the account dimension. Because we don’t aggregate it across accounts, and we don’t analyse it across time, or across other dimensions.