What I mean by a measure here is a time-variant, numerical property of an entity. It is best to explain this by example. In the investment industry, we have different asset classes: equities, bonds, funds, ETFs, etc. Each asset class has different measures. Equities have opening and closing prices, daily volume, market capitalisation, daily high and low prices, as well as annual and quarterly measures such as turnover, pretax profit and EPS. Bonds have different daily measures: clean and dirty prices, accrued interest, yield and duration. Funds have different daily measures: NAV, alpha, sharpe ratio, and volatility, as well as monthly measures such as 3M return, 1Y return, historic yield, fund size and number of holdings. ETFs have daily bid, mid and offer prices, year high and low, and volume; as well as monthly measures such as performance. The question is: what is an appropriate data model for this situation?
We have three choices:
- Put all measures from different product types into a single table.
- Separate measures from each product types into different tables.
- Put the common measures into one table, and put the uncommon measures into separate tables.
My preference is approach 2, because we don’t need to join across table for each product type. Yes we will need to union across different tables to sum up across product types, but union is much more performant than join operation. The main weakness of approach a is column sparsity.
On top of this of course we will need to separate the daily measures and monthly measures into two different tables. Annual and quarterly measures for equities (such as financial statement numbers) can be combined into one table. We need to remember that measures with different time granularity usually are from different groups. For example, the prices are daily but the performance are monthly.
In addition to different time-variant properties (usually numerical), each asset class also different static properties (can be textual, date or numeric). For example, equities have listed exchanges, industry sectors, country of domicile, and dividend dates. Bonds have issuers, call and maturity dates, and credit ratings. Funds have benchmark, trustee, legal structure and inception date. Examples of numerical properties are minimum initial investment and annual charges for funds; outstanding shares and denomination for equities; par and coupon for bonds. Some static properties are common across asset classes, such as ISIN, country of risk, currency.
Static properties from different asset classes are best stored in separate tables. So we have equity table, bond table, fund table and ETF table. Common properties such as ISIN, country of risk, etc. are best stored in a common table (usually named security table or instrument table).
Why not store all static properties in a common table? Because the properties are different for each asset class so it is like forcing a square peg into a round hole.
For time variant properties it is clear that the table already stores historical data in the rows. Different dates are stored as different rows. What we are discussing here is the historical data of the static attributes. Here we have two choices:
- Using SCD approach: store the historical values on different rows (called versions), and each row is only valid for certain time period. SCD stands for slowly changing dimension, a Kimball approach in data warehousing.
- Using Audit Table approach: store the historical rows in an audit table (also called history table). This is the traditional approach in normalised modelling. The main advantage is that the main table is light weight and performant.
When to use them? Approach a is suitable for situations where the historical versions are accessed a lot, whereas approach b is suitable for situations where the historical versions are very rarely accessed.
The main issue with approach a is that we need to use “between” on the validity date columns. In data warehousing we have a surrogate key to resolve this issue, but in normalised modelling we don’t. Well, we could and we should. Regardless we are using appraoch a or b, in the time-variant tables we need to store the ID of the historical row for that date. This will make getting historical data a lot faster.