Data Warehousing, BI and Data Science

25 September 2018

Data Warehouse for Asset Management

Filed under: Data Warehousing — Vincent Rainardi @ 6:50 am

A. Brief Summary

A.1. Fact Tables

The main fact tables in an asset management warehouse are position, performance, risk and transaction. In the next 4 paragraph I’ll explain what they are briefly. Later on in the article I will be explaining them in more detail (or in a separate article).

The position fact table (aka valuation fact table) contains the daily holdings and benchmark constituents for every portfolio, both security/instrument positions and cash positions. The main issue with this fact table is the matching between holding rows and benchmark rows. We have the option of not attaching the benchmark, which makes it flexible particularly for portfolios having more than one benchmarks. The position fact table is mainly used to provide portfolio breakdown, and top N holdings. The other issue is lookthrough.

The performance fact table contains the performance returns of every portfolios compared with their benchmarks, for many different periods e.g. 3M, 6M, 1Y, 3Y. There is also another type of performance fact table, which contains the performance attribution to different factors such as yield curve, sector allocation and security selection. Performance can also be allocated to different sectors, currencies and asset classes, each compared to the benchmark.

Because of the non additive nature of risk numbers, there are many risk fact tables. They mainly hold the risk sensitivities such as interest rate sensitivities (PV01, DV01), inflation sensitivities (IE01), credit rate sensitivities (CR01) and exchange rate sensitivities. The risk fact table also stores the VAR data (value at risk) at portfolio level and at “carved-up” level (for example, industry sector, currency and asset class)

Transaction fact table stores every buy, sell and corporate action in every portfolio, including the fees and charges. The transaction status, which changes at every stage on different dates, can be stored on an accumulated snapshot fact table. Whereas the trades data can be stored on a transaction fact table.

A.2. Dimensions

In an asset/investment management DW, the following dimensions are commonly found:

  1. Security dimension (aka Instrument dimension or Asset dimension) which contains 3 types of securities: held securities, benchmark securities and watchlist securities.
    Note: cash is a financial instrument but not a security. Only equity, fixed income and money market are financial securities, including derivatives (refer to The Securities Exchange Act, section 3 part 10)
  2. Client dimension (aka customer dimension) containing a list of clients from the CMS such as Salesforce, with their usual attributes such as client type, Mifid category, industry sector, and address.
  3. Portfolio dimension (aka account dimension, product dimension or fund dimension) which contains a list of open funds, segregated account/mandate and trust funds that the company runs.
  4. Benchmark dimension, containing a list of benchmarks which the portfolios are measured against, including blended benchmarks. The “pairing” between portfolios and benchmarks is not done in this dimension.
  5. Country dimension covers two country code schemes (ISO 2 chars) and (ISO 3 chars) and 2 to 3 country name columns (one for each OMS/IMS, plus ISO).
  6. Currency dimension is pretty static. A country column is not useful here, as the currency name already includes the country and some currencies are used by several countries. It is usually a good practice to update country and currency dimensions manually.
  7. Rating dimension. Covers Fitch, Moody’s, and S&P. A good rating dimension have the 3 agencies as columns so we get the translation between them (instead of stacking them up as rows with a “type” column). Also covers “major” rating and “notched” rating as two different attributes/columns (AA is a major and AA+ is a notched). Also covers cash ratings (Moody’s), which is quite different from credit rating. Don’t put the “watch” rating and “outlook” here, they should be a DD on the security dimension.
  8. Industry Sector dimension. Covers at least ICB and GICS, along with major providers such as ML, Barclays, Bloomberg. Some are 4 levels deep, some 7 levels. Unlike rating, don’t try to do translation between providers here, so stack them up with a “provider” column, then level 1 to level 7 columns.
  9. Asset class dimension (aka asset type, instrument type, asset group) which contains a list like: equity, fixed income, cash, property, commodity as level 1. Level 2 contains the breakdown for example for fixed income the level 2s are bond, floating rate note, CDS, bond future. For bond the level 3s are callable/puttable or government/corporate.
  10. Dimensions related to performance return attributes: return type (gross or net of fees), accounting method, cumulative or annualised, arithmetic or geometric. We could model them as a junk dimension because most attributes only have 2-3 values, if we don’t mind losing the description. I don’t recommend the performance period to be put here (e.g. 3 month, 1 year, 5 years), best to put them as columns in the performance return fact table (having them as a column in the performance return is second best).
  11. For the sake of completeness, date dimension.

A.3. Issues

The usual issues/dilemmas with designing the dimensions are:

  1. Do we include country of risk/incorporation, currency, industry sector, etc. in the security dimension? Absolutely, as type 2 attributes. Plus we also have them as separate dimensions in their own right.
  2. How do you model derivative instruments which has very different attributes? By creating Option, Swaps, Futures and CDS as separate dimensions. Don’t separate the OTCs (Over The Counter) Derivatives, e.g. options, swaps, CDS) from ETDs (Exchange Traded Derivatives, e.g. futures, forwards). Don’t create too many dimensions for derivatives, but group them up. 4 or 5 dimensions sounds about right. Don’t worry too much if certain attributes are duplicated between the main security dimension and the derivative instrument dimension, it’s ok.
  3. Do we put cash in the security dimension or not? Don’t. Only real financial instruments should be in the security dimension (including CD, Repo, OTC and ETD), but leave settled and unsettled cash out (including collateral cash for margin requirements) See my article here for more info: link. Cash in different currencies should be put directly in the position fact table, clearly labelled as settled or unsettled as many portfolio management decisions hang on this label.

B. Details – Fact Tables

B.1. Position Fact Tables

Position fact tables store the valuation for each position from the back office accounting system, including adjustments. The valuation is daily, but the one at the month end is continuously adjusted / restated until a few weeks later.

For example, every business day we could be receiving:

  1. Daily position valuation from the OMS system
  2. Daily position valuation from the accounting system
  3. Today’s version of the last month end position valuation from the accounting system

The valuation is not only on the securities that we hold, but also the cash. There are a few things we need to be aware here, with regards to the validity of our position fact table:

  1. The cash amount is dependent on the settlement status of the transactions for that day.
    For example, this morning we had £1m cash, and this afternoon we bought £300k worth of equities and bonds. So what’s our cash position at the end of today? Answer: Settled cash = £700k, unsettled cash = £300k.
  2. The securities we hold is affected by the trade status.
    For example, today the portfolio manager put an order to buy £200k of bond X. If the status of the order is “executed by the broker” (it’s called “filled orders”) then we will have that security in today’s position. But if the status of the order is “not yet executed by the broker” then we don’t have that security in today’s position. There are many stages of an order/trade including “draft/created”, “verified”, “authorised”, “compliance checked”, “placed with/sent to the broker”, “confirmed by the broker”, “filled/executed by the broker”, “allocated”, “settled”, and “closed”. The position is created when the order is “executed”.
  3. The value of an equity position is the number of shares we hold x the previous day price of that equity (close of business price, e.g. when the exchange ended trading – the exact hour depends on which country the exchange is located). Like wise of bonds and any exchange traded securities. But for OTC derivative positions like options and swaps, the valuation requires some calculations to determine the “fair” market value, e.g. Black-Scholes model for options (this discipline is generally called “derivative pricing”). This is because OTC derivatives are not traded on exchanges, but a contract between two parties.
  4. Some of the positions could be valued in-house, and some positions could be valued by an external company to whom we outsource our back office operation. In this case we may need to merge two sources to get the holistic view.

Design of Position Fact Tables:

  • Name: Fact Holding, Fact Valuation or Fact Position. “Holding” restricts us from including benchmark positions. “Valuation” restricts us from including fixed income analytics. So in my opinion “Position” is the best name.
  • Grain: 1 row for each position in every portfolio and benchmark per day.
  • Portfolio and benchmark: two separate position fact tables would cause all sort of painful issues when joining, so put into 1 fact table but join on the front end OMS (Charles River, Aladdin, Simcorp, ThinkFolio). Do not join portfolio and benchmark positions in the DW. This is probably the most important point in the this article.
  • Front end and back end: put data from back end (State Street, Fund Accounting, Northern Trust) and front end (CRD, Aladdin, TF) in two separate position fact tables. The former contains accounting/finance valuations and the latter contains daily trade-by-trade valuations.
  • Surrogate keys: date key (called reporting date), portfolio key or benchmark key, security key. Those are SKs which make up the grain. Other SKs are: currency key, client key.
  • History: stores 5 years of daily snapshot.
  • Partition: partitioned on reporting date.
  • Indexing: index on date key, portfolio key and security key. Also index on other SK columns.
  • Restated month end valuation: stores in a separate fact table.
  • Measures: valuations of each position in 3 currencies (portfolio, base and local) and their implied FX rates, fixed income analytics such as durations, yields and spreads, maturities (in years, 4 dp).
    Note: Analytics numbers are from front end OMS. Fund accounting/finance/back office only provides valuation numbers.

Loading: daily file from back office outsourcing/finance and daily export from OMS.
If the valuation of OTC derivative positions is conducted in house rather than outsourced, this needs to be merged into the back office data.

Usage: correct security key is vital as this fact table is the main source for reporting aggregated positions (portfolio breakdown) by currency, asset class, country, industry sector, rating, etc. which is required by both clients and regulator. Mostly usage are only interested in the month end, but settled positions. Some requires executed (unsettled) positions.


  • Lookthrough: store the positions without a lookthrough. A lookthrough position is a fund held in the portfolio, usually our own funds. In a portfolio, in addition to holding cash, equity or fixed income position, we also hold a fund. We should not “expand” the “content” of this fund. Instead we should store the value of the whole fund. If we “expand”, the valuation of other positions become incorrect, particularly cash positions.
  • Unsettled cash positions: in the Position fact table, separate unsettled cash positions from settled cash. The separation and calculation of the unsettled cash should be done within fund accounting, not within the DW.
  • Mid day vs end of day valuations: always take end of day valuations, even if we do the valuations in house (e.g. for options and swaps).

B.2. Performance Fact Tables

Performance is calculated in the back office system such as Stat Pro, Barclays Point (now Bloomberg Port), MSCI BarraOne, or outsourced to back office provider such as State Street or Northern Trust (Passport).


  • Name: Fact Performance Return (to differentiate from Performance Attribution).
  • Grain: 1 row for each share class, per month, per return type (gross or net return), per return calculation (arithmetic or geometric return), per data source, per row type (portfolio/benchmark/excess return), per cumulative type (annualised or cumulative), per price/total type (price return or total return which is price+income).
  • Surrogate keys: reporting date key, account key (1 account for each share class), benchmark key, currency key, return type key, return calculation key, data source key
  • Degenerate dimensions: performance row type (P for portfolio, B for benchmark, E excess return = R for relative return = A for active return), cumulative type (A for annualised, C for cumulative), performance total/price type (P for Price Only Returns, T for Total Return which is Price + Income), return calculation (A for arithmetic, G for geometric).
    Alternatively we can also put all these DD columns in a junk dimension.
  • Measures: 1M, QTD, 3M, 6M, YTD, 1Y, 3Y, 5Y, 10Y (M = month, Q = Quarter, Y = Year). Some asset managers also have 2Y and 15Y.
  • Loading: from StatPro/Simcorp/Aladdin output or files from the back office profider such as NT or State Street.


  • Some have MTD (month to date), which stores the daily intra-month returns.
  • A popular alternative design is to create a period dimension, containing 1M to 10Y, effectively making this fact table a vertical fact table. I prefer to have a separate measure column for each period, because the first thing we do when we have a vertical performance fact table is to pivot it so that the 1 column of numbers becomes 1 row of numbers. Also because the number of periods are relatively fixed.
  • Performance numbers have to be signed off. Before sign off the data is available, and can be used for reporting, but there a (small) risk that the data is incorrect – which in some cases is acceptable. So we need to add a DD column (or part of the junk dimension) indicating the sign off status.
  • Another attribute which should be added to this fact table is the sign off date (as a date key).

Performance Attribution Fact Table

The active return or excess return or relative return is the return of the portfolio minus the return of the benchmark. This active return can be attributed/allocated to several factors, i.e. yield curve, duration, security selection, currency, credit, carry and sector allocation. This is called performance attribution data. For example, a relative return of 6% can be attributed to:

  • 2% because of duration
  • 1% because of security selection
  • 2% because of sector allocation
  • 1% because of carry


  • Name: Fact Performance Attribution
  • Grain: 1 row per portfolio, per reporting date.
  • Surrogate keys: reporting date key, portfolio key, benchmark key.
    Additionally: currency key.
  • Measures: yield curve, duration, security selection, credit, carry, sector allocation, and total.
  • Loading: from StatPro/Point/Barra output or files from the back office profider such as NT or State Street.

B.3. Risk Fact Tables

The risk fact tables store the sensitivities numbers and value at risk. They also stores the results of “stress test” on multiple scenarios, such as European sovereign default, global economic recession, oil price fluctuation, tightened credit markets, liquidity squeze, exchange rate fluctuation, Brexit scenarios, raise of US interest rates.

I’ll start with sensitivities fact table, the VAR fact table, then stress test fact table. Note: IE01 reflects how sensitive the portfolio is to inflation. PV01 reflects how sensitive the portfolio is to interest rates.

Sensitivities fact table:

  • High level: combine IE01, PV01 and DV01 in one fact table, storing them as different measures.
  • Name: Fact Portfolio Sensitivities
  • Grain: 1 row per portfolio, per day.
  • Surrogate keys: calculation date key (when the sensitivities were calculated), portfolio key
  • Measures: IE01, PV01, DV01. Some asset managers also stores CR01 and IR01.
  • Loading: from risk system such as IBM Algo Risk, MSCI BarraOne, MSCI RiskMetrics, Simcorp. Daily load of the output files.
  • Specifics: IE01, PV01 and DV01 also calculated at position level. Put them in the position fact table. Keep the ones in portfolio sensitivities at fund/portfolio level.

VAR fact table:

  • Description: stores value at risk at portfolio level, and at sector and country levels
  • Name: Fact Portfolio VAR

Grain: 1 row for each sector/country, per day.


6 September 2018

Modern Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 7:09 pm

What should 2018 Data Warehousing look like (as opposed to in year 2005)? That’s what the word “Modern” refers to. We now have streaming data, we have big data, we have data lake, and we have data scientists. A modern DW in 2018 incorporates these 4 factors. An “old” DW in 2005 did not incorporate these 4 factors.

The first attempt to define a Modern DW was from Bill Inmon in his DW 2.0 concept in 2006, where he added unstructured data, different user communities and (whitepaper: link, book: link). There was no big data or NoSQL at that time, but we already had unstructured data in data warehousing. I would argue that metadata and different archiving lines were not new, they had been in DW before 2005.

Then there was an attempt to define a Modern DW as Lambda, i.e. streaming and batch. The best definition is probably from Melissa Coates: link, search “characteristics of a modern data warehouse” on this page. Here she defines other importance characteristics such as Hadoop, data lake, NoSQL, bimodal (sandbox). I would argue that data virtualisation, near real-time, governance, agile delivery, self service BI, large volumes, MDM and high variety are not new. They had been in DW before 2005.

The Azure community is quite adamant with the Lamda concept of a modern DW. Notably Adatis, in particular Simon Whiteley in April 2017 (link), and in SQL Saturday. Microsoft supports this Lambda idea in Azure (concept: link, diagram: link).

Then there is a concept of “Data Lake side-by-side with DW”, notably proposed by Barry Devlin, supported by Hortonworks (link) and also proposed by BlueGranite (link) where Melissa works (see her SQL Saturday presentation is here). The diagram on page 5 of the BlueGranite eBook, which essentially the same as page 5 of Melissa’s SQL Saturday, is in my opinion the best diagram describing a modern DW, as we can see below:

BlueGranite diagram

Here we have the data lake as the staging area for both the streaming data and the data warehousing. And it is also the staging area for the NoSQL data store such as graph database. And the data lake is also accessed by the machine learning. So there are 4 purposes of the data lake.

Arguably it’s Microsoft centric (i.e. Azure), but they are the best vendor at the moment with the most complete components for modern DW (in my opinion). A close second is AWS, with Redshift for the DW and S3 for the data lake (catalogue & ETL on Glue). The Hadoop and Spark are on EMR (Elastic MapReduce). GCP is on the 3rd place with BigQuery as the DW, Cloud Storage as the data lake, and Cloud DataFlow for the streaming data. The Hadoop and Spark are on Cloud Dataproc.

Blog at