Data Warehousing, BI and Data Science

9 December 2018

Performance Attribution

Filed under: Data Warehousing — Vincent Rainardi @ 9:37 pm

In asset management, the business is about managing portfolios. Either it is a public fund where individual investors can put their money in, or a portfolio specifically created for one client such as a pension fund or an insurance.

Almost every portfolio has a benchmark*. The objective a portfolio is to beat the benchmark, meaning that the performance return of the portfolio should be higher than the performance of the benchmark. A benchmark is usually an index which operates in the same asset class, region and sector. For example, if the fund is UK equity, the benchmark can be FTSE 100 All Share.
*Note: an absolute return portfolio doesn’t usually have a benchmark as it is compared to zero (or 3 month Libor in some cases)

A portfolios grows over time, say 12% in the last 1 year. The benchmark to which the porfolio is compared to also grows, say 10% in the last 1 year. The art of understanding where this 2% difference is coming from is called performance attribution. Is it from the US companies or European companies? Is it from technology companies or mining companies?

Fortunately there are mathematics which can explain it. Knowing what makes this 2% or where this 2% is from is very important because it gives us insight about which parts of the portfolio are successful and which parts are not.

This 2% is called Relative Performance. Relative between the portfolio and the benchmark. It is also called Excess Performance or Active Performance. Sometimes people call it outperformance (or underperformance if the portfolio return is less than the benchmark). In this article I would like to describe how to allocate the excess performance into different geography regions/countries, different asset classes, different sectors or different currencies.

But before getting into that I would like give a background on what is a portfolio, what is a holding, what is performance return, what is a benchmark, and finally what is performance attribution. I hope with this background/introduction the readers without any background in investment management will be able to understand what performance return is, before getting into performance attribution.

Part 1. Portfolio, Weight, Benchmark and Performance

In a portfolio there are many assets. They are known as holdings. A holding can be a share, a bond or any other type of security. And cash. A global equity portfolio for example, can contain 50 holdings, some of which are shares of US companies, some are shares of European companies, some are companies in Asia, UK, Japan and Latin America.

Let’s say that the total value of this portfolio is 100 million USD. This $100m is called AUM (Asset Under Management). Let’s say that we invested in 10 US companies, and we put $5 million each, so the value of our holdings in these 10 US companies are $50m. That is 50% of the AUM. This 50% is called weight.

Each of the 50 companies in the portfolios has a share price. The share price changes every day. Some increases, some decreases. Let’s say that from the total value of $100m in 31st Oct 2017, within 1 year it became $111.32m in 31st Oct 2018, an increase of 11.32%. This 11.32% is called Performance Return. We say that our portfolio has a performance return of 11.32%.

In daily conversation people don’t say the “Return”, they just say “performance”. We can measure the performance of the fund in the last 1 month, in the last 3 months, in the last 6 months or in the last 1 year. So if today is 3rd of December 2018, the 1 month performance of our fund is measured by comparing the total value of the fund (the AUM) on 31st Oct 2018 to 30th Sep 2018. The 3 months performance is measured by comparing 31st Oct to 31st July 2018. And 1 year performance is comparing 31st Oct 2018 (the last month end) to 31st Oct 2018 (1 year before).

Accumulative and Annualised Return

In addition to 1,3,6 and 12 months, we also have 3 years, 5 years and 10 years performance return. We can present the performance returns greater than 1 year in two different ways. The first one is called accumulative, i.e. for 3 years we compare the value of the portfolio on 31st Oct 2018 (the latest month end) to the value of the portfolio on 31st Oct 2015 (3 years before). Let’s say that the value of the portfolio in 3 years grows from $100m to $133.1m, so it grows by 33.1%. This 33.1% is the accumulative performance return for 3 years.

The second is called annualised return. What is x, so that if the portfolio grows by x every year for 3 years, it grows from 100m to 133.1m? In this case it is 10%. From $100m it becomes $110m, then £121m, then $133.1m. This 10% is called annualised performance return.

Arithmetic and Geometric Annualised Return

When calculating annualised return we have two options. From the 3 years accumulated return of 33.1% we can divide by 3, getting 11.033% per year. This is called arithmetic annualised return.

The other way is to find x% where $100m * (1+x) * (1+x) * (1+x) = $133.1m. And in this case that x is 10%. This is called geometric annualised return.

We can see here that the geometric annualised return (10%) is lower than the arithmetic annualised return (11.033%). This is because of compound growth, i.e. the amount of growth in year 2 is bigger than the growth in year 1, and the amount of growth in year 3 is bigger than year 2 (for the same % growth).

Geometric annualised return is the real thing. It is the right thing to use. Whereas arithmetic annualised return is technically incorrect, so should not be used. It is only there for a quick “back-of-the-envelope” calculations.

Gross and Net Returns

As a fund manager we charge fees, so there are two option: before the fees are deducted (called gross) or after the fees are deducted (called net). If a fund is growing by 10% per year (that’s the gross performance return), and the total of fees and charges are 1%, the net performance return is 9%.

When investors ask “What’s the performance of your fund?”, they usually mean the net return, i.e. after the costs and charges/fees. But a fund manager usually prefer to specify the gross return, because that’s what she/he delivers (the charges are imposed by the company, not them) and because the gross return is higher, so sounds better.

Usage of Performance Return

Investors don’t like performance return numbers which is less than one year. Many pension funds prefer to invest only in funds which is 3 years old or more. They would like to know the 3 years annualised net performance of the fund, to be compared with the 1 year net return to understand the long term and short term view on performance. And to be compared with the volatility (also 1 year and 3 years) to understand the risk, as well as to Alpha, Beta and Sharpe numbers.

Part 2. Performance Attribution

Now that we know what performance return is, let’s talk about performance attribution, which is the topic of this article.

Let’s have a look at this global equity portfolio:

Region Portfolio Weight Benchmark Weight Portfolio Return Benchmark Return Stock Selection Regional Allocation Interaction Factor Total
US 50% 44% 15% 12% 1.32% 0.19% 0.18% 1.69%
UK 10% 15% 10% 8% 0.30% 0.04% -0.10% 0.24%
Europe 18% 14% 5% 3% 0.28% -0.23% 0.08% 0.13%
Asia Pacific 3% 9% 2% 4% -0.18% 0.29% 0.12% 0.23%
Japan 14% 8% 12% 8% 0.32% -0.05% 0.24% 0.51%
Other 3% 10% 6% 9% -0.30% -0.01% 0.21% -0.10%
Cash 2% 0% 0% 0% 0.00% -0.18% 0.00% -0.18%
Total 100% 100% 11.32% 8.80% 1.74% 0.05% 0.73% 2.52%

Table 1: Example of a Global Equity Portfolio

Portfolio and Benchmark Weight and Return

The performance return of this portfolio is 11.32% (let’s say this is 1 year performance), and the performance return of the benchmark is 8.80%. So the relative performance between the portfolio and benchmark is 11.32% -8.80% = 2.52%.

The portfolio contains equities from the US, UK, Europe (excluding UK), Japan, Asia Pacific (excluding Japan) and other region (such as Africa or Latin America). In the second column we can see that the US equity is 50% of the portfolio, UK is 10% of the portfolio, and so on. The portfolio also contains 2% cash.

The benchmark also contains US, UK, Europe, Japan, Apac and other, but the percentages (the “weights”) are different. US occupies 44% of the benchmark, UK 15%, Europe 14% and so on. The benchmark does not contain any cash. We can see this in the third column.

In the fourth column is the portfolio return. We can see that the US equity grew by 15%, UK by 10%, Europe by 5%, and so on. And the total at the bottom is 11.32%. This is the growth of the portfolio as a whole.

In the fifth column we can see the performance return of the benchmark. The US equity grew by 12%, UK equity by 8%, Europe by 3% and so on. And at the bottom we can see that the benchmark as a whole grew by 8.8%.

The job of a fund manager is to invest in the region which will have most growth, i.e. US, UK and Japan.

Performance Attribution

We are now ready to discuss the last 4 columns: the performance attribution columns. On the last column we can see what makes up the 2.52% difference between the portfolio return and the benchmark return, i.e. 1.69% is because of US equity, 0.24% is because of UK equity, 0.13% is because of Europe equity, and so on.

Region Portfolio Weight Benchmark Weight Portfolio Return Benchmark Return Stock Selection Regional Allocation Interaction Factor Total
US 50% 44% 15% 12% 1.32% 0.19% 0.18% 1.69%
UK 10% 15% 10% 8% 0.30% 0.04% -0.10% 0.24%
Europe 18% 14% 5% 3% 0.28% -0.23% 0.08% 0.13%
Asia Pacific 3% 9% 2% 4% -0.18% 0.29% 0.12% 0.23%
Japan 14% 8% 12% 8% 0.32% -0.05% 0.24% 0.51%
Other 3% 10% 6% 9% -0.30% -0.01% 0.21% -0.10%
Cash 2% 0% 0% 0% 0.00% -0.18% 0.00% -0.18%
Total 100% 100% 11.32% 8.80% 1.74% 0.05% 0.73% 2.52%

Table 2: Performance Attribution – Regional Level

But not only that, we can also breakdown this 2.52% outperformance into:

  1. How much of it is because the allocation into regions
  2. How much of it is because of the stocks/shares that were chosen within each region
  3. And how much of it is because of the combination of #1 and #2 above

In the above case, we can see these 3 numbers down at the bottom of the table:

  1. 05% is because of the allocation into regions.
    For example, the PM (portfolio manager) chose to allocate 50% into US, which enjoyed a high growth (15%).
  2. 74% is because of the stocks/shares that were chosen within each region.
    For example, the US stocks raised by 15% whereas the benchmark raised only by 12%. If we use the benchmark weighting (44%) for both the portfolio and benchmark, we get:
    Performance of the US equity in the portfolio = 44% x 15% = 6.6%
    Performance of the US equity in the benchmark = 44% x 12% = 5.28%
    So the portfolio is outperforming the benchmark by 1.32% as we can see on the table above (the “Stock Selection” factor for US)
  3. 73% is because of the combination between stock selection factor and the sector allocation factor.

For a credit portfolio we have one more factor that we can calculate: the yield curve. This is about how much performance is caused by allocating the portfolio into different maturity buckets, e.g. 1 year, 3 years, 5 years, 10 years, 20 years, etc. Generally speaking, in a normal market condition, the longer the maturity the higher the performance.

Last but not least, each of the region attribution (1.69% for US equity for example) can be allocated into stock selection, sector allocation and interaction.

All this is incredably useful for portfolio managers so that they know whether their allocation is correct or not, and whether their stock selection is correct or not. Note that the words “stock selection” is not accurate for a credit portfolio. “Security selection” is a better choice of words.

Sometimes Performance Attribution is also called Performance Contribution, i.e. how much each factor (or region) contributes to the overall performance.


Finally we come to the core of this article: the calculation.

Region Portfolio Weight Benchmark Weight Portfolio Return Benchmark Return Stock Selection Regional Allocation Interaction Factor Total Contribution
Asia Pacific
Total 100% 100% Total PR Total BR Total TA

Table 3: Performance Attribution Calculation

The calculations of the Stock Selection (SS), Regional Allocation (RA), Interaction Factor (IF) and Total Contribution (TC) columns are as follows:

  1. SS = BW x (PR – BR)
  2. RA = (PW – BW) x (BR – Total BR)
  3. IF = (PW – BW) x (PR – BR)
  4. TC = SS + RA + IF
  5. Total PR = Sum (PW x PR)
  6. Total BR = Sum (BW x BR)
  7. Total TC = Sum (TC) = Total PR – Total BR

Logic Behind The Calculations

Now let’s try to understand why the formula for the Stock Selection (SS), Regional Allocation (RA), Interaction Factor (IF) and Total Contribution (TC) columns are as above.

1. Stock Selection

Stock Selection is how different the US stocks/securities in the portfolio perform, compared to the US stocks in the benchmark. So to understand that we use the same weighting for both the portfolio and the benchmark.

In the above case, the US stock has a weight of 50% in the portfolio and 44% in the benchmark. It has 15% performance in the portfolio and 12% in the benchmark. So the performance attribution of the US stock in the portfolio is 50% x 15% = 7.50%. And the performance attribution of the US stock in the benchmark is 44% x 12% = 5.28%.

But to understand the effect of just stock selection, we have to use the same allocation. So let us use 44% for both of them. The performance attribution of the US stock in the portfolio is now 44% x 15% = 6.6%. And the performance of the US stock in the benchmark is still the same as above, which is 5.28%.

Because we use the same allocation (44%) for both the portfolio and the benchmark, the effect we are observing now is purely because of stock selection. The difference between 6.6% and 5.28% is purely because of stock selection. 6.6% minus 5.28% = 1.32% – that is the effect of stock selection.

That is why SS = BW x (PR – BR). It is the difference in return, when the weight is kept constant.

2. Regional Allocation

So how do we find out what is the effect of regional allocation? By keeping the return constant.

So something like this:
RA = (PW – BW) x BR

In the above, we kept the return constant (BR) and so we can understand the effect of the weighting.

Putting in the numbers for the US stocks:
RA for US = (50% – 44% ) x 12% = 6% x 12% = 0.72%

But 12% is not quite right. Because 12% is the return of the US stock. What we need is how well the US stocks perform in comparison to the other regions. Well, as a whole, the benchmark returned 8.8%. So the US is returning 3.2% above the average. That is the figure we need. So the formula becomes:

RA = (PW – BW) x (BR – Total BR) = (50% – 44%) x (12% – 8.8%) = 6% x 3.2% = 0.192%.

This is fairer. The effect of allocating more to the US is not 0.72% but only 0.192%, because it is not out of 12% (the return of US stocks) but out of 3.2% (the outperformance of the US stock compared to the other regions).

3. Interaction Factor

To calculate the Interaction Factor we must not kept the weight constant, and we must not kept the return constant either. We should use the actual differences in weight and return.

The differences in weight for the US stocks is 50% – 44% = 6%
The differences in return for the US stocks is 15% – 12% = 3%

If we multiply them we get the effect caused by the differences of both the weight and the return.

Hence the formula for the Interaction Factor is (PW – BW) x (PR – BR).

4. Total Contribution

Total Contribution is the sum of the 3 factors above, i.e.

TC = SS + RA + IF

Country Level

We can use this method to understand the effect of allocation not only at region level, but also at country level. So in the above, how each country in Europe contributed to the total European return.

For example, it’s like this:

Country Port Weight PW in Europe Bench Weight BW in Europe Port Return Bench Return Stock Selection Country Allocation Interaction Factor Total Contribution
France 4% 22.2% 2% 14.3% 4% 3% 0.14% 0.00% 0.08% 0.22%
Germany 2% 11.1% 1% 7.1% 8% 2% 0.43% -0.04% 0.24% 0.63%
Italy 1% 5.6% 2% 14.3% 4% 4% 0.00% -0.09% 0.00% -0.09%
Ireland 1% 5.6% 3% 21.4% 5% 4% 0.21% -0.16% -0.16% -0.10%
Netherland 2% 11.1% 1% 7.1% 4% 2% 0.14% -0.04% 0.08% 0.18%
Norway 5% 27.8% 3% 21.4% 7% 2% 1.07% -0.06% 0.32% 1.33%
Spain 3% 16.7% 2% 14.3% 2% 3% -0.14% 0.00% -0.02% -0.17%
Total Europe 18% 100% 14% 100% 5.00% 3.00% 1.86% -0.39% 0.53% 2.00%

Table 4: Performance Attribution – Country Level

Here we breakdown the 2% outperformance in European tocks (5% in the portfolio vs 3% in the benchmark) into 0.22% is from France, 0.63% is from Germany, -0.09% is from Italy, and so on.

And within 0.22% for France we further break it down to 0.14% is from Stock Selection, 0% is from Regional Allocation and 0.08% is from Interaction Factor.

They key in breaking down the outperformance from Region level to Country level is to compute the Portfolio Weight in Europe and Benchmark Weight in Europe columns. These 2 columns are calculated by dividing the Portfolio Weight (4% for France, 2% for Germany, and so on) by the total Europe weight (18%). So France’s portfolio weight in Europe is 4% / 18% = 22.2%. Germany portfolio weight in Europe is 2% / 18% = 11.1%. Then we do the same thing for the Benchmark Weight in Europe (BW in Europe column).

The question remain as to why in the country level the total of Stock Allocation for Europe is 1.86% where as in the Region level the total of the Stock Allocation for Europe is 0.28% (see table 5 and 6 below).

Region Portfolio Weight Benchmark Weight Portfolio Return Benchmark Return Stock Selection Regional Allocation Interaction Factor Total Contribution
Europe 18% 14% 5% 3% 0.28% -0.23% 0.08% 0.13%
Total 100% 100% 11.32% 8.80% 1.74% 0.05% 0.73% 2.52%

Table 5: European Contribution = 0.13%

Country Portfolio Weight Benchmark Weight Portfolio Return Benchmark Return Stock Selection Country Allocation Interaction
Total Europe 18% 14% 5.00% 3.00% 1.86% -0.39% 0.53% 2.00%

Table 6: European Contribution = 2%

This is because when calculating the 0.28% we are trying to allocate the 2.52% into region level (table 5). In context of the 2.52%, the Europan stocks get 0.13% and the European Stock Selection get 0.28%. Where as when calculating the 1.86% we are trying to allocate the 2% European outperformance into countries. In this context the total of Stock Selection for Europe is 1.86%. So both are right, it’s just that they have different context. One is global, and the other is within Europe.

Sector, Currency, Asset Class

In addition to the performance attribution by region and country, we can also produce the performance attribution by industry sector, currency and asset class, like this:

Industry Sector Portfolio Weight Benchmark Weight Portfolio Return Benchmark Return Stock Selection Sector Allocation Interaction Factor Total Contribution
Energy 3.7% 7.5% 3.2% 4.3% -0.08% 0.11% 0.04% 0.07%
Materials 5.3% 8.9% 4.7% 5.2% -0.04% 0.07% 0.02% 0.04%
Industrials 6.8% 9.3% 5.7% 6.1% -0.04% 0.03% 0.01% 0.00%
Consumer Discret. 8.9% 9.3% 9.4% 7.3% 0.20% 0.00% -0.01% 0.19%
Consumer Staples 14.5% 8.8% 11.3% 9.7% 0.14% 0.15% 0.09% 0.38%
Health Care 11.6% 9.4% 7.1% 7.7% -0.06% 0.01% -0.01% -0.06%
Financials 7.7% 13.6% 4.3% 5.7% -0.19% 0.09% 0.08% -0.02%
IT 8.6% 11.2% 10.6% 8.2% 0.27% -0.03% -0.06% 0.18%
Communication 6.8% 8.1% 9.3% 10.3% -0.08% -0.04% 0.01% -0.11%
Utilities 13.8% 7.2% 8.9% 6.7% 0.16% -0.03% 0.15% 0.27%
Real Estate 10.3% 6.7% 9.4% 7.9% 0.10% 0.03% 0.05% 0.18%
Cash 1.9% 0.0% 0.0% 0.0% 0.00% -0.14% 0.00% -0.14%
Total 100.0% 100.0% 8.13% 7.15% 0.37% 0.24% 0.37% 0.98%

And this:

Currency Portfolio Weight Benchmark Weight Portfolio Return Benchmark Return Stock Selection Currency Allocation Interaction Factor Total Contribution
USD 27.9% 24.3% 6.7% 5.6% 0.27% -0.04% 0.04% 0.26%
GBP 19.5% 15.6% 5.6% 5.3% 0.05% -0.06% 0.01% 0.00%
EUR 24.8% 28.4% 7.8% 6.2% 0.45% 0.02% -0.06% 0.42%
CAD 6.9% 8.8% 9.4% 8.9% 0.04% -0.04% -0.01% -0.01%
JPY 11.3% 13.6% 11.7% 9.4% 0.31% -0.06% -0.05% 0.20%
AUD 7.7% 9.3% 6.7% 8.5% -0.17% -0.03% 0.03% -0.17%
Other 1.9% 0.0% 0.0% 0.0% 0.00% -0.13% 0.00% -0.13%
Total 100.0% 100.0% 7.38% 6.80% 0.96% -0.34% -0.04% 0.58%

Knowing how the currency allocation affect the performance is useful for the people on the desk (PM and the analysts). And so is knowing how the industry sector allocation affect the performance. It helps the desk knowing where their outperformances are from.

And like when regions can be further broken down into countries, the sectors can further be broken down into industries.

The key to be able to calculate all the above is: a) knowing the value of each position at the beginning and at the end of the period, and b) putting the country of risk, industry sector, currency and asset class for every position in the holding and benchmark. That’s what enables us to group each holding into regions/countries, industries/sectors, currency and asset class/sub class, and table out the weights and performances for each section, both portfolio and benchmark. And from there we can apply the above logic/method to calculate the performance attribution.


4 November 2018

Percentage of Total and Filter on 2 Columns

Filed under: Data Warehousing — Vincent Rainardi @ 8:02 pm

This query shows how to Filter on 2 columns, add 2 measures using AddColumns, and create a measures to calculate the percentage of total for a certain year and country.

DEFINE MEASURE financials[Sales%] =
SUM(financials[ Sales]) /
( SUM(financials[ Sales]),
  ALLEXCEPT(financials, financials[Year], financials[Country])
) * 100

    ( financials,
      ( financials[Country] = "Canada",
        financials[Year] = 2014
    financials[Month Name],
    financials[Month Number]
  "Sales", FORMAT(CALCULATE(SUM(financials[ Sales])), "#,###"),
  "Sales%", FORMAT(financials[Sales%],"##.##")
ORDER BY financials[Month Number]


  • The Sales% is defined as Sales divided by the sum of Sales for the year and country currently selected.
  • The financial table is filtered on Country = Canada and Year = 2014.
  • Then two columns are chosen from this table: Month Name and Month Number.
  • Then 2 measures got added: Sales and Sales% (which was defined above)
  • Finally it’s ordered on Month Number.

Note: we can’t filter on 3 columns using And function. The maximum argument on And function is two. The And operator (&&) can take more than 2 arguments.

PS.Any one know how to hide the Month Number column?

DAX Studio and DAX Guide

Filed under: Data Warehousing — Vincent Rainardi @ 7:36 pm

Thanks to Darren Gosbel and others, the Power BI, Power Pivot and SSAS Tabular communities now have a tool to test DAX. It is DAX Studio (, which is a superb tool. Not only it enables us to test ideas flexibly, but it also enables us to troubleshoot and optimise DAX queries.

With DAX Studio we can:

  • Connect to Power BI, a Tabular server and to Power Pivot and run DAX queries on them all.
  • Browse the model to understand the tables and columns
  • Run DMVs to get the measures, hierarchies, dimensions
  • Understand the query plans, Storage Engine and Formula Engine and timing.
  • Format our DAX queries
  • Capture DAX queries that we run in Power BI (and therefore allowing us to improve their performance)

It is an indispensable tool for anyone working in DAX. With Power BI being ever so popular, the demand for DAX is highly increasing. It was Darren Gosbel who primarily created DAX Studio, with Marco Russo wrote a small part of it.

Marco and Alberto wrote DAX Guide ( It is so comprehensive, so detail, so complete. It contains every DAX function in Power BI, in Excel, in Tabular, Azure AS and in SSDT. For every function it describe the syntax, the return values, the parameter, the example, and it describes them in detail. Just look at how it explain the Calculate function. The context transition is there, the filter is there, the order of evaluation is there, the modifier is there. It’s so complete.

Thank you Darren, Marco & Alberto, for creating these two very useful tools for all of us.

PS. All other tools related to DAX, Tabular and indeed other BI tasks are listed here:, e.g. VertiPaq Analyzer, Tabular Editor, BISM Normalizer, DAX Editor, Power Pivot Utilities, DAX Date Template, SSAS Tabular Translator, etc. Note: these tools are written by many people.

Correction, 5/11: DAX Studio was written mostly by Darren Gosbel:, not Marco or Alberto. My apologies for this mistake. Thanks Marco for your correction.

A few DAX functions

Filed under: Data Warehousing — Vincent Rainardi @ 2:34 pm

This query shows summarise, addcolumns, filter, union, row and format DAX functions:

    ( FILTER
      ( financials,
        financials[Product] = "Montana"
    "Sales", FORMAT(CALCULATE(SUM(financials[ Sales])), "#,###")
  ROW("Total", "Total", "Sales", FORMAT(SUM(financials[ Sales]), "#,###"))
  • The financials table is filtered on Product = Montana
  • Display just the Country column using Summarize
  • Sales column is added using AddColumns functions
  • Total row is added using Union function
  • Add commas for thousand separator using Format function

Canada 24,887,655
Germany 23,505,341
France 24,354,172
Mexico 20,949,352
USA 25,029,830
Total 118,726,350

Data source: link
Ref: Whytheq: link, SQLBI: link

5 October 2018

Junk Dimension Revisited

Filed under: Data Warehousing — Vincent Rainardi @ 5:27 am

If the data we get from the source system consists of many columns, and 11 of those columns are like this:

  • Small, medium, large (2 columns)
  • High, medium, low (2 columns)
  • Red, green, blue, brown (1 column)
  • Red, amber, green (2 columns)
  • Yes, no (2 columns)
  • Long, short (2 column)

1. Source Row

How should we model it in dimensional model?

The answer is of course a junk dimension. If in the above data we have not 5 rows but 5000 rows, then the distinct combination of values is probably only 200. That is the power of the junk dimension. It puts all these attributes into one table, in a compact form.


One of the challenge of implementing a junk dimension is the lookup, to check if the row coming from the source data already exists in the dimension table or not. If it’s only 11 columns like above then we can compare the source row and the target row directly by comparing all the columns. But if the number of columns is 100, then it would be cumbersome to type it all in.

One common solution is to use hasing. We hash the 100 source columns and compare it with the hash column in the dimension which has been computed previously. If we find a match, that means the row already exists in the dimension table.

There are a few functions in SQL Server which can do this. One of them is checksum. But it is possible that two different inputs generate the same hash output. A better function is hashbytes, with much lower possibility of producing the same hash output for different input, particularly if we use it with MD5 algorithm, as we can read on Books Online and on SQL Server Central (link).

With Hashbytes we can choose different algorithms: MD2, MD4, MD5, SHA, SHA1, SHA2. Note that from SQL 2016 only SHA2_256 and SHA2_512 are recommended. All other algorithm are deprecated. So I recommend using SHA2_512 rather than MD5, as it has less chance of producing the same output value for different input values.

Impossible To Have Duplicates

Having the same output value for different input values is called a collision. MD5 is very weak, it took only a few seconds using a normal computer to produce a collision (link).

SHA1 is the second strongest, We need to try 10^19 different inputs before collision occurs, which takes about 100 year if we use a server with 64 cores. Last year Google and CWI Amsterdam managed to produce a collision with SHA1 (link).

With SHA2 using 64 bytes (512 bits) it is practically impossible to have a duplicate/collision. We need to try 10^77 different inputs before a duplicate/collision occurs. If we use a server with 64 cores, it will take about 5800 years.

Warning: Functions Are Slow

One warning when using functions in SQL Server (any function including hashing functions), is that function is always perform a row by row operations. Or RBAR as Jeff Moden calls it, stands for Row By Agonising Row. In relational databases, be it Oracle or SQL Server, or anything else, when dealing with high number of rows we should always perform a set operation, not a row-by-row operation. If the incoming data from source is over 1m rows, hashing them will take time, particularly if when we use a complex algorithm such as MD5.

First Character Technique

One alternative is taking the first character of every column and use it as the business key for lookup, like this:

2. First Character

By comparing the First Character column between the incoming source row and the one stored in the dimension table, we can check if the row already exists or not.

The advantage of using this technique is that there is no way that two different inputs will produce the same output.

But (as always there is a but) if in one column we have two values with the same first character like below, we need to differentiate them:

3. Same first character

In the above example the Status column have two values with the first character = O, i.e. Outstanding and On Hold. So we need to differentiate them, for example by making On Hold as H.

To implement that we will need either a case statement or a lookup table. For example: case when Status = ‘On Hold’ then ‘H’ else Left(Status,1).

If the upper and lower case in the source data is consistent, we don’t need to convert the first character to uppercase. But if the upper/lower case in the source data is inconsistent then we need to convert it to uppercase. For example, if in the source data the Status column contains both “Active” and “active”, then we need to convert both to “A” by taking the uppercase using UPPER() function.

Dealing With Null

If we use Hashbytes, Null values need to be converted to “Unknown” or “1/1/1900” or -1 depending on the data type. Otherwise it won’t work. We can also opt to convert NULL to the appropriate data type before inputting it to the Hashbytes function.

The advantage of Checksum is that it can deal with NULL. We don’t have to convert NULL values to something else. The other advantage of Checksum is that it is faster than Hashbytes with SHA_512.

Data Type

Varchar and Nvarchar produce different output, even through the input text is the same. Similarly Char and Nchar produce different output.

Varchar(100) and Varchar(200) produce the same output, if the input text is the same. So length of the varchar does not affect the output.

If the collations are different, the outputs are different.

We need to remember all the above points when later on we want to change the data type of the attribute column, either in the staging or in the dimension. Or if we want to change the collation of the database.

8000 bytes limitation

The main limitation of Hashbytes function is that it can only accept a maximum of 8000 bytes. So if the total bytes of all the input columns is more than 8000, we will need to split the input columns into two or three and hash them separately.

Hashbytes Performance

So how fast or show is Hashbytes function? A table with 49 columns and 1.5 million rows takes about 44 seconds with MD5 (link).

A table with one varchar(10) column and 11 million rows takes about 17 seconds with SHA2 256 (link)

Obviously it depends on how many CPUs are in the server.

Back to Junk Dimension

So if there are many distinct values in a column (say 100) then we can’t use the first character technique, because there will be many duplicate. In this case Hashbytes is better, because it can take thousands distinct values, and it can take many columns too.

Yes/No Columns

If the source data contains many boolean columns (say 100 columns, all True and False or Yes and No), then we can either put it in the fact table as degenerate dimension columns, or create a junk dimension. I would advise to create a junk dimension, because it will declutter the fact table. And because those columns are attributes so they should be in a dimension table. And because from space consideration the junk dimension solution is much more compact, particularly if the fact table is a snapshot fact table.

25 September 2018

Data Warehouse for Asset Management

Filed under: Data Warehousing — Vincent Rainardi @ 6:50 am

A. Brief Summary

A.1. Fact Tables

The main fact tables in an asset management warehouse are position, performance, risk and transaction. In the next 4 paragraphs I’ll explain what they are briefly. Later on in the article I will be explaining them in more detail.

The position fact table (aka valuation fact table) contains the daily holdings and benchmark constituents for every portfolio, both security/instrument positions and cash positions. The main issue with this fact table is the matching between the holding/portfolio positions and benchmark positions. We have the option of not attaching the benchmark, which makes it flexible particularly for portfolios having more than one benchmarks. The position fact table is mainly used to provide portfolio breakdown, and top N holdings. The other issue is lookthrough.

The performance fact table contains the performance returns of every portfolios compared with their benchmarks, for many different periods e.g. 3M, 6M, 1Y, 3Y. There is also another type of performance fact table, which contains the performance attribution to different factors such as yield curve, sector allocation and security selection. Performance can also be allocated to different sectors, currencies and asset classes, each compared to the benchmark. Please see my article here.

Because of the non additive nature of risk numbers, there are many risk fact tables. They mainly hold the risk sensitivities such as interest rate sensitivities (PV01, DV01), inflation sensitivities (IE01), credit rate sensitivities (CR01) and exchange rate sensitivities. The risk fact table also stores the VAR data (value at risk) at portfolio level and at “carved-up” level (for example, industry sector, currency and asset class)

Transaction fact table stores every buy, sell and corporate action in every portfolio, including the fees and charges. The transaction status, which changes at every stage on different dates, can be stored on an accumulated snapshot fact table. Whereas the trades data can be stored on a transaction fact table.

Other business areas requiring fact tables are:

  • Fund administration e.g. redemptions, subscriptions, distribution payments
  • Liquidity management e.g. cash ladder (projected cash)
  • Collateral management and counterparty exposure
  • ESG data (Environment, Social, Governance)

A.2. Dimensions

In an asset/investment management DW, the following dimensions are commonly found:

  1. Security dimension (aka Instrument dimension or Asset dimension) which contains 3 types of securities: held securities, benchmark securities and watchlist securities.
    Note: cash is a financial instrument but not a security. Only equity, fixed income and money market are financial securities, including derivatives (refer to The Securities Exchange Act, section 3 part 10)
  2. Issuer dimension (aka company dimension) containing a list of public and private companies who issue shares or bonds, had an OTC trade/deal with us, or has ESG data.
  3. Client dimension (aka customer dimension) containing a list of clients from the CMS such as Salesforce, with their usual attributes such as client type, Mifid category, industry sector, and address.
  4. Portfolio dimension (aka account dimension, product dimension or fund dimension) which contains a list of open funds, segregated account/mandate and trust funds that the company runs.
  5. Benchmark dimension, containing a list of benchmarks which the portfolios are measured against, including blended benchmarks. The “pairing” between portfolios and benchmarks is not done in this dimension.
  6. Country dimension covers two country code schemes (ISO 2 chars) and (ISO 3 chars) and 2 to 3 country name columns (one for each OMS/IMS, plus ISO).
  7. Currency dimension is pretty static. A country column is not useful here, as the currency name already includes the country and some currencies are used by several countries. It is usually a good practice to update country and currency dimensions manually.
  8. Rating dimension. Covers Fitch, Moody’s, and S&P. A good rating dimension have the 3 agencies as columns so we get the translation between them (instead of stacking them up as rows with a “type” column). Also covers “major” rating and “notched” rating as two different attributes/columns (AA is a major and AA+ is a notched). Also covers cash ratings (Moody’s), which is quite different from credit rating. Don’t put the “watch” rating and “outlook” here, they should be a DD on the security dimension.
  9. Industry Sector dimension. Covers at least ICB and GICS, along with major providers such as ML, Barclays, Bloomberg. Some are 4 levels deep, some 7 levels. Unlike rating, don’t try to do translation between providers here, so stack them up with a “provider” column, then level 1 to level 7 columns.
  10. Asset class dimension (aka asset type, instrument type, asset group) which contains a list like: equity, fixed income, cash, property, commodity as level 1. Level 2 contains the breakdown for example for fixed income the level 2s are bond, floating rate note, CDS, bond future. For bond the level 3s are callable/puttable or government/corporate.
  11. Dimensions related to performance return attributes: return type (gross or net of fees), accounting method, cumulative or annualised, arithmetic or geometric. We could model them as a junk dimension because most attributes only have 2-3 values, if we don’t mind losing the description. I don’t recommend the performance period to be put here (e.g. 3 month, 1 year, 5 years), best to put them as columns in the performance return fact table (having them as a column in the performance return is second best).
  12. For the sake of completeness, date dimension.

A.3. Issues

The usual issues/dilemmas with designing the dimensions are:

  1. Do we include country of risk/incorporation, currency, industry sector, etc. in the security dimension? Absolutely, as type 2 attributes. Plus we also have them as separate dimensions in their own right.
  2. How do you model derivative instruments which has very different attributes? By creating Option, Swaps, Futures and CDS as separate dimensions. Don’t separate the OTCs (Over The Counter) Derivatives, e.g. options, swaps, CDS) from ETDs (Exchange Traded Derivatives, e.g. futures, forwards). Don’t create too many dimensions for derivatives, but group them up. 4 or 5 dimensions sounds about right. Don’t worry too much if certain attributes are duplicated between the main security dimension and the derivative instrument dimension, it’s ok.
  3. Do we put cash in the security dimension or not? Don’t. Only real financial instruments should be in the security dimension (including CD, Repo, OTC and ETD), but leave settled and unsettled cash out (including collateral cash for margin requirements) See my article here for more info: link. Cash in different currencies should be put directly in the position fact table, clearly labelled as settled or unsettled as many portfolio management decisions hang on this label.

B. Details – Fact Tables

B.1. Position Fact Tables

Position fact tables store the valuation for each position from the back office accounting system, including adjustments. The valuation is daily, but the one at the month end is continuously adjusted / restated until a few weeks later.

For example, every business day we could be receiving:

  1. Daily position valuation from the OMS system
  2. Daily position valuation from the accounting system
  3. Today’s version of the last month end position valuation from the accounting system

The valuation is not only on the securities that we hold, but also the cash. There are a few things we need to be aware here, with regards to the validity of our position fact table:

  1. The cash amount is dependent on the settlement status of the transactions for that day.
    For example, this morning we had £1m cash, and this afternoon we bought £300k worth of equities and bonds. So what’s our cash position at the end of today? Answer: Settled cash = £700k, unsettled cash = £300k.
  2. The securities we hold is affected by the trade status.
    For example, today the portfolio manager put an order to buy £200k of bond X. If the status of the order is “executed by the broker” (it’s called “filled orders”) then we will have that security in today’s position. But if the status of the order is “not yet executed by the broker” then we don’t have that security in today’s position. There are many stages of an order/trade including “draft/created”, “verified”, “authorised”, “compliance checked”, “placed with/sent to the broker”, “confirmed by the broker”, “filled/executed by the broker”, “allocated”, “settled”, and “closed”. The position is created when the order is “executed”.
  3. The value of an equity position is the number of shares we hold x the previous day price of that equity (close of business price, e.g. when the exchange ended trading – the exact hour depends on which country the exchange is located). Like wise of bonds and any exchange traded securities. But for OTC derivative positions like options and swaps, the valuation requires some calculations to determine the “fair” market value, e.g. Black-Scholes model for options (this discipline is generally called “derivative pricing”). This is because OTC derivatives are not traded on exchanges, but a contract between two parties.
  4. Some of the positions could be valued in-house, and some positions could be valued by an external company to whom we outsource our back office operation. In this case we may need to merge two sources to get the holistic view.

Design of Position Fact Tables:

  • Name: Fact Holding, Fact Valuation or Fact Position. “Holding” restricts us from including benchmark positions. “Valuation” restricts us from including fixed income analytics. So in my opinion “Position” is the best name.
  • Grain: 1 row for each position in every portfolio and benchmark per day.
  • Portfolio and benchmark: two separate position fact tables would cause all sort of painful issues when joining, so put into 1 fact table but join on the front end OMS (Charles River, Aladdin, Simcorp, ThinkFolio). Do not join portfolio and benchmark positions in the DW. This is probably the most important point in the this article.
  • Front end and back end: put data from back end (State Street, Fund Accounting, Northern Trust) and front end (CRD, Aladdin, TF) in two separate position fact tables. The former contains accounting/finance valuations and the latter contains daily trade-by-trade valuations.
  • Surrogate keys: date key (called reporting date), portfolio key or benchmark key, security key. Those are SKs which make up the grain. Other SKs are: currency key, client key.
  • History: stores 5 years of daily snapshot.
  • Partition: partitioned on reporting date.
  • Indexing: index on date key, portfolio key and security key. Also index on other SK columns.
  • Restated month end valuation: stores in a separate fact table.
  • Measures: valuations of each position in 3 currencies (portfolio, base and local) and their implied FX rates, fixed income analytics such as durations, yields and spreads, maturities (in years, 4 dp).
    Note: Analytics numbers are from front end OMS. Fund accounting/finance/back office only provides valuation numbers.

Loading: daily file from back office outsourcing/finance and daily export from OMS.
If the valuation of OTC derivative positions is conducted in house rather than outsourced, this needs to be merged into the back office data.

Usage: correct security key is vital as this fact table is the main source for reporting aggregated positions (portfolio breakdown) by currency, asset class, country, industry sector, rating, etc. which is required by both clients and regulator. Mostly usage are only interested in the month end, but settled positions. Some requires executed (unsettled) positions.


  • Lookthrough: store the positions without a lookthrough. A lookthrough position is a fund held in the portfolio, usually our own funds. In a portfolio, in addition to holding cash, equity or fixed income position, we also hold a fund. We should not “expand” the “content” of this fund. Instead we should store the value of the whole fund. If we “expand”, the valuation of other positions become incorrect, particularly cash positions.
  • Unsettled cash positions: in the Position fact table, separate unsettled cash positions from settled cash. The separation and calculation of the unsettled cash should be done within fund accounting, not within the DW.
  • Mid day vs end of day valuations: always take end of day valuations, even if we do the valuations in house (e.g. for options and swaps).

B.2. Performance Fact Tables

Performance is calculated in the back office system such as Stat Pro, Barclays Point (now Bloomberg Port), MSCI BarraOne, or outsourced to back office provider such as State Street or Northern Trust (Passport).


  • Name: Fact Performance Return (to differentiate from Performance Attribution).
  • Grain: 1 row for each share class, per month, per return type (gross or net return), per return calculation (arithmetic or geometric return), per data source, per row type (portfolio/benchmark/excess return), per cumulative type (annualised or cumulative), per price/total type (price return or total return which is price+income).
  • Surrogate keys: reporting date key, account key (1 account for each share class), benchmark key, currency key, return type key, return calculation key, data source key
  • Degenerate dimensions: performance row type (P for portfolio, B for benchmark, E excess return = R for relative return = A for active return), cumulative type (A for annualised, C for cumulative), performance total/price type (P for Price Only Returns, T for Total Return which is Price + Income), return calculation (A for arithmetic, G for geometric).
    Alternatively we can also put all these DD columns in a junk dimension.
  • Measures: 1M, QTD, 3M, 6M, YTD, 1Y, 3Y, 5Y, 10Y (M = month, Q = Quarter, Y = Year). Some asset managers also have 2Y and 15Y.
  • Loading: from StatPro/Simcorp/Aladdin output or files from the back office profider such as NT or State Street.


  • Some have MTD (month to date), which stores the daily intra-month returns.
  • A popular alternative design is to create a period dimension, containing 1M to 10Y, effectively making this fact table a vertical fact table. I prefer to have a separate measure column for each period, because the first thing we do when we have a vertical performance fact table is to pivot it so that the 1 column of numbers becomes 1 row of numbers. Also because the number of periods are relatively fixed.
  • Performance numbers have to be signed off. Before sign off the data is available, and can be used for reporting, but there a (small) risk that the data is incorrect – which in some cases is acceptable. So we need to add a DD column (or part of the junk dimension) indicating the sign off status.
  • Another attribute which should be added to this fact table is the sign off date (as a date key).

Performance Attribution Fact Table

The active return or excess return or relative return is the return of the portfolio minus the return of the benchmark. This active return can be attributed/allocated to several factors, i.e. yield curve, duration, security selection, currency, credit, carry and sector allocation. This is called performance attribution data. For example, a relative return of 6% can be attributed to:

  • 2% because of duration
  • 1% because of security selection
  • 2% because of sector allocation
  • 1% because of carry


  • Name: Fact Performance Attribution
  • Grain: 1 row per portfolio, per reporting date.
  • Surrogate keys: reporting date key, portfolio key, benchmark key.
    Additionally: currency key.
  • Measures: yield curve, duration, security selection, credit, carry, sector allocation, and total.
  • Loading: from StatPro/Point/Barra output or files from the back office profider such as NT or State Street.

B.3. Risk Fact Tables

The risk fact tables store the sensitivities numbers and value at risk. They also stores the results of “stress test” on multiple scenarios, such as European sovereign default, global economic recession, oil price fluctuation, tightened credit markets, liquidity squeze, exchange rate fluctuation, Brexit scenarios, raise of US interest rates.

I’ll start with sensitivities fact table, the VAR fact table, then stress test fact table. Note: IE01 reflects how sensitive the portfolio is to inflation. PV01 reflects how sensitive the portfolio is to interest rates.

Sensitivities fact table:

  • Description: combine IE01, PV01 and DV01 in one fact table, storing them as different measures.
  • Name: Fact Portfolio Sensitivities
  • Grain: 1 row per portfolio, per day.
  • Surrogate keys: calculation date key (when the sensitivities were calculated), portfolio key
  • Measures: IE01, PV01, DV01. Some asset managers also stores CR01 and IR01.
  • Loading: from risk system such as IBM Algo Risk, MSCI BarraOne, MSCI RiskMetrics, Simcorp. Daily load of the output files.
  • Noge: IE01, PV01 and DV01 also calculated at position level. Put them in the position fact table. Keep the ones in portfolio sensitivities at fund/portfolio level.

VAR fact table:

  • Description: stores value at risk (VAR) at portfolio level, and at sector and country levels. Not that VAR measure is not additive, so we can’t aggregate the VAR of all European countries to calculate the VAR for Europe. So we need to store the VAR at country level and at region level. Ditto for the sector. A VAR measure depends on the confidence level (i.e. 95%, 99%, 99.9%, etc.) and the time horizon (i.e. 1 day, 10 days, etc.) For example: on 20th Nov 2018, the one day VAR of this portfolio is $1 million with a confidence level of 99%. Which means that the possibility of this portfolio losing $1m or more in one day is 1%. Whereas “10-day VAR with a confidence level of 95% is $4m” means: the possibility of losing $4m in 10 days is 5%.
  • Name: Fact Portfolio VAR
  • Grain: 1 row for each sector/country, per portfolio, per day, per horizon, per confidence level.
  • Surrogate keys: calculation date key, portfolio key, sector/country key
  • Measure: because the time horizon is fixed at 1 day and 10 days, and the confidence levels are fixed at 99.9%, 99% and 95%. I prefer to have the time horizons and confidence levels as columns so we have 6 VAR measures / 6 columns like this: VAR_1D_999, VAR_1D_99, VAR_1D_95, VAR_10D_999, VAR_10D_99, VAR_10_95. This saves us from pivoting the result all the time.
  • Rows: the rows in this fact table are one of the following 5 types:
    1. Portfolio Level
    2. Country Level
    3. Region Level
    4. Industry Level (we pick 1 scheme, such as Merrill Lynch or Barclays)
    5. Sector Level
  • Note: the tricky thing here is storing the sector or geographical segments because the measure is non-aggregatable. We do have SKs for each country, but how about the regions? Should we create rows for Asia, Latin America, Europe, etc. in the country dimension? No. We must not change the grain of a dimension, by adding higher level rows. The best way I find is the create a separate (new) VAR Segment dimension, which has these columns: Segment Scheme (value: Country, Region, Industry or Sector), Segment L1, Segment L2, Segment L3. The rows with Segment Scheme = Country have the L1 populated with country, L2 with region, L3 blank. The rows with Segment Scheme = Region have the L1 populated with region, L2 and L3 blank. The rows with Segment Scheme populated with Industry have the L1 populated with Industry, L2 with Sector, L3 with Sector Group.

Why not separating the VAR fact table as 3 tables? 1 for portfolio level, 1 for country (and region) level and 1 for industry (and sector) level? Because they are not aggregatable anyway, we can’t sum the rows. So might as well put them all in one table.

One note with regards to ML sector or Barclays sector, or GICS sectors, etc.: it is not a problem obtaining the classification scheme i.e. how different industries are grouped into sectors and super sectors. The problem is to get the data of 20,000 companies in the world, each of them is allocated into a sector. Is that company manufacturing or retail? Obtaining this data is the issue. Not just for 100 or 1000 companies, but for 20,000 public companies in the world, that is the issue. So in reality we will use the industrial classification used by our security data provider, i.e. if we are sourcing the securities data from Bloomberg then we use the industrial classification provided by Bloomberg. If we source the securities data from Thomson Reuters or Factset, then we use the industry data from them.

This is part of security/instrument mastering process. I mean attaching the ML, GICS (MSCI and S&P) or Barclays sector into each security. And attaching ratings into each security. And attaching country of risk and country of incorporation into each security (and country of domicile).

We have discussed the VAR on industry sector and on geography (countries). How about the carving-up of the VAR based on the asset classes and ratings? These two are not as common as the industry and countries, but it is possible to run them pass the risk system and get an output. If this is the case in your company, then we put them into the same VAR fact table above, with the VAR Segment Key column pointing to rows in the VAR Segment dimension for asset classes (such as IRS, CDS, bond and equity) and ratings (such as AA-, BBB+, etc.)

B.4. Transaction Fact Table

Background: Internal and external orders

There are two types of order: the one which the fund managers put in to our trader (this is called “internal-facing”) and the one which our trader put out to our brokers (this is called “external-facing”). Within our company, it is possible two fund managers (or more) put an order for the same security (on the same say or thereabout), and our trader combines them into one external order. After the order is executed, the trader allocates the security into two portfolios. The opposite can also happens, i.e. one order from a fund manager is very big so the trader decided to split it into two smaller external orders and put them to separate brokers to get better price. After the orders are executed, the trader combines the two external orders and put them both to the same portfolio belonging to that fund manager.

The transaction fact table here stores the internally-facing orders. Not the externally facing orders.

Background: order lifecycle

  • A member of the desk (e.g. a financial analyst, or the fund manager) sends an email to a trader asking them to buy or sell an exchange traded security, or make an OTC derivative deal. He specifies the quantity and the price boundary (a range of price where the desk is prepared to except) and the timing (must be same day, or can it be next day, or even day after next). If is a bond, he specifies the issue, i.e. the maturity and coupon. If it is equity, they specify the class (e.g. Alphabet class A or C, with or without voting rights), the ADR (for US), HK or Shanghai (for Chinese equity). If it’s a derivatives they specify the terms of the deal, e.g. reference rate, maturity, etc.
  • A trader enter the order into their blotter, recording the security (e.g. ISIN/SEDOL), maturity, coupon, dates, class, etc. and to which portfolio this order is for. The trader enter the order on the Order Management System (OMS), aka Investment Management System (IMS) and the order status is Draft.
  • When the order is created, various compliance checks automatically run on the OMS for that portfolio, and if this order breaches one of the rules (be it mandate or regulatory), OMS will block this order and it cannot continue to the next stage.
  • If the order passes the compliance check, the manager of that fund then get notified automatically, asking them to approve this order (known formally as “authorise”). The fund manager checks the impact of this order to her/his portfolio e.g. the valuations, the risks, and the portfolio characteristics such as duration, and then authorise the order.
  • In the mean time the trader tries to find the best price by comparing quotes from several brokers. If it is an OTC deal, the risk system automatically checks the counterparty risk level of the broker. If the risk level is breached, this order is blocked and can’t continue.
  • Once the order is authorised, and the risk + compliance systems are both green, the trader go ahead with one of the brokers (or two) and place an order with them. At this point the status of the order becomes “sent”. There is a new MiFID regulation with regards to best execution, i.e. an asset management company must execute every order to the best of the client’s interest, not the company’s interest. This means finding the best price amongst the brokers and this is reported to the regulator.
  • The broker then send the order to the exchange, to find a match on the opposite side, i.e. if it is a sell order they find a buy side, and vice versa. Once a match is found and agreed the order becomes a trade, and the order status becomes “filled”. The exchange sends the trade details to the broker i.e. who the other side is, and at what price the trade happened. The exchange also sends the trade details to the custodian i.e. the bank who holds the securities on behalf of the asset manager (usually also do settlements and administration such as dividend and coupon payments).
    Note that trade means both sides whereas order means one side.
  • The broker informs our trader that the trade has been executed, along with the trade details such as price, who the opposite side is and the settlement date. This is called “confirmation”. The broker confirms all the trade details with the trader. The broker on the other side also confirm the trade details with their client. Any incorrect details would be corrected at this stage.
  • Two days after the trade was made, we settle the trade, known as T+2 (some securities are T+1 or T+3). If we are buying, our back office will transfer the money to the clearing house, and the clearing house sends the security to our custodian. If we are selling, our custodian will send the security to the clearing house and our back office will receive the payment. Often the custodian bank offers settlement services, which means that, instead of the back office making and receiving the payments, it is the custodian bank who is making and receiving payments to and from the clearing house. After this stage, the status of the order becomes “settled”.
  • Then the securities is allocated into the relevant portfolios (if we are buying, and the trader combines several internal orders into 1 external order). At this point the status of the order is “allocated”. But, the new security was put in the portfolio way before “allocation”. The security is put as “holding” in the portfolio (and therefore affects the value of the portfolio), when the order status is “filled”.
  • After the order is allocated, the order is closed. The order status at that point becomes “closed”.
  • Before an order is executed (order status = “filled”), it can be cancelled and the order status becomes “cancelled”. After the order is filled, the order can’t be cancelled as both sides have committed*. If the broker made a mistake, e.g. the buying the wrong security, then the broker pay for the wrong security out of their pocket (which they will sell later) and then buy the correct security for us the AMC.
    *Note: The exchange usually allows cancellation, as long as both sides agree. But the opposite side of the trade rarely agree to the request to cancel the trade. Correction, yes (at confirmation stage) but cancelling, usually no.

Ok that’s a lot of background. Now let’s specify the design of the fact table.

Transaction Fact Table

  • Description: stores every order for every portfolio. In the transaction fact table we do not store the other side of the trade. We only store our side of the trade. We store the quantity, price, value and the dates of every stage.
  • Name: Fact Transaction
  • Grain: 1 row for every order
  • Fact Table Type: accumulated snapshot
  • Surrogate Keys: Transaction Key (aka Order Key), Order Status Key, Broker Key, Trader Key, Portfolio Key, Security Key, Exchange Key, Authoriser Key, Order Created Date Key, Authorised Date Key, Sent Date Key, Filled Date Key, Confirmation Date Key, Settlement Date Key, Cancellation Date Key, Closed Date Key.
  • Degenerate Dimension: Order ID (internal), Order Type (B for Buy and S for Sell), Trade ID (external)
  • Measures: Quantity, Trade Price, Order Value, Traded Yield, Traded Spread
  • Loading: once a day from the “trade file” exported by the OMS at the end of the day (about 11pm London time, after the US market is closed). Just like position file. This trade file is loaded to the data warehouse straight after it is produced by the OMS. All orders which are not “closed” are “loaded” into this fact table: new orders are inserted, existing orders are updated.

Note: Bear in mind that 2 internal orders (2 order IDs) can be combined into 1 external trade (1 trade ID). And 1 internal order (1 order ID) can be split into 2 external trades (2 trade IDs), each going to different brokers, each has different opposite side. The grain of this fact table is 1 row per internal order in the OMS. If 2 orders are combined into 1 trade, 2 rows will have the same trade ID. If an order is split into 2 trades, only the trade ID of the primary trade is stored here on that row (the largest trade, by value).

Unlike in the retail industry (supermarkets, shops) where an order usually contain multiple items, in OMS an order contains one security.

External Trade Fact Table

Now that we know what a transaction fact table is, we can create an external trade fact table which stores externally facing trades.

  • Name: Fact External Trade
  • Grain: 1 row for each trade
  • Fact Table Type: Accumulated Snapshot
  • Surrogate Keys: Trade Key, Trade Date Key, Exchange Key, Broker Key, Trade Type Key, Security Key, Counterparty Key, Confirmation Date Key, Settlement Date Key.
  • Degenerate Dimension: Trade ID
  • Measures: Quantity, Trade Price, Value
  • Loading: once a day from the OMS export

C. Dimensions

C.1. Security Dimension

A security is a financial instrument, such as shares/stocks of a company. The followings are also securities:

  • Bond: an instrument where we get paid coupon (interest) every 6 months
  • Option: an agreement to buy or sell shares or bonds (or currencies, or indexes)
  • Future: an agreement to buy or sell bonds or commodities in the future
  • Forward: like Future but for currencies or commodities
  • Swap: an agreement to exchange income from a security
  • Foreign Exchange (FX): an agreement to buy a currency
  • Credit Default Swap (CDS): an agreement to get compensation if a bond is defaulted (the issuer can’t pay the coupon/interest)
  • Loan: an agreement to borrow money with interest
  • Securitised: a bond backed by a collection of loans or mortgages
  • Certificate of Deposit: a certificate from a bank which indicates that we have deposited a large sum of money in that bank

Note: the above is called “asset class”.

The followings are not a security:

  • Cash, i.e. money in different currencies (but an option to buy money in other currency is a security)
  • Account, i.e. we put money into a broker or a bank (e.g. for margin trading)
  • Real estate, i.e. property such as a house or a land
  • Commodity, such as oil, cotton, iron and copper (but the a future on commodity is a security)

In the broadest sense securities can be grouped into two: bond or equity. Bond is formally known as “debt”, to include other fixed income securities outside bonds, such as derivatives (options, swaps and forward).

A security can be listed in an exchange. For example: a share can be listed in a stock exchange, a future can be listed in a future exchange. A security can be negotiated just directly with an investment bank (they are called the “sell side”, we are called the “buy side”) and this type is called OTC, stands for over-the-counter, meaning that we make a deal outside the exchange. OTCs are mostly swaps and options, but an equity and future can also be bought outside the exchange (directly from the seller, which is mostly banks). This attribute is called security type, i.e. listed or OTC. Listed security is also called ETS, stands for Exchange Traded Security, or ETD, stands for Exchange Traded Derivatives (if it is a derivative).

Security Master application

An asset/investment management company usually has an instrument mastering system, aka Security Master application. This application enables the creation of a security (financial instrument*) in real time whenever the following event occurs (or before hand):

  1. A benchmark contains a new security (not already exists in security master).
  2. An order contains a new security.
  3. An OTC deal/trade is made.
  4. A file from ESG vendors contains a new security

*Note: cash is a financial instrument but not a security (ditto deposits).
Asset is everything that is held in a portfolio, including cash, deposits and derivatives.

Security Master

The Security Master application stores every security used in the company. Equity, bonds, options, futures, swaps, loans, certificate of deposit, treasury, money market, ETF, everything. Old securities which have matured are also stored here. Every security held in every portfolio is stored here. Every security held in any benchmark used by the company is stored here. Every security resulting from any over-the-counter trade is stored here (e.g. IRS, CDS, FX option). Every ISIN or Sedol mentioned in any ESG file from any ESG vendors (e.g. Bloomberg, Trucost, Sustainalytics, MSCI, ISS, Factset, BoardEx, CDP). The ESG data is actually per company, not per security. But, next to the company name, the vendor put the primary ISIN or Sedol of that company (usually the common equity).

The primary source of a Security Master application is the Bloomberg security universe. There are about 650,000 securities there, but an investment/asset management company usually only request the actively traded ones, which is about 67,000. This way, when fund managers are considering to buy a new equity or bond which is not in the index, they will find it in the OMS. Because the OMS takes the security from the Security Master and the Security Master has every actively traded security, be it shares, bonds, treasury, listed options or listed futures.

Of course the Bloomberg security universe does not contain OTC derivatives*, because an OTC derivative is a security created by two companies making a deal between them to buy and sell an option or a swap. They are not listed in any exchange. Therefore the OTC system needs to connect to the Security Master.
*Note: are there OTCs which are not derivatives? Yes, an equity can be bought outside an exchange.

The main purpose of a Security Master app is to enrich the “skeleton” security with various attributes (usually from Bloomberg) such as security name, country of risk, country of incorporation, country of domincile, currency, rating, issuer, exchange, and industry sectors, along with various identifier such as ISIN, Sedol, Cusip and Ticker. Security Master also have screens/pages where user can manually alter the attributes of a security.

Security Dimension

The ideal source of a security is the security master application where the creation, integration and enrichment happens, but if it is not available (usually in the case of a small asset manager), then security dimension becomes the creation and integration point. Creation point means that new securities such as OTC are created in the security dimension (skeleton only) and then enriched by making a “per sec” call to Bloomberg to get the remaining attributes. I have to emphasise that a security dimension is not the right place for mastering security in the company, because a) the mastering process requires a lot of manual checking, and potentially manual entries, b) it requires judgements with regards to what value we want to allocate to security, when there are conflicting values coming from different data sources.

Dimension Design:

  • Dimension Name: Dim Security
  • Dimension Type: SCD Type 2
  • Grain: 1 row for each version of a security
  • Business Key: Security ID (from Security Master)
  • Surrogate Key: Security Key
  • Description: contains listed and OTC securities but not cash, funds or deposits. This is a “core and custom” dimensions, i.e. Dim Security as the core dimension contains about 40-50 common attributes across asset class and Dim Security Equity/Fixed Income/Swap/Option/CDS as custom dimensions covering attributes specific to those asset classes.
    Note: The “core and custom” dimensions is also known as “supertype and subtype” dimension, see Jeremy Kashel’s explanation here on Adatis blog and a short description here on Kimball group website. Also see the Learn Data Modeling post on this here and Oracle Academy presentation here.
  • Number of rows: about 500,000 to 1m (30-40,000 distinct securities, 10-20 version each)
  • Number of columns: about 40-50
  • Attributes: Security ID (from Security Master), Security Name, Description, ISIN, Sedol, Cusip, BBG ID (Bloomberg), Ticker (Bloomberg), FSYM ID (Factset), RED ID (Markit), RIC (Thomson Reuters), Master Issue ID (Moody’s), Exchange, Issuer, Currency, Country (Risk, Domicile, Incorporation), Market, Industry (GICS, ICB, ML, BBG, Barclays), Issue Date, Security Type, Maturity Date, Underlying Issuer, Asset Type/Class/Group.
  • Security attributes not included: Series, Tenor, Reference Name (these are CDS attributes), callable, call dates, puttable, put dates (these are Option attributes), Nominal, Principal, Coupon, Debt Type, Convertible Flag, Perpetual Flag, Securitisation, Zero Coupon, Sinking Fund, Credit Ratings, Reference Yield Curve (these are fixed income attributes). These attributes are stored in Dim Security Fixed Income, Dim Security Equity, Dim Security Swap, Dim Security Option and Dim Security CDS
  • System columns: inserted datetime, last updated datetime, latest flag.
  • Loading: from security master. Rows in the security master is already deduplicated, meaning that it’s 1 row per security. So we can take an export of the security master, and apply the SCD type 2 process on the attribute that we like to keep the historical value.


  • Dimension Name: should it be called instrument dimension or security dimension? Either one is fine. Financial instrument includes cash and deposit, whereas security doesn’t. Is oil a financial instrument? No. Is it a security? No. Commodities like cotton, wheat, soya beans, iron ore, copper and oil are not financial instruments or securities. But the options on them are financial instruments and security.
    Gold is not a financial instrument. A loan is a financial instrument, because it is tradable. An FX option is a financial instrument because it is a trade (a deal) between 2 parties. Property is not a financial instrument.
    The definitive definition of a financial instrument is given by the Securities Act (link) and the IMF: link.
  • The same CDS is issued by several banks, for example a 5 year CDS on a Tesco bond can be issued by JP Morgan and Citi Group. Tesco is the reference entity, whereas JPM and Citi are the issuer. So the “issuer” of a CDS is different from the issuer of a bond or shares.
  • Licence on identifiers: to use the proprietary security identifier listed above (FSYM, BBG ID, RIC, etc) we need to pay a licence fee. Ditto Cusip (link). We can’t store them without paying first. But not ISIN and Sedol, as far as I know.
  • Ticker: Each vendor has their own ticker, e.g. Bloomberg tickers are different from Factset tickers, and are different from Reuters tickers. They are all licenced (not free). For example: The ticker for PT Telekomunikasi Indonesia are TKLM:IJ (Bloomberg) and TLKM:JK (Reuters). The Exchange Ticker code (known as EPIC or TIDM in UK) is free to use as far as I know. For example: NYSE: T, LSE: LLOY, ETR: BMW, NASDAQ: AMZN and IDX: TKLM.
  • The shares of a company can be traded on different exchanges under different ISINs. For example, RBS is traded in London (ISIN: GB00B7T77214) and New York (ISIN: US7800976893, as ADR).
  • OTC swaps such as IRS are stored in the security dimension as one row, not 2 rows. Most of the attributes are stored in Dim_Security_Swaps. Only the core attributes are stored in Dim_Security. In Dim_Security_Swaps IRSes are stored as 2 rows: 1 for the fixed leg and 1 for the floating leg. It is possible that both floating, in the case of multi-currencies swap.

C.2. Issuer Dimension

The issuer dimension stores four things:

  • All public companies which issue shares or bonds.
  • All central governments, municipal governments and government related bodies/agencies which issue bonds.
  • All the investment banks which issue credit default swaps (CDS) and certificate of deposit (CD).
  • All companies which have ESG data (both public and private)

Dimension Design:

  • Dimension Name: Dim Issuer
  • Dimension Type: type 2
  • Grain: 1 row for each version of company, government, or agency
  • Business Key: Bloomberg Company ID, or Factset Entity ID, or Markit RED ID, or LEI, or DUNS ID.
  • Surrogate Key: Issuer Key
  • Attributes: Issuer Code (one of the business keys above), issuer name (legal name) – this is type 1, issuer short name, several entity identifiers (BBCID, RED, FSYM, LEI, DUNS) – this is type 1, several primary security identifiers (primary ISIN, primary SEDOL, primary CUSIP, primary ticker, primary exchange) – this is type 2, issuer type (public company, private company, central government, municipal goverment, or government agency) – this is type 2, headquarter address, telephone, contact person, country of risk (primary country of operation), country of domicile, country of incorporation (all countries are stored as 2 character ISO codes, not as surrogate keys), parent company ID (this is the business key e.g. BBG CID, not the surrogate key), industry sector (several columns, i.e. GICS, ICB, SIC, NAICS, ISIC, ML, BBG, Barclays).
  • Attributes not included: CEO (this is stored in the Executive Dimension because it date dependent), market cap (this is stored in a fact table because it is date dependent), number of employees (date dependent – should be stored in a fact table)
  • System columns: inserted datetime, last updated datetime, latest flag.
  • Loading: a daily export of the issuer table from the security master get loaded into Dim Issuer, with SCD type 2 process.


  • Name: we should not call this dimension Dim Company because the governments are not companies.
    Dim Entity or Dim Legal Entity is possible but it is not popular nor clear.
  • Other entity identifiers are listed on Alacra (link). Most of them are either sectoral (e.g. only banks) or geographical (e.g. only Japan).
  • The main challenge in populating this dimension is to make sure that each company is only inserted once. This matching process (to identify if a company already exists) thankfully does not happen here in the issuer dimension, but in the security master system.
    The security master system, after creating a new security, it populates the issuer code and underlying issuer code for that new security. At this time, it tries to find that issuer in its internal issuer table, using the natural key (e.g. RED, FSYM or BBG CID). Because issuers are already deduplicated in the security master, we in the DW team just need to consume it.
    But the ESG data sources will have companies not in the “investment” data space. They need to be added into the security master, and come out into the DW.

C.3. Client Dimension

A client is from whom we receive a mandate and money to invest on their behalf. It could be a segregated mandate, which means that we manage their money separately, i.e. in a separate portfolio. It could be a pooled fund, which means that we put their money into one of our existing fund, and manage it together with the money from other clients. So in this sense, our clients are companies or government bodies.

The client dimension is usualy built from Salesforce which stores the detail of every single client and prospect. Most of the asset managers that I know use Salesforce for their CRM (Salesforce is used by many other industry too, e.g. banking and insurance).

Many asset managers are institutional, e.g. they only have corporate clients. In this case the client dimension will contain only companies and governments/goverment bodies. Some asset managers are retail, e.g. they have individual investors as well as corporate clients. In this case the client dimension will contain companies, government/goverment bodies and individual people. Companies have different attributes to people. Companies have different attributes to governments too, but not as different as people’s attributes, i.e. companies and governments and government bodies/agencies can be put into one table.

The question is, should we create two client dimensions, one for company/government and one for individual people? Yes we should. The retail client have different requirements, different attributes and different processes to the corporate clients. And they access different funds too.

Within corporate clients, we have a few major categories: insurance clients, pension funds, consultants, corporate investors, and government/sovereign clients.

Insurance clients usually request the following Solvency II related reports: Pillar 3 QRT (Quantitative Reporting Templates), Tripartite template, capital charge calculation with risk pillar attribution, and capital charge analysis.

Whereas pension fund clients usually only needs the following (insurance clients need these too): asset valuation (holding level and higher level aggregations, e.g. sector, rating, duration), performance attribution, performance return. All of the above are comparing the portfolio to the benchmark. Plus they also need market review, market outlook and trades report.

Dimension Design:

  • Dimension Name: Dim Client
  • Dimension Type: type 2
  • Grain: 1 row for each version of a client
  • Business Key: Salesforce Account Number
  • Attributes: Client Name, Address (including post code and country), Billing Address (if different), Telephone, Website, Main Contact Person, Main Email Address, Account Manager, Industry, Client Type, Client Sector, Account Source, Parent Company ID, Date became client, Date of first mandate.
  • System columns: the usual 3.
  • Data source: from Salesforce.

If we are administering our own retail funds, we need to create a subscriber dimension, which contains the individual investors. We need to install and use a fund administrator software, and the subscribe dimension is populated from the export of the fund administor system. If the fund administration is outsourced to a third party, we get our subscriber data from the third party.

Dimension Design:

  • Dimension name: Dim Subscriber (or Dim Investor)
  • Dimension Type: type 2
  • Grain: 1 row for each version of a subscriber
  • Business Key: Subscriber
  • Surrogate Key: Subscriber Key
  • Attributes: Subscriber name, Address, Date of birth, Email address, Telephone number, Date of first subscription.
  • Data source: fund administration system

It is rare that we need to create this Dim Subscriber because we are not dealing with the day-to-day subscription (people putting money in) or redemption (people pulling their money out), or income distribution (allocating the dividends from the company in the fund to the subscribers). The day to day task of administering a fund is usually outsourced. If we do manage the day to day fund administration, then we also need to create a fact table for subscriptions and redemptions, and any other movement of money within the funds.

C.4. Portfolio Dimension

In layman term, a portfolio is a collection of shares or bonds that we manage as one entity. In addition shares and bonds, we can also have derivatives such as options, futures, swaps, FX, CDS, securitised, CD, etc (see section C1 above where I explained about various securities), and cash.

A typical equity portfolio consists of 30 to 60 shares occupying 98% of the fund, with the remaining 2% as cash. It is open to the public, meaning that individual investor can in that fund. The fund could be available in different share classes, i.e.

  • In different currencies such as USD, EUR and GBP
  • With different distribution type, i.e. accumulation or income. Accumulation means the dividend from the shares is reinvested into the fund. Income means the dividend from the shares is distributed to investors.
  • For different target invetors, i.e. institutional or retail investor. One share class might have minimum investment of £100 because it is for retail investor, whereas another share class might have minimum investment of £100,000 because it is for institutional investor.
  • With different charges, i.e. one share class having Initial Charge of 5% and Ongoing Charges Figure (OCF) of 1%, whereas another share class has Initial Charge of 4% and OCF of 1.5%. OCF includes expenses that we pay for administrator, accountant, auditor, legal, depositary and registrar.
    A depositary bank (with an o in the US, depository) holds the shares in the fund and distribute the dividends from the shares. It is also known as custodian bank or fiduciary.
    A registrar keeps a register of shareholders of the fund (including AML and KYC) and produce capital account report and distributor position report. For a fund the correct term is “subscriber”, for a unit trust it is “shareholder”. AML stands for Anti Money Laundering, to check that the money from the subscriber is from valid source. KYC stands for Know Your Customer, to check the identity of the subscriber to make sure they are who they say they are, and investigate if they are polically connected or domiciled in restricted countries.
  • With different performance fees, i.e. one share class may charge a performance fee, whereas the other share class doesn’t charge a performance fee. For example the investor pays 10% of the amount by which the NAV of the fund exceeds the FTSE Allshare index.
  • ISA compliant, meaning whether individual investors can keep this fund in their Individual Saving Account (ISA) or not.
  • SIPP compliant, meaning whether individual investors can keep this fund in their Self Investing Pension Plan account (SIPP) or not.

A fund can be open ended, meaning that when there is a new subscriber they issue new shares (or unit of shares, to be precise) and when a subscriber sells their holding their shares are sold (called “redemption”). Or it can be a closed ended fund, meaning that the number of shares is fixed, and the fund is listed in the stock exhange so investors can buy and sell it.

A fund has specific objective, which is to invest in certain markets, in certain financial instrument. For example, only equity or only bonds (or both). Only in Europe or only in Asia (or global). Only small companies or only big companies (or both). Only emerging market or only developed market (or both). Only in bio technology companies only in financial services companies, or only in mining companies (or all industry sectors).

A fund has a benchmark, for example to outperform FTSE 100 GBP Total Return Index. Total Return means that the dividend is reinvested back into the index. FTSE 100 is the 100 largest companies listed in London stock exchange.

So that is one end of the spectrum: a portfolio which is a fund. The other end of the spectrum is a portfolio which is a segregated mandate. We receive a mandate (instruction) from the pension fund trustee of Supermarket ABC to invest £20 million of their money in a certain way, for example: 80% of it must be invested as fixed income in the US and European markets separately, we must not take any short position and we must not use derivative other than FX options to hedge currency risk. We create a client account called Supermarket ABC, and under this account we create a portfolio called “Supermarket ABC US Fixed Income” and “Supermarket ABC European Fixed Income”.

Now that we know the difference between a portfolio, a fund, a mandate and an account, we are ready to design the portfolio dimension. But one more thing: some asset manager called a portfolio as a “product”, because that’s what it is known as in the sales and marketing department. In my opinion this is incorrect because a product can consists of several portfolios.


  • Dimension Name: Dim Portfolio (Dim Account, Dim Fund, Dim Product are not correct names, due to the reasons I described above)
  • Dimension Type: type 2, particularly on portfolio manager, status, valuation point, distribution date, depository because we need to record the previous values. Not on inception date, portfolio name, description, etc – they are type 1 attributes because the changes means correction.
  • Grain: 1 row for each version of the portfolio.
  • Business Key: Portfolio Code
  • Surrogate Key: Portfolio Key
  • Attributes: Portfolio Name, Portfolio Description, ISIN, Client Code (not Client Key that’s snowflaking), Legal Structure (ICVC, OEIC, Unit Trust, etc.), Portfolio Type (Fund, Sub Fund, Segregated or Shell/Composite) Main Asset Class (Equity, Debt or both), Distribution Type (Accumulation or Income), Parent Portfolio Code, Main Share Class, Base Currency, Portfolio Manager (or Fund Manager), Status (Active, Soft Close or Closed), Inception Date, Closed Date, Distribution Date, Depositary Bank, Valuation Point,
  • Attributes which should not be included: number of holdings, fund size, sharpe, alpha, beta, yield because they are date dependent so they should be stored in fact tables.


C.5. Benchmark Dimension

C.6. Country Dimension

C.7. Currency Dimension

C.8. Rating Dimension

C.9. Industry Dimension

C.10. Asset Class Dimension

C.11. Performance related dimensions

C.12. Date Dimension

C.13. Broker Dimension

6 September 2018

Modern Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 7:09 pm

What should 2018 Data Warehousing look like (as opposed to in year 2005)? That’s what the word “Modern” refers to. We now have streaming data, we have big data, we have data lake, and we have data scientists. A modern DW in 2018 incorporates these 4 factors. An “old” DW in 2005 did not incorporate these 4 factors.

The first attempt to define a Modern DW was from Bill Inmon in his DW 2.0 concept in 2006, where he added unstructured data, different user communities and (whitepaper: link, book: link). There was no big data or NoSQL at that time, but we already had unstructured data in data warehousing. I would argue that metadata and different archiving lines were not new, they had been in DW before 2005.

Then there was an attempt to define a Modern DW as Lambda, i.e. streaming and batch. The best definition is probably from Melissa Coates: link, search “characteristics of a modern data warehouse” on this page. Here she defines other importance characteristics such as Hadoop, data lake, NoSQL, bimodal (sandbox). I would argue that data virtualisation, near real-time, governance, agile delivery, self service BI, large volumes, MDM and high variety are not new. They had been in DW before 2005.

The Azure community is quite adamant with the Lamda concept of a modern DW. Notably Adatis, in particular Simon Whiteley in April 2017 (link), and in SQL Saturday. Microsoft supports this Lambda idea in Azure (concept: link, diagram: link).

Then there is a concept of “Data Lake side-by-side with DW”, notably proposed by Barry Devlin, supported by Hortonworks (link) and also proposed by BlueGranite (link) where Melissa works (see her SQL Saturday presentation is here). The diagram on page 5 of the BlueGranite eBook, which essentially the same as page 5 of Melissa’s SQL Saturday, is in my opinion the best diagram describing a modern DW, as we can see below:

BlueGranite diagram

Here we have the data lake as the staging area for both the streaming data and the data warehousing. And it is also the staging area for the NoSQL data store such as graph database. And the data lake is also accessed by the machine learning. So there are 4 purposes of the data lake.

Arguably it’s Microsoft centric (i.e. Azure), but they are the best vendor at the moment with the most complete components for modern DW (in my opinion). A close second is AWS, with Redshift for the DW and S3 for the data lake (catalogue & ETL on Glue). The Hadoop and Spark are on EMR (Elastic MapReduce). GCP is on the 3rd place with BigQuery as the DW, Cloud Storage as the data lake, and Cloud DataFlow for the streaming data. The Hadoop and Spark are on Cloud Dataproc.

15 July 2018

Power BI Q&A

Filed under: Data Warehousing — Vincent Rainardi @ 5:59 pm
  1. Q: Can Power BI read SSAS cube?
    A: Yes, both tabular (link) and multidimensional (link)
  2. Q: Can Power BI display SSRS reports?
    A: Yes but on-premise only, by installing Power BI Report Server (link)
    SSRS (paginated) reports are not available in the Power BI Cloud/Service, but we can pin SSRS (paginated) reports on Power BI Dashboards (link)
  3. Q: Can I use natural language to query the data in Power BI, e.g. “What were the total sales last year?”
    A: Yes, only in the cloud version, only in English language. On a Power BI dashboard, we can type our question in the “Ask a question about your data” box. Power BI Q&A uses the names of the tables, columns, and calculated fields (link).
  4. Q: What is the difference between a Power BI Dashboard and a Power BI Report?
    A: Dashboard is 1 page, whereas Report is multiple pages. Dashboard is only available in the cloud/service version, whereas reports are available on both cloud and on-prem/desktop version. You can subscribe to a report, but not to a dashboard (link).
  5. Q: What is the difference between the “free” version and the “pay” version of Power BI?
    A: On the “free” version we can’t publish and consume app, can’t share dashboard or subscribe to dashboard (link).
  6. Q: Can we view Power BI reports on Android, or just iPads?
    A: We can view Power BI reports on iPad and iPhone (link) as well as on Android tablets and phones (link) and Windows 10 devices (link)
  7. Q: How can Power BI in the cloud connect securely to an SQL Server on-prem?
    A: By installing a data gateway on-prem, which protects the on-prem SQL Server.
  8. Q: What’s the best way for inputing simple data into Power BI report?
    A: Using embedded PowerApps (link)
  9. Q: Can I run R scripts in Power BI?
    A: Yes, Get Data > More, Other > R script. You can then use the output (data frame) in the Power BI visualisation (link).
  10. Q: How much is Power BI?
    A: Power BI Desktop is free. Power BI Pro is $9.99/month per user. Power BI Premium is $5000 to $6000/month for 1 to 1000 users, if only 100 users are authoring reports (link), i.e. the rest are just consuming the reports/dashboards.


11 May 2018

Big Volume Data Warehouse

Filed under: Data Warehousing — Vincent Rainardi @ 7:42 pm

Some people are worried with the volume of a snapshot fact table. If we insert half a million rows a day, after a year we would get 180 million rows, and after 5 years close to 1 billion rows. For example, position valuation of a large asset manager, or account balances of a large retail bank.

The volume issue also occurs on transaction fact table, i.e. we could be inserting a few million rows a day, resulting a few billion rows a year. For example, stock exchange bids and offers, search engines searches, the browsing traffic of a large ecommerce sites, or the good old supermarket checkouts case?

If we haven’t dealt with billions rows fact table like above, we would be worried about the query performance. How long will it take to query a fact table with 10 billion rows? Would it take a minute? 10 minutes? Well it could be, if the database engine has to scan the table. But if it is partitioned, it could only takes a second.

So the question now becomes: ok, great to know that it can take a second to query a 10 billion rows fact table. But how do I partition that big fact table? Well it depends on what fact table it is. If it a snapshot fact table, the partition should be on the snapshot date. If it is a transaction fact table, the partition should be on the transaction date.

Ok, that’s great. But what if we need to query across dates? For example, in the stock exchange data warehouse, to find out which are the top 10 most traded shares by volume (or by value) in the last 3 months. That query will take a lot of minutes. To do this, we need to prepare a fact table which stores the transaction volume and value of every share on every date. We populate this daily snapshot fact table every day, immediately after the main transaction fact table is populated.

For example, today is 11th May 2018. Right after the 10th May data is loaded into the shares transaction table, we calculate the volume and value of every shares traded on 10th May and insert these rows into this daily snapshot fact table. By querying this table it will only take a second to find out what are the top 10 most traded shares in the last 3 months.

15 April 2018

BI vs Data Science

Filed under: Data Warehousing — Vincent Rainardi @ 5:57 am

Last night there was a talk about Business Intelligence (BI) vs Data Science at Dataiku London. Unfortunately I was not able to attend but that event inspired me to write about this topic as it has always been an interesting topic for me. And I think it is very fitting to this blog as it is about Data Warehousing and BI, and recently I added about Data Science too.


Before we go too much further let’s remind ourselves what BI and Data Science are. Data Science is about scientific approaches to manage and analyse data using statistics, machine learning and visualisation (link). BI (Business Intelligence) is also about managing and analysing data and visualisation, but using business approaches, not scientific approaches (link, link).

I have been creating business intelligence and analytics applications across many industries including travel, banking, pharmaceutical, manufacturing, insurance. Both with a data warehouse/mart, and without. They are used for analysing the business, for reporting, and for dashboarding.

I have done Big Data development for a few years, including Hadoop, Hive, Data Lake, Spark, graph and document databases. In the last 6 months I have been learning Machine Learning and Neural Network, using Jupyter, Matlab and R, and recently Tensorflow. I understand how they are used for visual recognition, predicting prices, network security, marketing and playing games.

This Article

I have written a few articles about “Choosing between Big Data and Data Warehousing” (link, link, link) which differentiate the back-end part of BI and Data Science. So in this article I would like to focus on the front-end part, i.e. how they are used.

One of the best approaches to compare the usages of BI vs Data Science is probably using use cases from a particular industry. So in a particular industry we compare what BI can do and what Data Science can do. For example, in insurance BI is used for analysing customers, earnings, risks, financials and claims, and for reporting (operational, management and regulatory reports). Whereas Data Science is used to forecast losses, income and profitability (both account and product), adjusting rates, classifying customers for renewal, and grouping potential customers.

If we dive into other industry, say travel, we will understand the differences between BI and Data Science a bit more. The more industries we study, the more we understand the differences of how BI and Data Science are used. One is using business approach, and the other is using scientific approach.

With the above background I think we are ready to dive into core of this article, so let’s start with the first industry: insurance. I will then do one more industry: travel. I am a beginner in data science and particularly in machine learning. Suggestions and advice from readers will be most welcome, with regards to both about the approach and the content at or via the comment box below. Once again in this article I’m not discussing the how or technical aspect, but the usage or business advantages, comparing the BI vs Data Science.


General insurance is about insuring buildings, ships and businesses from risks like storms, sinking and frauds. There are about 20 classes in general insurance including property, aviation, marine, construction, oil & gas, professional indemnity, public liability and riots. It is a little bit different from life and pension business, and from retail business, i.e. home, car, health, pet and travel insurance.

I guess in this example we need to pick one line out of the above 3 main insurance businesses, which branches into many lines. Let us use car insurance for this example, because it is retail business so many people can relate to, and it’s easier to understand. It has many attributes which provide good illustration for the BI and Data Science usage.

The main usage of BI in car insurance is to analyse premium income and claim costs. These two numbers (and many of their derivatives, i.e. net, gross, subs, etc.) are analysed with regards to vehicle, time, office and customer attributes. Two of most common methods of this analysis is by using OLAP cubes and dashboards. The dashboards presents the most important findings such as income growth by vehicle types and profitability by customer groups, with guided drilldown along chosen pathways. The cubes provide a free-to-roam exploration platform, enabling drilling any numbers to the lowest details on any available pathways.

The business values of these activities are from knowing when and where exactly the revenues and costs are coming from and what factors influence them. This knowledge gives the business greater control and ability to grow the profitability across all business segments. Understanding claims and premiums profile for each type of vehicle (age, van/SUV, fuel, gearbox, engine size, parking, usage, milage) and each type policy holders / drivers (location, years of licence, profession, accident history, health, penalty points) enables the business to target the sales and marketing budgets on the more promising customers, as well as pricing the policy at the correct level. It also enables more precise filtering with regards to following up the leads from the Aggregates such as GoCompare and MoneySupermarket, and from brokers.

The Data Science is used to analyse customer churns, establishing rates, and analyse black box telematics with regards to risks and impact to premium levels. Standard actuarial rates give the business non-competitive advantage as they use standard factors such as driver age, engine size, mileage and parking location. Data Science enables insurers to factor-in new attributes such as driving behaviour, driving routes and driver’s social media, calculating the underwriting rates more specifically and more precisely for each policy. This enables the business to win new businesses, both commercial and retail, as quotes/pricing is the most important factor influencing new customers. Also, machine learning is used to improve the efficiency of online advertising, by predicting which how many users (and of which types) would click-through on different types of ads and by optimising bidding.

So BI and Data Science give different business advantages. It is not true that Data Science will replace BI in car insurance business, and in insurance in general. On the contrary Data Science will complement the business advantages that BI currently delivers.


Selling flights, hotel, holidays and cruises are the core businesses of travel companies. Car hire, travel insurance, excursions, airport transfer, airport parking, tours, restaurants, attractions, shows, villa, currency exchange and rail passes are the secondary businesses. They have retail and commercial business lines, both heavily utilising internet as a marketing platform because it reaches many countries and very cost efficient.

The BI is used for analysing web traffic such as funnels and conversion rates, revenue breakdown, customer segmentation, customer loyalty programs, and marketing campaigns such as email alerts and newsletters. Funnels enable the business to understand how much traffic filters trough each step of the purchasing process, and at which page each customer stops, as well as the sequence of pages they viewed and for how long each. This ultimately enables us to improve the website content and the products, resulting in higher conversion rates. Revenue breakdown is by product, by customer types, by country, by platform, by campaign, and by time. Marketing campaigns enable the business to drive more sales, with the content of each email/newsletter tailored differently to each individual customer. Each image on the newsletter is uniquely tagged enabling us to track which click from which customer triggered each sale.

The business value of these activities are: increased revenue, wider customer base, and increased customer loyalty. The revenue is higher because of higher conversion rates on web traffic and because the marketing campaigns drive more sales to the websites. Also because as a result of breakdown analysis we can focus our budget and time on the more profitable and promising product types and customer segments. The customer base is wider because market penetration from the campaigns. Customer loyalty increased because the offer is tailored specifically for each customer segment, sometimes down to each individual customer.

The Data Science can be used for customer segmentation, predicting campaign results, and analysing web traffic. It seems overlapping with the above BI activities, but if we dive into the detail levels they are actually complementing. By associating real time web traffic data and past purchase data using machine learning, travel companies can find out which customers have the best potential for purchasing products (using cookies to identify customers). The learned parameters are then used to display offers on the web pages the customers are currently on. The offers are specifically tailored to the highest potential that they are interested, based on their browsing behaviours (current and past sessions).

Predicting campaign results is done by first tracking back each sale to find its origin and contributing factors. The origin can be a click on an image on a particular email sent by a particular campaign. This can be identified by the tag on that image, which has the customer ID, campaign ID and product ID. The origin can also be a Google or social media advert from a particular campaign that the travel company runs, giving us the ID of the ad and possibly the user/audience ID. The contributing factors of a successful campaign can be the product, creative, layout, timing, pricing, audience and origin. Knowing which factors contribute the most to the success or failure of each campaign can be very revealing, from which we can then improve the subsequent campaigns. Various machine learning algorithms can be used for this including support vector machine, random forest and neural network.

Next Page »

Blog at