Data Warehousing, BI and Data Science

21 February 2019

Data Files – Delimiter and Qualifier

Filed under: Data Warehousing — Vincent Rainardi @ 8:06 am

Suppose you got to specify a file spec for your data supplier to send their data via FTP. Would you request them to send you Excel files or CSV files? Do you prefer a tab delimited text file or pipe delimited? Do you ask them to qualify their string with double quotes?

Excel or CSV

I would prefer CSV files than Excel files, because the number of columns in Excel which can be read by SSIS is limited to 255 columns (see link). Whereas in CSV files there are no limitation regarding the number of columns.

To overcome this limitation we need to import it twice (or three times) using two data source components and then join them. But in order to join them we will need to have an identity column, which will become the join criteria (see link). We also need to be careful with the performance when joining because merge join can be slow.

The second issue with Excel file is the OLE DB Provider installed in the server where SSIS is running, otherwise we could get an error message saying that “OLE DB provider Microsoft.ACE.OLEDB.12.0 is not registered” (see link). The “Use 32 bit runtime” in the SQL Agent job step and Run64BitRunTime in the project property are also affecting this.

The other disadvantage of importing an Excel file is dealing with zero in the front of a string, such as “007”, which automatically becomes number 7.

Also numbers which are in scientific notation, such as 1.23E21, which will then be imported as a string rather a number, causing a failure. If it is in CSV it written as 1230000000000000000000 in the file and imported as a number.

The other limitation of Excel is about cells containing long strings, such as 2000 characters, being cut to 256 characters. This issue only happens before 2007 edition.

Delimiter

What delimiter do you prefer: comma, pipe or tab?

The problem with CSV or a comma delimited file is that we have comma in the data. This causes misalignment when the file is imported. For example, if there are 10 columns, and one of them has a comma, this row will become 11 columns. This problem is known in the IT world as “delimiter collision”, see here: link.

Comma in the data is very common when dealing with numeric fields, such as “123,000”. Or, in countries like Indonesia which uses comma as a decimal point, it is like this: “100,23”.

We can enclose it with double quotes, but why not eliminate the problem in the first place? That’s why for text files people prefer pipe delimiter or tab delimiter. Text file with tab delimiter is known as tab-separated values, or TSV (link), but pipe delimited files are not known as PSV. We do have DSV though, which stands for Delimiter Separated Values (link).

Pipe is generally preferable because of the perception that it is rarer than tab. Some strings may contains tab, for example in a commentary field.

Qualifier

Qualifier means enclosing the data with something, such as double quote. Double quote is the most common delimiter. Other delimiters (but far less common) are single quote and brackets. There are 3 different types of bracket, i.e. [ ], < > and { }.

The issue with double quote delimiter is that the data may contain double quote, such as in commentary fields. This applies to other delimiters too. That’s why it is ideal to use pipe delimiter, without qualifier.

So that’s my preferred choice for the format of a data file, if I can choose it: pipe delimited text file without qualifiers. But unfortunately, in reality, we very rarely get the opportunity to choose the format. Usually the vendor dictates the format because the same file goes to many customers.

Advertisements

20 February 2019

Transactional Fact Tables

Filed under: Data Warehousing — Vincent Rainardi @ 9:11 pm

Transactional fact tables are not as popular as periodic snapshot fact table. In this article I would like to compare transactional and periodic snapshot fact tables, list their advantages and disadvantages, and give two cases from my own experience where I needed to decide between the two. But first let me explain what these two types of fact tables are.

What is a transactional fact table?

In dimensional data modelling we have 3 types of fact tables: transactional, periodic snapshot and accumulation snapshot. I’m going to explain the first two below.

1. Transactional Fact Table

A transactional fact table is a fact table where:

  • Each event is stored in the fact table only once.
  • It has a date column indicating when the event occurred.
  • It has an identifier column which identifies each event.
  • The number of rows is the same as the source table.

A classic example is when in the source system we have a sales table containing customer orders (for example, in a shop, a restaurant, or a factory). Say Monday we had 80 orders, Tuesday 70 orders, Wednesday 90, and so on. So on Monday night we load the 80 rows for Monday into the data warehouse. On Tuesday night we load the 70 rows for Tuesday, and on Wednesday night we load the 90 rows for Wednesday.

Transactional Fact Tables
Figure 1
. A transactional fact table loading 3 days data

In the data warehouse we store the customer orders in the Sales Fact Table, which is a transactional fact table.

  • In this Sales Fact Table, every order is stored only once. The 70 orders on Tuesday are different to the 80 orders on Monday. And they are also different to the 90 Wednesday orders.
  • For this we use the order date column. In the above example the Monday, Tuesday and Wednesday are the order date. This order date column indicates when the event occurred, when the order happened.
  • In this sales table we also have a sales identifier, such as order number if it is a shop, or ticket number if it is a restaurant.
  • On Wednesday night, after the warehouse load finishes, we have 80+70+90 = 240 rows, the same as in the source system.

In addition to insert, we also have update in the source table. In the example above, in addition to 70 new orders on Tuesday, we also have updates to some of the 80 Monday orders. This is they key difference to the Periodic Snapshot fact table: a Transactional fact table updates existing rows, and therefore lost some history.

Other ideal examples of a transactional fact table is the journal table in an accounting system, a trade table in an investment banking system, a premium table in an insurance system, a payment table in a payment system, a transaction table in a retail banking system, and a call table in a telecommunication system.

2. Periodic Snapshot Fact Table

A periodic snapshot fact table is a fact table where:

  • The whole source system is copied into the fact table regularly.
  • The same event is stored multiple times.
  • It has a snapshot date column indicating when a copy of the source table was created.

An ideal example of a periodic snapshot fact table is the bank account balance. At the end of each day, the balances of every customer account in the bank is stored in this account balance table. Say there were 20,000 customers on Monday; 22,000 customers on Tuesday and 24,000 customers on Wednesday.

Periodic Snapshot Fact Tables
Figure 2. Periodic Snapshot Fact Table

  • Every day we copy the whole content of the account balance table into the periodic snapshot fact table.
  • So on Monday night we stored 20,000 rows in the account balances periodic snapshot fact table, on Tuesday night 22,000 rows and on Wednesday night 24,000 rows. So an account is copied every day to the fact table, each day with potentially a different balance amount.
  • In the fact table we have a column called snapshot date. For all the rows created on Monday night, we set the snapshot date column to (for example) 11th Feb 2018. For the rows created on Tuesday night we set the snapshot date to 12th Feb 2018 and for the Wednesday rows we set the snapshot date to 13th Feb 2018.

Of course there are accounts which were closed on Tuesday and no longer in the account balance table in the source system. In the fact table, the Monday data set contains these accounts, but the Tuesday data set doesn’t contain these accounts, and neither does the Wednesday data set.

And there are accounts which were updated on Tuesday. These changes will be reflected on the Tuesday snapshot in the fact table, different to their Monday rows.

Another example is inventory table in manufacturing, holdings table in fund management, billing balance table in telecommunication, and daily temperature table in a weather system.

Advantages and Disadvantages

The advantages of a transactional fact table are:

  • It mimics the source table
  • It is simpler as we only have once version of each event

The disadvantages of a transactional fact table are:

  • We don’t have the previous values
  • Update is slow if the fact table is large, potentially performance issue

Now let’s get on with the two cases of transactional fact table implementations.

Case1: Retail Association

Let’s suppose we are running a Retail Association. Every shop in the country reports their annual sales data on our website, within one month of their financial year end. In January 2019 there were 8,000 shops reporting their sales data on our website, and in Feb 2019 there were 9000 shops reporting their sales data. There are about 100,000 shops in total.

Every month we get a CSV file containing the previous month data. The file is called Data_YYYYMM.csv and the file contains a column called Reporting Date. So:

  • The Data_20180131.csv contains the 8000 shops reporting in January 2019, with the Reporting Date column containing dates from 1st to 31st Jan 2019.
  • The Data_20190228.csv contains the 9000 shops reporting in February 2019, with the Reporting Date column containing dates from 1st to 28th Feb 2019.

Case 1
Figure 3. Retail Association monthly data file loaded into a transactional fact table

Because on the February file the January data is not given in full (but only the changed and new rows), we can’t make it as a periodic snapshot fact table. So in this case a transactional fact table is the only option.

Heart of Transactional Fact Tables

In theory the loading is straight forward. Jan file is loaded, then Feb file, and so on. But in reality this is rarely the case. A few shops who supposed to report in January were late, and they reported in February. So the in the February file we also get a few rows of the January “late reporter” shops.

Some shops made an error in their January submission and corrected it in February. So in February file we also have a few rows containing January corrections.

That is the heart of transactional fact tables: performing updates to the previous months’ data. In this case it goes back only one month, but in real cases it could be a few months.

Case 2: Waste Reporting

Suppose we are working with a government department responsible for implementing the waste regulations in our country. Every company and government agency in the country needs to report the amount and types of their packaging waste, electronic waste, commercial waste and recyclables e.g. plastic, paper, metal. This report is annual (once a year). Every town council also need to report the amount and types of their household waste, also annually.

The timing when they have to submit the report is different for every company, agency and council, depending on their financial year end. But the report contains the same year, which is the previous calendar year. So companies/agencies/councils with financial year ending on 31st March 2018 need to report the waste happened from Jan to Dec 2017. Those with financial year ending on 31st August 2018 also need to report the waste happened from Jan to Dec 2017.

The data we get from this central government department is monthly, i.e. every month we get a file. The file contains the waste produced in the previous calendar year. So the file we get in all 12 months of 2019 contains the waste data for 2018, but growing every month. For example, in Jan 2019 the file contains 17,000 entities (i.e. companies, agencies and councils), in Feb 2019 the file contains 32,000 entities, the March 2019 file contains 49,000 entities, and so on, like below.

Case 2
Figure 4. Waste Regulation monthly data file loaded into a transactional fact table

The February file contains all the 17,000 January entities plus 15,000 entities that report in February. But the February file also contains corrections for January. In other words, in the February file the waste data for some of these 17,000 January entities might have changed. For example, company A which in January reported waste of 11,000 kg of waste for 2018 calendar year, in February might submitted a correction to change this figure to 11,500 kg. This correction may happen up to 6 months after the submission deadline. So the correction for January data can happen in the July file.

Now we need to decide whether transactional fact table is the right approach for this case.

Case 2 Periodic vs Transactional
Figure 5. Comparing Transactional Fact Table and Periodic Snapshot

If we build this fact table as a periodic snapshot fact table, we will have every version of the data. We will have the January version when Entity A was 11,000 kg, and we will also have the February and March version when entity A was 11,500 kg.

If we build it as a transactional fact table, we will only have the latest version.

From my experience many people simply go for the snapshot model because of “just in case”. In case we need the January version.

The other reason, which is more valid, is trace-ability. If we use a transactional fact table, the report we produced in Feb 2019 (based on the Jan data) will have different numbers to the report we produce in Apr 2019 (based on the Mar data). If we use the snapshot model, we can explain the differences, i.e. because in January Entity A was 11000 kg, not 11500 kg.

So in case 1 we are dealing with “incremental” monthly data files, whereas in case 2 we are dealing with “accumulated” monthly data files.

 

15 February 2019

Data Type

Filed under: Data Warehousing — Vincent Rainardi @ 6:53 pm

In the data warehouse, should we make the data type the same as the source system? If in the source system the data type of the currency code column is varchar(50), in the our data warehouse should we set the data type to varchar(50) or to varchar(3)? If an ISIN column is defined as varchar(100) in the source system should you set the data type to varchar(100) or varchar(12)?

The short answer is: set it to whatever the source system specifies.

The long answer is:

  1. If the source system specifies a ridiculously long data type such as varchar(500) for currency code or varchar(1000) for ISIN, or varchar(8000) for a description column, and the actual data in those columns are much shorter than the data type (e.g. 3, 12 and 200 characters respectively for the above 3 examples), then set the data type to fit the actual data length, plus spare.
  2. Have a standard in the warehouse for length. For example: 1, 5, 10, 20, 50, 100, 200 and pick the next one up. For example if the actual data length is 12 set the data type for that column to 20, and if the actual data length is 30 set the data type to 50.
  3. In the ETL, if the actual data exceed the column data type, cut the data. For example, if the data type of the attribute column in the dimension is 100 and the actual data is 101 characters, then cut it to 100 characters and load it into the dimension but flag this in the ETL error log.
  4. Talk to the source system team. Perhaps there is a reason why the designer of that source column set it to wider than what it should be. Perhaps when there is no ISIN, they put the other security identifier such as currency + SEDOL in there? Or internal security identifier, which is quite long?
  5. If the source system team confirmed that the data they put into that column will not exceed 12 characters (even though the column is defined as 100 characters) then set the column in the warehouse to 12. But again, in the ETL cut it to 12 if it exceed 12, and log it.

 

31 January 2019

Column Store Index

Filed under: Data Warehousing — Vincent Rainardi @ 8:57 am

I write this article specifically to answer this question:
Do column store indices improve query performance in data warehousing?

So I’m not looking at data loading performance, and I’m looking specifically data warehousing, i.e. querying fact tables. From past experience I know that the answer is yes if the fact tables are large and the query is selective. Otherwise the answer is no. But I don’t have the numbers. So with this article I would like to measure it. Thank you to Nadir Doctor for suggesting this topic.

Experiment Design

I will first query a small fact table (50k rows) which is indexed the normal way, i.e. clustered on PK, non-clustered on SKs. For this I will use the internet sales fact table on Adventure DW. I’ll execute two queries. The first one will select all rows, and the second one will select only 0.1% of the rows.

I will then create column store indices on this fact table, and execute the same two queries. I expect that there will be no performance improvement because the fact table is small.

Then I’ll add more rows to the internet sales fact table so that it becomes 10 million rows and execute the same queries. I expect there won’t be performance improvement on the select all, but a significant improvement on the selective query.

To summarise:
Experiment Design.JPG

Note: The reason I put the “select all” in there because that’s how OLAP tools works, e.g. Power BI, SSAS, Qlikview. They select all rows from the fact table. Where reporting tool such as SSRS works on selective query (Power BI can be selective too).

SQL Queries and DDL

The query I use is as follows:

select F.SalesOrderNumber, F.SalesOrderLineNumber, DC1.FirstName + ' ' + DC1.LastName as CustomerName, DP.ProductAlternateKey, DP.ModelName, DC2.CurrencyAlternateKey as Currency, F.UnitPrice, F.OrderQuantity, F.SalesAmount
from FactInternetSales F
left join DimCustomer DC1 on F.CustomerKey = DC1.CustomerKey
left join DimProduct DP on F.ProductKey = DP.ProductKey
left join DimSalesTerritory DST on F.SalesTerritoryKey = DST.SalesTerritoryKey
left join DimCurrency DC2 on F.CurrencyKey = DC2.CurrencyKey
where DST.SalesTerritoryCountry = 'United Kingdom' and F.OrderDateKey/100 = 201401

For select all rows I use the same query but without the where clause.

The DDL for create the indexes are as follows:

create clustered columnstore index PK_FactInternetSales_SalesOrderNumber_SalesOrderLineNumber on dbo.FactInternetSales
create clustered index PK_FactInternetSales_SalesOrderNumber_SalesOrderLineNumber on dbo.FactInternetSales(SalesOrderNumber, SalesOrderLineNumber)
create nonclustered index FactInternetSales_CustomerKey on dbo.FactInternetSales(CustomerKey)
create nonclustered index FactInternetSales_ProductKey on dbo.FactInternetSales(ProductKey)
create nonclustered index FactInternetSales_SalesTerritoryKey on dbo.FactInternetSales(SalesTerritoryKey)
create nonclustered index FactInternetSales_CurrencyKey on dbo.FactInternetSales(CurrencyKey)

Results

results
Query performance numbers are in milliseconds

So as expected, the column store indices only improve the query performance if the table is large and the query is selective.

Other Considerations

Apart of the query performance, of course we also have to consider the maintenance, i.e. how much effort we need to spend to keep the column store indices up-to-date. Is it a lot more than row store indices? (that is beyond the remit of this article)

References:

  1. Column Store Index – Design Guidance: link
  2. Column Store Index – Data Warehouse: link
  3. Niko Neugebauer on Column Store: link

28 January 2019

SSIS: Numeric Column Loaded as NULL

Filed under: Data Warehousing — Vincent Rainardi @ 6:08 am

Why SSIS loads a numeric column in Excel as NULL?
(only applies to Excel 97-2003 files, does not apply to Excel 2013 files)

What Happened

I have an Excel 97-2003 file containing a numeric column. The first 10 rows of this column are blank. There are numbers from row 11 onwards. When SSIS loads this file, all columns are imported correctly into the target table, except this one column which is imported as NULL.

Left: Excel 97-2003 source file, right: target table in SQL Server
excel 97-2003
Figure 1. Loading an Excel 97-2003 file

Why It Happened

Why Column2 is loaded as NULL, but Column3 is loaded ok?

This has nothing to do with SSIS. It is because of Microsoft Excel Driver. The Excel driver checks the first 8 rows of that column to find out the data type. Because they are blank, the Excel Driver (not SSIS) sets the expected data type to Text. And because the actual data is a Number, it is set as NULL.

How It Happened

Here is how it happened (the detailed mechanics).

When we setup the Excel Connection Manager in SSIS (see below), the Excel Driver installed in our PC does these steps:

excel source editor
Figure 2. Setting up Excel source

Step 1. The Excel Driver looks at the TypeGuessRows and ImportMixedTypes registry setting: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office\15.0\Access Connectivity Engine\Engines\Excel

registry settings
Figure 3. Registry Settings

Note: The branch in the registry is different for each Office version and 32/64 bit, see list here.

Step 2. The Excel Driver reads the first N rows from the Excel file to determine the data type, where N is the TypeGuessRows.

Note:

  • The default for TypeGuessRows is 8.
  • If the TypeGuessRows is set to 0, it reads the first 16384 rows (2^14).
  • To check all rows, the TypeGuessRows needs to be set to 1,048,576 (2^20), the maximum number of rows allowed in Excel.

Step 3. The Excel Driver sets the expected data type for that column as follows:

  • If the value of the ImportMixedTypes is “Text”, the expected data type is set based on the majority of the N rows where N the value of TypeGuessRows. There are 5 possible data types: Text, Number, Date, Currency, Boolean.
  • For example, if in the first 8 rows there are 4 dates, 3 texts, 2 number and 1 currency, the expected data type will be set to Date.
  • Another example: if in the first 8 rows there are 5 text and 3 numbers, the expected data type will be set to Text.
  • If there is a tie like (4 and 4) or (3,3,2) or (2,2,2,2) then the Number wins, followed by Currency, Date, Text and the last one is Boolean.
  • If the first 8 rows are all NULL the expected data type is set to the ImportMixedTypes.
  • In the above case, Column1 is set to Number (majority of 8), Column2 to Text (all NULL), Column3 to Number (majority of 1), Column4 to Text (all NULL).

Step 4. When we click the Preview button (see Appendix 1), the Excel Driver reads the first 200 rows from the Excel file and display the values on the screen. If the value from the file does not match the expected data type, the Excel Driver returns NULL to SSIS.
This is why in the above case Column2 is shown as NULL in the Preview screen.
The same happens during the actual load process.

Excel 2007 or Excel 2013

What happens if we import an Excel 2007/2013 file?

If we import an Excel 2007 or Excel 2013 the numeric column is loaded correctly, see Column2 below.

excel 2007 or 2013
Figure 4. Loading Excel 2007 or 2013

When loading an Excel 97-2003 file we get the same result as above if we set the Excel version on the Excel Connection Manager to “Microsoft Excel 2007”, like this:

set excel version
Figure 5. Changing the Excel Version

So when we face a problem with numeric column being null when loading from Excel 97-2003, we can set the Excel version to 2007 in Excel Connection Manager and SSIS will load it correctly.

 

27 January 2019

Starting a Post

Filed under: Data Warehousing — Vincent Rainardi @ 6:09 am

I usually start writing a post just like that. As if I was explaining it to someone in front of me.

But there is a better way to start a post, i.e. by formulating the thing you want to explain in the post into a question. This will grab the attention of the reader straight away.

For example, I am about to write an SSIS issue that happens when we import an Excel file containing a numeric column. The issue is: if the first 10 rows of that column is blank, Excel will import all rows of that column as blank.

I could start the usual way by explaining the issue (what happened), why it happened and what the solution is. I still think that is a good way to write it. But a better way to start the article is to formulate the issue into a question: Why SSIS imports a numeric column in Excel as null?

So that’s what I’m going to do in the next post.

5 January 2019

Asset Management Companies in the UK

Filed under: Data Warehousing — Vincent Rainardi @ 11:21 am

Having worked at several investment banks and asset management in the UK I was wondering which ones are the biggest asset managers in the UK. I don’t mean to list just the UK fund managers, but also US fund managers operating in the UK. So I searched in Wikipedia and Google, and the individual websites of each companies and wrote what I found below. Links to sources are on the AUM figures.

Largest asset management companies in the UK are (AUM $0.4tn or more, rounded to $0.1tn):
Note: only the asset management division/arm, not the whole group.

Smaller ones are (AUM under $0.4tn, rounded to $bn):

Those with AUM under $10bn are (please correct me if I’m wrong): Amati, Chelverton, City Financial, Downing, Hawksmoor, Premier, Miton, Neptune, Sanford DeLand, Wise. Those with AUM not known are: Cavendish, Evenlode, Unicorn.

Whereas the investment banks operating in London are (HQ mostly in Canary Wharf):

  • 1.     Barclays
  • 2.     BNP Paribas
  • 3.     Bank of America Merrill Lynch
  • 4.     Citigroup
  • 5.     Credit Suisse
  • 6.     Deutsche Bank
  • 7.     Goldman Sachs
  • 8.     HSBC
  • 9.     JP Morgan
  • 10.  Mizuho
  • 11.  Mitsubishi UFJ
  • 12.  Morgan Stanley
  • 13.  Nomura
  • 14.  UBS

 

9 December 2018

Performance Attribution

Filed under: Data Warehousing — Vincent Rainardi @ 9:37 pm

In asset management, the business is about managing portfolios. Either it is a public fund where individual investors can put their money in, or a portfolio specifically created for one client such as a pension fund or an insurance.

Almost every portfolio has a benchmark*. The objective a portfolio is to beat the benchmark, meaning that the performance return of the portfolio should be higher than the performance of the benchmark. A benchmark is usually an index which operates in the same asset class, region and sector. For example, if the fund is UK equity, the benchmark can be FTSE 100 All Share.
*Note: an absolute return portfolio doesn’t usually have a benchmark as it is compared to zero (or 3 month Libor in some cases)

A portfolios grows over time, say 12% in the last 1 year. The benchmark to which the porfolio is compared to also grows, say 10% in the last 1 year. The art of understanding where this 2% difference is coming from is called performance attribution. Is it from the US companies or European companies? Is it from technology companies or mining companies?

Fortunately there are mathematics which can explain it. Knowing what makes this 2% or where this 2% is from is very important because it gives us insight about which parts of the portfolio are successful and which parts are not.

This 2% is called Relative Performance. Relative between the portfolio and the benchmark. It is also called Excess Performance or Active Performance. Sometimes people call it outperformance (or underperformance if the portfolio return is less than the benchmark). In this article I would like to describe how to allocate the excess performance into different geography regions/countries, different asset classes, different sectors or different currencies.

But before getting into that I would like give a background on what is a portfolio, what is a holding, what is performance return, what is a benchmark, and finally what is performance attribution. I hope with this background/introduction the readers without any background in investment management will be able to understand what performance return is, before getting into performance attribution.

Part 1. Portfolio, Weight, Benchmark and Performance

In a portfolio there are many assets. They are known as holdings. A holding can be a share, a bond or any other type of security. And cash. A global equity portfolio for example, can contain 50 holdings, some of which are shares of US companies, some are shares of European companies, some are companies in Asia, UK, Japan and Latin America.

Let’s say that the total value of this portfolio is 100 million USD. This $100m is called AUM (Asset Under Management). Let’s say that we invested in 10 US companies, and we put $5 million each, so the value of our holdings in these 10 US companies are $50m. That is 50% of the AUM. This 50% is called weight.

Each of the 50 companies in the portfolios has a share price. The share price changes every day. Some increases, some decreases. Let’s say that from the total value of $100m in 31st Oct 2017, within 1 year it became $111.32m in 31st Oct 2018, an increase of 11.32%. This 11.32% is called Performance Return. We say that our portfolio has a performance return of 11.32%.

In daily conversation people don’t say the “Return”, they just say “performance”. We can measure the performance of the fund in the last 1 month, in the last 3 months, in the last 6 months or in the last 1 year. So if today is 3rd of December 2018, the 1 month performance of our fund is measured by comparing the total value of the fund (the AUM) on 31st Oct 2018 to 30th Sep 2018. The 3 months performance is measured by comparing 31st Oct to 31st July 2018. And 1 year performance is comparing 31st Oct 2018 (the last month end) to 31st Oct 2018 (1 year before).

Accumulative and Annualised Return

In addition to 1,3,6 and 12 months, we also have 3 years, 5 years and 10 years performance return. We can present the performance returns greater than 1 year in two different ways. The first one is called accumulative, i.e. for 3 years we compare the value of the portfolio on 31st Oct 2018 (the latest month end) to the value of the portfolio on 31st Oct 2015 (3 years before). Let’s say that the value of the portfolio in 3 years grows from $100m to $133.1m, so it grows by 33.1%. This 33.1% is the accumulative performance return for 3 years.

The second is called annualised return. What is x, so that if the portfolio grows by x every year for 3 years, it grows from 100m to 133.1m? In this case it is 10%. From $100m it becomes $110m, then £121m, then $133.1m. This 10% is called annualised performance return.

Arithmetic and Geometric Annualised Return

When calculating annualised return we have two options. From the 3 years accumulated return of 33.1% we can divide by 3, getting 11.033% per year. This is called arithmetic annualised return.

The other way is to find x% where $100m * (1+x) * (1+x) * (1+x) = $133.1m. And in this case that x is 10%. This is called geometric annualised return.

We can see here that the geometric annualised return (10%) is lower than the arithmetic annualised return (11.033%). This is because of compound growth, i.e. the amount of growth in year 2 is bigger than the growth in year 1, and the amount of growth in year 3 is bigger than year 2 (for the same % growth).

Geometric annualised return is the real thing. It is the right thing to use. Whereas arithmetic annualised return is technically incorrect, so should not be used. It is only there for a quick “back-of-the-envelope” calculations.

Gross and Net Returns

As a fund manager we charge fees, so there are two option: before the fees are deducted (called gross) or after the fees are deducted (called net). If a fund is growing by 10% per year (that’s the gross performance return), and the total of fees and charges are 1%, the net performance return is 9%.

When investors ask “What’s the performance of your fund?”, they usually mean the net return, i.e. after the costs and charges/fees. But a fund manager usually prefer to specify the gross return, because that’s what she/he delivers (the charges are imposed by the company, not them) and because the gross return is higher, so sounds better.

Usage of Performance Return

Investors don’t like performance return numbers which is less than one year. Many pension funds prefer to invest only in funds which is 3 years old or more. They would like to know the 3 years annualised net performance of the fund, to be compared with the 1 year net return to understand the long term and short term view on performance. And to be compared with the volatility (also 1 year and 3 years) to understand the risk, as well as to Alpha, Beta and Sharpe numbers.

Part 2. Performance Attribution

Now that we know what performance return is, let’s talk about performance attribution, which is the topic of this article.

Let’s have a look at this global equity portfolio:

Region Portfolio Weight Benchmark Weight Portfolio Return Benchmark Return Stock Selection Regional Allocation Interaction Factor Total
US 50% 44% 15% 12% 1.32% 0.19% 0.18% 1.69%
UK 10% 15% 10% 8% 0.30% 0.04% -0.10% 0.24%
Europe 18% 14% 5% 3% 0.28% -0.23% 0.08% 0.13%
Asia Pacific 3% 9% 2% 4% -0.18% 0.29% 0.12% 0.23%
Japan 14% 8% 12% 8% 0.32% -0.05% 0.24% 0.51%
Other 3% 10% 6% 9% -0.30% -0.01% 0.21% -0.10%
Cash 2% 0% 0% 0% 0.00% -0.18% 0.00% -0.18%
Total 100% 100% 11.32% 8.80% 1.74% 0.05% 0.73% 2.52%

Table 1: Example of a Global Equity Portfolio

Portfolio and Benchmark Weight and Return

The performance return of this portfolio is 11.32% (let’s say this is 1 year performance), and the performance return of the benchmark is 8.80%. So the relative performance between the portfolio and benchmark is 11.32% -8.80% = 2.52%.

The portfolio contains equities from the US, UK, Europe (excluding UK), Japan, Asia Pacific (excluding Japan) and other region (such as Africa or Latin America). In the second column we can see that the US equity is 50% of the portfolio, UK is 10% of the portfolio, and so on. The portfolio also contains 2% cash.

The benchmark also contains US, UK, Europe, Japan, Apac and other, but the percentages (the “weights”) are different. US occupies 44% of the benchmark, UK 15%, Europe 14% and so on. The benchmark does not contain any cash. We can see this in the third column.

In the fourth column is the portfolio return. We can see that the US equity grew by 15%, UK by 10%, Europe by 5%, and so on. And the total at the bottom is 11.32%. This is the growth of the portfolio as a whole.

In the fifth column we can see the performance return of the benchmark. The US equity grew by 12%, UK equity by 8%, Europe by 3% and so on. And at the bottom we can see that the benchmark as a whole grew by 8.8%.

The job of a fund manager is to invest in the region which will have most growth, i.e. US, UK and Japan.

Performance Attribution

We are now ready to discuss the last 4 columns: the performance attribution columns. On the last column we can see what makes up the 2.52% difference between the portfolio return and the benchmark return, i.e. 1.69% is because of US equity, 0.24% is because of UK equity, 0.13% is because of Europe equity, and so on.

Region Portfolio Weight Benchmark Weight Portfolio Return Benchmark Return Stock Selection Regional Allocation Interaction Factor Total
US 50% 44% 15% 12% 1.32% 0.19% 0.18% 1.69%
UK 10% 15% 10% 8% 0.30% 0.04% -0.10% 0.24%
Europe 18% 14% 5% 3% 0.28% -0.23% 0.08% 0.13%
Asia Pacific 3% 9% 2% 4% -0.18% 0.29% 0.12% 0.23%
Japan 14% 8% 12% 8% 0.32% -0.05% 0.24% 0.51%
Other 3% 10% 6% 9% -0.30% -0.01% 0.21% -0.10%
Cash 2% 0% 0% 0% 0.00% -0.18% 0.00% -0.18%
Total 100% 100% 11.32% 8.80% 1.74% 0.05% 0.73% 2.52%

Table 2: Performance Attribution – Regional Level

But not only that, we can also breakdown this 2.52% outperformance into:

  1. How much of it is because the allocation into regions
  2. How much of it is because of the stocks/shares that were chosen within each region
  3. And how much of it is because of the combination of #1 and #2 above

In the above case, we can see these 3 numbers down at the bottom of the table:

  1. 05% is because of the allocation into regions.
    For example, the PM (portfolio manager) chose to allocate 50% into US, which enjoyed a high growth (15%).
  2. 74% is because of the stocks/shares that were chosen within each region.
    For example, the US stocks raised by 15% whereas the benchmark raised only by 12%. If we use the benchmark weighting (44%) for both the portfolio and benchmark, we get:
    Performance of the US equity in the portfolio = 44% x 15% = 6.6%
    Performance of the US equity in the benchmark = 44% x 12% = 5.28%
    So the portfolio is outperforming the benchmark by 1.32% as we can see on the table above (the “Stock Selection” factor for US)
  3. 73% is because of the combination between stock selection factor and the sector allocation factor.

For a credit portfolio we have one more factor that we can calculate: the yield curve. This is about how much performance is caused by allocating the portfolio into different maturity buckets, e.g. 1 year, 3 years, 5 years, 10 years, 20 years, etc. Generally speaking, in a normal market condition, the longer the maturity the higher the performance.

Last but not least, each of the region attribution (1.69% for US equity for example) can be allocated into stock selection, sector allocation and interaction.

All this is incredably useful for portfolio managers so that they know whether their allocation is correct or not, and whether their stock selection is correct or not. Note that the words “stock selection” is not accurate for a credit portfolio. “Security selection” is a better choice of words.

Sometimes Performance Attribution is also called Performance Contribution, i.e. how much each factor (or region) contributes to the overall performance.

Calculation

Finally we come to the core of this article: the calculation.

Region Portfolio Weight Benchmark Weight Portfolio Return Benchmark Return Stock Selection Regional Allocation Interaction Factor Total Contribution
US PW BW PR BR SS RA IF TA
UK
Europe
Asia Pacific
Japan
Other
Cash
Total 100% 100% Total PR Total BR Total TA

Table 3: Performance Attribution Calculation

The calculations of the Stock Selection (SS), Regional Allocation (RA), Interaction Factor (IF) and Total Contribution (TC) columns are as follows:

  1. SS = BW x (PR – BR)
  2. RA = (PW – BW) x (BR – Total BR)
  3. IF = (PW – BW) x (PR – BR)
  4. TC = SS + RA + IF
  5. Total PR = Sum (PW x PR)
  6. Total BR = Sum (BW x BR)
  7. Total TC = Sum (TC) = Total PR – Total BR

Logic Behind The Calculations

Now let’s try to understand why the formula for the Stock Selection (SS), Regional Allocation (RA), Interaction Factor (IF) and Total Contribution (TC) columns are as above.

1. Stock Selection

Stock Selection is how different the US stocks/securities in the portfolio perform, compared to the US stocks in the benchmark. So to understand that we use the same weighting for both the portfolio and the benchmark.

In the above case, the US stock has a weight of 50% in the portfolio and 44% in the benchmark. It has 15% performance in the portfolio and 12% in the benchmark. So the performance attribution of the US stock in the portfolio is 50% x 15% = 7.50%. And the performance attribution of the US stock in the benchmark is 44% x 12% = 5.28%.

But to understand the effect of just stock selection, we have to use the same allocation. So let us use 44% for both of them. The performance attribution of the US stock in the portfolio is now 44% x 15% = 6.6%. And the performance of the US stock in the benchmark is still the same as above, which is 5.28%.

Because we use the same allocation (44%) for both the portfolio and the benchmark, the effect we are observing now is purely because of stock selection. The difference between 6.6% and 5.28% is purely because of stock selection. 6.6% minus 5.28% = 1.32% – that is the effect of stock selection.

That is why SS = BW x (PR – BR). It is the difference in return, when the weight is kept constant.

2. Regional Allocation

So how do we find out what is the effect of regional allocation? By keeping the return constant.

So something like this:
RA = (PW – BW) x BR

In the above, we kept the return constant (BR) and so we can understand the effect of the weighting.

Putting in the numbers for the US stocks:
RA for US = (50% – 44% ) x 12% = 6% x 12% = 0.72%

But 12% is not quite right. Because 12% is the return of the US stock. What we need is how well the US stocks perform in comparison to the other regions. Well, as a whole, the benchmark returned 8.8%. So the US is returning 3.2% above the average. That is the figure we need. So the formula becomes:

RA = (PW – BW) x (BR – Total BR) = (50% – 44%) x (12% – 8.8%) = 6% x 3.2% = 0.192%.

This is fairer. The effect of allocating more to the US is not 0.72% but only 0.192%, because it is not out of 12% (the return of US stocks) but out of 3.2% (the outperformance of the US stock compared to the other regions).

3. Interaction Factor

To calculate the Interaction Factor we must not kept the weight constant, and we must not kept the return constant either. We should use the actual differences in weight and return.

The differences in weight for the US stocks is 50% – 44% = 6%
The differences in return for the US stocks is 15% – 12% = 3%

If we multiply them we get the effect caused by the differences of both the weight and the return.

Hence the formula for the Interaction Factor is (PW – BW) x (PR – BR).

4. Total Contribution

Total Contribution is the sum of the 3 factors above, i.e.

TC = SS + RA + IF

Country Level

We can use this method to understand the effect of allocation not only at region level, but also at country level. So in the above, how each country in Europe contributed to the total European return.

For example, it’s like this:

Country Port Weight PW in Europe Bench Weight BW in Europe Port Return Bench Return Stock Selection Country Allocation Interaction Factor Total Contribution
France 4% 22.2% 2% 14.3% 4% 3% 0.14% 0.00% 0.08% 0.22%
Germany 2% 11.1% 1% 7.1% 8% 2% 0.43% -0.04% 0.24% 0.63%
Italy 1% 5.6% 2% 14.3% 4% 4% 0.00% -0.09% 0.00% -0.09%
Ireland 1% 5.6% 3% 21.4% 5% 4% 0.21% -0.16% -0.16% -0.10%
Netherland 2% 11.1% 1% 7.1% 4% 2% 0.14% -0.04% 0.08% 0.18%
Norway 5% 27.8% 3% 21.4% 7% 2% 1.07% -0.06% 0.32% 1.33%
Spain 3% 16.7% 2% 14.3% 2% 3% -0.14% 0.00% -0.02% -0.17%
Total Europe 18% 100% 14% 100% 5.00% 3.00% 1.86% -0.39% 0.53% 2.00%

Table 4: Performance Attribution – Country Level

Here we breakdown the 2% outperformance in European tocks (5% in the portfolio vs 3% in the benchmark) into 0.22% is from France, 0.63% is from Germany, -0.09% is from Italy, and so on.

And within 0.22% for France we further break it down to 0.14% is from Stock Selection, 0% is from Regional Allocation and 0.08% is from Interaction Factor.

They key in breaking down the outperformance from Region level to Country level is to compute the Portfolio Weight in Europe and Benchmark Weight in Europe columns. These 2 columns are calculated by dividing the Portfolio Weight (4% for France, 2% for Germany, and so on) by the total Europe weight (18%). So France’s portfolio weight in Europe is 4% / 18% = 22.2%. Germany portfolio weight in Europe is 2% / 18% = 11.1%. Then we do the same thing for the Benchmark Weight in Europe (BW in Europe column).

The question remain as to why in the country level the total of Stock Allocation for Europe is 1.86% where as in the Region level the total of the Stock Allocation for Europe is 0.28% (see table 5 and 6 below).

Region Portfolio Weight Benchmark Weight Portfolio Return Benchmark Return Stock Selection Regional Allocation Interaction Factor Total Contribution
US
UK
Europe 18% 14% 5% 3% 0.28% -0.23% 0.08% 0.13%
Apac
Japan
Other
Cash
Total 100% 100% 11.32% 8.80% 1.74% 0.05% 0.73% 2.52%

Table 5: European Contribution = 0.13%

Country Portfolio Weight Benchmark Weight Portfolio Return Benchmark Return Stock Selection Country Allocation Interaction
Factor
Total
Contrib
ution
France
Germany
Italy
Ireland
Netherland
Norway
Spain
Total Europe 18% 14% 5.00% 3.00% 1.86% -0.39% 0.53% 2.00%

Table 6: European Contribution = 2%

This is because when calculating the 0.28% we are trying to allocate the 2.52% into region level (table 5). In context of the 2.52%, the Europan stocks get 0.13% and the European Stock Selection get 0.28%. Where as when calculating the 1.86% we are trying to allocate the 2% European outperformance into countries. In this context the total of Stock Selection for Europe is 1.86%. So both are right, it’s just that they have different context. One is global, and the other is within Europe.

Sector, Currency, Asset Class

In addition to the performance attribution by region and country, we can also produce the performance attribution by industry sector, currency and asset class, like this:

Industry Sector Portfolio Weight Benchmark Weight Portfolio Return Benchmark Return Stock Selection Sector Allocation Interaction Factor Total Contribution
Energy 3.7% 7.5% 3.2% 4.3% -0.08% 0.11% 0.04% 0.07%
Materials 5.3% 8.9% 4.7% 5.2% -0.04% 0.07% 0.02% 0.04%
Industrials 6.8% 9.3% 5.7% 6.1% -0.04% 0.03% 0.01% 0.00%
Consumer Discret. 8.9% 9.3% 9.4% 7.3% 0.20% 0.00% -0.01% 0.19%
Consumer Staples 14.5% 8.8% 11.3% 9.7% 0.14% 0.15% 0.09% 0.38%
Health Care 11.6% 9.4% 7.1% 7.7% -0.06% 0.01% -0.01% -0.06%
Financials 7.7% 13.6% 4.3% 5.7% -0.19% 0.09% 0.08% -0.02%
IT 8.6% 11.2% 10.6% 8.2% 0.27% -0.03% -0.06% 0.18%
Communication 6.8% 8.1% 9.3% 10.3% -0.08% -0.04% 0.01% -0.11%
Utilities 13.8% 7.2% 8.9% 6.7% 0.16% -0.03% 0.15% 0.27%
Real Estate 10.3% 6.7% 9.4% 7.9% 0.10% 0.03% 0.05% 0.18%
Cash 1.9% 0.0% 0.0% 0.0% 0.00% -0.14% 0.00% -0.14%
Total 100.0% 100.0% 8.13% 7.15% 0.37% 0.24% 0.37% 0.98%

And this:

Currency Portfolio Weight Benchmark Weight Portfolio Return Benchmark Return Stock Selection Currency Allocation Interaction Factor Total Contribution
USD 27.9% 24.3% 6.7% 5.6% 0.27% -0.04% 0.04% 0.26%
GBP 19.5% 15.6% 5.6% 5.3% 0.05% -0.06% 0.01% 0.00%
EUR 24.8% 28.4% 7.8% 6.2% 0.45% 0.02% -0.06% 0.42%
CAD 6.9% 8.8% 9.4% 8.9% 0.04% -0.04% -0.01% -0.01%
JPY 11.3% 13.6% 11.7% 9.4% 0.31% -0.06% -0.05% 0.20%
AUD 7.7% 9.3% 6.7% 8.5% -0.17% -0.03% 0.03% -0.17%
Other 1.9% 0.0% 0.0% 0.0% 0.00% -0.13% 0.00% -0.13%
Total 100.0% 100.0% 7.38% 6.80% 0.96% -0.34% -0.04% 0.58%

Knowing how the currency allocation affect the performance is useful for the people on the desk (PM and the analysts). And so is knowing how the industry sector allocation affect the performance. It helps the desk knowing where their outperformances are from.

And like when regions can be further broken down into countries, the sectors can further be broken down into industries.

The key to be able to calculate all the above is: a) knowing the value of each position at the beginning and at the end of the period, and b) putting the country of risk, industry sector, currency and asset class for every position in the holding and benchmark. That’s what enables us to group each holding into regions/countries, industries/sectors, currency and asset class/sub class, and table out the weights and performances for each section, both portfolio and benchmark. And from there we can apply the above logic/method to calculate the performance attribution.

4 November 2018

Percentage of Total and Filter on 2 Columns

Filed under: Data Warehousing — Vincent Rainardi @ 8:02 pm

This query shows how to Filter on 2 columns, add 2 measures using AddColumns, and create a measures to calculate the percentage of total for a certain year and country.


DEFINE MEASURE financials[Sales%] =
SUM(financials[ Sales]) /
CALCULATE
( SUM(financials[ Sales]),
  ALLEXCEPT(financials, financials[Year], financials[Country])
) * 100

EVALUATE
ADDCOLUMNS
( SUMMARIZE
  ( FILTER
    ( financials,
      AND
      ( financials[Country] = "Canada",
        financials[Year] = 2014
      )
    ),
    financials[Month Name],
    financials[Month Number]
  ),
  "Sales", FORMAT(CALCULATE(SUM(financials[ Sales])), "#,###"),
  "Sales%", FORMAT(financials[Sales%],"##.##")
)
ORDER BY financials[Month Number]

Explanation:

  • The Sales% is defined as Sales divided by the sum of Sales for the year and country currently selected.
  • The financial table is filtered on Country = Canada and Year = 2014.
  • Then two columns are chosen from this table: Month Name and Month Number.
  • Then 2 measures got added: Sales and Sales% (which was defined above)
  • Finally it’s ordered on Month Number.

Note: we can’t filter on 3 columns using And function. The maximum argument on And function is two. The And operator (&&) can take more than 2 arguments.

PS.Any one know how to hide the Month Number column?

DAX Studio and DAX Guide

Filed under: Data Warehousing — Vincent Rainardi @ 7:36 pm

Thanks to Darren Gosbel and others, the Power BI, Power Pivot and SSAS Tabular communities now have a tool to test DAX. It is DAX Studio (http://daxstudio.org/), which is a superb tool. Not only it enables us to test ideas flexibly, but it also enables us to troubleshoot and optimise DAX queries.

With DAX Studio we can:

  • Connect to Power BI, a Tabular server and to Power Pivot and run DAX queries on them all.
  • Browse the model to understand the tables and columns
  • Run DMVs to get the measures, hierarchies, dimensions
  • Understand the query plans, Storage Engine and Formula Engine and timing.
  • Format our DAX queries
  • Capture DAX queries that we run in Power BI (and therefore allowing us to improve their performance)

It is an indispensable tool for anyone working in DAX. With Power BI being ever so popular, the demand for DAX is highly increasing. It was Darren Gosbel who primarily created DAX Studio, with Marco Russo wrote a small part of it.

Marco and Alberto wrote DAX Guide (https://dax.guide/). It is so comprehensive, so detail, so complete. It contains every DAX function in Power BI, in Excel, in Tabular, Azure AS and in SSDT. For every function it describe the syntax, the return values, the parameter, the example, and it describes them in detail. Just look at how it explain the Calculate function. The context transition is there, the filter is there, the order of evaluation is there, the modifier is there. It’s so complete.

Thank you Darren, Marco & Alberto, for creating these two very useful tools for all of us.

PS. All other tools related to DAX, Tabular and indeed other BI tasks are listed here: https://www.sqlbi.com/tools/, e.g. VertiPaq Analyzer, Tabular Editor, BISM Normalizer, DAX Editor, Power Pivot Utilities, DAX Date Template, SSAS Tabular Translator, etc. Note: these tools are written by many people.

Correction, 5/11: DAX Studio was written mostly by Darren Gosbel: http://geekswithblogs.net/darrengosbell/Default.aspx, not Marco or Alberto. My apologies for this mistake. Thanks Marco for your correction.

Next Page »

Blog at WordPress.com.