January | 2019 | Data Platform and Data Science

31 January 2019

Column Store Index

Filed under: Data Warehousing — Vincent Rainardi @ 8:57 am

I write this article specifically to answer this question:
Do column store indices improve query performance in data warehousing?

So I’m not looking at data loading performance, and I’m looking specifically data warehousing, i.e. querying fact tables. From past experience I know that the answer is yes if the fact tables are large and the query is selective. Otherwise the answer is no. But I don’t have the numbers. So with this article I would like to measure it. Thank you to Nadir Doctor for suggesting this topic.

Experiment Design

I will first query a small fact table (50k rows) which is indexed the normal way, i.e. clustered on PK, non-clustered on SKs. For this I will use the internet sales fact table on Adventure DW. I’ll execute two queries. The first one will select all rows, and the second one will select only 0.1% of the rows.

I will then create column store indices on this fact table, and execute the same two queries. I expect that there will be no performance improvement because the fact table is small.

Then I’ll add more rows to the internet sales fact table so that it becomes 10 million rows and execute the same queries. I expect there won’t be performance improvement on the select all, but a significant improvement on the selective query.

To summarise:
Experiment Design.JPG

Note: The reason I put the “select all” in there because that’s how OLAP tools works, e.g. Power BI, SSAS, Qlikview. They select all rows from the fact table. Where reporting tool such as SSRS works on selective query (Power BI can be selective too).

SQL Queries and DDL

The query I use is as follows:

select F.SalesOrderNumber, F.SalesOrderLineNumber, DC1.FirstName + ' ' + DC1.LastName as CustomerName, DP.ProductAlternateKey, DP.ModelName, DC2.CurrencyAlternateKey as Currency, F.UnitPrice, F.OrderQuantity, F.SalesAmount
from FactInternetSales F
left join DimCustomer DC1 on F.CustomerKey = DC1.CustomerKey
left join DimProduct DP on F.ProductKey = DP.ProductKey
left join DimSalesTerritory DST on F.SalesTerritoryKey = DST.SalesTerritoryKey
left join DimCurrency DC2 on F.CurrencyKey = DC2.CurrencyKey
where DST.SalesTerritoryCountry = 'United Kingdom' and F.OrderDateKey/100 = 201401

For select all rows I use the same query but without the where clause.

The DDL for create the indexes are as follows:

create clustered columnstore index PK_FactInternetSales_SalesOrderNumber_SalesOrderLineNumber on dbo.FactInternetSales
create clustered index PK_FactInternetSales_SalesOrderNumber_SalesOrderLineNumber on dbo.FactInternetSales(SalesOrderNumber, SalesOrderLineNumber)
create nonclustered index FactInternetSales_CustomerKey on dbo.FactInternetSales(CustomerKey)
create nonclustered index FactInternetSales_ProductKey on dbo.FactInternetSales(ProductKey)
create nonclustered index FactInternetSales_SalesTerritoryKey on dbo.FactInternetSales(SalesTerritoryKey)
create nonclustered index FactInternetSales_CurrencyKey on dbo.FactInternetSales(CurrencyKey)

Results

results
Query performance numbers are in milliseconds

So as expected, the column store indices only improve the query performance if the table is large and the query is selective.

Other Considerations

Apart of the query performance, of course we also have to consider the maintenance, i.e. how much effort we need to spend to keep the column store indices up-to-date. Is it a lot more than row store indices? (that is beyond the remit of this article)

References:

Column Store Index – Design Guidance: link
Column Store Index – Data Warehouse: link
Niko Neugebauer on Column Store: link

Leave a Comment

28 January 2019

SSIS: Numeric Column Loaded as NULL

Filed under: Data Warehousing — Vincent Rainardi @ 6:08 am

Why SSIS loads a numeric column in Excel as NULL?
(only applies to Excel 97-2003 files, does not apply to Excel 2013 files)

What Happened

I have an Excel 97-2003 file containing a numeric column. The first 10 rows of this column are blank. There are numbers from row 11 onwards. When SSIS loads this file, all columns are imported correctly into the target table, except this one column which is imported as NULL.

Left: Excel 97-2003 source file, right: target table in SQL Server

Figure 1. Loading an Excel 97-2003 file

Why It Happened

Why Column2 is loaded as NULL, but Column3 is loaded ok?

This has nothing to do with SSIS. It is because of Microsoft Excel Driver. The Excel driver checks the first 8 rows of that column to find out the data type. Because they are blank, the Excel Driver (not SSIS) sets the expected data type to Text. And because the actual data is a Number, it is set as NULL.

How It Happened

Here is how it happened (the detailed mechanics).

When we setup the Excel Connection Manager in SSIS (see below), the Excel Driver installed in our PC does these steps:

Figure 2. Setting up Excel source

Step 1. The Excel Driver looks at the TypeGuessRows and ImportMixedTypes registry setting: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office\15.0\Access Connectivity Engine\Engines\Excel

Figure 3. Registry Settings

Note: The branch in the registry is different for each Office version and 32/64 bit, see list here.

Step 2. The Excel Driver reads the first N rows from the Excel file to determine the data type, where N is the TypeGuessRows.

Note:

The default for TypeGuessRows is 8.
If the TypeGuessRows is set to 0, it reads the first 16384 rows (2^14).
To check all rows, the TypeGuessRows needs to be set to 1,048,576 (2^20), the maximum number of rows allowed in Excel.

Step 3. The Excel Driver sets the expected data type for that column as follows:

If the value of the ImportMixedTypes is “Text”, the expected data type is set based on the majority of the N rows where N the value of TypeGuessRows. There are 5 possible data types: Text, Number, Date, Currency, Boolean.
For example, if in the first 8 rows there are 4 dates, 3 texts, 2 number and 1 currency, the expected data type will be set to Date.
Another example: if in the first 8 rows there are 5 text and 3 numbers, the expected data type will be set to Text.
If there is a tie like (4 and 4) or (3,3,2) or (2,2,2,2) then the Number wins, followed by Currency, Date, Text and the last one is Boolean.
If the first 8 rows are all NULL the expected data type is set to the ImportMixedTypes.
In the above case, Column1 is set to Number (majority of 8), Column2 to Text (all NULL), Column3 to Number (majority of 1), Column4 to Text (all NULL).

Step 4. When we click the Preview button (see Appendix 1), the Excel Driver reads the first 200 rows from the Excel file and display the values on the screen. If the value from the file does not match the expected data type, the Excel Driver returns NULL to SSIS.
This is why in the above case Column2 is shown as NULL in the Preview screen.
The same happens during the actual load process.

Excel 2007 or Excel 2013

What happens if we import an Excel 2007/2013 file?

If we import an Excel 2007 or Excel 2013 the numeric column is loaded correctly, see Column2 below.

Figure 4. Loading Excel 2007 or 2013

When loading an Excel 97-2003 file we get the same result as above if we set the Excel version on the Excel Connection Manager to “Microsoft Excel 2007”, like this:

Figure 5. Changing the Excel Version

So when we face a problem with numeric column being null when loading from Excel 97-2003, we can set the Excel version to 2007 in Excel Connection Manager and SSIS will load it correctly.

Leave a Comment

27 January 2019

Starting a Post

Filed under: Data Warehousing — Vincent Rainardi @ 6:09 am

I usually start writing a post just like that. As if I was explaining it to someone in front of me.

But there is a better way to start a post, i.e. by formulating the thing you want to explain in the post into a question. This will grab the attention of the reader straight away.

For example, I am about to write an SSIS issue that happens when we import an Excel file containing a numeric column. The issue is: if the first 10 rows of that column is blank, Excel will import all rows of that column as blank.

I could start the usual way by explaining the issue (what happened), why it happened and what the solution is. I still think that is a good way to write it. But a better way to start the article is to formulate the issue into a question: Why SSIS imports a numeric column in Excel as null?

So that’s what I’m going to do in the next post.

Leave a Comment

5 January 2019

Asset Management Companies in the UK

Filed under: Data Warehousing — Vincent Rainardi @ 11:21 am

Having worked at several investment banks and asset management in the UK I was wondering which ones are the biggest asset managers in the UK. I don’t mean to list just the UK fund managers, but also US fund managers operating in the UK. So I searched in Wikipedia and Google, and the individual websites of each companies and wrote what I found below. Links to sources are on the AUM figures.

Largest asset management companies in the UK are (AUM $0.4tn or more, rounded to $0.1tn):
Note: only the asset management division/arm, not the whole group.

1. Blackrock $6.3tn
2. Vanguard $5.3tn
3. Fidelity $2.5tn
4. Allianz $1.7tn
5. JP Morgan $1.7tn
6. L&G $1.2tn
7. Invesco $1.0tn
8. Axa $0.9tn
9. UBS AM $0.8tn
10. Insight $0.8tn

11. Legg Mason $0.8tn
12. Franklin Templeton $0.7tn
13. Pictet $0.5tn
14. Aberdeen Standard $0.7tn
15. Schroders $0.6tn
16. Columbia Threadneedle $0.5tn
17. HSBC GAM $0.5tn
18. Aviva $0.4tn
19. M&G Prudential $0.4tn
20. Janus Henderson $0.4tn

Smaller ones are (AUM under $0.4tn, rounded to $bn):

1. BMO GAM $260bn
2. Baillie Gifford $250bn
3. Investec $211bn
4. Lazard $211bn
5. First State $155bn
6. Royal London $151bn
7. Man GLG $114bn
8. Newton $68bn
9. Jupiter $61bn
10. Carmignac $57bn

11. Kames $54bn
12. Hermes $46bn
13. Jo Hambro $39bn
14. Artemis $38bn
15. Ruffer $27bn
16. Lindsell Train $21bn
17. Polar Capital $19bn
18. Twenty Four $18bn
19. Hargreaves Lansdown $12bn

Those with AUM under $10bn are (please correct me if I’m wrong): Amati, Chelverton, City Financial, Downing, Hawksmoor, Premier, Miton, Neptune, Sanford DeLand, Wise. Those with AUM not known are: Cavendish, Evenlode, Unicorn.

Whereas the investment banks operating in London are (HQ mostly in Canary Wharf):

1. Barclays
2. BNP Paribas
3. Bank of America Merrill Lynch
4. Citigroup
5. Credit Suisse
6. Deutsche Bank
7. Goldman Sachs

8. HSBC
9. JP Morgan
10. Mizuho
11. Mitsubishi UFJ
12. Morgan Stanley
13. Nomura
14. UBS

Leave a Comment