Data Platform and Data Science

31 January 2019

Column Store Index

Filed under: Data Warehousing — Vincent Rainardi @ 8:57 am

I write this article specifically to answer this question:
Do column store indices improve query performance in data warehousing?

So I’m not looking at data loading performance, and I’m looking specifically data warehousing, i.e. querying fact tables. From past experience I know that the answer is yes if the fact tables are large and the query is selective. Otherwise the answer is no. But I don’t have the numbers. So with this article I would like to measure it. Thank you to Nadir Doctor for suggesting this topic.

Experiment Design

I will first query a small fact table (50k rows) which is indexed the normal way, i.e. clustered on PK, non-clustered on SKs. For this I will use the internet sales fact table on Adventure DW. I’ll execute two queries. The first one will select all rows, and the second one will select only 0.1% of the rows.

I will then create column store indices on this fact table, and execute the same two queries. I expect that there will be no performance improvement because the fact table is small.

Then I’ll add more rows to the internet sales fact table so that it becomes 10 million rows and execute the same queries. I expect there won’t be performance improvement on the select all, but a significant improvement on the selective query.

To summarise:
Experiment Design.JPG

Note: The reason I put the “select all” in there because that’s how OLAP tools works, e.g. Power BI, SSAS, Qlikview. They select all rows from the fact table. Where reporting tool such as SSRS works on selective query (Power BI can be selective too).

SQL Queries and DDL

The query I use is as follows:

select F.SalesOrderNumber, F.SalesOrderLineNumber, DC1.FirstName + ' ' + DC1.LastName as CustomerName, DP.ProductAlternateKey, DP.ModelName, DC2.CurrencyAlternateKey as Currency, F.UnitPrice, F.OrderQuantity, F.SalesAmount
from FactInternetSales F
left join DimCustomer DC1 on F.CustomerKey = DC1.CustomerKey
left join DimProduct DP on F.ProductKey = DP.ProductKey
left join DimSalesTerritory DST on F.SalesTerritoryKey = DST.SalesTerritoryKey
left join DimCurrency DC2 on F.CurrencyKey = DC2.CurrencyKey
where DST.SalesTerritoryCountry = 'United Kingdom' and F.OrderDateKey/100 = 201401

For select all rows I use the same query but without the where clause.

The DDL for create the indexes are as follows:

create clustered columnstore index PK_FactInternetSales_SalesOrderNumber_SalesOrderLineNumber on dbo.FactInternetSales
create clustered index PK_FactInternetSales_SalesOrderNumber_SalesOrderLineNumber on dbo.FactInternetSales(SalesOrderNumber, SalesOrderLineNumber)
create nonclustered index FactInternetSales_CustomerKey on dbo.FactInternetSales(CustomerKey)
create nonclustered index FactInternetSales_ProductKey on dbo.FactInternetSales(ProductKey)
create nonclustered index FactInternetSales_SalesTerritoryKey on dbo.FactInternetSales(SalesTerritoryKey)
create nonclustered index FactInternetSales_CurrencyKey on dbo.FactInternetSales(CurrencyKey)

Results

results
Query performance numbers are in milliseconds

So as expected, the column store indices only improve the query performance if the table is large and the query is selective.

Other Considerations

Apart of the query performance, of course we also have to consider the maintenance, i.e. how much effort we need to spend to keep the column store indices up-to-date. Is it a lot more than row store indices? (that is beyond the remit of this article)

References:

  1. Column Store Index – Design Guidance: link
  2. Column Store Index – Data Warehouse: link
  3. Niko Neugebauer on Column Store: link

28 January 2019

SSIS: Numeric Column Loaded as NULL

Filed under: Data Warehousing — Vincent Rainardi @ 6:08 am

Why SSIS loads a numeric column in Excel as NULL?
(only applies to Excel 97-2003 files, does not apply to Excel 2013 files)

What Happened

I have an Excel 97-2003 file containing a numeric column. The first 10 rows of this column are blank. There are numbers from row 11 onwards. When SSIS loads this file, all columns are imported correctly into the target table, except this one column which is imported as NULL.

Left: Excel 97-2003 source file, right: target table in SQL Server
excel 97-2003
Figure 1. Loading an Excel 97-2003 file

Why It Happened

Why Column2 is loaded as NULL, but Column3 is loaded ok?

This has nothing to do with SSIS. It is because of Microsoft Excel Driver. The Excel driver checks the first 8 rows of that column to find out the data type. Because they are blank, the Excel Driver (not SSIS) sets the expected data type to Text. And because the actual data is a Number, it is set as NULL.

How It Happened

Here is how it happened (the detailed mechanics).

When we setup the Excel Connection Manager in SSIS (see below), the Excel Driver installed in our PC does these steps:

excel source editor
Figure 2. Setting up Excel source

Step 1. The Excel Driver looks at the TypeGuessRows and ImportMixedTypes registry setting: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office\15.0\Access Connectivity Engine\Engines\Excel

registry settings
Figure 3. Registry Settings

Note: The branch in the registry is different for each Office version and 32/64 bit, see list here.

Step 2. The Excel Driver reads the first N rows from the Excel file to determine the data type, where N is the TypeGuessRows.

Note:

  • The default for TypeGuessRows is 8.
  • If the TypeGuessRows is set to 0, it reads the first 16384 rows (2^14).
  • To check all rows, the TypeGuessRows needs to be set to 1,048,576 (2^20), the maximum number of rows allowed in Excel.

Step 3. The Excel Driver sets the expected data type for that column as follows:

  • If the value of the ImportMixedTypes is “Text”, the expected data type is set based on the majority of the N rows where N the value of TypeGuessRows. There are 5 possible data types: Text, Number, Date, Currency, Boolean.
  • For example, if in the first 8 rows there are 4 dates, 3 texts, 2 number and 1 currency, the expected data type will be set to Date.
  • Another example: if in the first 8 rows there are 5 text and 3 numbers, the expected data type will be set to Text.
  • If there is a tie like (4 and 4) or (3,3,2) or (2,2,2,2) then the Number wins, followed by Currency, Date, Text and the last one is Boolean.
  • If the first 8 rows are all NULL the expected data type is set to the ImportMixedTypes.
  • In the above case, Column1 is set to Number (majority of 8), Column2 to Text (all NULL), Column3 to Number (majority of 1), Column4 to Text (all NULL).

Step 4. When we click the Preview button (see Appendix 1), the Excel Driver reads the first 200 rows from the Excel file and display the values on the screen. If the value from the file does not match the expected data type, the Excel Driver returns NULL to SSIS.
This is why in the above case Column2 is shown as NULL in the Preview screen.
The same happens during the actual load process.

Excel 2007 or Excel 2013

What happens if we import an Excel 2007/2013 file?

If we import an Excel 2007 or Excel 2013 the numeric column is loaded correctly, see Column2 below.

excel 2007 or 2013
Figure 4. Loading Excel 2007 or 2013

When loading an Excel 97-2003 file we get the same result as above if we set the Excel version on the Excel Connection Manager to “Microsoft Excel 2007”, like this:

set excel version
Figure 5. Changing the Excel Version

So when we face a problem with numeric column being null when loading from Excel 97-2003, we can set the Excel version to 2007 in Excel Connection Manager and SSIS will load it correctly.

 

27 January 2019

Starting a Post

Filed under: Data Warehousing — Vincent Rainardi @ 6:09 am

I usually start writing a post just like that. As if I was explaining it to someone in front of me.

But there is a better way to start a post, i.e. by formulating the thing you want to explain in the post into a question. This will grab the attention of the reader straight away.

For example, I am about to write an SSIS issue that happens when we import an Excel file containing a numeric column. The issue is: if the first 10 rows of that column is blank, Excel will import all rows of that column as blank.

I could start the usual way by explaining the issue (what happened), why it happened and what the solution is. I still think that is a good way to write it. But a better way to start the article is to formulate the issue into a question: Why SSIS imports a numeric column in Excel as null?

So that’s what I’m going to do in the next post.

5 January 2019

Asset Management Companies in the UK

Filed under: Data Warehousing — Vincent Rainardi @ 11:21 am

Having worked at several investment banks and asset management in the UK I was wondering which ones are the biggest asset managers in the UK. I don’t mean to list just the UK fund managers, but also US fund managers operating in the UK. So I searched in Wikipedia and Google, and the individual websites of each companies and wrote what I found below. Links to sources are on the AUM figures.

Largest asset management companies in the UK are (AUM $0.4tn or more, rounded to $0.1tn):
Note: only the asset management division/arm, not the whole group.

Smaller ones are (AUM under $0.4tn, rounded to $bn):

Those with AUM under $10bn are (please correct me if I’m wrong): Amati, Chelverton, City Financial, Downing, Hawksmoor, Premier, Miton, Neptune, Sanford DeLand, Wise. Those with AUM not known are: Cavendish, Evenlode, Unicorn.

Whereas the investment banks operating in London are (HQ mostly in Canary Wharf):

  • 1.     Barclays
  • 2.     BNP Paribas
  • 3.     Bank of America Merrill Lynch
  • 4.     Citigroup
  • 5.     Credit Suisse
  • 6.     Deutsche Bank
  • 7.     Goldman Sachs
  • 8.     HSBC
  • 9.     JP Morgan
  • 10.  Mizuho
  • 11.  Mitsubishi UFJ
  • 12.  Morgan Stanley
  • 13.  Nomura
  • 14.  UBS

 

Blog at WordPress.com.