I write this article specifically to answer this question:
Do column store indices improve query performance in data warehousing?
So I’m not looking at data loading performance, and I’m looking specifically data warehousing, i.e. querying fact tables. From past experience I know that the answer is yes if the fact tables are large and the query is selective. Otherwise the answer is no. But I don’t have the numbers. So with this article I would like to measure it. Thank you to Nadir Doctor for suggesting this topic.
Experiment Design
I will first query a small fact table (50k rows) which is indexed the normal way, i.e. clustered on PK, non-clustered on SKs. For this I will use the internet sales fact table on Adventure DW. I’ll execute two queries. The first one will select all rows, and the second one will select only 0.1% of the rows.
I will then create column store indices on this fact table, and execute the same two queries. I expect that there will be no performance improvement because the fact table is small.
Then I’ll add more rows to the internet sales fact table so that it becomes 10 million rows and execute the same queries. I expect there won’t be performance improvement on the select all, but a significant improvement on the selective query.
To summarise:
Note: The reason I put the “select all” in there because that’s how OLAP tools works, e.g. Power BI, SSAS, Qlikview. They select all rows from the fact table. Where reporting tool such as SSRS works on selective query (Power BI can be selective too).
SQL Queries and DDL
The query I use is as follows:
select F.SalesOrderNumber, F.SalesOrderLineNumber, DC1.FirstName + ' ' + DC1.LastName as CustomerName, DP.ProductAlternateKey, DP.ModelName, DC2.CurrencyAlternateKey as Currency, F.UnitPrice, F.OrderQuantity, F.SalesAmount from FactInternetSales F left join DimCustomer DC1 on F.CustomerKey = DC1.CustomerKey left join DimProduct DP on F.ProductKey = DP.ProductKey left join DimSalesTerritory DST on F.SalesTerritoryKey = DST.SalesTerritoryKey left join DimCurrency DC2 on F.CurrencyKey = DC2.CurrencyKey where DST.SalesTerritoryCountry = 'United Kingdom' and F.OrderDateKey/100 = 201401
For select all rows I use the same query but without the where clause.
The DDL for create the indexes are as follows:
create clustered columnstore index PK_FactInternetSales_SalesOrderNumber_SalesOrderLineNumber on dbo.FactInternetSales create clustered index PK_FactInternetSales_SalesOrderNumber_SalesOrderLineNumber on dbo.FactInternetSales(SalesOrderNumber, SalesOrderLineNumber) create nonclustered index FactInternetSales_CustomerKey on dbo.FactInternetSales(CustomerKey) create nonclustered index FactInternetSales_ProductKey on dbo.FactInternetSales(ProductKey) create nonclustered index FactInternetSales_SalesTerritoryKey on dbo.FactInternetSales(SalesTerritoryKey) create nonclustered index FactInternetSales_CurrencyKey on dbo.FactInternetSales(CurrencyKey)
Results
Query performance numbers are in milliseconds
So as expected, the column store indices only improve the query performance if the table is large and the query is selective.
Other Considerations
Apart of the query performance, of course we also have to consider the maintenance, i.e. how much effort we need to spend to keep the column store indices up-to-date. Is it a lot more than row store indices? (that is beyond the remit of this article)
References: