Data Platform and Data Science

25 May 2010

Optimising Cube Query Performance and Processing Performance

Filed under: Analysis Services — Vincent Rainardi @ 7:43 am
Tags:

Background

Several times this question cropped up: how do you optimise the query performance of a large cube. Eight out of ten, when people ask this question, they mean “my users are complaining that their reports are slow”. Report being Excel, SSRS or PPS. Slow means usually it’s 1 second but today it’s 1 minute.

The other 20% is genuine cube improvement. They are cube developers who are looking for ways to make their cubes better and more performing. Usually, they also seek for ways to optimise the processing time (the cube refresh time).

In this article I’ll try to (briefly) explain where/what to look, i.e. just the pointers. I’ll write my experience and thoughts so readers can get the idea (high level), and provide reference to a more complete source so if you need more detail information, as per how etc.

Query Performance

I usually look/work in this order: MDX queries, aggregation, partitioning, server settings. First I’d look at the MDX query, see if it’s 1 specific query (or a few) that causes the issue. If it does, then I’d try optimise that query. The 2 most frequent things are: 1) cell by cell instead of block computation, and 2) attribute from the same dim are not “next to each other”, causing number of tuples to exceed 4.2 billion (read Mosha’s blog on this, and my blog). We can capture the MDX using profiler. No OlapQueryLog doesn’t give us the MDX (the 101010… column corresponds to the attribute order).

If the MDX queries are OK, I’ll look at the aggregation. If it’s 2008 then design the aggregation for each MG. By partition if possible. Usage based if possible. Then number 3 to look at is partitioning. I’ll try to either split the MG so that the majority of the queries hit a small partition (current month for example), or split the MG so that the queries are distributed evenly across many partitions (by product for example). Or do both.

Last is server settings. There are 3 major areas: memory, disk/IO, CPU. If the AS server is on the same server as SQL, look at memory counters. Limit the SQL memory if possible, if not, move AS out to another server. Dedicated if possible. Also look at the upper/lower memory limit settings. And processing vs query. Also Windows vs SSAS heap (look under Memory Heap Type on 2008 perf guide).

For IO we can look at Threads\Query\MaxThreads (refer to 2008 perf  guide for settings), ThreadPool\QueryPoolJobQueueLength, IdleThreads, BusyThreads. When configuring a new AS server, better choose RAID 1+0 over RAID 5 or 1. AS data files are not as big as SQL data files. If a big query blocks many small queries set the Coordinator Query Balancing Factor (and Boost Priority). If there is ROLAP dim or MG, look at the underlying table/view.

If you want to be systematic, first determine if it’s Storage Engine or Query Processor. We can do this using eventsubclass 22 on Profiler. If it’s SE issue, look at: aggregate, partition and dimension (hierarchy and attribute relationship). If it’s QP, look at: MDX queries, cache warming, server settings e.g. IO, memory. Cache warming: create cache or run the query.

Processing Performance

First look at the processing output window, see if the issue is dimension or MG. If it’s dimension issue, create proper dim table (instead of select distinct from the transaction/fact table), process default, materialised/indexed view (to provide distinct values).

If it’s MG issue, look at: partition the MG (so simultaneous read happens), physical table partitioning (align them to the MG partitions), incremental processing (only process partitions that’s changed – need to flag this on ETL), aggregation (are they required? Compare with usage. Use Profiler, log output to table).

If building cubes from operational systems, having materialized view can improve the performance hundreds of times. The issue is, there are so many restrictions, we can’t use many things when writing the select statement. And, they may not allow us to create view there on the ops sys DB.

Align the table partition to the MG partition means: if the MG partition is on month, then the table partition should be on month too. This way, when the MG is being processed, all MG partitions are reading simultaneously from different table partition, avoiding IO contention. There’s no point partitioning the table if all MG partitions will be hitting the same table partition – it will only bring the SQL Server to its knees.

——————————–

Apologies I write this in a hurry, on my way to work – hence it’s kind of unedited, rush writing style. But I do hope it’s helpful for some. If you spot anything inaccurate, I’d be grateful if you’d point it out. There are many SSAS experts (see my who’s who list) who may have different opinions, I’d be more than happy to listen to their advise.

19 May 2010

Data Warehousing and Business Intelligence

Filed under: Analysis Services — Vincent Rainardi @ 5:59 pm
Tags:

But you have written about that in the past, in SQL Server Central, you might say. I know, I know. I even spent the whole month writing chapter one of my book which main purpose is to explain what a data warehouse is! But I’ve moved on and have more experience, etc so may I have a different view? This time I’m not going to dive into the theoretical aspect of it, but only offer you simpler definition:

Data Warehousing: data modelling, database design, ETL

BI: reports, cubes, data mining

I finally came to these simple definitions because I frequently had to explain what is data warehousing and what is BI. And most people only have 30 seconds to hear the answer. Who knows, the above short definition might be useful next time somebody ask you that question!

17 May 2010

Storing percentage measures in the fact table

Filed under: Data Warehousing — Vincent Rainardi @ 7:44 am
Tags:

A few weeks ago somebody asked me at SQL Server Central how he should store percentages in the fact table. And how he could aggregate it up.

A percentage measure (or ratio) needs to be stored as two columns in the fact table. One stores the nomination and the other stores the denomination.  The BI application (report or cube) will then need to calculate the percentage or ratio on the fly. It is necessary to store it as two columns, in order to be able to aggregate it up.

For example, say you have a measure called Ratio1, defined as A divided by B. In the fact table you create 2 measure columns: A and B. The in the report or cube we then create a measure called Ratio1, which is defined as: sum(A) divided by sum(B). This way, it doesn’t matter at what level* this measure it used, the ratio will always be correct.

*level: I’m referring to the hierarchy in the dimension table.

This is what we store in the fact table: (without the “Total” row of course)

This is what we store in the dim1 table:

And this is what the report (or cube) displays on the screen:

Notice that 67% + 20% + 56% is not 44%. But 44% is calculated from (the sum of A) divided by (the sum of B) at that level.

So by storing the nominator and de-nominator of the percentage measure in the fact table we can aggregate correctly at any level in the report or cube.

9 May 2010

Primary Key In Accumulating Snapshot Fact Tables

Filed under: Data Warehousing — Vincent Rainardi @ 11:29 pm
Tags:

In data warehousing it is common to have a fact table with the grain the same as the grain of a dimension. The dimension is usually the same as the main dimension, i.e. in CRM: customer, in manufacturing: product, in health care: patient, in insurance: policy, and in banking: account. For example: in retail, we have Fact Customer Purchasing Behaviour. The grain of this fact table is: 1 row for each customer. The fact table contains the amount of money that each customer spent on each of the 7 major product classes, in the last 12 months.

Another example is a workflow of mortgage approval process in retail banking. Grain: 1 row for each account. Measures: number of working days on each of the 5 stages in mortgage approval. From the date the customer applied, until the date the agreement is signed: application days, questionnaire days, verification days, approval days, and signing days. Or until the customer gets the money.

These fact tables are usually accumulating snapshot. In maintaining this table, we do more SQL update statement than insert statement. Usually these fact tables also have other dimension keys, i.e. in the last example of retail banking: it’s mortgage officer dimension, branch dimension, and mortgage (type) dimension.

Right, that’s the background, now the core of this article: Primary Key.  We have 2 questions:

  1. In the fact table, should we make the main dimension key (like account key in the mortgage example above) as the Primary Key?
  2. Should we make the account key in the fact table as a Foreign Key to dim account?

As usual, we should not focus on what’s right and what’s wrong. But to: in which case should I make it a PK, and in which case should I not make it a PK.

Should we make the main dimension key as the primary key of the fact table?

Before we start discussing this, I’d like to mention that if you need a background about “Primary Key on the Fact Table”, you can read this article: Primary Key on the Fact Table.

Whether we should make the main dimension key as the PK of the fact table or not, depends on two things:

  1. Is the grain of the fact table really the same as the grain of the main dimension?
  2. Whether we plan to have a surrogate fact key.

The first factor seems obvious, but we really should examine it carefully. I have found several cases where at the first sight the grain seems the same, but with closer examination I found that the grain of the fact table is more detail than the main dimension.

It’s easier to explain with an example. Take the mortgage status fact table example above. The grain of this fact table is (supposedly) 1 row for each mortgage account. We are not supposed to have an account with 2 rows (or more) in this fact table. But, we also have other dim keys in this fact table: branch key, mortgage type key and mortgage officer key. What if it is possible for a mortgage to have more than 1 officer? What if a mortgage was transferred from one branch to another, e.g. in 2008 it was in branch A, but in 2009 it was in branch B? What if a mortgage changed its type, say from “interest only” to “repayment”, or from “fixed rate” to “tracker”? In these cases, for 1 mortgage account we could have 2 rows in the fact table.

In these cases, we have three options:

  1. Maintain the principle that a mortgage can only have 1 row in this fact table, or
  2. Allow for a mortgage to have several rows in this fact table.
  3. Take the offending dimension key out.

If we stick with the principle that a mortgage can only have 1 row in this fact table, then we restrict the other dimension keys. In the above example, the other dimension keys are branch key, mortgage type key and mortgage officer key. We modify them as follows:

  • Mortgage officer key: change to “main mortgage officer key”, which specifies the  main* credit officer dealing with the customer for this mortgage. *see next paragraph.
  • Branch key: change to “current branch key”, which is the latest branch for this mortgage. If a mortgage was moved from branch A to branch B, then this column contains the key to branch B.
  • Mortgage type key: change to “current mortgage type key”, which is the latest mortgage type for this mortgage. If a mortgage was changed from “fixed rate” to “tracker”, then this column contains the key to tracker mortgage type.

I would like to comment further with regard to the word “main” that I mark with * above. To determine which one is current we need to have a business rule. As an example: “if there are two account officers assigned to a particular mortgage, then the main mortgage officer is the officer with the highest rank, at the time the mortgage was approved”. Indeed it is possible for an officer to outrank other officer over time (for example, because of promotion), hence the business rule specifies “at the time the mortgage was approved”. Indeed it is very possible that the 2 officers have the same rank, in this case the rule can further specify for example, “in the case that the officers have the same rank, the main mortgage officer is the first signatory on the approval document”.

I found that, over the years, the main problems with defining business rules are:

  1. There isn’t an existing business practice for it and different part of the business have different view on the proposed rule. And we can’t get them to agree.
  2. The rule is technically “not implementable”. As in, the data that is required to implement that rule is not captured anywhere in the system.

In these situations we need to be practical. If the missing data can be captured that usually the best alternative. If not, we need to create the rule with the data that we have. In the case of disagreeing departments or business units, it is often useful to go higher. At higher level, the business priority is usually more obvious, making it easier to decide.

The other two options are to allow for a mortgage to have several rows in this fact table, or remove the offending dimension key column. Allowing a mortgage to have several rows usually adds more issues with regards to the measures values. The issues are usually either a) how to split the measure across rows, or b) double counting.

Removing the offending dimension key column (and keep the grain of the fact table at mortgage account level) can be a good, practical approach. To get the branch key, mortgage type keys and mortgage officer key we will need to join to another fact table which has these keys. Unless you have an issue with the query performance, this is the “cleanest” approach as you keep those dimension keys where they should belong. And not duplicating them.

Whether we plan to have a surrogate fact key

The second factor affecting the decision about whether we should make the main dimension key as the primary key of the fact table is whether we plan to have a surrogate fact key.

I believe that saying “it’s a standard” is an unwise blanketing approach. To have a policy to create a surrogate fact key on each and every fact table without considering their structure and usage is disadvantageous. True that a surrogate (PK) fact key has its advantages, including better query performance, better indexing (specific to SQL Server platform, not applicable to Oracle), and enable us to uniquely identify a fact row. But there are cases where these advantages don’t occur.

One of those cases is accumulating snapshot fact table, with the grain equals to the main dimension. Consider the mortgage status fact table above. Which enables us to monitor closely the number of working days it took for a mortgage to progress from one stage to another. If the grain of this fact table is 1 row for each mortgage account, it make sense to make the mortgage account key as the primary key of the fact table, not by creating a surrogate fact key instead. Creating a surrogate fact key is not necessary because only one of the dimension key column in this fact table determines the granularity of the fact table. The need for a surrogate fact key arises when there are more than one dimension key columns determine the granularity.

Should we make the main dimension key in the fact table as a Foreign Key to the main dimension

In the above example, this mean: should we make account key in the fact table as a FK to the account dimension.

In my opinion, the answer is yes (identifying), because some accounts may not be in the fact table. Cancelled accounts for example. This also guarantees that all the account keys in the fact table are valid. In the fact table we should not have 0 (unknown) account key. The warehouse load (ETL) should check this, in addition to the FK constraint.

Accumulated snapshot fact table is usually not too big. Having an FK (only on 1 column) doesn’t have a negative impact on the load performance. It on the other hand, have an advantage on the query performance (when we join the fact to the account dimension). When the fact is being loaded, we do more update than insert, so this FK doesn’t have an impact on the load performance.

8 May 2010

Why a 400 GB Data Warehouse Requires 20 TB of Disks

Filed under: Data Warehousing — Vincent Rainardi @ 7:43 am
Tags:

A lot of people wonders, why to host a 400 Gigabytes data warehouse, the production SQL Server requires 20 Terabytes of disks. For example, say your BI Project Manager requires you to size your data warehouse and you came up with something like this:

  • 5 base fact tables, 1-30 million rows each: 10 GB
  • 10 snapshot tables, 100-300m rows each: 74 GB
  • 10 aggregate tables, 10-200k rows each: 2 GB
  • 20 dimension tables, 12 small (under 1000 rows), 5 medium (1-100k rows) and 3 large (>100k rows): 2 GB

Total data: 78 GB. (rounded: 80 GB)

  • Materialised/indexed views, 25% of data = 25% x 80GB = 20 GB
  • Indexes = 1.5x data: 120 GB
  • Unused + Unallocated space: 2x data: 160 GB (see here for background re what they are)

Estimated database size: 80+20+120+160 = 380 GB (rounded: 400 GB)

Growth for 2 years: 3x (1.2 TB). So, DDS database size = 1.6 TB.
Say NDS/ODS, Stage, marts: 1 TB

Total: 2.6 TB

PS.
DDS = Dimensional Data Store, i.e. the star/snowflake schema
NDS = Normalized Data Store, aka Enterprise DW, see my book, chapter 2.
ODS = Operational Data Store

LUN / disk carving

  • Disk = 146 GB unformatted (137 GB formatted), SAS 15k RPM, about £175 each.
  • 20 data volumes, RAID 10, 20×4 disks = 80 disks (total capacity 20x137GB=2.7TB)
  • 2 log volumes, RAID 10, 2×4 disks = 8 disks
  • Tempdb files: 20 cores, 1 file for each core, 1 on each volume, shared on data vol
  • 2 backup volumes, RAID 5, 2×25 disks = 50 disks (24x137GB=3.2 TB usable space)
  • Hot standby: 4
  • Total disks: 80+8+50+4=142 disks

Total: 142 disks. Raw disks capacity: 142x146GB = about 20 TB.
That’s just for prod, I haven’t mentioned DR, dev, UAT and Prod Support!

An example is RamSan 5000 (SSD), which cost £1m (ref)! A more conventional SAN example is CX4. Cheapest route is DAS (direct attach), probably £70-80k e.g. about £2.5k per enclosure (25 bays, controller incl.) – need 6 of them, plus the disks.

Notes:

  • Log files can go to the data disks (and vice versa), but not for the same database. But for best result seperate them.
  • ODS/NDS and DDS both have partitioned tables: need to split each data disk into several volumes (6 to 10 volumes).
  • Many companies are now starting using solid state disk for DW, 500x faster, about £500-£600 for 256 GB (ref). If we use SSD we don’t need to do RAID 10, just RAID 1 or 5. Read/write speed = 200 MB/s. I think it’s only a matter of time before all everybody uses SSD.

To recap, initial calc of the “data warehouse” 400 GB.
Need to cater for 2 years growth (3x): 1.6 TB.
Add ODS/NDS, Staging, data marts: 2.6 TB. (and metadata, standing data, control DB, audit/DQ)
To make 2.6 TB if we use RAID 10, we are looking at roughly 20 TB of raw disks (including the backup volume).

Blog at WordPress.com.