Data Warehousing and Business Intelligence

30 November 2010

SSAS Developer Interview Questions

Filed under: Analysis Services,Other — Vincent Rainardi @ 9:47 pm
Tags: ,

There are 5 key areas to test:

  1. Cube design: the stages of creating a cube in BIDS, i.e. data source, DSV, dimension, cube.
  2. Performance tuning: troubleshooting query performance as well as processing performance.
  3. MDX: the multi dimensional expression that we all love.
  4. Client tools: creating reports and dashboards on Strategy Companion, Excel, Tableau, Panorama, etc.
  5. Administration: installing and upgrading SSAS servers, configuring SSAS server settings, security, processing, deployment, migration, backing up and restoring the cubes.

Beyond these 5, I think it is very important for a SSAS developer (and ETL developer) to understand the concept of dimensional modelling well. Hence I like throw in one or two Kimball questions. Usually candidates understand dimensional modelling, at least the basic, but we need to make sure.

1. Cube design

Out of the 5 things above, Cube Design is the most important one. It is the bread and butter of the job. We need to establish 2 things here:

a) Do the candidates know the basic cube design (things that your company use now)

b) Do they know the advanced cube design (features that your company don’t use now)

Because it depends on the SSAS features your company use, what constitutes “basic” are different from company to company. But it’s probably not far from this list:

  • Data source: connection string, impersonation
  • DSV: table, view, named query, relationships between entities
  • Dimension: hierarchy, sort order, attribute relationships, OrderByAttribute, display folder, default member
  • Cube: referenced dimension, many-to-many, partitioning, aggregation, measure expression, format string, calculated members, error configuration, display folder
  • Role: membership, cube access

Advanced cube design:

  • DS & DSV: multiple DS, multiple DSVs, logical primary key, friendly name, retrieve relationship, schema restriction, number of connections
  • Dimension: parent child dim, translations, date calculation dimension, multiple KeyColumns, ragged dim, type 2 dim, custom rollup, unary operator, write enabled, ROLAP.
  • Cube: actions (URL & drill through), translations, perspectives, HOLAP, proactive caching, input dimension (dimension which is not connected to any MG, but used in calculation), IsAggregatable, KPI.
  • Role: dimensional security, cell security
  • AMO: doing all of the above programmatically using C#.

I usually like to establish first if the candidate can do the job, i.e. basic stuff above/things that we used in the company. I do this by picking 3 from the basic stuff, see if they are comfortable. If they do, I pick 3 from the advanced stuff. If they are not comfy with the basic, the question I need to answer is whether the candidate would be able to sort it out by themselves on the job. I do that by giving them some hints.

Whenever possible, I prefer to wrap the question in a business context, rather than bare theory. For example, question a) below is with business context, where as question b) is theory:

a) Explain how you handle currency conversion.
They can use measure expression, many-to-many, but we know immediately if they have done this stuff or not.

b) Explain different modes of impersonation.
That is just theory. I prefer to give a context i.e. when would we want to use service account and when to use a specific user name? Or: if the connection string already defines how to connect to the database, then what’s impersonation for?

2. Performance tuning

I think what I wrote here [Optimising Cube Query and Processing Performance] is quite comprehensive to be used as a base for our questions. As per its title, it covers the troubleshooting of both query performance and processing performance.

For query performance we can ask MDX queries, aggregation, partitioning, server settings. For processing performance: dimension, materialized view, partitioning and the most important of them all is incremental processing. I explain more about processing performance here.

Again I like to wrap them in a case, for example: a cube takes 5 hours to process and we need to get it down to 1 hour. How can we do that? Many candidates never optimise dimension processing, but any good SSAS developer will at least understand partitioning. An expert developer may even give you a lecture about why aligning table partitioning and incremental partition processing are very important.

Another favourite question of mine is what to do if a) a query is slow, and b) a certain measure is slow (other measures are fine). For a) trap the MDX, run it in MDX Studio (thanks Mosha) and find out which part makes it slow by altering the query bit by bit. For b) establish first if it’s a calculated measure or base measure. If it’s calculated, look at the MDX and analyse on MDX Studio. If it’s a base measure, look at the storage mode (ROLAP?), the dimension usage tab (relationship type = fact or many to many? Try removing some relationships & reprocess), measure expression (is it a simple A/B formula?), simplify the attributes involved in a query (perhaps the issue is with the attributes, not with the measures?).

3. MDX

This is always tricky. Not about figuring out what to ask, but what level of MDX do you need in your company/project. Any SSAS developer can do basic MDX, right? Not quite. Yes they can do the basic MDX select statement, but they might not know even the basic stuff like filter and sorting. So we need to test those. I’d recommend the following list:

Count, YTD, parallel period, filter, existing, order, first/lastchild, as/descendents, currentmember, if/iif/case, existing, head & tail, crossjoin, nonempty, except, scope, top/bottomcount.

They are considered every day use. Any SSAS developer who always queries their cubes using client tools would be separated from those who write MDX to query the cubes. Obviously we don’t need to ask all of them, just pick 3 or 4.

Before we test those functions, it’s worth establishing if they grasp the concepts of multidimensional database (OLAP) or not. For example: tuple, set, cube space, axis, level, member. If they can’t differentiate a tuple from a set, it’s probably not worth asking them any MDX function.

You might argue that “O but we don’t use MDX here”. But it is essential that they understand the concept and the basic functions. You can’t effectively compose SSRS reports displaying data from your cubes if your SSAS developer can’t write MDX. You can’t effectively build a cube if you don’t understand multidimensional database concepts.

XMLA (ASSL) is a bit different. They are not as important as MDX because you can script them on SSMS. Whether it is to create or to alter SSAS objects, or to process them, you can script them on SSMS. Besides, we have AMO for creating/altering SSAS objects programmatically. With AMO you get proper development environment like looping and branching, so you have more control. And it’s .NET. True that you can use tools like Excel, Panorama & ProClarity to get the MDX, or use Profiler to trap Strategy Companion & Tableau’s MDX statements, but you can’t optimise the calculated members if your SSAS developer doesn’t know MDX.

Another thing to consider is whether you need to cover DAX. This depends on whether you are using (or will be using) PowerPivot or not. But DAX is becoming more and more central in MS OLAP technology so in the long run (like 2 years from now) you will need to cover DAX. But for now it’s still optional, depending on the use of PowerPivot.

4. Client Tools

Specific features are different from one client tool to the next, but the basic ideas are similar: creating report, chart and dashboard. Some of the specific features to test are: create custom set, create calculated member, time calculation function such as YTD and QTD, hyperlink to differ reports, drilldown. One of my favourite questions is whether a specific calculation should be done in the client tool or in SSAS.

The administration of the client tool is usually done by somebody else, not the SSAS developer, so we shouldn’t worry about it in this interview. But if you are a small company then usually one person does it all, in which case you will need to ask the candidate about installation, migration, user security, and other aspects of the client tool.

5. Administration

Administration tasks are not the responsibility of an SSAS developer. They are the responsibility of the SQL DBA. These days, in large corporations, SQL Server DBAs are asked to manage all 4 aspects of SQL Server. They manage not only the relational engine, but also SSAS server, SSIS packages and SSRS servers. But in smaller companies SSAS developers are involved in administering the servers. So I include this section (section 5, Admin) in the interview.

The Questions

Enough with the background, let’s get on with the questions. I’m going to label each question with 1-5 indicating the 5 areas above, and LMH (low, medium, high) indicating the difficulty. I’ll put the label in curly brackets: {}. Hopefully the labels help you compose the right mix of interview questions.

  1. How do you ensure that January, February, March, etc will be in the correct sequence when the user browse the cube? {1-L}
  2. In SSAS, how do you design currency conversion for a) from many transaction currencies to 1 reporting currency, b) from 1 transaction currency to several reporting currencies and c) from many transaction currencies to many reporting currencies? {1-M}
  3. A user reported that a Panorama report that she uses every day is very slow today. Normally it opens in 1 or 2 seconds, but today it’s didn’t open at all (she has waited for 4 minutes). Explain the approach how you are going to handle this case systematically. {4-L}
    Note: you can still ask this question if though you are not using Panorama.
  4. We have a risk analysis cube. This cube is 500 GB in size, containing 20 billion fact rows. Explain how we: a) process this cube efficiently; b) ensure a good query performance. {2-M}
  5. Explain (step by step) how you build a cube from a data mart. {1-L}
  6. Explain how you migrate a cube from a dev AS server to production. {5-L}
    Follow up with advantage/disadvantage of each approach. {5-M}
  7. A cube has 300 partitions and we want to process only the partitions where the fact table row changed. We don’t want to process the entire measure group. How do we do that? {2-M}
  8. We have a cube with 1 measure and 1 dimension. That dimension has only 1 attribute. Write an MDX to list the members of that attribute of which measure1 is bigger than 10. We want to sort it in descending order, i.e. largest amount first. {3-L}
  9. Like above but sort on the attribute, not on the measure. {3-L}
  10. People say that we shouldn’t use cell security because it’s slow. Explain how we should use it then? How do you combine it with dimensional security? {2-H}
  11. What’s the difference between a tuple and a set? What is an axis? {3-L}
  12. You need to grant access to 200 users to access a cube. Each user can only access certain data within the cube. The access is by product and region, i.e. user1-20 can only access product1-50 and region1-3, user21-40 can only access product30-80 and region2-5. How would you do that? {5-M}
  13. Some users need to see the sales figures in USD, other users in GBP. How would you do that? {1-H}
  14. You need to upgrade 4 SSAS servers (dev, UAT, prod, DR) from 2005 to 2008. Explain would you approach this. {5-L}
  15. People say that we can have a real time cube using ROLAP or ProActive Caching. Explain how, and the plus/minus of each approach. {1-H}
  16. Cube processed OK but all measures return no value. How do you troubleshoot this issue systematically? {1-M}
  17. How do you do Slowly Changing Dimension type 2 in SSAS? {1-M}
  18. The data mart only has 1 measure: amount. This measure could be asset, expense, income, etc depending on an attribute called “Account”. How do you implement this in SSAS cube? {1-H}
  19. The value of measure “price” is only valid if the product dimension is at the leaf level and the location dimension is at the “store” level. How do you set it? {2-M}
  20. Attribute relationship: In what case would we want to set: a) the cardinality to “one to one”? b) The relationship type to rigid? What’s the risk (of doing so)? {1-M}
  21. Why do you need to design aggregations? How do you design aggregations (usage-based and non usage-based)? Explain how you test it. {1-H}
  22. People say that perspective is not a security measure. What do they mean by that? {1-M}
  23. Star vs snowflake: what’s the benefit of having a snowflake data mart in SSAS? {1-M}
  24. We have a dimension which is as big as the fact table (in terms of the number of rows). What do you suggest we do with this dimension? And with regards to SSAS, what would you do differently? {1-H}
  25. Excel: how do you change language from English to French when browsing a cube?

Answers:

  1. Order by key and put month number in the key column. If they can’t answer this you should worry.
  2. It’s best if the candidate can explain about the currency conversion BI wizard. But if he can explain the “traditional way” it’s quite good too: using calculated measure and many-to-many relationship.
  3. As with any support call, we need to find out which report. Try to reproduce what the user did in NovaView. Find what the MDX is (Tools menu, Direct MDX, ctrl-alt-V), execute in SSMS. If this is slow then find out which part of the MDX makes it slow by building the query step-by-step. Then consider: a) aggregation, b) block computation, c) convert to base measure (do calculation in the relational/ETL), d) improve the MDX if it’s a calculated measure, e) partitioning. Most of the time, when a user says that usually it is 2 seconds but today it timed out, it is not the MDX or the cube structure. It’s usually either the server or the connection. So check that a) the server is up and running and b) from Panorama you can “open” the cube in that server.
  4. Incremental processing (IP). If you have huge cubes and the candidate has never done IP before, end the interview. If you have huge cubes in your company (100 GB), your questions should be focused on performance (processing & query). The job spec should clearly say: wanted: SSAS performance tuning specialist (not a “normal” AS developer) and prepare to pay 20% premium.
  5. It’s an easy question; every single candidate should be able to answer this question as it is bread and butter. But their answer shows their class. One of the important things I’m looking for here is requirement analysis. An SSAS developer who “just build it” without analysing the requirements is usually less helpful then those who analyse the requirements. The other thing I’m looking for here is performance, e.g. when building the DSV do they ensure that the SQL is efficient? When creating calculated measure, do they ensure that the MDX is efficient? The last thing I’m looking for here is how they do unit testing.
  6. a) backup-restore, b) script (XMLA)-execute-process, c) deployment wizard.
  7. First detect (in relational) which partition changed (you need some kind of flag / timestamp column on the fact table). In the metadata table, update the status of each partition of which the data has changed. Then based on this metadata table, process the changed partitions using either a) AMO or b) “AS execute DDL” task in SSIS.
  8. See below.
  9. See below.
  10. Look at Jason Thomas’ post herehere, and Bryan Smith’s post here.
  11. A tuple is a cut of a cube; a set is a collection of several tuples (could be 1 tuple). A tuple is enclosed with parentheses: (); a set is enclosed with curly brackets: {}. An axis is an edge of a multidimensional result set returned by an MDX query (columns, rows, pages, sections, chapters)
  12. Put users into AD groups and add these groups as the members of the cube roles. For each role, define the dimensional security (dimension data tab) on product and region dims.
  13. Create 2 roles: USD and GBP and add users into these roles via AD groups. In each role, set the default member in the currency dimension using dimensional security (dimension data tab in the role).
  14. Upgrade Dev first (document the process), test the cubes by running the reports (Panorama/Strategy Companion/Excel, etc). Use the documented process to upgrade UAT, and test the UAT cubes. Then upgrade prod and DR at the same time (using the documented process). Or DR first, test then Prod. Before upgrading Dev, you might want to copy a Prod cube to Dev and export some reports into Excel. After upgrading Dev, run the same reports and compare with the results before upgrade.
  15. There are 2 things here: dim and MG. In reality, a real time cube doesn’t mean that all dims and all MGs are real time. But only 1 or 2 MG are real time. Usually all dims are MOLAP, perhaps with the exceptions of 1 or 2 dims. To make an MG real time, we can define it as ROLAP storage mode. For performance reason, best not to define the whole of MG as ROLAP, but only 1 partition (usually the last one, if partition by month/period). Unless it’s a small MG (<100,000 rows). For dims, to make it ROLAP set the StorageMode property. ProActive Caching (PC) provides automatic management of MOLAP storage. The cube incorporates the changes in the relational tables, based on a notification mechanism. You can do PC on partition and dim. If the fact table (or partition of the fact table) is small, and data is changing frequently (like every minute), best to use ROLAP. If the fact table (or partition of the fact table) is large (like 8m rows in partition), and the data is rarely changed, best to use PC.
  16. Do “explore data” on the fact table on DSV. Run the SQL if it’s a named query. If this doesn’t return anything, check the fact table. If this returns rows, check the dimension usage tab. Focus on 1 MG. Remove all relationships (keep the cube dims), reprocess cube and see if you get a measure at the top level. If you do, add the relationship to the dims one by one until you find out which dim causing the issue.
  17. No special settings, just create the dim as normal from the SCD2 table. The only difference that it has 3 extra attributes: IsCurrent, Valid From & To date.
  18. Parent child dim, unary operator column, custom rollup.
  19. Scope statement
  20. a) If for each member of attribute1 there is only 1 member of attribute2, b) if “member mappings” are fixed. Risk: if a member changes its mapping.
  21. Why: to increase query performance. How (usage based): OLAP query log. How (non usage based): based on attributes which are queried/used most. Refer to SSAS2008PerfGuide section 3.4.3.1. Test in Mosha’s MDX Studio, before and after, 2 different cubes, same query.
  22. You can’t prevent access to dim/attribute/measure using perspective. They just can’t see it, but if they know the name of the dim/attribute/measure they can access it.
  23. Flexibility in building a dimension, speed of processing. No difference in query performance.
  24. Dimensional modelling: break into smaller dimensions and/or snow flake the dim. SSAS: limit the number of attributes (don’t just take all attributes but be selective), index on those attributes (relational), limit the number of rows going into the cube by using a where clause on the DSV/view, put attributes with lower grain (smaller number of distinct values) into a referenced dim, linked via the main dim to the MG.
  25. Modify the connection string in the ODC file, see my post here.

Number 8:

select order
( filter
  ( {[dim1].[attribute1].[All].Children},
    [Measures].[Measure1] &amp;gt; 10
  ),
  [Measures].[Internet Sales Amount],
  desc
) on rows,
[Measures].[Measure1] on columns
from [Cube]

Number 9:


select order
( filter
  ( {[dim1].[attribute1].[All].Children},
    [Measures].[Measure1] &amp;gt; 10
  ),
  [dim1].[attribute1].CurrentMember.Name,
  desc
) on rows,
[Measures].[Measure1] on columns
from [Cube];

 

 

24 November 2010

SQL Parsing in BIDS

Filed under: Analysis Services — Vincent Rainardi @ 12:09 am

BIDS and SSMS have differences in terms of SQL parsing. In programming languages, to “parse” a statement is to break the sentence into the smallest components (called token) and check if all the tokens a) are meaningful symbol and b) are forming an allowable expression. A SQL query that runs in SSMS might not run in BIDS.

In SSMS the following SQL runs successfully:

SELECT Table4.Col1 C1, Table2.Col2 C2
FROM Table2
INNER JOIN Table3 ON Table2.Col3 = Table3.Col4
INNER JOIN Table4 ON Table3.Col2 = Table4.Col1
INNER JOIN Table5 ON Table4.Col7 = Table5.Col7
UNION
SELECT Table4.Col1, Table2.Col2
FROM Table4
INNER JOIN Table5 ON Table4.Col7 = Table5.Col7
INNER JOIN Table2 ON Table4.Col8 = Table2.Col2

But when we paste that SQL onto BIDS’s DSV and run it, BIDS changes the SQL to:

SELECT Table4.Col1 C1, Table2.Col2 C2
FROM Table2
INNER JOIN Table3 ON Table2.Col3 = Table3.Col4
INNER JOIN Table4 ON Table3.Col2 = Table4.Col1
INNER JOIN Table5 ON Table4.Col7 = Table5.Col7
UNION
SELECT Table4.Col1, Table2.Col2
FROM Table4 AS Table4_1
INNER JOIN Table5 AS Table5_1 ON Table4.Col7 = Table5.Col7
INNER JOIN Table2 AS Table2_1 ON Table4.Col8 = Table2.Col2

And gives this error message:

Multi-part identifier Table4.Col1, Table2.Col2, Table4.Col7, Table5.Col7, Table4.Col8, Table2.Col2 could not be found.

This was because BIDS detected that Table4, Table5 and Table2 were already mentioned on the first part of the query (before the union). So for the 2nd part of the query (after the union) BIDS added table aliases to Table4, Table5 and Table2 to make them unique. BIDS adds “_1” to the table name. So Table4 is now known as Table4_1, Table5 is now known as Table5_1 and Table2 is now known as Table2_1. And once you have given an alias, you need to refer the table using the alias. You can’t refer them using the original name.

The solution is adding “_1” to the multi-part identifier after the union:

SELECT Table4.Col1 C1, Table2.Col2 C2
FROM Table2
INNER JOIN Table3 ON Table2.Col3 = Table3.Col4
INNER JOIN Table4 ON Table3.Col2 = Table4.Col1
INNER JOIN Table5 ON Table4.Col7 = Table5.Col7
UNION
SELECT Table4_1.Col1, Table2_1.Col2
FROM Table4 AS Table4_1
INNER JOIN Table5 AS Table5_1 ON Table4_1.Col7 = Table5_1.Col7
INNER JOIN Table2 AS Table2_1 ON Table4_1.Col8 = Table2_1.Col2

And that runs OK in BIDS.

Note: in a union you don’t need to give an alias to the selected column names, but it’s better to do so:

SELECT Table4.Col1 C1, Table2.Col2 C2
FROM …
UNION
SELECT Table4_1.Col1 C1, Table2_1.Col2 C2
FROM …

If you name the columns before and after the UNION differently, SQL will use column names before the UNION.

18 November 2010

Tuning Cube Processing Performance

Filed under: Analysis Services — Vincent Rainardi @ 10:24 am

I wrote about performance tuning on SSAS about 5 months ago. That was on both processing performance and query performance. This time I would like to write only about processing performance but in a bit more detail.

So the usual case is something like “my cube processing takes 5 hours, can we reduce it to 2 hours?” Now what do we need to do? The wrong thing to do is blaming the hardware. The right thing to do is to systematically find out which part of the cube takes a long time to process.

Step 1. Identify the Bottleneck

When it takes a long time to process your cube database, the first thing to do is to get information about how long each dimension and measure group take. The purpose of this “time measurement” exercise is to identify the top issues, i.e. which part of the cube takes the longest time to process.

This can be done in 2 ways: a) using either BIDS or SSMS to process the cube, or b) use SSIS processing cube task. The advantage of using SSIS is that the process is logged so you can get the timing information in detail. But the disadvantage is that it’s slightly less readable than the process dialog box on BIDS and SSMS. If you have complete control of the servers (meaning that you can run / process anything you like, whenever you like, as in Dev SSAS server), I would recommend to process it in SSMS first to quickly identify the top issues, and then if you need the detail you can use SSIS.

The “processing output” window gives us information about the time it took to process each dimension and each measure group. If your cube contains 10 dimension, and dimension 1 to 9 took 1 minute each but dimension 10 took 30 minutes, then take a note that dimension 10 is an issue and you will need to look at it.

If your cube contains 10 measure groups (MG), and MG1 to MG9 took 5 minutes each but MG10 took 3 hours, make a note that MG10 is an issue and you will look at MG10 later. Apologies with pointing out  the really obvious above, but I want us to be clear on how to identify which ones are the issues.

What if MG10 takes 3 hours and MG9 takes 1 hour, do you need to worry about MG9? Well at the moment, no. But when you can get MG10 to process in 30 minutes, then MG9 becomes an issue. This is because the MGs are processed in parallel (see next paragraph*). So after reducing MG30 to 30 minutes, MG9 is now the bottleneck.

*The number of parallel threads depends on the “Maximum numbers of connections” settings on the Data Source, the default is 10). Each thread processes not an MG, but partitions within the MG. So if MG1 has 30 partitions, AS might not process MG2 to MG10 during the first minute, because all 10 threads are being used to retrieve data for MG1 partitions.

Step 2. Dimensions

Now that you know that the bottleneck is (for example) in Dimension 10 and Measure Group 10, we could work on them. Let’s do dimension first.

When processing a dimension, AS does “select distinct” on each attribute. Say your dimension has 3 attributes: attr1, attr2, attr3. Your dimension table (dim10) has 4 columns: dimkey, attr1, attr2, attr3. To process attr1, AS basically does “select distinct attr1 from dim10”. Then it does “select distinct attr2 from dim10”, and so on.

So far so good. But if your dimension is fed not from a physical table, but from a view, and this view joins 20 physical tables, then it could take a long time to process the dimension. Why? I’ll explain why.

Say your dim10 is created from view10 and it has 40 attributes. View10 takes 3 minutes to run, joining 20 tables, returning 100,000 rows. To process attribute1, SSAS does “select distinct attribute1 from view10”, which takes 3 minutes. Then it does “select distinct attr2 from view10”, which takes another 3 minutes. And so on until attr40. So it will take 3×40 = 120 minutes = 2 hours for SSAS to process dim10.

So what do we do when we face this situation? We need to materialise that view. Open the view definition, which is something like “create view view10 as select … from … inner join … left join … where …” Create a table to store the data on this view. This is can be done simply by using “select into”, i.e. “select … into table10 from … inner join … left join … where …” This will create table1 that contains view10 rows.

Now before you can process the dim, you need to repoint dim10 on the DSV from view10 to table10. Right click the dim10 table on DSV, choose “Replace Table” and choose “With Other Table”. Select table10 and click OK, now process the cube. It will finish in 3 minutes. Why? Because the data is ready on table10. SQL Server doesn’t need to run view10 every time to get the data from the underlying 20 tables.

Named Query

What if your DSV doesn’t use a view, but using a named query instead?

Same thing. Materialise the named query. Copy the SQL on the named query and paste it onto a blank query window on SSMS. Add “into table X” to the select statement to create a physical table and then execute it. Then repoint the named query on the DSV to table X.

Big Dimension Table

If the materialised dim table is under 100,000 rows, select distinct would finish in a few seconds. But if it’s 100 million rows (e.g. dim_customer in CRM data warehouse) then it could take 20-30 minutes to do select distinct on each column. In this case, we need to index that column. It is best to index the column separately/individually, i.e. if your dimension has 5 attributes, create 5 different indexes. Each index contains 1 key column. This is because SSAS issues a “select distinct” on each column individually. Don’t add extra columns like when you are creating a “covering index”. Index only the columns that are used by the cube. Don’t use columns that are not used by the cube.

Step 3. Measure Groups

Having tuned the dimensions, we now tackle the measure groups. The keyword here is partitioning. We need to partition both the table and the cube.

To partition the table, first create additional file groups (FG). Most databases use only 1 FG. We need to create 8 or 16 new FGs, depending on the disks on the server. If we only have 1 logical volumes, then create between 6 and 9 additional FGs. This number comes from experience. This number works for both RAID 5 and RAID 10 disks. If you want a precise, very optimum setting, then you will have to try different numbers of FGs e.g. 4, 8, 12 for example, then create the partitioned table, and try querying the table to see the performance.

If you have multiple logical volumes, you are lucky. You will get better query performance. Say you have 4 logical volumes. Create 16 FGs and place them into round robin fashion into those volumes, i.e. FG1 on vol1, FG2 on vol2, FG3 on vol3, FG4 on vol4, FG5 on vol1, FG6 on vol2, and so on.

Details of Partitioning

You can use the GUI but if you want to script it, the command to create the FGs is:

USE master
GO

ALTER DATABASE DataWarehouse ADD FILEGROUP FG2
ALTER DATABASE DataWarehouse ADD FILEGROUP FG3
ALTER DATABASE DataWarehouse ADD FILEGROUP FG4
etc

It’s a matter of taste but I always start from FG2 because we already have 1 FG, i.e. the Primary file group.

Now you need to create the additional database files and add them to those FGs:

ALTER DATABASE DataWarehouse ADD FILE
( NAME = N'Data2',
  FILENAME = N'H:\Path\DataWarehouse2.ndf',
  SIZE = 20GB , FILEGROWTH = 5GB
) TO FILEGROUP [FG2]

ALTER DATABASE DataWarehouse ADD FILE
( NAME = N'Data3',
  FILENAME = N'H:\Path\DataWarehouse3.ndf',
  SIZE = 20GB , FILEGROWTH = 5GB
) TO FILEGROUP [FG3]
etc

Note: calculate the size and growth based on the size of the original table (right click table name, reports, standard reports, disk usage by top table). If the original table is 200 GB, and you have 8 FGs, then 200/8=25 so set the initial size = 30 GB with 5 GB growth increment. Don’t use %, use absolute as the growth is more controlled. Large increment is generally better, provided you have the disk space. Both the size and the increment must be uniform across all FGs. This is because we intend to distribute the data equally across those FGs.

Now create the partition function and partition scheme:

USE DataWarehouse
GO

CREATE PARTITION FUNCTION PF_MONTH (int)
AS RANGE LEFT FOR VALUES
( 1, 2, 3, … 98, 99, 100)

CREATE PARTITION SCHEME PS_MONTH
AS PARTITION PF_MONTH TO
( FG2, FG3, FG4, FG5, FG6, FG7, FG8, FG9,
  FG2, FG3, FG4, …
  FG2, FG3, FG4, FG5, FG6
)

The number of FGs on the partition scheme = the number of values on the partition function plus 1, i.e. in the above example, the number of FGs should be 101.

Try to keep the number of rows in each partition under 10 million. 2-5 million would be ideal but 20 million is not good. It’s too big and too slow. This is the physical table partitioning we are talking about, not cube/measure group partitioning. As far as possible, the distribution of the data should be uniform. Meaning that the number of rows is roughly the same in all partitions. It’s not good to put 10 million rows on partition 1, but only 1 million on partition 2.  So pick your partition key carefully. A column with many distinct values is better for partitioning key. Tailor the partitioning function ranges to group them equally. Sometimes we can’t distribute the rows equally because it depends on how the data is spread across the partitioning key, but try your best because the loading performance and the query performance of that partitioned table depend on this distribution profile.

Next create the table and place it on the partition scheme:

CREATE TABLE Partitioned_Fact_Table
( Fact_Key bigint IDENTITY(1,1) NOT NULL,
  Month1_Key int,
  Dim1_Key int,
  Dim2_Key int,
  etc
  Measure1 money,
  Measure2 decimal(18,6),
  etc
  CONSTRAINT [PK_Partitioned_Fact_Table] PRIMARY KEY CLUSTERED
  ( [Month_Key], [Fact_Key] ASC )
  ) ON PS_MONTH(Month_Key)

Now load the data in:

INSERT INTO Partitioned_Fact_Table WITH (TABLOCK)
( [columns] )
SELECT [columns]
FROM Original_Fact_Table WITH (NOLOCK)

Notes:

a) The purpose of TABLOCK is so that the insert is done minimally logged. Meaning that the load will be faster.

b) The purpose of NOLOCK is so that the query is pessimistic (dirty read is OK), allowing other queries to run on the source table.

c) If you have 500 million rows this query could run for 4 hours, depending on your RAIDs. So try with 1 million rows first, i.e. SELECT TOP 1000000 FROM …, and see how much time a million takes. If it takes 30 seconds to do 1 million then it will be 120 million per hour. If it’s more than 100 million rows, you need to add approx 30 minutes to this calculation as there is an overhead at the beginning of the process. So 360 million is 3.5 hours.

d) During the loading process do not issue select count(*) to get the number of rows as this interfere with the “select into” that is running. Do not right click on target table name on the explorer and choose properties, storage to get the row count either. That also interferes. If you want to know the progress (number of rows), query the partition system view:

select sum(P.rows) from sys.partitions P with (nolock)
join sys.tables T with (nolock) on P.object_id = T.object_id
where T.Name = 'Partitioned_Fact_Table'

e) During the first 30-45 minutes that query on d) above return 0. Meaning that SQL Server doesn’t immediately load the table, but there’s a delay. But after it loads the first row it will then grow steadily and quickly. Quicker than SSIS workflow. Using SSIS workflow the rows immediately landed on the target, but the pace is lower. I’ll blog this (with numbers) when I got a chance later.

f) Remember that the partition key must be part of the primary key. Even if your PK candidate is unique across partition, you will still need to include the partition key in the primary key.

Step 4. Indexes

If your MG query involves a “group by” or a join, the best thing to do is to go back to your data warehouse designer and ask for that summary fact table to be created. Overall it is more efficient to do it that way than you (cube developer/designer) having to issue a “group by” and join query, which is very costly. ETL tools like Informatica and DataStage (WebSphere) has specific transformation for calculating aggregates and doing joins and they operate very efficiently. Doing it at SQL engine directly for a large volume of data is looking for trouble: a) your tempdb will grow enormously. We are talking 200, 400 GB here, b) SQL engine will need to sort the data before the “group by” can take place and this takes a lot of time. We are talking 1-2 hours  or even 4 hours, before  the first row is returned.

Well enough said. If after all that warning you still insist to do the aggregation and join on SSAS, read on. You will need to index your partitioned fact table. Here are some of the principles:

a) Index columns used in the joins. Index them individually. If for example you have 2 joins, the first one hits the fact table at column 3 and the second one at column 5, then create 2 indexes. One for column 3 and one for column 5. SQL needs to sort the column before entering the join. These indexes will make that sorting quicker.

b) Index the columns used in the “group by”. This needs to be a “covering index”, meaning that one index with many keys. The order of the keys needs to match the order of the group by. The purpose of doing this is the speed up the sorting. More than 50% of the cost is on the sorting. Sometime 80%. So it is a “must” that you have a covering index for your “group by” query. Otherwise the query would be painfully slow as SQL is doing table scan. If your “group by” on the partition query is not “big buckets” to “small buckets” then reorder them.

c) If the “sum” column on your “group by” query is only 1 column, add this column to the covering index. This prevents the SQL engine from going to the base table to get the measure, as the measure column is conveniently included in the index file.

d) Selectivity. In indexing, selectivity is the name of the game. If it’s not selective it’s not used. Selective means: out of 1 million rows, SQL select only 10 rows. If out of 1m rows SQL select 500k that’s not selective, index won’t be used. So how do you ensure selectivity? Number of distinct values. If it’s only 2 or 5, don’t bother indexing it (there are exceptions, for example if it’s involved in multiple key). If it’s 15k distinct values it’s will be used (again there are exceptions, for examples if it’s out of date, or multiple key).

e) Sort. The main purpose of an index is to sort a column. Before SQL can arrive at a certain value, it needs to sort it first. So if you see a sort icon on the query plan, think if indexing that column would help or not.

f) Maintenance. This is the must important thing of indexing. So please remember this well. It’s all very well indexing here and there, but remember that you have to maintain that index. Every insert on the fact table SQL needs to update the index (in DW we don’t usually delete or update the fact table, only on dimension table we do those). Inserting a row to the table page is quick – it goes to the last page. But then SQL needs to insert that row into the index as well, and that’s slow. So if you have 24 indexes on your fact table, totalling of 150 GB, bear in mind that your fact table load will suffer. If without any index your daily load on that fact table is 1 hour, with those 24 indexes your load could become 3 hours. So weigh the benefit of each index against the cost of maintaining it. How do you know how much is the benefit and how much is the cost? Simple. Measure it! For the benefit: execute your query before and after indexing. For the cost of maintenance: do your insert before and after indexing.

Verify with estimated execution plan. I know it’s only an estimated, not actual, but SQL engine is very good at predicting what’s going to happen. 99% of the time it’s usually true. So you don’t need to do the actual execution plan. But if you have to verify using the actual plan, don’t use the original partitioned fact table. You could be spending hours if you do that. Instead, do a “select into” to create a copy of that big table but load only 1 partition into it. Or, you can use the original table, but limit the query to only use 1 partition by adding a where clause. The purpose of this exercise is so that we get an actual execution plan that mimic the true data behaviour, but at a fraction of the time.

What to see in the estimated execution plan?

a) The “group by” is doing “index seek”, not “table scan”

b) The join sorting is using the index.

c) General observations such as: total cost (left most icon), parallelism (crucial when dealing with partitioned table), sort icons (would an index help)

It might be difficult to interpret the total cost. For guidance: 5000 means 3 minutes. Well, you know as well as I do that it’s never right or precise, because it depends on the hardware. But it’s useful as guidance. So if the cost is 50,000 it’s half an hour. And if it’s 500,000 you start to worry and look for ways to optimise your query.

Step 5. Queries hitting the same table

See if any of the DSV queries (particularly the MGs/Measure Groups) are hitting the same table or not. Example: MG1, MG2, MG3 and MG4 are all querying table1. Or worse: MG1  and 2 are querying table1, and MG3 and 4 are aggregating table1 (aggregating means “group by”).

SSAS processing is parallel. Because of that if you have 10 partitions on MG1 and 10 partitions on MG2 and they are hitting the same table, some waiting will occur. Thread 1 can’t query table1 because it is being queried by tread 4, 5, 6 and 7. I’m not talking about SQL, I’m talking about I/O, i.e. physical disk. SQL would happily issue 5 threads hitting the same table at the same time, but the RAID stripes are access sequentially. While that stripe is being read by thread1, other thread needs to wait (not if that stripe has been cached into the controller’s memory or if SQL has put the data in the server memory – hence having 128 GB RAM in your DW SQL Server helps a lot in speeding up the queries). This prolongs the processing time. From the cube processing side, the ideal solution is to have a second table. But that’s not efficient from modelling and ETL point of view, so that doesn’t usually occur in real life. So usually you have to survive with only 1 table.

What can we do to resolve this? Break that MG into as many partitions as possible. And partition the underlying table into as many partitions as possible. So those MGs partitions will be hitting DIFFERENT table partitions at the same time.

As usual I welcome any feedback and comments on vrainardi@gmail.com

Vincent Rainardi 18/11/2010

12 November 2010

SELECT COUNT(*) FOR BIG TABLE

Filed under: Analysis Services — Vincent Rainardi @ 7:39 pm

To get the number or rows in a table we usually do:


SELECT COUNT(*) FROM VERY_BIG_TABLE

For a small table (<100k rows) it’s fine but if it’s a very big table it takes a long time.

A table with 250 million rows , unpartitioned, no index, 40 columns: it takes 2 mins 18 seconds to do select count(*). (SQL Server with dual processors, 32GB RAM, single RAID5).

This one only takes 1 second:


select sum(P.rows) from sys.partitions P with (nolock)

join sys.tables T with (nolock) on P.object_id = T.object_id

where T.Name = 'VERY_BIG_TABLE' and index_id = 1

Would be very useful to memorise that one! As in data warehousing we deal with lots of very big tables.

Well the “nolock” hint is totally optional. I was just being too careful there. No need for it really as they are system tables which are not modifiable by users.

Create a free website or blog at WordPress.com.