Data Platform and Data Science

29 December 2010

SSAS DMV: Join using DataSet

Filed under: Analysis Services — Vincent Rainardi @ 4:32 pm
Tags:

In SSAS 2008 we have Dynamic Management View (DMV), which is basically XMLA discover command, wrapped as data mining select statement (DMX). On SSMS, if we connect to an SSAS instance, right click on the server name, New Query, DMX,  type “select * from $system.discover_connections”, and click Execute, we will see a list of users currently connecting to that server.

The output is like this: (I only pick 1 row and transpose it so we can see the column names clearly)

CONNECTION_ID 42
CONNECTION_USER_NAME domain\user1
CONNECTION_IMPERSONATED_USER_NAME
CONNECTION_HOST_NAME 10.1.27.146:3315
CONNECTION_HOST_APPLICATION SQL Server Management Studio
CONNECTION_START_TIME 29/12/2009 11:34
CONNECTION_ELAPSED_TIME_MS 3089784
CONNECTION_LAST_COMMAND_START_TIME 29/12/2009 11:45:03
CONNECTION_LAST_COMMAND_END_TIME 29/12/2009 11:45:04
CONNECTION_LAST_COMMAND_ELAPSED_TIME_MS 62
CONNECTION_IDLE_TIME_MS 1972821
CONNECTION_BYTES_SENT 3610612
CONNECTION_DATA_BYTES_SENT 15880694
CONNECTION_BYTES_RECEIVED 1502120
CONNECTION_DATA_BYTES_RECEIVED 1502120

And if we type “select * from $system.discover_sessions” we will get:

SESSION_ID D1A8209F-27C4-4013-AB23-154704213CB3
SESSION_SPID 357640
SESSION_CONNECTION_ID 661
SESSION_USER_NAME domain\user1
SESSION_CURRENT_DATABASE Cube1
SESSION_USED_MEMORY 71
SESSION_PROPERTIES
SESSION_START_TIME 29/12/2009 11:57
SESSION_ELAPSED_TIME_MS 3193141
SESSION_LAST_COMMAND_START_TIME 29/12/2009 12:09
SESSION_LAST_COMMAND_END_TIME 29/12/2009 12:09
SESSION_LAST_COMMAND_ELAPSED_TIME_MS 0
SESSION_IDLE_TIME_MS 2476374
SESSION_CPU_TIME_MS 87687
SESSION_LAST_COMMAND {MDX Statement}
SESSION_LAST_COMMAND_CPU_TIME_MS 62
SESSION_STATUS 0
SESSION_READS 103571
SESSION_WRITES 0
SESSION_READ_KB 6593264
SESSION_WRITE_KB 0
SESSION_COMMAND_COUNT 1854

And when we type “select * from $system.discover_commands” we get:

SESSION_SPID 356127
SESSION_COMMAND_COUNT 31
COMMAND_START_TIME 29/12/2009 11:36
COMMAND_ELAPSED_TIME_MS 0
COMMAND_CPU_TIME_MS 281
COMMAND_READS 102
COMMAND_READ_KB 6154
COMMAND_WRITES 0
COMMAND_WRITE_KB 0
COMMAND_TEXT {MDX Statement}
COMMAND_END_TIME 29/12/2009 11:36

For other DMVs please see my June presentation at Microsoft TVP here. Also see my recent article on SSAS DMV here.

We can link the discover_sessions and discover_connections on CONNECTION_ID. And we can link the discover sessions and discover_commands on SESSION_SPID.  Like this:

If we want to join the three DMVs (command, session, connection), we have 3 options:

  1. Join using Linked Server
  2. Join using SSIS
  3. Join using DataSet

Midas Matelis wrote about the Linked Server. Basically it’s using OpenQuery.
Chris Webb wrote about the SSIS one.
This article is about using DataSet to join the DMVs. It’s using ADOMD.

What is a DataSet? We can think of DataSet as in-memory ERD diagram. It’s a collection of tables which have relational integrity between them.

So, we are going to issue Discover_Connections, Sessions and Commands then join them in the DataSet. The steps are:

  1. Add reference to ADOMDClient
  2. Open a connection and execute Discover_Commands
  3. Create a DataTable for Discover_Commands and populate it
  4. Execute Discover_Sessions
  5. Create a DataTable for Discover_Sessions and populate it
  6. Execute Discover_Connections
  7. Create a DataTable for Discover_Connections and populate it
  8. Add all three tables to the DataSet and join them
  9. Display the joined table to the screen

1. Add reference to ADOMDClient

Create a new C# project. You can use Windows form, WPF or web project. Or Silverlight project if you use VS2010. Add reference to Microsoft.AnalysisServices.AdomdClient

2. Open a connection and execute Discover_Commands

AdomdConnection ASConn = new AdomdConnection(“Data Source=ServerName;Catalog=;Integrated Security=SSPI;”);
ASConn.Open();
AdomdCommand Cmd = ASConn.CreateCommand();
Cmd.CommandText = “select * from $system.discover_commands”;
AdomdDataReader R = Cmd.ExecuteReader();

Leave the Catalog empty. We don’t need to specify it if we only want to execute discover command, session and connection. We do need to specify the database name for some DMV.

3. Create a DataTable for Discover_Commands and populate it

First we create a DataTable called tblCommand:
DataTable tblCommand = new DataTable(“Command”);

Then we get the column names from the schema:
DataTable tblSchema = new DataTable(“Schema”);
tblSchema = R.GetSchemaTable();
foreach (DataRow schemarow in tblSchema.Rows)
{ tblCommand.Columns.Add(schemarow.ItemArray[0].ToString(), System.Type.GetType(schemarow.ItemArray[5].ToString()));}

Then we populate the DataTable from the ADOMD Data Reader:
while (R.Read())
{object[] ColArray = new object[R.FieldCount];
for (int i = 0; i < R.FieldCount; i++)
{if (R[i] != null) ColArray[i] = R[i];}
tblCommand.LoadDataRow(ColArray, true);
}
R.Close();

4. Execute Discover_Sessions

We reuse the data reader:
Cmd.CommandText = “select * from $system.discover_sessions”;
R = Cmd.ExecuteReader();

5. Create a dataset table for Discover_Sessions and populate it

First we create a DataTable called tblSession:
DataTable tblSession = new DataTable(“Session”);

Then we get the column names from the schema:
tblSchema = R.GetSchemaTable();
foreach (DataRow schemarow in tblSchema.Rows)
{tblSession.Columns.Add(schemarow.ItemArray[0].ToString(), System.Type.GetType(schemarow.ItemArray[5].ToString()));}

Then we populate the DataTable from the ADOMD Data Reader:
while (R.Read())
{object[] ColArray = new object[R.FieldCount];
for (int i = 0; i < R.FieldCount; i++)
{if (R[i] != null) ColArray[i] = R[i]; }
tblSession.LoadDataRow(ColArray, true);
}
R.Close();

6. Execute Discover_Connections (reuse reader)
Cmd.CommandText = “select * from $system.discover_connections”;
R = Cmd.ExecuteReader();

7. Create a DataTable for Discover_Connections and populate it

First we create a DataTable called tblConnection:
DataTable tblConnection = new DataTable(“Connection”);
Get the column names from schema (tblSchema is defined above)
Column name is on the 1st column, data type is on the 6th column.
tblSchema = R.GetSchemaTable();
foreach (DataRow schemarow in tblSchema.Rows)
{ tblConnection.Columns.Add(schemarow.ItemArray[0].ToString(), System.Type.GetType(schemarow.ItemArray[5].ToString()));}
Then we populate the Connection DataTable
while (R.Read())
{ object[] ColArray = new object[R.FieldCount];
for (int i = 0; i < R.FieldCount; i++)
{ if (R[i] != null) ColArray[i] = R[i];}
tblConnection.LoadDataRow(ColArray, true);
}

Then we close the data reader and the connection because we have finished reading data from SSAS. After this we only play in memory.
R.Close();
ASConn.Close(true);
ASConn.Dispose();

8. Add all three DataTables to the DataSet and join them

Create a DataSet and add the DataTables:
DataSet DS = new DataSet();
DS.Tables.Add(tblCommand);
DS.Tables.Add(tblSession);
DS.Tables.Add(tblConnection);

Create a relationship between Session and Command on Session SPID:
DataColumn ParentColumn = DS.Tables[“Session”].Columns[“SESSION_SPID”];
DataColumn ChildColumn = DS.Tables[“Command”].Columns[“SESSION_SPID”];
DataRelation Rel = new DataRelation(“Session_Command”, ParentColumn, ChildColumn);
DS.Tables[“Command”].ParentRelations.Add(Rel);

Create a relationship betweeen Connection and Session on Coonection ID
ParentColumn = DS.Tables[“Connection”].Columns[“CONNECTION_ID”];
ChildColumn = DS.Tables[“Session”].Columns[“SESSION_CONNECTION_ID”];
DataRelation Rel2 = new DataRelation(“Connection_Session”, ParentColumn, ChildColumn);
DS.Tables[“Session”].ParentRelations.Add(Rel2);

Join the three tables: (David M provides a good reference, using array)

First we create a DataTable called tblJoin to store the final result:
DataTable tblJoin = new DataTable(“Join”);

Then we create the columns of tblJoin, by getting the column names and data types from tblConnection, tblSession and tblCommand. To avoid duplicate column name, add “2” on the second join column.

for (int i = 0; i < tblConnection.Columns.Count; i++)
{ tblJoin.Columns.Add(tblConnection.Columns[i].ColumnName,
     tblConnection.Columns[i].DataType);
}

for (int i = 0; i < tblSession.Columns.Count; i++)
{ if (!tblJoin.Columns.Contains(tblSession.Columns[i].ColumnName))
    tblJoin.Columns.Add(tblSession.Columns[i].ColumnName, tblSession.Columns[i].DataType);
  else
    tblJoin.Columns.Add(tblSession.Columns[i].ColumnName + "2", tblSession.Columns[i].DataType);
}

for (int i = 0; i < tblCommand.Columns.Count; i++)
{ if (!tblJoin.Columns.Contains(tblCommand.Columns[i].ColumnName))
    tblJoin.Columns.Add(tblCommand.Columns[i].ColumnName, tblCommand.Columns[i].DataType);
  else
    tblJoin.Columns.Add(tblCommand.Columns[i].ColumnName + "2", tblCommand.Columns[i].DataType);
} 

tblJoin.BeginLoadData();
foreach (DataRow ConnectionRow in tblConnection.Rows)
{ DataRow[] SessionRows = ConnectionRow.GetChildRows(Rel2);
  if (SessionRows != null && SessionRows.Length > 0)
  { object[] ConnectionArray = ConnectionRow.ItemArray;
    foreach (DataRow SessionRow in SessionRows)
    { DataRow[] CommandRows = SessionRow.GetChildRows(Rel);
      if (CommandRows != null && CommandRows.Length > 0)
      { object[] SessionArray = SessionRow.ItemArray;
        foreach (DataRow CommandRow in CommandRows)
        { object[] CommandArray = CommandRow.ItemArray;
          object[] JoinArray = new object[ConnectionArray.Length
            + SessionArray.Length + CommandArray.Length];
          Array.Copy(ConnectionArray, 0, JoinArray, 0, ConnectionArray.Length);
          Array.Copy(SessionArray, 0, JoinArray, ConnectionArray.Length, SessionArray.Length);
          Array.Copy(CommandArray, 0, JoinArray, ConnectionArray.Length +
          SessionArray.Length, CommandArray.Length);
          tblJoin.LoadDataRow(JoinArray, true);
        }
      }
    }
  }
}

7. Display the joined table to the screen

Here I used DataGridView object. If you want the AutoSizeColumnsMode can be set to cover the widest content.
dataGridView1.DataSource = tblJoin;

We can also display only the column that we like (not all columns).
We can put the server name in a text box so we can change SSAS server easily.
The result contains 48 columns: (the tblConnection columns are not shown)

The code is attached (C#): App1v2 (it’s pptx, rename to .zip first).
The zip file contains the Visual Studio solution (VS 2008).
Version 1 (only Session and Command, no Connection):  App1 (it’s pptx, rename to .zip first)

Potential for further development:

  • Rename columns on the dataGridView to be more user friendly. Hide unnecessary columns.
  • Refresh every 2 seconds
  • Cascade drill down: you click + on a connection and it displays the session, you click + on a connection and it displays the commands.
  • Execute other DMVs, such as displaying object resources (memory etc) used by each command or connection.
  • Give warning if the amount of resources used by a connection exceeds certain pre-set limits.

27 December 2010

Data Warehousing Interview Questions

Filed under: Data Warehousing — Vincent Rainardi @ 12:16 pm
Tags:

Following my article last week “SSAS Developer Interview Questions”, which generated interesting responses, I thought I’d write similar thing on the data warehousing. Like before, I will be using LMH to show the complexity (Low, Medium, High). This time, I’ll add the “Purpose”, i.e. what the question is designed for.

Data warehousing skill never stands on its own. You don’t interview people just for their data warehousing knowledge. Usually you are either looking for an ETL developer (Informatica, DataStage, BODI, SSIS), a Microsoft BI developer (RS/AS/IS), a BO/Cognos (Report) Developer, a Data Warehouse Architect, a BI Solution Architect, an ETL Architect, a Data Architect or an Enterprise Architect. All these roles require data warehouse knowledge. So most likely you will be combining these questions with other questions, e.g. for an Informatica developer you will be combining them with Informatica questions.

When I am filling a role, what I am looking particularly is whether the candidate can do the job or not. Nowadays it is different from 10 years ago. We now have Google and Bing. Anything we don’t know we can quickly Google it. So the amount of knowledge is not important now. What is important (to me) is a) experience, b) problem solving and c) character. I remember about 3 years ago I was interviewing a candidate for a data architect role. The candidate was speaker in data architecture conference so we were convinced she must be very good. I asked her what the rule for the 2nd normal form was. She could not answer it. I asked the 1st and 3rd normal form and she could not answer them either. This is like bread and butter. You can’t do data modelling without knowing the normalisation rules.

But then in an interview with a bank 3 years ago I was also “blank” like her. I was an author of a data warehousing & BI book on SQL Server so they were thinking high of me with regards to SSIS, SSAS and SSRS. They asked me what those 5 tabs in SSIS BIDS. I could mention 4 but could not remember the 5th one, even though I was using SSIS almost everyday. Since then when I interviewed I did not look for the amount of knowledge, but whether the candidate can solve a problem instead. I remember one day my manager and I was interviewing for a Teradata developer role. I said to the candidate that the amount of Teradata knowledge that he had was not important. Within 5 minutes of opening the manual or Googling he will be able to get that knowledge. So I said I would not ask him any Teradata SQL or BTEQ functions. Instead I gave him 2 real world problems that we were facing in the project and asked him to give me the solutions in about 5 minutes. They way he interrogated us with question to get information about the project and finally suggested a brilliant solution really impressed us, so we offered him the job. I can completely understand that some people disagree with my approach. After that interview my boss pulled me and told me off: “You must not say that in front of the candidates Vincent. Of course the amount of Teradata knowledge they possess is important! Why do you think we hire them for?”

So in the interview questions below, which crystallise from my experience, I put both knowledge based and experience/problem solving questions. Generally I’m less interested in “theory only” questions, so I try to wrap them up in real world problem situation.

I’m a firm believer that experience is the best teacher. So at interview I always try to find out if the candidate has done it before. So I test them using every day problems. People who have done data warehousing will surely come across those problems, and understand what the solutions are. People who are new in data warehousing, or only know the DW theory from books wouldn’t have encountered those problems and would not have a clue what the answers are.

1. Question: How do you implement Slowly Changing Dimension type 2? I am not looking for the definition, but the practical implementation e.g. table structure, ETL/loading. {L}

Answer: Create the dimension table as normal, i.e. first the dim key column as an integer, then the attributes as varchar (or varchar2 if you use Oracle). Then I’d create 3 additional columns: IsCurrent flag, “Valid From” and “Valid To” (they are datetime columns). With regards to the ETL, I’d check first if the row already exists by comparing natural key. If it exists then “expire the row” and insert a new row. Set the “Valid From” date to today’s date or the current date time.

An experienced candidate (particularly DW ETL developer) will not set the “Valid From” date to the current date time, but to the time when the ETL started. This is so that all the rows in the same load will have the same Valid From, which is 1 millisecond after the expiry time of the previous version thus avoiding issue with ETL workflows that run across midnight.

Purpose: SCD 2 is the one of the first things that we learn in data warehousing. It is considered the basic/fundamental. The purpose of this question is to separate the quality candidate from the ones who’s bluffing. If the candidate can not answer this question you should worry.

2. Question: How do you index a fact table? And explain why. {H}

Answer: Index all the dim key columns, individually, non clustered (SQL Server) or bitmap (Oracle). The dim key columns are used to join to the dimension tables, so if they are indexed the join will be faster. An exceptional candidate will suggest 3 additional things: a) index the fact key separately, b) consider creating a covering index in the right order on the combination of dim keys, and c) if the fact table is partitioned the partitioning key must be included in all indexes.

Purpose: Many people know data warehousing only in theory. This question is designed to separate those who have actually built a warehouse and those who haven’t.

3. Question: In the source system, your customer record changes like this: customer1 and customer2 now becomes one company called customer99. Explain a) impact to the customer dim (SCD1), b) impact to the fact tables. {M}

Answer: In the customer dim we update the customer1 row, changing it to customer99 (remember that it is SCD1). We do soft delete on the customer2 row by updating the IsActive flag column (hard delete is not recommended). On the fact table we find the Surrogate Key for customer1 and 2 and update it with customer99’s SK.

Purpose: This is a common problem that everybody in data warehousing encounters. By asking this question we will know if the candidate has enough experience in data warehousing. If they have not come across this (probably they are new in DW), we want to know if they have the capability to deal with it or not.

4. Question: What are the differences between Kimball approach and Inmon’s? Which one is better and why? {L}

Answer: if you are looking for a junior role e.g. developer, then the expected answer is: in Kimball we do dimension modelling, i.e. fact and dim tables where as in Inmon we do CIF, i.e. EDW in normalised form and we then create a DM/DDS from the EDW. Junior candidates usually prefer Kimball, because of query performance and flexibility, or because that’s the only one they know; which is fine. But if you are interviewing for a senior role e.g. senior data architect then they need to say that the approach depends on the situation. Both Kimball & Inmon’s approaches have advantages and disadvantages. Some of the main reasons of having a normalised DW can be found here.

Purpose: a) to see if the candidate understands the core principles of data warehousing or they just “know the skin”, b) to find out if the candidate is open minded, i.e. the solution depends on what we are trying to achieve (there’s right or wrong answer) or if they are blindly using Kimball for every situation.

5. Question: Suppose a fact row has unknown dim keys, do you load that row or not? Can you explain the advantage/disadvantages? {M}

Answer: We need to load that row so that the total of the measure/fact is correct. To enable us to load the row, we need to either set the unknown dim key to 0 or the dim key of the newly created dim rows. We can also not load that row (so the total of the measure will be different from the source system) if the business requirement prefer it. In this case we load the fact row to a quarantine area complete with error processing, DQ indicator and audit log. On the next day, after we receive the dim row, we load the fact row. This is commonly known as Late Arriving Dimension Rows and there are many sources for further information; one of the best is Bob Becker’s article here in 2006. Others refer to this as Early Arriving Fact Row, which Ralph Kimball explained here in 2004.

Purpose: again this is a common problem that we encounter in regular basis in data warehousing. With this question we want to see if the candidate’s experience level is up to the expectation or not.

6. Question: Please tell me your experience on your last 3 data warehouse projects. What were your roles in those projects? What were the issues and how did you solve them? {L}

Answer: There’s no wrong or right answer here. With this question you are looking for a) whether they have done similar things to your current project, b) whether their have done the same role as the role you are offering, c) whether they faces the same issues as your current DW project.

Purpose: Some of the reasons why we pay more to certain candidates compared to the others are: a) they have done it before they can deliver quicker than those who haven’t, b) they come from our competitors so we would know what’s happening there and we can make a better system than theirs, c) they have solved similar issues so we could “borrow their techniques”.

7. Question: What are the advantages of having a normalised DW compared to dimensional DW? What are the advantages of dimensional DW compared to normalised DW? {M}

Answer: For advantages of having a normalised DW see here and here. The advantages of dimensional DW are: a) flexibility, e.g. we can accommodate changes in the requirements with minimal changes on the data model, b) performance, e.g. you can query it faster than normalised model, c) it’s quicker and simpler to develop than normalised DW and easier to maintain.

Purpose: to see if the candidate has seen “the other side of the coin”. Many people in data warehousing only knows Kimball/dimensional. Second purpose of this question is to check if the candidate understands the benefit of dimensional modelling, which is a fundamental understanding in data warehousing.

8. Question: What is 3rd normal form? {L} Give me an example of a situation where the tables are not in 3rd NF, then make it 3rd NF. {M}

Answer: No column is transitively depended on the PK. For example, column1 is dependant on column2 and column2 is dependant on column3. In this case column3 is “transitively dependant” on column1. To make it 3rd NF we need to split it into 2 tables: table1 which has column1 & column2 and table2 which has column2 and column3.

Purpose: A lot of people talk about “3rd normal form” but they don’t know what it means. This is to test if the candidate is one of those people. If they can’t answer 3rd NF, ask 2nd NF. If they can’t answer 2nd NF, ask 1st NF.

9. Question: Tell me how to design a data warehouse, i.e. what are the steps of doing dimensional modelling? {M}

Answer: There are many ways, but it should not be too far from this order: 1. Understand the business process, 2. Declare the grain of the fact table, 3. Create the dimension tables including attributes, 4. Add the measures to the fact tables (from Kimball’s Toolkit book chapter 2). Step 3 and 4 could be reversed (add the fact first, then create the dims), but step 1 & 2 must be done in that order. Understanding the business process must always be the first, and declaring the grain must always be the second.

Purpose: This question is for data architect or data warehouse architect to see if they can do their job. It’s not a question for an ETL, report or cube developer.

10. Question: How do you join 2 fact tables? {H}

Answer: It’s a trap question. You don’t usually join 2 fact tables especially if they have different grain. When designing a dimensional model, you include all the necessary measures into the same fact table. If the measure you need is located on another fact table, then there’s something wrong with the design. You need to add that measure to the fact table you are working with. But what if the measure has a different grain? Then you add the lower grain measure to the higher grain fact table. What if the fact table you are working with has a lower grain? Then you need to get the business logic for allocating the measure.

But it is possible to join 2 fact tables, using the common dim keys, but the performance is usually horrible. For example: if FactTable1 has dim1key, dim2key, dimkey3 and FactTable2 has dim1key and dim2key then join them like this:

select f2.dim1key, f2.dim2key, f1.measure1, f2.measure2
from
( select dim1key, dim2key, sum(measure1) as measure1
  from FactTable1
  group by dim1key, dim2key
) f1
join FactTable2 f2
on f1.dim1key = f2.dim1key and f1.dim2key = f2.dim2key

So if we don’t join 2 fact tables that way, how do we do it? The answer is fact key column. It is a good practice (especially in SQL Server because of the concept of cluster index) to have fact key column to enable us to identify rows on the fact table (see my article here). The performance would be better (than joining on dim keys), but you need to plan this in advance as you need to include the fact key column on the other fact table.

select f2.dim1key, f2.dim2key, f1.measure1, f2.measure2
from FactTable1 f1
join FactTable2 f2 on f2.fact1key = f1.factkey

I implemented this technique originally for self joining, but then expand the use to join to other fact table. But this must be use on an exception basis rather than the norm.

Purpose: not to trap the candidate of course. But to see if they have the experience dealing with a problem which doesn’t happen every day.

11. Question: How do you index a dimension table? {L}

Answer: clustered index on the dim key, and non clustered index (individual) on attribute columns which are used on the query where clause.

Purpose: this question is critical to be asked if you are looking for a Data Warehouse Architect (DWA) or a Data Architect (DA). Many DWA and DA only knows logical data model. Many of them don’t know how to index. They don’t know how different the physical tables are in Oracle compared to in Teradata. This question is not essential if you are looking for a report or ETL developer. It’s good for them to know, but it’s not essential.

12. Question: Tell me what you know about William Inmon? {L} Alternatively: Ralph Kimball.

Answer: He was the one who introduced the concept of data warehousing. Arguably Barry Devlin was the first one, but he’s not as popular as Inmon. If you ask who is Barry Devlin or who is Claudia Imhoff 99.9% of the candidates wouldn’t know. But every decent practitioner in data warehousing should know about Inmon and Kimball.

Purpose: to test of the candidate is a decent practitioner in data warehousing or not. You’ll be surprise (especially if you are interviewing a report developer) how many candidates don’t know the answer. If someone is applying for a BI architect role and he never heard about Inmon you should worry.

13. Question: What is the difference between a data mart and a data warehouse? {L}

Answer: Most candidates will answer that one is big and the other is small. Some good candidates (particularly Kimball practitioners) will say that data mart is one star. Whereas DW is a collection of all stars. An excellent candidate will say all the above answers, plus they will say that a DW could be the normalised model that store EDW, whereas DM is the dimensional model containing 1-4 stars for specific department (both relational DB and multidimensional DB).

Purpose: The question has 3 different levels of answer, so we can see how deep the candidate’s knowledge in data warehousing.

14. Question: What the purpose of having a multidimensional database?

Answer: Many candidates don’t know what a multidimensional database (MDB) is. They have heard about OLAP, but not MDB. So if the candidate looks puzzled, help them by saying “an MDB is an OLAP database”. Many will say “Oh… I see” but actually they are still puzzled so it will take a good few moments before they are back to earth again. So ask again: “What is the purpose of having an OLAP database?” The answer is performance and easier data exploration. An MDB (aka cube) is a hundred times faster than relational DB for returning an aggregate. An MDB will be very easy to navigate, drilling up and down the hierarchies and across attributes, exploring the data.

Purpose: This question is irrelevant to report or ETL developer, but a must for a cube developer and DWA/DA. Every decent cube developer (SSAS, Hyperion, Cognos) should be able to answer the question as it’s their bread and butter.

Q&A on: Impersonation

Filed under: Analysis Services — Vincent Rainardi @ 10:05 am
Tags:

Q: What’s the difference between impersonation and connection string?

A: Let’s assume that the source database is in Oracle, running on UNIX. In BIDS, in the data source we have Connection String and Impersonation Information:

  1. Connection String: The credential specified here is the database user account (user ID and password) that Oracle will accept, allowing to read the rows on its tables.
  2. Impersonation Info: This the Windows credential/user account under which Analysis Services will be running when it tries to connect to that UNIX server to talk to Oracle.

For the number 1 above, we need to ask the Oracle DBA (or Sybase, Informix, DB2, etc depending on what the source RDBMS is) to create a database user account for us. If the RDBMS is running on Windows, this can be a Windows user account, if this account is authenticated by the RDBMS. If the source RDBMS is a SQL Server, then this can be a SQL Server login.

Number 2 must be a Windows account, because Analysis Services runs on Windows. Let’s say that our SSAS server is running (all the time) under a service account called Domain1\ServiceAccount1. In the example above, when SSAS connects to the UNIX server where the Oracle is running, we may want SSAS to connect using Domain1\ServiceAccount2 which has fewer privileges than ServiceAccount1. In this case ServiceAccount2 is the impersonation account.

Q: Are the connection string & impersonation in SSMS the same as the ones in BIDS?

A: Yes, they are the same. BIDS is just a tool to define connection string & impersonation. On the BIDS’ Solution Explorer, when we right click on Data Source, we are defining the connection string and the impersonation information. When we deploy the AS DB, these 2 things are copied across (implemented) to the AS server. That’s what we see in SSMS.

Q: Can we process a cube from a SQL server source using SQL Server authentication?

A: No, we can’t “process” a cube using a SQL Server account. SSAS needs a Windows account to impersonate when processing a SSAS database. “Process” doesn’t mean just connecting to the data source. “Process” means connect, read data, populate the dimensions and cubes, calculate aggregate and indexes, etc.

So yes, you can connect to the SQL Server source database using SQL Server authentication, but you can’t process the cube using a SQL Server account.

Q: If we can access the source database using connection string, why do we need impersonation?

A: If you use user name and password in the connection string, then they are used to connect to the source database. The impersonation account is not only used to connect to the source relational database, but to maintain the AS OLAP DB.

Q: Why can’t I use my Windows user account for impersonation?

A: You can use your Windows account, but there is a disadvantage when the cube is migrated to Production. Because a) if you use your user account to connect to the relational DB, other developers can’t maintain the BIDS project, and b) when the project is deployed to Production it won’t run because your account has access to the dev database, but not to the prod database. Using service account solves these problems. Impersonation is to specify which account to be used by the cube to connect to the relational database.

Analysis Services is a Windows service, running as SSAS service account. (to view this, double click the SSAS Service from control panel, then click on the Log On tab, see screenshot below).

The SSAS service account does not necessarily have access to the database. Hence the concept of impersonation: at run time the SSAS service account pretend to be (impersonate) or uses the impersonation account to access the database.

Q: How many modes of impersonation are there and what are the differences between them?

A: There are four impersonation modes:

a) Use a specific Windows user name & password: a domain user account created specifically for this purpose. This is the best practice because all the impersonation account need is a read access to the source database.

b) Use the service account: the SSAS service account usually set as a powerful domain account which has access to multiple SQL Servers. Using the SSAS service account gives you access to many databases in many SQL Server. You don’t have to worry about which account have access to which databases as a) above

c) Use the credential of the current user
SSAS service will use your windows account to access the database. When the cube is under development this is handy if you have access to the dev database. This way you can start the development straight away, with out waiting for any account to be setup. If you handover the BIDS project to your colleague developer however, your colleague needs to have access to that source database too. Otherwise he can’t process the cube.
Note that if you do this, when the cube is moved to production, you will need to change the account to the proper production account like a) above, or use the service account like b) above. As a developer, your Windows account does not normally have access to production databases. I’ll be surprised if it does because it creates a security issue.

d) Inherit
SSAS will use whatever specifies on the Database Properties. To view the Database Properties used for this, in SSMS, right click a SSAS database and choose Properties. Under Security Settings, there’s Data Source Impersonation Info. This is what Inherit uses. See screenshot below.

25 December 2010

Who Are Using The Cubes Now?

Filed under: Analysis Services — Vincent Rainardi @ 1:26 pm
Tags:

To find out who are accessing the cubes right now, there are several ways:

  1. Query the SSAS query log table
  2. Use Activity Viewer application
  3. Query the Discover_Connection DMV (or the schema rowset XMLA)

1. Query log table

select MSOLAP_Database, max(StartTime) as last_access_time,
MSOLAP_User, count(*) as hits, sum(duration) as duration
from OlapQueryLog
where MSOLAP_User is not null –system/internal
and StartTime >= getdate()-0.5/24 –in the last 1/2 hour
and MSOLAP_User <> ‘Domain\MyUser’ –exclude me
and MSOLAP_User <> ‘Domain\ServiceUser’ –exclude service account
group by MSOLAP_Database, MSOLAP_User
order by MSOLAP_Database desc, max(StartTime) desc

Note:

  1. We need to activate the query log first. See William E. Pearson III article here and Edward Melomed article here, both on how to configure SSAS Query Log.
  2. If you do enable the query log on a busy production machine, I would advise to archive the entries to a “usage mart” (see here), and truncate the table every day/week.
  3. To see individual log entry remove the group by clause.
  4. Too see all queries by a particular user:
    select * from OlapQueryLog where MSOLAP_User = ‘Domain\User1’ order by StartTime desc
  5. To find out what was the user executing on #4 above, look at the Dataset column. It is a series of 0 and 1 with several commas, like this:
    0100,00000,010,00,000
    Which means: the measure group has 5 dimensions (the group of numbers separated by commas). The query is using dim1 and dim3 (the one with 1 on them). Dim1 has 4 attributes, dim2 has 5 attributes, dim3 has 3 attribute, dim4 has 2 attributes and dim5 has 3 attributes.
  6. To know what dim1 and dim2 are in #5 above, look at the cube structure tab in BIDS. That’s the order. Alternatively you can script the cube to XMLA and look under Cube Dimension section. Not the database dimension but the cube dimension. I should mention that sometimes the order of the dimension in the Query Log doesn’t match the cube dimension order. I have seen cases where the order of dims in cube dimension are: 1,2,3,4,5,6,7,…,20,21,22 but the order in the Dataset column in the query log is: 1,2,3,4,5,6,9,11,12,13,14,8,10,15,16,17,18,19,20,7,21,22. So I would advice to test: query the cube with just 1 dim on the row and compare the value on the dataset column.
  7. In #5 above, if all attributes in the dim will have the value of 1, it means that one of the attribute on that dim has a default value. For example:
    0100,00000,010,00,111111
    Dim 5 has all attributes = 1, which means that 1 (or more) attribute on Dim5 has a default value.
  8. The number of digit in the attribute count in #5 above includes hidden attributes and disabled attributes, but excludes the hierarchies. For example: Dim product has 3 attribute. Attribute 1 is disabled (enabled = false), attribute 2 is hidden (visible = false) and attribute 3 is active. The number of digits for dim product is 3, not 2 or 1, i.e. 000, not 00 or 0. If dim product has a hierarchy, the number of digits is still 3, i.e. 000 not 0000.

2. Use Activity Viewer application

Activity Viewer is an application that enables us to see what is currently happening the SSAS server, i.e. users, sessions, queries, I/O and CPU. It is written by Luis Ballesteros, Alli Curley, Malini Jagannadhan, Meera Srinivasan, and others. The 2008 version is here. The 2005 version is here.

What I describe here is for the 2008 version. When started for the first time, it asks for the SSAS server name. This server name is stored in an XML file in the same folder as the exe. We can add more than 1 server.

Then click on “Users” on the left. This displays who’s currently connecting to the SSAS server. If a user has high memory or CPU, click on that user then click on Details at the bottom. This displays the session details. Scroll to the right to find ‘Session Last Command’ column, which contains the MDX executed by that user. Unfortunately we can’t copy the MDX.

If we want to terminate a session (I usually combine with OLAP Heart Beat), click on Cancel Session on the bottom right. Click ‘Back’ to go back to the user list.

3. Query the Discover_Connection DMV

Open a DMX window and type “select * from $system.discover_connections”

For details see SSAS DMV.

Comparing Oracle 11g OLAP Option and SSAS

For my next DWBI project, I’m thinking about implementing Oracle 11g OLAP option instead of SSAS, so I’m comparing them.

  1. Browsability: users can use OBI (e.g. Answers), BO (since 9i AW/2003), Excel or other tool to browse Oracle OLAP. Users are not expected to use SQL to query the cube. In AS now we have Crescent to look forward to. In the mean time we could use Tableau (see Jen’s session in SQLBits 8), Strategy Companion, CubePlayer or Excel 2010.
  2. Oracle OLAP cubes are stored as true multidimensional store (I’m interpretting this as compressed MOLAP), not as relational tables with materialized view/MV. This MOLAP storage solves the classic ROLAP problem of “MV aggregates can’t cover every permutation of every member of every attribute against other members of other attributes”.
  3. Oracle mentions compression & cost-based aggregation (see page 7). SSAS is certainly compressed, but I’m not sure if SSAS aggregations are cost-based.
  4. Oracle OLAP has the ability to blend multidimensional and SQL models within the same query (see Bud’s comments below). This “blend concept” is similar to SQL Server’s BISM concept in Denali (2011). BISM can expose itself as multidimensional interface (MDX) or SQL/relational interface (DAX). The difference is that BISM is relational (tables), with “in-memory column” (VertiPaq), where as 11g’s OLAP is compressed MOLAP.
  5. Many-to-many relationship. If you are not familiar with M2M in SSAS, pls read Marco’s paper about M2M Revolution. This is critical for many business situations. Can’t live without it. Have searched the whole library and I couldn’t find M2M in 11g OLAP. Bud said we can do it (see below), and from what I understand here, this is done by simply mapping a base measure to the fact table containing the M2M relationship, just like the way we create any base measure. Not sure how do we hide the measure though (because the measure is dummy, it is only used to establish M2M relationship between 2 dimensions).
  6. Yes 11g OLAP Option can create calculated measures. But as we in SSAS world know, calculated measure such as YTD or growth is not a problem if it’s only 1 or 2 measures, but if we have 50 measures we use date tool aka date calculation dimension (see Chris, Alberto & Marco’s most useful SSAS book here, chapter 6). Seems that we can’t do this in 11g OLAP. Surely we can’t create 50 * 10 calc measures? Well we can, but how about the usability?
    Update 27/12: I have read Bud’s comment below, and I know we can create “measure dim” and “calculation dim”, but in SSAS we need to create a hidden measure and write some “MDX calculations” to establish how the growth/2 yr growth, etc will be calculated. Have read about creating calc measure on 11g here, but couldn’t find how that “MDX calculation” is done in Oracle OLAP. Could anybody point me a direction?
  7. Complex calculation/MDX, i.e. existing (results depends on what’s on the slicer). Hidden named sets which is used by a calc member, etc. How do we do that in 11g OLAP?
    Update 27/11: have explored Simba’s MDX Provider for Oracle OLAP (link, link, link) and could not find comprehensive OLAP/MDX functions such as this. It seems that Simba’s product only provide connectivity, not ablity to do complex calculations. Would be grateful if somebody can enlighten me (list of Simba’s MDX functions for Oracle 11g OLAP). Have also looked at Bizzantz’s DeltaMaster methods, but it’s far below the richness of AS’ MDX e.g. scoping, existing, etc.
  8. MDX interface. 11g OLAP cube does mention MDX interface to OBIEE, but I expect this is basic “select … on columns, … on rows from cube” without the ability to use MS’ MDX functions such as NECJ, parallel period, etc. See page 14.
    Update 27/11: same comment as point 7. Update 28/11: see OLAP DML Functions here.
  9. Measure expression. This is required for example for currency conversion. If we don’t have measure expression in 11g OLAP (and we don’t have M2M either), how do we do currency conversion?
    Update 27/11: we can do this in 11g OLAP using “calculated measure” (see here), just need to specify the correct “calculation type” and “calculation”. It is quite comprehensive functions with regards to period / time series computation, but outside period I’m not sure. Update 28/11: found it. It’s called OLAP DML Functions, see here. It’s very comprehensive.
  10. Dimensional/cell security. In AS I sort of take it for granted that group1 can only see member1 whereas group2 can only see member2, along with giving users different default members based on role. But in 11g OLAP?
    Update 28/11: as Bud said, yes we can do dimensional security in Oracle OLAP, see “Selecting data by criteria” here.
  11. In AS we can control the incremental cube processing, i.e. process changed partition only etc. It is mentioned in 11g OLAP this can be automated (see the end on page 28 re Fast cube refresh), and that they can update incrementally by reading the log files (see last paragraph of page 6). This is a good thing (compared to AS).

Oracle OLAP blog is here. The Wiki is here. 11g OLAP user guide is here. OLAP Option documentation is here.

If you find something incorrect above, I’d be grateful if you could let me know, either via comments or vrainardi@gmail.com. I will correct them. Thanks again for Bud Eggress for your corrections.

22 December 2010

Current Trend in Data Warehousing and Business Intelligence

Filed under: Business Intelligence,Data Warehousing,Other — Vincent Rainardi @ 9:24 am
Tags: , ,

It is near the end of the year. 22nd December 2010. It’s been snowing heavily here in the UK, which is rare. As year end is approaching I’d like to write about the current trend on DWBI. Reflect on what has happened in 2010, and what might take off in 2011-2012.

Before I start I’d like to remind the reader that this is what I see. So it’s not the entire truth. It’s only a fragment of the market i.e. towards where I live (UK) and the products I’m familiar with. If you are a DWBI vendor, I’m sure you see more a complete picture than me in terms of your products. So please correct me.

Trends in DWBI Software

DWBI Suite: As a complete DWBI tool set/suite, by far MSBI is the most widely used (number of companies using it, not revenue). OBIEE, SAS, BO & Cognos follow at a far distance. Cognos is not a suite, where’s the ETL tool? It’s more appropriate calling it “IBM suite”, where they bundle InfoSphere DataStage and DB2 into the mix. There was an ETL tool in Cognos 8 BI (called Data Integration), but in V10 it’s gone. SAS is not a complete DWBI suite as it has no DW. Their ETL is DI Studio. BO has an ETL tool (BODI), but no DW/DB either.

ETL: terms of features, Informatica 9 is at the top, but SSIS is the most widely used. By far. Others are Data Stage (InfoSphere), OWB (now ODI in 11g), Ab Initio, BODI. For a more comprehensive list (of ETL tools) see my old article here. Some of my colleagues who have used Ab Initio claimed that it is the fastest ETL tool due to in-memory look up/pinning and parallel operations and that it scales well. But I doubt it. Their approach is very secretive. We had to sign an NDA before seeing the presentation. This usually means that there is nothing behind the wall. It’s a marketing technique where if you are low it’s best to keep yourself hidden. ODI & BODI are about equal at number 3 behind SSIS & Informatica (in terms of usage). There has been a change of perception since 2009 about SSIS, that it is no longer an inferior product like DTS. And the market take up since 2008 has been amazing. There are a lot more SSIS job than Informatica. Here in London many insurance companies use SSIS as their ETL tool, where as banks use more Informatica, mainly because of historical reasons. One or two banks still use Hummingbird Genio. I admit the new ODI has very good features (in particular the Knowledge Module), which could position it above SSIS (in terms of features, not usage).

Reporting tool: unlike 8 years ago, companies are no longer buying a stand alone reporting product like Crystal Report any more. They buy a suite. Lots of companies use SSRS because it comes “free” with SQL Server and the IT dept personnel are able to use it (because they can download & install it – unlike BO & Cognos). Companies with more budget usually opted for BO, Cognos & MicroStrategy. BO XIR2 developer edition is £1750 per user whereas Webi is £500 per user. The Dev Ed allows you to create reports, whereas the webi is to read/run reports. Compare that price with SSRS which is free. Which is why some companies are migrating from BO to RS (I’ve come across 3 companies in London this year). With regards to prices I want to point out that prices vary widely from companies to companies. Not only BO but all vendors. Company A could pay £1000/user whereas company B only £200/user for the same product.

BPM (Business Performance Management), aka EPM (Enterprise PM) or CPM (Corporate PM): Hyperion (Oracle), SAS, Board, BO & Cognos. Board is semi Self Service BI. BO’s one is called Planning and Consolidation. In 2007 SAP bought OutlookSoft and Pilot but they seems to be gone now. Hyperion is by far the top in this market. Pity OutlookSoft is no longer there to challenge Hyperion. SAS looks good and has proper Financial Mgt.
BPM definition is never clear, lots of marketing & labelling. Budget vs actual, planning, KPI, all thrown into the mix. Traditionally it’s Financial Mgt, like Hyperion & SAS but then all sorts of BI are thrown in, including dashboard, scorecard & KPI. So when people say B/E/CPM, clarify first what they mean by that.

OLAP: SSAS is as strongest as ever. By far the most widely used cube tool. I would estimate AS is probably 60% of the market now (usage, not revenue, i.e. out of 100 companies who use cubes, 60 of them use SSAS – my estimate, not market research). I’ve seen a few banks now use SSAS. Oracle Essbase & IBM PowerPlay follows far behind. Oracle 11g OLAP option could be a contender in the next 2 years. MicroStrategy is a respectable player in OLAP and has been in the market for a very long time (20 years?). Consistent with ROLAP since the beginning. Very useful to read their comparison to other vendor here. There is a big change going on with SSAS in terms of UDM & in-memory OLAP. Read MS BISM (Semantic Model) here (TK Anand), here (Chris Webb), here (Boyan Penev), here (Teo Lachev), here (Marco Russo), here (Jason Thomas) and here (Thomas Ivarsson).

RDBMS (SMP): the most widely used RDBMS for DW is without a doubt SQL Server (usage, not revenue). By far. Then Oracle, then DB2. Oracle and especially DB2 are used only by big companies, whereas SQL Server is used by companies of all sizes. Sybase & Informix are not in the game. Just like MySQL & Ingres, I only heard like once a year (if that) where people use Sybase or Informix for data warehousing. There has been a total change of perception with regards to SQL Server capacity handle big DW. 5 years ago the perception (especially amongst London banks) was that SQL Server can’t handle big DW (say upwards of 3 TB). So they used Oracle/DB2 for “big DWs” and SQL Server is approved for “small DWs”. The perception now is that SQL Server 2008 capability for large DW is equal to Oracle 11g. I know an investment bank that uses SQL 2008 for their huge DW (upwards 15 TB). This take up is largely due to SQL 2008’s partitioning capability. Both Oracle and DB2 are still very respectable for DW platform. Some people still think that they are “slightly better” than SQL Server, in terms of performance for >5 TB DW.

MPP: I think Teradata & Netezza are still leading (again, number of companies using it not revenue). Exadata is gaining popularity, but it’s kind of 1/4 MPP. Definitely not full MPP. Not even half. Read here (vs Twinfin) and here (Teradata). Good if you are thinking transactional, but for MPP DW it’s a question mark. Seems that PDW & NeoView are not in the game. This year I heard their promotions/marketing but not their implementations. Even Greenplum is more used than PDW & NeoView (London, not sure about US & other countries). In my opinion if you use MS BI (SSIS, SSAS, SSRS) and are choosing an MPP, it is better to choose Teradata than PDW. We have OLEDB for Teradata, which enables IS, RS & AS to connect to Teradata better. Read MS-TD integration here and here. I was disappointed to hear that PDW had technical problems and that the take up is not good. When I wrote this I was very optimistic about PDW. Kognitio WX2 is MPP, but using commodity blades.

Visualisation: What’s that? OLAP client? Reporting client? Charting? Well, all of the above. MicroStrategy, CubePlayer, Tableau, QlikView, Excel, SSRS, SCA (Strategy Companion Analyzer), Panorama, Crescent. For SSAS client, I think Crescent would be the winner, but that’s 12 months away. How about now? From features, not Panorama, but Tableau, CubePlayer and SCA. From architecture/installation, etc, not Excel, but SCA. Vidas Matelis composed a comprehensive list of SSAS client here. The issue with SSAS client is not the features, but the vendor. Majority are small vendors. Which ones will still stand in 10 years time? How can we be sure that they will not be bought and then killed like ProClarity?

Trend in DWBI Jobs

It is important for DWBI architects and managers to consider the resource/people/skill factor when choosing DWBI software. Employees & contractor also need to actively look at the trend, to consider what training they need to do in 2011. Agencies know this well, but employees/contractor usually not: best site for salary/rate and trend is ITJobsWatch. Below I only mention the rate (contract), for salary just click the “Permanent” tab on the IT Jobs Watch site.
ETL: Informatica rate is higher than SSIS. In London market (not UK rates), SSIS current market rate is £350-450 DOE, where as Informatica is £500-600 (because you can’t download the product like SSIS). Note that the £450 SSIS also comes with SSRS & SQL Server skills and to a degree SSAS, so it’s a clear advantage to the recruiter.
OBIEE: is gaining popularity. More people wants OBIEE Developer. This is caused by version 11g. Good rate too, OBIEE 11g was £500 in Nov but now £600. Big shortage (many installations but few developers). Unlike MS BI which is saturated.
MPP: No change in Teradata & Netezza skills. Still very low volume. Stable/no change in rates, 375-450. Risky to get into because of low volume. And for most people, it’s impossible to get into. Not only you can’t download evaluation version, but where can you get hardware from? The only chance is taking training at Marylebone (Teradata) or Marble Arch (Netezza).
Solvency II: Solvency II is the name of the game now, driving lots of DWBI projects. So people with Lloyds market knowledge has advantage. And this trend is going until 2012, perhaps 2013. Again it’s London (and Europe, e.g. Dublin, Brussels, Zurich) but it does not happen in for example Boston or Melbourne.
BO: BO is interesting: decreasing volume (companies are moving to RS) but increasing rate (400-450): see here. The best of BO is a chance to get SAP experience.
QlikView: skill is more in demand compared to last year (number of jobs doubled) but rates decrease (understandably): was 400 now 350. Again this is London rate, outside London is lower, see here.

Future Trend in BI (not DW but BI)

Some people ask me what’s next in BI. This was my usual answer. There are 2 things that will gain popularity in the next 2 years:
Self Service BI. They already gain popularity. Users want to compose their own reports. Connect to the DW / DM themselves. I can imagine it will be a mess in a few years, but there you go, that’s the trend. Main stream BI will still be built by IT, only “personal” view are in the users’ hands. Main players are QlikView & PowerPivot. And Board to a degree. No body heard Rosslyn yet (“What’s that?” is the response I’m getting at the moment). In terms of features, QlikView is still perceived to be much better than PowerPivot as PowerPivot is still version 1. But the take up is good. I’ve seen a few companies using it so it’s only a matter of time before PowerPivot overtakes QlikView.
Streaming BI. In weather and science research streaming BI is old news. Temperature analysis for example. But in commercial companies this is new. Stream Insight is the major player (part of R2). I think it will take off in the next 2 years. People wants to analyse the data as soon as it comes in. FX trading in an example (well, other asset classes too). Any kind of online investment trading really. O yes and risk analytics. It’s kind of not stream at the moment, but quite a big flow. It’s kind of against all DW principles that’s why it’s a bit odd. DW is batch. Nightly. Not second by second streaming. It’s a revolution, I know. But it’s coming. Read here for an intro and here for architecture.

People make mistakes, so if you know anything is incorrect please rectify me. As usual I welcome discussion & questions at vrainardi@gmail.com.

Vincent Rainardi, 22/12/2010

18 December 2010

Reporting and Analytics

Filed under: Business Intelligence — Vincent Rainardi @ 8:11 am
Tags:

I’ve embedded this topic in several times in different articles, but I’ll write it again specifically just Reporting and Analytics. Because I came across it again yesterday and it is very important concept in BI. As a user, and equally important, as a BI architect.

As a user, you would want to do 2 things in BI: reporting and analytics.

  1. Reporting is where you read a report. Columnar tables. Layout. Appearance, corporate logo, cosmetics. Regulatory (to get certain information at certain time for a certain government body). Font, header, footer. You know what information you need. You need a report built for you to get the data out (of the warehouse, and of the ODS, and of the transaction system). So you have warehouse reports vs operational reports. They layout is very important to you. It has to contain specific columns, etc.
  2. Analytics is where you explore a cube. Slice and dice. Browsing the data. Drill down, drill up, drill across. You have an idea of what you are looking, but you don’t know the specifics yet. For example, you want to find the trend of premium and claim reserve across different classes but you don’t know which measures and for which period. You want all the data in the system at your finger tip. You want to explore the data to find things. Things that you don’t know yet for sure, but the further you explore the data, the clearer the picture in your head. Then after you get a clear picture, you want a dashboard. Things you need to look every day. You want KPI, your team’s targets. You want BPM, tracking your budgets.

Years ago in many companies where ERP was introduced there’s a need of reporting. Users ordering reports from “the IT department”. 4 GL reports, Easytrieve, Cobol, AS/400 reports. Everything is on Mainframe. Then mini computers. Then everything is on PC. R&R was very popular. Clipper. FoxPro. Access. Suddenly everything became PC based. Crystal Reports was number 1. Still, the essence is the smae: users ask IT to extract the information via reports. The “report developer” profession at that time was very popular.

Then the BI era. Dimensional modelling. OLAP. Cognos, Essbase/Hyperion. MicroStrategy. Cartesis. Concept of dashboard, KPI, score cards. BPM/EPM (Hyperion was right in the front). The whole world was doing BI & DW. Until today. But the old reporting still exists. Not as elegant as before, but still required. Strongly required. For routine, monthly progress report (sales, production, etc). For regulatory purpose (FSA, etc). For the Senior Management. Operational reports for the sales managers, branch managers. Reports still exists strongly today. But it never be a selling point when a BI consultant pitches their products & services. Usually it’s analytics and visualisation. They are always the key selling points. Secondary selling points are DQ, MDM, DG (data governance), DI (data integration) and the “realtime-ness”.

So if someone asks do we need reporting or analytics, the answer is clearly both. Reporting is the basic of “information extract”. Users can survive with reporting alone. But not for long. Your company can’t excel or perform without “BI”, i.e. the analytics. In my book I said there are 3 aspects of BI: reporting, analytics and data mining. Forget about mining for now (predictive analytics, etc). Let’s focus on the first two. This article is about the first two, reporting and analytics. You need both. Organic growth of “information extract” within a company will start with reporting, but then moving onto “using that extract intelligently”, “exploration”, “trend analysis”, etc and before you know it the company is in the middle of BI. Dashboard, KPI, BPM. BI has major selling points (from the IT director to the board). Hence it’s taking off everywhere in the last 5 years. Yes the economic downturn in the last 3 years killed off many BI projects, but it’s still one of the most important ones on the agenda of CIOs. Even with limited budget, some BI projects still need to continue. Tactical ones especially. Within the targeted business area, usually the one that is a key revenue generator to the business, i.e. credit (bank), underwriting (insurance), sales (retail), tariff (telco).

11 December 2010

Data Warehousing Interview Questions

Filed under: Data Warehousing,Other — Vincent Rainardi @ 10:59 pm
Tags: ,

Following my article last week “SSAS Developer Interview Questions”, which generated quite a lot of responses, I thought I’d write similar thing on the data warehousing. Like before, I will be using LMH to show the complexity (Low, Medium, High). This time, I’ll add the “purpose”, i.e. what the questions are designed for.

Data warehousing skill never stands on its own. You don’t interview people just for their data warehousing knowledge. Usually you are either looking for an ETL developer (Informatica, DataStage, BODI, SSIS), a Microsoft BI developer (RS/AS/IS), a BO/Cognos (Report) Developer, a Data Warehouse Architect, a BI Solution Architect, an ETL Architect, a Data Architect or an Enterprise Architect. All these roles require data warehousing knowledge. So most likely you will be combining these questions with other questions, e.g. for an Informatica developer you will be combining them with Informatica questions.

When I interview to fill a role, what I look particularly is whether the candidate can do the job or not. Nowadays it is different from 10 years ago. We now have Google and Bing. Anything we don’t know we can quickly Google it. So the amount of knowledge is not important now. What is important (to me) is a) experience, b) problem solving and c) character. I remember about 3 years ago I was interviewing a candidate for a data architect role. The candidate was speaker in data architecture conference so we were convinced she must be very good. I asked her what the rule for the 2nd normal form was. She could not answer it. I asked the 1st and 3rd normal form and she could not answer them either. This is like bread and butter. You can’t do data modelling without knowing the normalisation rules.

But then in an interview with a bank 3 years ago I was also “blank” like her. I was an author of a SQL Server data warehousing & BI book so they were thinking high of me with regards to SSIS, SSAS and SSRS. They asked me what those 5 tabs in SSIS BIDS. I could mention 4 but could not remember the 5th one, even though I was using SSIS almost everyday. Since then when I interviewed I did not look for the amount of knowledge, but whether the candidate can solve problems instead. I remember one day my manager and I was interviewing for a Teradata developer role. I said to the candidate that the amount of Teradata knowledge that he had was not important. Within 5 minutes of opening the manual or Googling he would be able to get that information. So I said I would not ask him any Teradata SQL or BTEQ functions. Instead I gave him 2 real world problems that we were facing in the project and asked him to give me the solutions in about 5 minutes. The way he interrogated us with questions to get information about the project and finally suggested a good solution really impressed us, so we offered him the job. I can completely understand that some people disagree with my approach. After that interview my boss pulled me and told me off: “You must not say that in front of the candidates Vincent. Of course the amount of Teradata knowledge they possess is important! Why do you think we hire them for?”

So in the interview questions below, which crystallises from my experience, I put both knowledge-based and experience/problem solving questions. Generally I’m less interested in “theory only” questions, so I try to wrap them up in real world problems/situations.

I’m a firm believer that experience is the best teacher. So at interviews I always try to find out if the candidate has done it before. So I test them using every day problems. People who have done data warehousing will surely come across those problems, and understand what the solutions are. People who are new in data warehousing, or only know the DW theory from books wouldn’t have encountered those problems and would not have a clue what the answers are.

1. Question: How do you implement Slowly Changing Dimension type 2? I am not looking for the definition, but the practical implementation e.g. table structure, ETL/loading. {M}

Answer: Create the dimension table as normal, i.e. first the dim key column as an integer, then the attributes as varchar (or varchar2 if you use Oracle). Then I’d create 3 additional columns: IsCurrent flag, “Valid From” and “Valid To” (they are datetime columns). With regards to the ETL, I’d check first if the row already exists by comparing the natural key. If it exists then “expire the row” and insert a new row. Set the “Valid From” date to today’s date or the current date time.

An experienced candidate (particularly DW ETL developer) will not set the “Valid From” date to the current date time, but to the time when the ETL started. This is so that all the rows in the same load will have the same Valid From, which is 1 millisecond after the expiry time of the previous version thus avoiding issue with ETL workflows that run across midnight.

Purpose: SCD 2 is the one of the first things that we learn in data warehousing. It is considered the basic/fundamental. The purpose of this question is to separate the quality candidate from the ones who are bluffing. If the candidate can not answer this question you should worry.

2. Question: How do you index a fact table? And explain why. {H}

Answer: Index all the dim key columns, individually, non clustered (SQL Server) or bitmap (Oracle). The dim key columns are used to join to the dimension tables, so if they are indexed the join will be faster. An exceptional candidate will suggest 3 additional things: a) index the fact key separately, b) consider creating a covering index in the right order on the combination of dim keys, and c) if the fact table is partitioned the partitioning key must be included in all indexes.

Purpose: Many people know data warehousing only in theory or only in logical data model. This question is designed to separate those who have actually built a data warehouse and those who haven’t.

3. Question: In the source system, your customer record changes like this: customer1 and customer2 now becomes one company called customer99. Explain a) impact to the customer dim (SCD1), b) impact to the fact tables. {M}

Answer: In the customer dim we update the customer1 row, changing it to customer99 (remember that it is SCD1). We do soft delete on the customer2 row by updating the IsActive flag column (hard delete is not recommended). On the fact table we find the Surrogate Key for customer1 and 2 and update it with customer99’s SK.

Purpose: This is a common problem that everybody in data warehousing encounters. By asking this question we will know if the candidate has enough experience in data warehousing. If they have not come across this (probably they are new in DW), we want to know if they have the capability to deal with it or not.

4. Question: What are the differences between Kimball approach and Inmon’s? Which one is better and why? {L}

Answer: if you are looking for a junior role e.g. a developer, then the expected answer is: in Kimball we do dimension modelling, i.e. fact and dim tables whereas in Inmon’s we do CIF, i.e. EDW in normalised form and we then create a DM/DDS from the EDW. Junior candidates usually prefer Kimball, because of query performance and flexibility, or because that’s the only one they know; which is fine. But if you are interviewing for a senior role e.g. senior data architect then they need to say that the approach depends on the situation. Both Kimball & Inmon’s approaches have advantages and disadvantages. I explained some of the main reasons of having a normalised DW here.

Purpose: a) to see if the candidate understands the core principles of data warehousing or they just “know the skin”, b) to find out if the candidate is open minded, i.e. the solution depends on what we are trying to achieve (there’s right or wrong answer) or if they are blindly using Kimball for every situation.

5. Question: Suppose a fact row has unknown dim keys, do you load that row or not? Can you explain the advantage/disadvantages? {M}

Answer: We need to load that row so that the total of the measure/fact is correct. To enable us to load the row, we need to either set the unknown dim key to 0 or the dim key of the newly created dim rows. We can also not load that row (so the total of the measure will be different from the source system) if the business requirement prefer it. In this case we load the fact row to a quarantine area complete with error processing, DQ indicator and audit log. On the next day, after we receive the dim row, we load the fact row. This is commonly known as Late Arriving Dimension Rows and there are many sources for further information; one of the best is Bob Becker’s article here in 2006. Others refer to this as Early Arriving Fact Row, which Ralph Kimball explained here in 2004.

Purpose: again this is a common problem that we encounter in regular basis in data warehousing. With this question we want to see if the candidate’s experience level is up to the expectation or not.

6. Question: Please tell me your experience on your last 3 data warehouse projects. What were your roles in those projects? What were the issues and how did you solve them? {L}

Answer: There’s no wrong or right answer here. With this question you are looking for a) whether they have done similar things to your current project, b) whether their have done the same role as the role you are offering, c) whether they faces the same issues as your current DW project.

Purpose: Some of the reasons why we pay more to certain candidates compared to the others are: a) they have done it before they can deliver quicker than those who haven’t, b) they come from our competitors so we would know what’s happening there and we can make a better system than theirs, c) they have solved similar issues so we could “borrow their techniques”.

7. Question: What are the advantages of having a normalised DW compared to dimensional DW? What are the advantages of dimensional DW compared to normalised DW? {M}

Answer: For advantages of having a normalised DW see here and here. The advantages of dimensional DW are: a) flexibility, e.g. we can accommodate changes in the requirements with minimal changes on the data model, b) performance, e.g. you can query it faster than normalised model, c) it’s quicker and simpler to develop than normalised DW and easier to maintain.

Purpose: to see if the candidate has seen “the other side of the coin”. Many people in data warehousing only knows Kimball/dimensional. Second purpose of this question is to check if the candidate understands the benefit of dimensional modelling, which is a fundamental understanding in data warehousing.

8. Question: What is 3rd normal form? {L} Give me an example of a situation where the tables are not in 3rd NF, then make it 3rd NF. {M}

Answer: No column is transitively depended on the PK. For example, column1 is dependant on column2 and column2 is dependant on column3. In this case column3 is “transitively dependant” on column1. To make it 3rd NF we need to split it into 2 tables: table1 which has column1 & column2 and table2 which has column2 and column3.

Purpose: A lot of people talk about “3rd normal form” but they don’t know what it means. This is to test if the candidate is one of those people. If they can’t answer 3rd NF, ask 2nd NF. If they can’t answer 2nd NF, ask 1st NF.

9. Question: Tell me how to design a data warehouse, i.e. what are the steps of doing dimensional modelling? {M}

Answer: There are many ways, but it should not be too far from this order: 1. Understand the business process, 2. Declare the grain of the fact table, 3. Create the dimension tables including attributes, 4. Add the measures to the fact tables (from Kimball’s Toolkit book chapter 2). Step 3 and 4 could be reversed (add the fact first, then create the dims), but step 1 & 2 must be done in that order. Understanding the business process must always be the first, and declaring the grain must always be the second.

Purpose: This question is for data architect or data warehouse architect to see if they can do their job. It’s not a question for an ETL, report or cube developer.

10. Question: How do you join 2 fact tables? {H}

Answer: It’s a trap question. You don’t usually join 2 fact tables especially if they have different grain. When designing a dimensional model, you include all the necessary measures into the same fact table. If the measure you need is located on another fact table, then there’s something wrong with the design. You need to add that measure to the fact table you are working with. But what if the measure has a different grain? Then you add the lower grain measure to the higher grain fact table. What if the fact table you are working with has a lower grain? Then you need to get the business logic for allocating the measure.

It is possible to join 2 fact tables, i.e. using the common dim keys. But the performance is usually horrible, hence people don’t do this in practice, except for small fact tables (<100k rows). For example: if FactTable1 has dim1key, dim2key, dimkey3 and FactTable2 has dim1key and dim2key then you could join them like this:

select f2.dim1key, f2.dim2key, f1.measure1, f2.measure2
from
( select dim1key, dim2key, sum(measure1) as measure1
  from FactTable1
  group by dim1key, dim2key
) f1
join FactTable2 f2
on f1.dim1key = f2.dim1key and f1.dim2key = f2.dim2key

So if we don’t join 2 fact tables that way, how do we do it? The answer is using the fact key column. It is a good practice (especially in SQL Server because of the concept of cluster index) to have a fact key column to enable us to identify rows on the fact table (see my article here). The performance would be much better (than joining on dim keys), but you need to plan this in advance as you need to include the fact key column on the other fact table.

select f2.dim1key, f2.dim2key, f1.measure1, f2.measure2
from FactTable1 f1
join FactTable2 f2
on f2.fact1key = f1.factkey

I implemented this technique originally for self joining, but then expand the usage to join to other fact table. But this must be used on an exception basis rather than the norm.

Purpose: not to trap the candidate of course. But to see if they have the experience dealing with a problem which doesn’t happen every day.

11. Question: How do you index a dimension table? {L}

Answer: clustered index on the dim key, and non clustered index (individual) on attribute columns which are used on the query’s “where clause”.

Purpose: this question is critical to be asked if you are looking for a Data Warehouse Architect (DWA) or a Data Architect (DA). Many DWA and DA only knows logical data model. Many of them don’t know how to index. They don’t know how different the physical tables are in Oracle compared to in Teradata. This question is not essential if you are looking for a report or ETL developer. It’s good for them to know, but it’s not essential

12. Question: Tell me what you know about William Inmon? {L} Alternatively: Ralph Kimball.

Answer: He was the one who introduced the concept of data warehousing. Arguably Barry Devlin was the first one, but he’s not as popular as Inmon. If you ask who is Barry Devlin or who is Claudia Imhoff 99.9% of the candidates wouldn’t know. But every decent practitioner in data warehousing should know about Inmon and Kimball.

Purpose: to test if the candidate is a decent practitioner in data warehousing or not. You’ll be surprise (especially if you are interviewing a report developer) how many candidates don’t know the answer. If someone is applying for a BI architect role and he never heard about Inmon you should worry.

13. Question: How do we build a real time data warehouse? {H}

Answer: if the candidate asks “Do you mean real time or near real time” it may indicate that they have a good amount of experience dealing with this in the past. There are two ways we build a real time data warehouse (and this is applicable for both Normalised DW and Dimensional DW):

a) By storing previous periods’ data in the warehouse then putting a view on top of it pointing to the source system’s current period data. “Current period” is usually 1 day in DW, but in some industries e.g. online trading and ecommerce, it is 1 hour.

b) By storing previous periods’ data in the warehouse then use some kind of synchronous mechanism to propagate current period’s data. An example of synchronous data propagation mechanism is SQL Server 2008’s Change Tracking or the old school’s trigger.

Near real time DW is built using asynchronous data propagation mechanism, aka mini batch (2-5 mins frequency) or micro batch (30s – 1.5 mins frequency).

Purpose: to test if the candidate understands complex, non-traditional mechanism and follows the latest trends. Real time DW was considered impossible 5 years ago and only developed in the last 5 years. If the DW is normalised it’s easier to make it real time than if the DW is dimensional as there’s dim key lookup involved.

14. Question: What is the difference between a data mart and a data warehouse? {L}

Answer: Most candidates will answer that one is big and the other is small. Some good candidates (particularly Kimball practitioners) will say that data mart is one star. Whereas DW is a collection of all stars. An excellent candidate will say all the above answers, plus they will say that a DW could be the normalised model that store EDW, whereas DM is the dimensional model containing 1-4 stars for specific department (both relational DB and multidimensional DB).

Purpose: The question has 3 different levels of answer, so we can see how deep the candidate’s knowledge in data warehousing.

15. Question: What the purpose of having a multidimensional database? {L}

Answer: Many candidates don’t know what a multidimensional database (MDB) is. They have heard about OLAP, but not MDB. So if the candidate looks puzzled, help them by saying “an MDB is an OLAP database”. Many will say “Oh… I see” but actually they are still puzzled so it will take a good few moments before they are back to earth again. So ask again: “What is the purpose of having an OLAP database?” The answer is performance and easier data exploration. An MDB (aka cube) is a hundred times faster than relational DB for returning an aggregate. An MDB will be very easy to navigate, drilling up and down the hierarchies and across attributes, exploring the data.

Purpose: This question is irrelevant to report or ETL developer, but a must for a cube developer and DWA/DA. Every decent cube developer (SSAS, Hyperion, Cognos) should be able to answer the question as it’s their bread and butter.

16. Question: Why do you need a staging area? {M}

Answer: Because:

a) Some data transformations/manipulations from source system to DWH can’t be done on the fly, but requires several stages and therefore needs to “be landed on disk first”

b) The time to extract data from the source system is limited (e.g. we were only given 1 hour window) so we just “get everything we need out first and process later”

c) For traceability and consistency, i.e. some data transform are simple and some are complex but for consistency we put all of them on stage first, then pick them up from stage for further processing

d) Some data is required by more than 1 parts of the warehouse (e.g. ODS and DDS) and we want to minimise the impact to the source system’s workload. So rather than reading twice from the source system, we “land” the data on the staging then both the ODS and the DDS read the data from staging.

Purpose: This question is intended more for an ETL developer than a report/cube developer. Obviously a data architect needs to know this too.

17. Question: How do you decide that you need to keep it as 1 dimension or split it into 2 dimensions? Take for example dim product: there are attributes which are at product code level and there are attributes which are at product group level. Should we keep them all in 1 dimension (product) or split them into 2 dimensions (product and product group)? {H}

Answer: Depends on how they are going to be used, as I explained in my article “One or two dimensions” here.

Purpose: To test if the candidate is conversant in dimensional modelling. This question especially is relevant for data architects and cube developers and less relevant for a report or ETL developer.

18. Question: Fact table columns usually numeric. In what case does a fact table have a varchar column? {M}

Answer: degenerate dimension

Purpose: to check if the candidate has ever involved in detailed design of warehouse tables. Follow up with question 19.

19. Question: What kind of dimension is a “degenerate dimension”?  Give me an example. {L}

Answer: A “dimension” which stays in the fact table. It is usually the reference number of the transaction. For example: Transaction ID, payment ref and order ID

Purpose: Just another question to test the fundamentals.

20. Question: What is show flaking? What are the advantages and disadvantages? {M}

Answer: In dimensional modelling, snow flaking is breaking a dimension into several tables by normalising it. The advantages are: a) performance when processing dimensions in SSAS, b) flexibility if the sub dim is used in several places e.g. city is used in dim customer and dim supplier (or in insurance DW: dim policy holder and dim broker), c) one place to update, and d) the DW load is quicker as there are less duplications of data. The disadvantages are: a) more difficult in “navigating the star*”, i.e. need joins a few tables, b) worse “sum group by*” query performance (compared to “pure star*”), c) more flexible in accommodating requirements, i.e. the city attributes for dim supplier don’t have to be the same as the city attributes for dim customer, d) the DW load is simpler as you don’t have to integrate the city.

*: a “star” is a fact table with all its dimensions, “navigating” means browsing/querying, “sum group by” is a SQL select statement with a “group by” clause, pure star is a fact table with all its dimensions and none of the dims are snow-flaked.

Purpose: Snow flaking is one of the classic debates in dimensional modelling community. It is useful to check if the candidate understands the reasons of just “following blindly”. This question is applicable particularly for data architect and OLAP designer. If their answers are way off then you should worry. But it also relevant to ETL and report developers as they will be populating and querying the structure.

I hope these interview questions will be useful for the data warehousing community. Not only for the interviewees but also for the interviewers. I’m sure I made some mistakes in this article (everybody does) so if you spot one please contact me. As usual I welcome any question & discussion at vrainardi@gmail.com.

Vincent Rainardi, 11/12/2010

Update 6/5/2012: Just a quick note that Arshad Khan has put together 500 data warehousing and BI questions and answers in his new book: Business Intelligence & Data Warehousing Simplified: 500 Questions, Answers, & Tips. It is by far the most comprehensive Interview Questions I have ever seen and will be very useful for preparing interviews. Some of the questions are:

  • What is a fact?
  • What are the different types of facts?
  • What are the characteristics of a fact table?
  • What is fact table granularity?
  • What is OLAP?
  • What are the benefits of OLAP?
  • What are OLAP limitations?
  • How does OLAP impact the data warehouse?
  • What are OLAP Models?
  • What is the Inmon approach?
  • What is the Kimball approach?
  • What is a star schema?
  • What are the benefit of a star schema?
  • What is the snowflake schema?

8 December 2010

Normalised Data Warehouse (2)

Filed under: Data Warehousing — Vincent Rainardi @ 11:59 pm
Tags:

In June I wrote about Normalised Data Warehouse, see here. At that time I was very brief. I’d like to expand it a little bit, particularly with regards to “what is it?”, as this seems to be asked by many people. Hence this 2nd article.

Before I start, I’d like to mention that the idea of a normalised data warehouse originated from William H. Inmon in his book “Building the Data Warehouse”. His architecture is known as the Corporate Information Factory (CIF). This idea was further developed by Claudia Imhoff, Nicholas Gallemo and Jonathan G. Geiger in their book “Mastering Data Warehouse Design” (the book doesn’t just talk about normalised DW, but dimensional DW too). Many other authors also wrote about this concept, and their books can be found here. Two books I’d like to particularly mention are #6, DW 2.0 by Inmon, which further developed the idea and #5 (Barry Devlin’s) which is also relevant. Thanks to Alberto Ferrari for reminding me about this.

The idea of having a centralised data warehouse in normalised format can be confusing. Some people enter the world of data warehousing without having prior experience of database design. They never do normalisation process. They never designed a transactional database. They have designed a dimensional data mart, but they don’t know what the Codd’s rules are. This week I came across a case like this. And that was the 3rd case I encountered. So I write this article to help similar situations, which I’m sure will happen again in the future.

I’ll explain the issues, then address them one by one.

  • First, they don’t know what a normalised warehouse look like. They still think about fact and dimension.
  • Second, they don’t know why a warehouse should be normalised. They never heard about the concept of transactional data warehouse.
  • Third, they don’t know how to normalise a database. They don’t know the first, second and third normal form. They don’t know where to put foreign keys.

A lot of people came into data warehousing straight into Kimball model. Into dimensional modelling. To them, data warehouse is identical to fact and dimension tables. There can be no other form. They read Kimball’s Toolkit book. Probably only the first 3 chapters. Then they feel that they now know what data warehousing is. “It is about creating a fact table in the middle and several dimension tables around it”, they said confidently.

This is understandable, because when you Google (or Bing these days) “data warehousing” or “data warehousing books” the result contains the dimensional ones more than the normalised ones. So much more. People’s opinions are also like that. For some BI/DW architects and managers, Kimball is the only model of data warehousing. They can explain a lot about the benefits of being dimensional. To them, a normalised warehouse is just like a glorified staging area.

OK that’s all very well, but when I asked have they seen a normalised warehouse, they said they haven’t. Have they learned about a normalised warehouse? They haven’t. Do they know why some companies create their DW in normalised fashion? No they don’t. So it seems one sided. We need to know both sides in order to make informed decisions. So in the rest of this article I’ll explain about what a normalised warehouse looks like and why a warehouse should be normalised, i.e. the reasons.

What a Normalised Data Warehouse Looks Like

A normalised data warehouse looks like the database of a transaction system. It does not have fact or dimension tables. It has master and transaction tables. It also has auxiliary tables.

Master tables contain reference data. They are the primary entities (objects) of a business process. They are the what, the who, the where, and the whom of a business process, such as product, customer, account, invoice, region, job, patient, policy, claim, shares, market, and mobile numbers. Transaction tables are entities containing business events, such as payments, orders, subscriptions, gas bills, phone calls, download, sales, purchases, and journal. Auxiliary tables are code-decode tables. The master tables only contain the codes. The descriptions are located in the auxiliary tables. For example, in the policy master table, the status of a policy is “W”. What “W” means is stored in the auxiliary tables (it means “Written”, as in “underwritten”, a term in insurance).

A normalised data warehouse contains no data redundancy. Meaning that the data is not repeated. Each piece of data is defined only in 1 place. This is because the tables have been normalised. Normalisation is a process of eliminating redundant data.

A normalised data warehouse contains many tables, like 80 to 100 tables. In some cases more than 100. It covers many areas in the organisations, not only 1 or 2 areas. For example, if it is a retail bank, it covers customer service, loan, withdrawals and deposits, account servicing, credit worthiness, settlement, marketing and promotions, taxes, contact and communication, cheque clearance, credit cards, finance, etc. No, it does not usually contain IT or HR data.

A normalised warehouse often contains data from many companies or subsidiaries. For example if a group has 29 subsidiary companies in 17 countries, the normalised warehouse probably contains data from 19 out of the 29 companies. No, it does not usually contain all companies, because of practical and political reasons. Practical as in project management, budget, and time constraint. Political as in US, Europe and Asia have their own IT organisation & budget, hence their own DW projects.

A normalised data warehouse is often updated incrementally. In the source system there is usually a transaction identifier which is incremental (as in, going from small number to big). Such as invoice ID, order ID, transaction ID, and call ID. These are usually incremental. So we can update the warehouse transaction table incrementally (as in, the opposite of truncate-reload). Yes sometimes this transaction identifier is recycled (goes back to zero after 999999…) but there are techniques to deal with that, so we can still load the warehouse incrementally. Yes in some cases some old transaction rows are updated but there are techniques to deal with that (please refer to my article on SQL Server Central titled Data Extraction Methods). These days with change tracking and CDC built into SQL 2008, and practically on all other top RDBMSes as well, it’s only a matter of time before all transaction systems implement some kind of mechanism to enable incremental extraction of data.

How about master tables? Are they extracted incrementally? Most often than not, for small master table the approach is “upsert all rows”. This is because it’s quicker, more robust, and more practical. Small as in 10k rows and below. How about deletion? In those master tables I mean, not in the transaction tables. Do we delete the rows in the normalised warehouse? No, they are only marked (soft delete). Why? a) Because they could be referenced by the transaction tables, and b) to keep the information because it is still useful.

A normalised DW is often the amalgamation (or integrated) version of several systems. Yes in a way it is a glorified staging area. I would prefer to call it “integration area” though, because that’s what happening in a normalised warehouse: data integration. In fact, “data integration” is the where biggest effort is spent in building a normalised warehouse. It is often one of the main reasons of creating a normalised warehouse. More of this later, when I explain the “why”. Right now I’m focusing on the “what”.

Sometimes a normalised DW contains snapshoted transaction history. It is basically a copy of a transaction table as it was some time ago. Or, a copy of several transaction tables joined together, as they were some time ago. Some time ago as in, every month end. Snapshoting transaction history is one way of storing the old version of transaction data.

Old version of transaction data

Let’s be clear first about what is “old version of transaction data”. Because a lot of times people confuse it with “transaction history”. Transaction data is, by its very nature, historical. The order table stores orders from many years ago until today. Old years (say 1998) is usually archived into another table, or offline (as in, not in the system, but can be restored back if required). Or they are simply deleted (gone forever, not even stored on offline storage). Only recent years are in the transaction table. Say, last 2 years. So, your transaction table (say order table) contains 2009 and 2010 data. 2000 to 2008 orders are put into a table called “order history”. This is what we call “transaction history”. That is not what I mean with “old version of transaction data”

Say payment1 was created on 1/1/2001 for customer1, amount = £300. 2 months later, 1/3/2001 order1 was modified. The amount is changed to £200. Or the customer is changed to customer2. That’s what I mean with the old version of transaction data, i.e. £300 and customer1. The new version is £200 and customer2.

One way of keeping the old version of transaction data is by using “snapshoting technique”. As in, Kimball’s “snapshot fact table” technique. Another way is using the same technique that SQL Server’s CDC employs: by having a second table, with the same structure, plus 3-4 extra columns at the end. Before the transaction row is updated, they are copied into this second table first. The “audit table technique”, it was previously known, as it was used for auditing purpose. A normalised data warehouse keeps the old version of transaction data using one of the two techniques above.

Normalised data warehouses are often found in parallel database systems (MPP = Massively Parallel Processing) such as Teradata, Netezza, Oracle Exadata and Microsoft PDW. Querying a normalised data warehouse requires more computing power, because we will have to join a lot of tables. If you are using SMP database system, indexing and partitioning will play major part in the performance. If you are using MPP database system, data distribution is the key.

A normalised DW sometimes contains the “old versions” of master data. It keeps the old version using the “audit table technique” described above. No, it does not use the SCD techniques like in the dimensional warehouse.

Reasons for Normalising a Data Warehouse

The main reason of normalising a data warehouse is so that there is no data redundancy in the data warehouse. This has multiple benefits. The obvious one is that there is only 1 place to update when the data changes. This means that the ETL/data load is easier and more efficient (faster).

The second reason is to keep the data warehouse real time. Because there is only 1 place to update (or insert), the update would be quick and efficient.

Third, to manage master data. The idea is that rather than having MDM as a separate system, the master table in the normalised warehouse become the master store. This is because the normalised warehouse is updated by all source systems. The normalised warehouse broker table (in the case of insurance warehouse) becomes the master broker table. The normalised warehouse airport table (in the case of travel data warehouse) becomes the master airport table. The customer table in the normalised warehouse becomes the master customer table. The source of all customer data in the enterprise. It is a popular choice to place the CDI (Customer Data Integration, MDM for customer) for CRM project in a normalised DW instead of a dimensional DW.

Fourth, to enable the enterprise to maintain consistency between multiple dimensional data marts (please refer to chapter 2 of my book). Having a normalised DW you can create a dimensional DM for any business area, for a specified time period, for a particular range of products/customers/geography. For businesses where data is their sellable asset, this capability really has business appeals e.g. you can take different cuts of your data and price each cut differently.

Fifth, to make data integration easier. If you have integrated multiple business system you would appreciate that the core of the work is on the mapping, i.e. where each column from each system should be loaded to the target system. Because in a normalised DW each individual data is located only in one place, it is easier to update the target. Secondly, the source systems are usually normalised, it is easier to map them to a normalised DW because both the source and the target are normalised e.g. more similarities than dimensional DW. In a normalised DW the data are located in different tables, not combined into 1 big dimension table e.g. customer dim or product dim.

Sixth, to save development effort. This is an argument that MPP people usually have. Why having 1 year DW project to create a dimensional model of your data, create the ETL to populate it, and update it every day? Why don’t we just copy the structure of the source business system in the MPP database system and we can query it 1000 times faster? Or, if we are not copying it, at least we can have a very similar structure/data model so the data load would be very simple as the tables in the normalised DW are almost one to one mapping to the source system.

So there you are. Those are the “what” and the “why” of normalised DW. Hope this article is useful for those who has never see a normalised DW; at least giving them an idea. As usual I might make mistakes, your may have different opinions, etc, so I welcome any suggestions, discussion and corrections at vrainardi@gmail.com.

Vincent Rainardi, 8/12/2010

PS. I wrote this about 2 months ago, but didn’t have a chance to complete it. I wrote the last 3 paragraphs today.

Blog at WordPress.com.