Following my article last week “SSAS Developer Interview Questions”, which generated interesting responses, I thought I’d write similar thing on the data warehousing. Like before, I will be using LMH to show the complexity (Low, Medium, High). This time, I’ll add the “Purpose”, i.e. what the question is designed for.
Data warehousing skill never stands on its own. You don’t interview people just for their data warehousing knowledge. Usually you are either looking for an ETL developer (Informatica, DataStage, BODI, SSIS), a Microsoft BI developer (RS/AS/IS), a BO/Cognos (Report) Developer, a Data Warehouse Architect, a BI Solution Architect, an ETL Architect, a Data Architect or an Enterprise Architect. All these roles require data warehouse knowledge. So most likely you will be combining these questions with other questions, e.g. for an Informatica developer you will be combining them with Informatica questions.
When I am filling a role, what I am looking particularly is whether the candidate can do the job or not. Nowadays it is different from 10 years ago. We now have Google and Bing. Anything we don’t know we can quickly Google it. So the amount of knowledge is not important now. What is important (to me) is a) experience, b) problem solving and c) character. I remember about 3 years ago I was interviewing a candidate for a data architect role. The candidate was speaker in data architecture conference so we were convinced she must be very good. I asked her what the rule for the 2nd normal form was. She could not answer it. I asked the 1st and 3rd normal form and she could not answer them either. This is like bread and butter. You can’t do data modelling without knowing the normalisation rules.
But then in an interview with a bank 3 years ago I was also “blank” like her. I was an author of a data warehousing & BI book on SQL Server so they were thinking high of me with regards to SSIS, SSAS and SSRS. They asked me what those 5 tabs in SSIS BIDS. I could mention 4 but could not remember the 5th one, even though I was using SSIS almost everyday. Since then when I interviewed I did not look for the amount of knowledge, but whether the candidate can solve a problem instead. I remember one day my manager and I was interviewing for a Teradata developer role. I said to the candidate that the amount of Teradata knowledge that he had was not important. Within 5 minutes of opening the manual or Googling he will be able to get that knowledge. So I said I would not ask him any Teradata SQL or BTEQ functions. Instead I gave him 2 real world problems that we were facing in the project and asked him to give me the solutions in about 5 minutes. They way he interrogated us with question to get information about the project and finally suggested a brilliant solution really impressed us, so we offered him the job. I can completely understand that some people disagree with my approach. After that interview my boss pulled me and told me off: “You must not say that in front of the candidates Vincent. Of course the amount of Teradata knowledge they possess is important! Why do you think we hire them for?”
So in the interview questions below, which crystallise from my experience, I put both knowledge based and experience/problem solving questions. Generally I’m less interested in “theory only” questions, so I try to wrap them up in real world problem situation.
I’m a firm believer that experience is the best teacher. So at interview I always try to find out if the candidate has done it before. So I test them using every day problems. People who have done data warehousing will surely come across those problems, and understand what the solutions are. People who are new in data warehousing, or only know the DW theory from books wouldn’t have encountered those problems and would not have a clue what the answers are.
1. Question: How do you implement Slowly Changing Dimension type 2? I am not looking for the definition, but the practical implementation e.g. table structure, ETL/loading. {L}
Answer: Create the dimension table as normal, i.e. first the dim key column as an integer, then the attributes as varchar (or varchar2 if you use Oracle). Then I’d create 3 additional columns: IsCurrent flag, “Valid From” and “Valid To” (they are datetime columns). With regards to the ETL, I’d check first if the row already exists by comparing natural key. If it exists then “expire the row” and insert a new row. Set the “Valid From” date to today’s date or the current date time.
An experienced candidate (particularly DW ETL developer) will not set the “Valid From” date to the current date time, but to the time when the ETL started. This is so that all the rows in the same load will have the same Valid From, which is 1 millisecond after the expiry time of the previous version thus avoiding issue with ETL workflows that run across midnight.
Purpose: SCD 2 is the one of the first things that we learn in data warehousing. It is considered the basic/fundamental. The purpose of this question is to separate the quality candidate from the ones who’s bluffing. If the candidate can not answer this question you should worry.
2. Question: How do you index a fact table? And explain why. {H}
Answer: Index all the dim key columns, individually, non clustered (SQL Server) or bitmap (Oracle). The dim key columns are used to join to the dimension tables, so if they are indexed the join will be faster. An exceptional candidate will suggest 3 additional things: a) index the fact key separately, b) consider creating a covering index in the right order on the combination of dim keys, and c) if the fact table is partitioned the partitioning key must be included in all indexes.
Purpose: Many people know data warehousing only in theory. This question is designed to separate those who have actually built a warehouse and those who haven’t.
3. Question: In the source system, your customer record changes like this: customer1 and customer2 now becomes one company called customer99. Explain a) impact to the customer dim (SCD1), b) impact to the fact tables. {M}
Answer: In the customer dim we update the customer1 row, changing it to customer99 (remember that it is SCD1). We do soft delete on the customer2 row by updating the IsActive flag column (hard delete is not recommended). On the fact table we find the Surrogate Key for customer1 and 2 and update it with customer99’s SK.
Purpose: This is a common problem that everybody in data warehousing encounters. By asking this question we will know if the candidate has enough experience in data warehousing. If they have not come across this (probably they are new in DW), we want to know if they have the capability to deal with it or not.
4. Question: What are the differences between Kimball approach and Inmon’s? Which one is better and why? {L}
Answer: if you are looking for a junior role e.g. developer, then the expected answer is: in Kimball we do dimension modelling, i.e. fact and dim tables where as in Inmon we do CIF, i.e. EDW in normalised form and we then create a DM/DDS from the EDW. Junior candidates usually prefer Kimball, because of query performance and flexibility, or because that’s the only one they know; which is fine. But if you are interviewing for a senior role e.g. senior data architect then they need to say that the approach depends on the situation. Both Kimball & Inmon’s approaches have advantages and disadvantages. Some of the main reasons of having a normalised DW can be found here.
Purpose: a) to see if the candidate understands the core principles of data warehousing or they just “know the skin”, b) to find out if the candidate is open minded, i.e. the solution depends on what we are trying to achieve (there’s right or wrong answer) or if they are blindly using Kimball for every situation.
5. Question: Suppose a fact row has unknown dim keys, do you load that row or not? Can you explain the advantage/disadvantages? {M}
Answer: We need to load that row so that the total of the measure/fact is correct. To enable us to load the row, we need to either set the unknown dim key to 0 or the dim key of the newly created dim rows. We can also not load that row (so the total of the measure will be different from the source system) if the business requirement prefer it. In this case we load the fact row to a quarantine area complete with error processing, DQ indicator and audit log. On the next day, after we receive the dim row, we load the fact row. This is commonly known as Late Arriving Dimension Rows and there are many sources for further information; one of the best is Bob Becker’s article here in 2006. Others refer to this as Early Arriving Fact Row, which Ralph Kimball explained here in 2004.
Purpose: again this is a common problem that we encounter in regular basis in data warehousing. With this question we want to see if the candidate’s experience level is up to the expectation or not.
6. Question: Please tell me your experience on your last 3 data warehouse projects. What were your roles in those projects? What were the issues and how did you solve them? {L}
Answer: There’s no wrong or right answer here. With this question you are looking for a) whether they have done similar things to your current project, b) whether their have done the same role as the role you are offering, c) whether they faces the same issues as your current DW project.
Purpose: Some of the reasons why we pay more to certain candidates compared to the others are: a) they have done it before they can deliver quicker than those who haven’t, b) they come from our competitors so we would know what’s happening there and we can make a better system than theirs, c) they have solved similar issues so we could “borrow their techniques”.
7. Question: What are the advantages of having a normalised DW compared to dimensional DW? What are the advantages of dimensional DW compared to normalised DW? {M}
Answer: For advantages of having a normalised DW see here and here. The advantages of dimensional DW are: a) flexibility, e.g. we can accommodate changes in the requirements with minimal changes on the data model, b) performance, e.g. you can query it faster than normalised model, c) it’s quicker and simpler to develop than normalised DW and easier to maintain.
Purpose: to see if the candidate has seen “the other side of the coin”. Many people in data warehousing only knows Kimball/dimensional. Second purpose of this question is to check if the candidate understands the benefit of dimensional modelling, which is a fundamental understanding in data warehousing.
8. Question: What is 3rd normal form? {L} Give me an example of a situation where the tables are not in 3rd NF, then make it 3rd NF. {M}
Answer: No column is transitively depended on the PK. For example, column1 is dependant on column2 and column2 is dependant on column3. In this case column3 is “transitively dependant” on column1. To make it 3rd NF we need to split it into 2 tables: table1 which has column1 & column2 and table2 which has column2 and column3.
Purpose: A lot of people talk about “3rd normal form” but they don’t know what it means. This is to test if the candidate is one of those people. If they can’t answer 3rd NF, ask 2nd NF. If they can’t answer 2nd NF, ask 1st NF.
9. Question: Tell me how to design a data warehouse, i.e. what are the steps of doing dimensional modelling? {M}
Answer: There are many ways, but it should not be too far from this order: 1. Understand the business process, 2. Declare the grain of the fact table, 3. Create the dimension tables including attributes, 4. Add the measures to the fact tables (from Kimball’s Toolkit book chapter 2). Step 3 and 4 could be reversed (add the fact first, then create the dims), but step 1 & 2 must be done in that order. Understanding the business process must always be the first, and declaring the grain must always be the second.
Purpose: This question is for data architect or data warehouse architect to see if they can do their job. It’s not a question for an ETL, report or cube developer.
10. Question: How do you join 2 fact tables? {H}
Answer: It’s a trap question. You don’t usually join 2 fact tables especially if they have different grain. When designing a dimensional model, you include all the necessary measures into the same fact table. If the measure you need is located on another fact table, then there’s something wrong with the design. You need to add that measure to the fact table you are working with. But what if the measure has a different grain? Then you add the lower grain measure to the higher grain fact table. What if the fact table you are working with has a lower grain? Then you need to get the business logic for allocating the measure.
But it is possible to join 2 fact tables, using the common dim keys, but the performance is usually horrible. For example: if FactTable1 has dim1key, dim2key, dimkey3 and FactTable2 has dim1key and dim2key then join them like this:
select f2.dim1key, f2.dim2key, f1.measure1, f2.measure2 from ( select dim1key, dim2key, sum(measure1) as measure1 from FactTable1 group by dim1key, dim2key ) f1 join FactTable2 f2 on f1.dim1key = f2.dim1key and f1.dim2key = f2.dim2keySo if we don’t join 2 fact tables that way, how do we do it? The answer is fact key column. It is a good practice (especially in SQL Server because of the concept of cluster index) to have fact key column to enable us to identify rows on the fact table (see my article here). The performance would be better (than joining on dim keys), but you need to plan this in advance as you need to include the fact key column on the other fact table.
select f2.dim1key, f2.dim2key, f1.measure1, f2.measure2 from FactTable1 f1 join FactTable2 f2 on f2.fact1key = f1.factkeyI implemented this technique originally for self joining, but then expand the use to join to other fact table. But this must be use on an exception basis rather than the norm.
Purpose: not to trap the candidate of course. But to see if they have the experience dealing with a problem which doesn’t happen every day.
11. Question: How do you index a dimension table? {L}
Answer: clustered index on the dim key, and non clustered index (individual) on attribute columns which are used on the query where clause.
Purpose: this question is critical to be asked if you are looking for a Data Warehouse Architect (DWA) or a Data Architect (DA). Many DWA and DA only knows logical data model. Many of them don’t know how to index. They don’t know how different the physical tables are in Oracle compared to in Teradata. This question is not essential if you are looking for a report or ETL developer. It’s good for them to know, but it’s not essential.
12. Question: Tell me what you know about William Inmon? {L} Alternatively: Ralph Kimball.
Answer: He was the one who introduced the concept of data warehousing. Arguably Barry Devlin was the first one, but he’s not as popular as Inmon. If you ask who is Barry Devlin or who is Claudia Imhoff 99.9% of the candidates wouldn’t know. But every decent practitioner in data warehousing should know about Inmon and Kimball.
Purpose: to test of the candidate is a decent practitioner in data warehousing or not. You’ll be surprise (especially if you are interviewing a report developer) how many candidates don’t know the answer. If someone is applying for a BI architect role and he never heard about Inmon you should worry.
13. Question: What is the difference between a data mart and a data warehouse? {L}
Answer: Most candidates will answer that one is big and the other is small. Some good candidates (particularly Kimball practitioners) will say that data mart is one star. Whereas DW is a collection of all stars. An excellent candidate will say all the above answers, plus they will say that a DW could be the normalised model that store EDW, whereas DM is the dimensional model containing 1-4 stars for specific department (both relational DB and multidimensional DB).
Purpose: The question has 3 different levels of answer, so we can see how deep the candidate’s knowledge in data warehousing.
14. Question: What the purpose of having a multidimensional database?
Answer: Many candidates don’t know what a multidimensional database (MDB) is. They have heard about OLAP, but not MDB. So if the candidate looks puzzled, help them by saying “an MDB is an OLAP database”. Many will say “Oh… I see” but actually they are still puzzled so it will take a good few moments before they are back to earth again. So ask again: “What is the purpose of having an OLAP database?” The answer is performance and easier data exploration. An MDB (aka cube) is a hundred times faster than relational DB for returning an aggregate. An MDB will be very easy to navigate, drilling up and down the hierarchies and across attributes, exploring the data.
Purpose: This question is irrelevant to report or ETL developer, but a must for a cube developer and DWA/DA. Every decent cube developer (SSAS, Hyperion, Cognos) should be able to answer the question as it’s their bread and butter.
Excellent sir.. Tis should be the way to take interview .. Thanks . for the great article .
But i have one disagreement with your Scd1 question . should it not be a better idea to create a new record with “company 99” as the basic definition of both the entities 1 amd 2 has changed. the purpose of DW is also historical information . as the 2 companies have merged to means it is a new entity. the fact table also should not be updated but from current loading onwards fact will reference “99”. for any other calculation of future with historical data some metatadata table can be used to store this kind of information and used as and when required.
Comment by sasanka ghosh ( India) — 13 August 2011 @ 3:41 pm |
Thank you Sasanka for your comment.
If it was SCD type 2 we should create a new row for “Company 99”. But because it is SCD type 1, we should update existing row.
Comment by Vincent Rainardi — 13 August 2011 @ 4:10 pm |
pardon if i m bothering ..
SCD type 2 i understand .. where attributes are changing without actually changing the “main definition” of the entity .. .. still we create a new record if in metadata ” the changed attributes are candidate for SC2 type change”
but if company A and company b merges … in modelling terms u are changing the basic definition of a entity ..so it is a case of fresh entry
In that case i think these are the solutions in order of preference
two new fields
SCD type 1 — new record .. with surrogate key .. new nataural key ..( some metatdata storing the previous relationship in some other table)
scd type 2 — new record with surrogate key and new natural key
scd type 3 :: .. with two new fields… ( what happens in this company merges with another)
multiple version overlay with some tweaking …
best solution
snowflake schema … but design the sytem in such a way so that it will support star transformation
Comment by sasanka ghosh ( India) — 14 August 2011 @ 3:08 pm |
Hi Vincent!
Awesome article! It’s a pleasure reading this. I want to add a few comments – let me know if you agree/ disagree.
****About Question 2, indexing fact:
– Do not use bitmap index in fact if you plan to load the fact using multiple parallel streams (e.g. pass-through partition in Informatica or multiple jobs loading together) as bitmap index creates lock-contention.
– Indexes affects adversely to the loading performance
****About question 3, (in response to above comments by Sasanka Ghosh):
– It is important to preserve the history, however, we need to see if this is a merger scenario where one company is merging with other or both the companies are abolishing and completely new entity is forming.
Take this example. “Company A” has business registration number (BRN) 100 and “company B” has business registration number 200. If B merges with A (merger scenario), A will still have the same BRN for A. In that case, we should just soft-delete B record in dimension (but should not update the fact – let the older records still refer to B).
But If A and B, both are abolished and a new entity C is formed, then we have to create a new record and soft delete both A and B. Let’s not touch the older records in the fact – because they represent the version of truth that existed before.
****Question 10, about joining 2 facts :
I am not a data modeler. But this questions always troubled me. Here it is: I do not want to join 2 facts But I want to do this analysis: I want to analyze the impact of marketing propaganda on my revenue. My revenue is captured daily and kept in one fact table. My marketing initiatives are captured daily (measures like: number of road shows arranged) and loaded in some other fact table. For obvious reasons, Marketing data and revenue data are kept in different fact tables (if not in different data marts).
So in order to analyze the impact of marketing on sales, do you suggest me to create a separate fact table containing both the data? Or should I just join the 2 facts?
By the way, I also run a website on various subjects of data warehousing at http://www.dwbiconcepts.com and I have referred this article in one of the pages in my site, with due credit.
Comment by Akash — 10 May 2012 @ 8:39 am |
Hi Akash, thank you for your comments.
I agree with your opinions on Q2 and Q3. During a merger, one company buys the other company’s shares, either with their own shares or with cash. On Q10 I suggest we join the 2 fact tables using the common dimensions between them. Thanks for informing us about your DW web site.
Comment by Vincent Rainardi — 11 May 2012 @ 9:59 pm |
Excellent Vincent!
But I would disagree on Q3 indexing.
In Oracle bitmap indexes on fact table dim key give best performance with STAR TRANSFORMATION JOIN with dimensions. Yes, bitmap indexes would create locks if fact table is loaded in parallel. But now majority of fact table are partitioned and exchange partition technique is used when you load the empty table, index is and then make this segment a fact table partition. Btree indexes can be used but there are usually much bigger for dim keys and perform not so well.
Creating covering index in the right order on the combination of dim keys was used for STAR JOIN in the past (Oracle) now STAR TRANSFORMATION JOIN is much more efficient. And different queries needs different compound indexes. Number of such indexes can be big. Consider the space and maintenance expenses for all these indexes.
In Oracle if column is partitioned it is not necessary to index it because it is already helps the search by partition pruning.
Comment by Sergey Alentyev — 12 October 2012 @ 6:02 am |
Thanks Sergey
Comment by Vincent Rainardi — 13 October 2012 @ 1:26 pm |