# Data Warehousing and Machine Learning

## 7 February 2015

### Data Scientist

Filed under: Business Intelligence,Other — Vincent Rainardi @ 8:57 am

Some people wonder what is the meaning of data scientist. Put it simply, a data scientist is an expert in:

1. Statistical analysis
2. Data mining
3. Big data
4. Data visualisation

In addition to the above technical skills, a data scientist usually has good business knowledge in one or two sectors, such as banking, insurance, investment, oil or biotech.

In this article, I will explain each of the 4 technical areas, and then pick an example.

Statistical Analysis

I consider the following statistical analysis skills to be important to have for a data scientist:

1. Generalised Linear Model (Bernoulli distribution, Bayesian methods, Gaussian regressor)
2. Cluster analysis (k-means, Expectation Maximisation, fuzzy clustering)
3. Goodness of Fit Tests (Kolmogorov Smirnov, Pearson’s Chi2, Anderson-Darling)
4. Bayesian Statistics (Naïve Bayes Classifier, Bayesian Network, Hidden Markov Models)
5. Factor Analysis (Maximum Likelihood Extraction, Principal Component Extraction)
6. Time Series, Matrix computations, Latent Dirichlet Allocation

For many candidates, this first subject (Statistical Analysis) is the biggest stumbling block. The above topics are advanced statistics, some of them are at PhD level. Very few people in IT know them because they are usually taught in universities when you take mathematics, physics or engineering.

As with anything else, Statistical Analysis requires the use of software/tool. The most widely used Statistical Analysis tools are SPSS and SAS. These two are defacto industry standard. Matlab has a special place. It’s not as user friendly nor comprehensive as SPSS and SAS, but can be programmed to do anything and extremely flexible for any kind of statistical processing.

Statistical analysis requires a good programming skills, particularly R and S. Knowledge of parallel processing and multithreading will be also useful. R is the defacto standard language for statistical computing, data analysis and data mining. R is derived from S, with Lexical Scoping semantics added to it, so S provides good fundamental but practically no longer used. Some of the popular IDE for writing R are Rstudio, Rattle GUI and RWeka.

Data Mining

The reason Data Mining is required is for predictive analysis and forecasting the future, as well as descriptive analysis (explaining the past). For this you need to be able to build data mining models, train the models and use it for forecasting. Data mining requires strong mathematical foundation, such as clustering, regression and neural networks. It also requires knowledge about specific tools such as SAS and Analysis Services.

Data mining requires knowledge of data warehousing and BI, because data mining can only uncover patterns actually present in the data. As the data is usually a very large set, it is commonly stored in a data warehouse, and undergo data warehousing processes such as data integration, denormalisation modelling, data quality improvement, and data cleansing.

Data mining requires business knowledge, such as Customer Relationship Management, market basket analysis, credit card fraud, insurance claims, marine underwriting, credit risk and FX trading. Without the business knowledge, it is impossible to create a good mining model.

Big Data

A data scientist need to know about big data because increasingly, more data is stored in big data architecture, i.e. Hadoop, HBase, MapReduce, EC2/S3. They do not need to know about Pig, Oozie, Lucene, Flume, Sqoop in detail, but they need to have experience with platform that the company uses, such as Hortonworks, Cloudera, BigInsights (IBM), and HDInsight (Microsoft). These platforms are fully equiped with all the tools that you need to load and process data in Hadoop. Data access layers, streaming, query language, security, scheduling, and governance all rolled-in into an Enterprise-ready platform.

A data scientist may not be write complex MapReduce transformation in PigLatin and extend the functions using Python or Ruby (that’s for the programmer to do), but they do need to understand the basic concept. For instance, how the Reduce job combines the output tuples from a Map into a smaller set of tuples, what are keys and values in Hadoop, why you need a Shuffle between a Map and a Reduce, etc. Whether you use Horton, BigInsights or HDI – the implementations are different between companies, but the core concepts are always the same.

Using Big Data platforms such as BigInsights enable data scientists to do data discovery and visualisation. It comes with advanced text analytics tool, machine learning analytics, large scale indexing, adaptive MapReduce, compression, security and stream analytics. Not having knowledge of such a platform means that the data scientist limits their capability to process the data.

Data Visualisation

It’s the buzz word for BI tool. Tableau, QlikView and Spotfire are the most popular. IBM has five: Cognos Insight, Cognos Express, SPSS Analytic Catalyst, SPSS Visualisation Designer and Watson Analytics. SAP has Lumira. SAS has Visual Analytics and Visual Statistics. And there are tons of other tools: Dundas, iDashboards, Datazen,  MicroStategy, Birst, Roambi, Klipfolio, Inetsoft, to name a few.

A good data scientist must have experienced creating visualisation using one or two of the above popular tools. All these tools are not difficult to use, compared to programming in Python or Ruby for example, or even compared to Statistical Analysis.Within 2 weeks you will be able to grasp the basics, and within a month you would be able to use them fluently. They are very user friendly, highly visual GUI (i.e. point and click, drag and drop, that sort of things).

Example

One of the most famous team of data scientist is AIG’s. As the Wall Street Journal reported (link): “AIG analyzes reams of data, from massive databases to handwritten notes from claim adjusters, to identify potentially fraudulent activity. It produces a list of around 100 claims to investigate, ranked highest priority to lowest. It also produces charts and other visualizations, like heat maps, that helps investigators understand why machine learning algorithms make the connections they do.”

Jobs and Pay

I recently saw a job advert for a data scientist for £800 per day. Admittedly this is the highest I have ever seen. I believe the normal range is £400 to £600 per day, depending on industry sector. Banking paying the highest, insurance or pharma probably second, and public sector including NHS the lowest.

• ITJobsWatch (link) reported that the average salary for permanent position is £57,500; whereas for contract position it is £450/day.
• Glassdoor (link) reported that the national average for data scientist is £40,000.
• PayScale (link): reported that the median is ££39,691.

It is impossible to get all 4 skills in 1 person. So in many companies this role is carved up to two to three positions: a big data guy, a visualisation developer, and a statistical analyst. In some companies, you don’t need to know any statistics to be called Data Scientist; you just need to know SQL. They call the SQL database guy as Data Scientist just to make the role more attractive and looks good. Ten years ago the “SQL guy” it used to be called “report developer”, but now a days it could be (mis)labelled as “data scientist”.

## 26 May 2012

### Principles of Explaining Things

Filed under: Other — Vincent Rainardi @ 10:19 pm

Some of the principles I try to follow when writing a post/article are:

1. Learn by example
It is easier for the reader to understand if we give an example. It is often easier for them to understand the example, than the definition. I believe that human learn best  using examples, not by understanding the definitions. For example, rather than explaining how to populate a fact table, it is better to actually do it in SSIS and take the readers through the process, step by step. An example provides the reader/learner with something to copy.
2. Show the end result first
It is quicker to make readers understand a procedure if we snow them how the end result looks like, before we begin. For example, before explaining the 6 steps of how to build an SSIS workflow that populate a dimension, readers will have clearer understanding if we first display the screenshot: “This is how it will look when we finish. Now let’s do it step by step.” Because now the readers have the big picture in their mind of what they are trying to achieve.
Rather trying to be perfect and explain 16 steps required to build a cube, it is better to start with just 6 steps, the most basic one. Then after we have succeeded to make the user/reader understand the basic 6 steps, we introduce 3 additional steps. And after they can truly grasp the 9 steps, we add 3 more, and so on until we reach the perfect 16 steps. But if you try to explain all 16 in the beginning, it will take longer time for them to understand.
4. It’s easier to show it than explaining it
Rather than typing 1000 words try to explain Resource Governor in SQL Server, it is quicker to just show them the screens. Ideally it’s video capture, but if you can’t then static picture will do. Pictures are much better than words in explaining things to other people.

## 4 June 2011

### Back to Microsoft Word

Filed under: Other — Vincent Rainardi @ 8:03 am
Tags:

In the end, after a few weeks on Windows Live Writer, I decided to go back to use MS Word 2010. The main reasons were:

1. Images. I do a lot of screenshots in my articles, usually to describe a software or a process, e.g. SQL Server BI, etc. With WLW every time I update the article, it reload all the images as another name. I recall one of the comments on the WLW blog was about that. Having experience it myself, I felt I had to correct it because of space reason (and also untidy), e.g. delete all the copies of the images from WordPress, leaving just 1 copy. This is very time consuming. Using Word 2010 Blog template, same problem. So I had to revert to a normal Word template (not the blog one). This way, only 1 copy of each image because I essentially load the images one by one myself into WordPress. I only use the MS Word for writing the article and inserting images, but then copy paste the text into WordPress manually, and upload the images manually. But, if the article doesn’t contain any images, like this one, I use the Blog template feature of MS Word, simplifying the uploading process tremendeously.
2. Features. There are so many features in Word that I missed. Shortcuts, spelling/proofing, font, and its ease of use. I have been using Word since in 1990. It was DOS based, MSWord 5. I was a WP user before, and Chi Writer, but MS Word was better so I converted. I have been with Word only since 1990 and I’m very familiar all the shortcuts etc. WLW on the other hand is quite limited I found, in terms of features
3. Management. I maintain a copy of all my articles on my laptop, backed up to USB disk (every week) and external HD (every 3 months). I found it is easier to manage the documents using Word. Using WLW it always goes to MyWeblogPosts under My Document, and I had to move them manually to by DWBI Blog folder. I didn’t have the freedom to create the documents/files on the location that I wanted.

Vincent, 4/6/11.

## 22 April 2011

### Blogging Using Windows Live Writer

Filed under: Other — Vincent Rainardi @ 7:33 am

• Simple, easy to use. Simpler than Microsoft Word.
• You can setup multiple blog accounts and very easy to switch between them. Even if you have the same user name for different blogs.
• You can add pictures. No need to load them separately to WordPress.
• Font, colour, bold, italic, underline
• You can set the tags as well as the categories. And you can set the post date too.
• You can open the a post from WordPress, edit it and update it.
• You can edit the source (HTML) code/script.
• You have Preview facility (how it would look on WordPress)

• You can’t change the location of the local draft*. Windows Live Writer let us save our post in our computer, but it is always on My Document\My Weblog Post\Draft. I’m not able to point this else where. I’ve checked File|Options and Blog Account|Blog Options but there’s no option to set the directory for the Local draft.
There are 2 drafts: other than a local draft which is on your PC, you can also post a draft is on your blog.
• After you publish your post to WordPress, if you modify it on WordPress you can’t edit it again in WLW. Well, you can, but you get an old copy.
• You can’t use

## 21 April 2011

### Blogging Using Microsoft Word

Filed under: Other — Vincent Rainardi @ 10:23 pm
Tags:

In the last couple of months I have been blogging probably 30 posts using Microsoft Word (2010) and felt it’s so much easier than typing it on WordPress directly. I have 3 blogs and all of them are on WordPress. One is about data warehousing and BI (this blog), and the other two are about chess and life thoughts. The main advantages of using Microsoft Word to blog are:

1. It is quick and simple.
2. Using Microsoft Word, I don’t have to upload the pictures one by one into WordPress.
3. All the formatting are preserved, for example: italics, bold, underline, fonts, colours.
4. You can set the category from within Word.
5. You can use the Format Painter which is very useful.
6. You have the thesaurus & dictionary nearby.
7. You can insert symbols, shapes and screenshots easily.

1. When you use the blog template on Word, you can’t set the View to large font. (I used to do 140% on Word)
2. To write code (e.g. SQL Script) is not so user friendly. I can copy and paste from SSMS into Word, and Word preserve all the blue, red, green and grey colours, which is fantastic, but when I publish it, they messed up.
3. You can’t set the tags. You can set the category, but not the tags. So you need to go to WordPress to add your tags.
4. If you have several blogs in WordPress under 1 account, you can’t switch between accounts. You need to manually editing the blog URL everytime you switch. The Manage Accounts can only setup several accounts if you have different user names.

## 16 January 2011

### Team Setup for DWBI Projects

Filed under: Business Intelligence,Data Warehousing,Other — Vincent Rainardi @ 8:40 am
Tags: , ,

A few people contacted me with regards to the team setup for a BI/DW project. One of the feedback I received was something like: I’m a PM, I’m running a BI project. It’s MSBI end to end. DW/DM on SQL05, IS as ETL, RS reports, AS cubes. I have 3 people working on the project: 1 ETL developer, 1 report/cube developer and 1 data architect/modeller. How do I run the project to ensure that everybody is “loaded”, i.e. in the first month when the DW is being modelled/designed, the report/cube guy will not be doing anything, and ditto is the ETL guy.

Another feedback I received today was: My experience from my clients is that the BI teams most often tend to be (way) too small for the task at hand. In the Kimball books they suggest a team setup with a multitude of roles spread on preferably as many people, when comparing that to my reality the gap is huge where most roles do exist but only spread over a handful people. I find it a bit hard to separate concerns too much since everything within BI/DW is interconnected, architectural needs are often captured through report demands, two ends of the rope. What’s a good team size and setup (of course this is related to the project size but in general)?

I’ve written about this topic on chapter 3 (DW Development Methodology) of my book, and early part of chapter 4. But for now I’ll just focus on directly addressing the 2 situations above.

I’m a believer that we learn better from examples, rather than from theory. So I will explain this team setup thing using an example. The project is a very small, only 4 people on the project.

Team Members

In this project we have 4 people: 1 architect, 1 BA, 1 PM, 1 BI Developer. Because the project is very small, we want a hands on architect. Do not use architects who can only do data modelling. Advertise for “BI architect”, not for “data architect”, they are completely different. The architect must have a DBA experience, i.e. they know how to partition a table, update statistics, etc. When selecting a PM, choose someone who has delivered at least 3 DWBI projects. And ask which part of the DWBI project he delivered, because some PM only did part way through the project, e.g. just the reporting part. It is better to pay an experienced PM highly, then an inexperienced PM cheaply. Do not use PMs who are not a DWBI PM. I know that a DWBI PM is difficult to get, but it is a must that you use a DWBI PM, not a “App Dev” PM or any other PM.

The BI Developer must be an all-rounder. Throw away people who can only do RS or IS. He or she must be able to do T-SQL, IS, RS and AS. I know they are more expensive, but you have no chance of succeeding if your developer is 1 one sided (i.e. just RS). And choose someone who has delivered at least 3 BI projects. Do not use developers who done less than 3 projects. And ask which part of the BI project they did, i.e. ETL or report. Pick a BA who really understands the business. Ideally he/she has been with the business for at least 3 years. The BA will be involved throughout the project, all the way to the end.

The goal of the project is to build a data warehouse on SQL Server 08, with 10 reports and 3 cubes (TBD), in 6 months. Say it’s a utility company, selling gas & electricity to 3 million customers. The source is 1 business system used for sales, stock, bulk purchase, distribution, etc. They want to analyse sales figure, profitability, supplier performance, and distribution performance. The architecture is 1 DDS (see my book chapter 2), no ODS or NDS in the middle.

Break Into Small Parts

The key of managing this project effectively (in terms of its resource use and time) is to break the project into several parts, based on functionality. Team first work together to deliver part 1 (say it’s sales analysis). On week 1, the BA & Architect sit down for a day or two, scoping the functionality. Then the architect spend 3 days doing data modelling (designing the DW table structure), just for part 1. @Architect: I would emphasise not to worry too much about getting it right 100%. Just take a first stab on it. You can refine it later. On week 2, the Architect & the BA spend 2 days specifying the mappings (which column in source going to which column in DW). The architect then spend 2 days creating the DW tables. On the first day of week 2, the architect creates the stage tables, which mimic the structure of the source tables. This way the BI Dev will not be idle week 2: he/she needs to populate the stage tables. Please do not take the number of days literally, as they depends on the size of the work. You could easily double all the numbers. But the proportion should be more or less the same.

Remember that at this time, what we are doing is Sales Analysis (SA). We do not touch customer profitability, supplier performance or distribution performance. Say the ETL goes for 2 weeks (SA only), which is done by the BI Developer. In this time the Architect are designing the reports and cube (SA only). Starting with the UI (which is done together with the BA), then onto the “where the data should be sourced” and then the SQL statement (rough SQL only, leave the details to the developer). One question that the BA should ask him/herself all the time is: Is this what the business need? And the one question that the Architect should ask him/herself all the time is: Will it work?

Testing and Rhythm

The Architect needs to be able to guide the developer. If the Developer is stuck on a particular point (don’t know how to do it), it is the Architect’s job to show how to do it. If there are several approaches how it can be built (say view or direct SQL), it is the Architect’s job to select the best approach. While the developer is busy building the reports and the cube, the Architect and the BA do two things a) create a test plan for SA, and b) design the Customer Profitability (CP), ETL, report & cube. The test plan will probably take ½ week, whereas designing CP will probably take 1-1.5 week. When they are building CP, they will probably get the hang of the rhythm (the repeated cycle of timing). And so, onwards with part 3 (supplier performance) and part 4 (distribution performance).

How about testing? The BA writes the test plan. The BA (helped by the Architect) does the system testing & UAT. The architect prepares the DWBI system for testing. The developer does the unit testing.

It is my experience and a pair of Architect + BA will be able to feed 1 ETL developer and 1 Report/Cube developer quite adequately. This is a point where I find the balance. The workload is not too much or too little on either side.

Management

What does the PM do? The PM is the one ultimately responsible for the delivery of the project. Apart from managing the project (project plan, monitoring progress, providing adequate resource, budgeting), there are 2 main things that the PM must do: a) to shield the team from outside pressures, b) to communicate with outside parties/stake holders. Some people say that PM (in this team structure of 4-5 people) is a part time job, which I tend to agree. The PM can take 1  more project (or two), while at the same time running this BI project. Some PM plays 2 schedules: 1 for the external stake holders and 1 for the team member. This does not work. Within 3 weeks he or she will get caught. Need to have 1 schedule and be open. As with any projects, the name of the game is trust. In order for the project to run well, you need to trust the other team members.

For a bigger project, the 3 pillars are still there: PM, BA, Architect. These people are the ones who drive the rest of the project team. They drive it from 3 different sides: 1 from the management side, 1 from the business side and 1 from the technical side. You can add 2 more developers and 1 more tester, as long as you break it into smaller part like above, and the 3 pillars stand, you will be able run the project in balance and keep everyone occupied.

Hope this helps. As usual I welcome any question and discussion at vrainardi@gmail.com or via comments. Vincent Rainardi, 16/1/2011

## 22 December 2010

### Current Trend in Data Warehousing and Business Intelligence

Filed under: Business Intelligence,Data Warehousing,Other — Vincent Rainardi @ 9:24 am
Tags: , ,

It is near the end of the year. 22nd December 2010. It’s been snowing heavily here in the UK, which is rare. As year end is approaching I’d like to write about the current trend on DWBI. Reflect on what has happened in 2010, and what might take off in 2011-2012.

Before I start I’d like to remind the reader that this is what I see. So it’s not the entire truth. It’s only a fragment of the market i.e. towards where I live (UK) and the products I’m familiar with. If you are a DWBI vendor, I’m sure you see more a complete picture than me in terms of your products. So please correct me.

Trends in DWBI Software

DWBI Suite: As a complete DWBI tool set/suite, by far MSBI is the most widely used (number of companies using it, not revenue). OBIEE, SAS, BO & Cognos follow at a far distance. Cognos is not a suite, where’s the ETL tool? It’s more appropriate calling it “IBM suite”, where they bundle InfoSphere DataStage and DB2 into the mix. There was an ETL tool in Cognos 8 BI (called Data Integration), but in V10 it’s gone. SAS is not a complete DWBI suite as it has no DW. Their ETL is DI Studio. BO has an ETL tool (BODI), but no DW/DB either.

ETL: terms of features, Informatica 9 is at the top, but SSIS is the most widely used. By far. Others are Data Stage (InfoSphere), OWB (now ODI in 11g), Ab Initio, BODI. For a more comprehensive list (of ETL tools) see my old article here. Some of my colleagues who have used Ab Initio claimed that it is the fastest ETL tool due to in-memory look up/pinning and parallel operations and that it scales well. But I doubt it. Their approach is very secretive. We had to sign an NDA before seeing the presentation. This usually means that there is nothing behind the wall. It’s a marketing technique where if you are low it’s best to keep yourself hidden. ODI & BODI are about equal at number 3 behind SSIS & Informatica (in terms of usage). There has been a change of perception since 2009 about SSIS, that it is no longer an inferior product like DTS. And the market take up since 2008 has been amazing. There are a lot more SSIS job than Informatica. Here in London many insurance companies use SSIS as their ETL tool, where as banks use more Informatica, mainly because of historical reasons. One or two banks still use Hummingbird Genio. I admit the new ODI has very good features (in particular the Knowledge Module), which could position it above SSIS (in terms of features, not usage).

Reporting tool: unlike 8 years ago, companies are no longer buying a stand alone reporting product like Crystal Report any more. They buy a suite. Lots of companies use SSRS because it comes “free” with SQL Server and the IT dept personnel are able to use it (because they can download & install it – unlike BO & Cognos). Companies with more budget usually opted for BO, Cognos & MicroStrategy. BO XIR2 developer edition is £1750 per user whereas Webi is £500 per user. The Dev Ed allows you to create reports, whereas the webi is to read/run reports. Compare that price with SSRS which is free. Which is why some companies are migrating from BO to RS (I’ve come across 3 companies in London this year). With regards to prices I want to point out that prices vary widely from companies to companies. Not only BO but all vendors. Company A could pay £1000/user whereas company B only £200/user for the same product.

BPM (Business Performance Management), aka EPM (Enterprise PM) or CPM (Corporate PM): Hyperion (Oracle), SAS, Board, BO & Cognos. Board is semi Self Service BI. BO’s one is called Planning and Consolidation. In 2007 SAP bought OutlookSoft and Pilot but they seems to be gone now. Hyperion is by far the top in this market. Pity OutlookSoft is no longer there to challenge Hyperion. SAS looks good and has proper Financial Mgt.
BPM definition is never clear, lots of marketing & labelling. Budget vs actual, planning, KPI, all thrown into the mix. Traditionally it’s Financial Mgt, like Hyperion & SAS but then all sorts of BI are thrown in, including dashboard, scorecard & KPI. So when people say B/E/CPM, clarify first what they mean by that.

OLAP: SSAS is as strongest as ever. By far the most widely used cube tool. I would estimate AS is probably 60% of the market now (usage, not revenue, i.e. out of 100 companies who use cubes, 60 of them use SSAS – my estimate, not market research). I’ve seen a few banks now use SSAS. Oracle Essbase & IBM PowerPlay follows far behind. Oracle 11g OLAP option could be a contender in the next 2 years. MicroStrategy is a respectable player in OLAP and has been in the market for a very long time (20 years?). Consistent with ROLAP since the beginning. Very useful to read their comparison to other vendor here. There is a big change going on with SSAS in terms of UDM & in-memory OLAP. Read MS BISM (Semantic Model) here (TK Anand), here (Chris Webb), here (Boyan Penev), here (Teo Lachev), here (Marco Russo), here (Jason Thomas) and here (Thomas Ivarsson).

RDBMS (SMP): the most widely used RDBMS for DW is without a doubt SQL Server (usage, not revenue). By far. Then Oracle, then DB2. Oracle and especially DB2 are used only by big companies, whereas SQL Server is used by companies of all sizes. Sybase & Informix are not in the game. Just like MySQL & Ingres, I only heard like once a year (if that) where people use Sybase or Informix for data warehousing. There has been a total change of perception with regards to SQL Server capacity handle big DW. 5 years ago the perception (especially amongst London banks) was that SQL Server can’t handle big DW (say upwards of 3 TB). So they used Oracle/DB2 for “big DWs” and SQL Server is approved for “small DWs”. The perception now is that SQL Server 2008 capability for large DW is equal to Oracle 11g. I know an investment bank that uses SQL 2008 for their huge DW (upwards 15 TB). This take up is largely due to SQL 2008’s partitioning capability. Both Oracle and DB2 are still very respectable for DW platform. Some people still think that they are “slightly better” than SQL Server, in terms of performance for >5 TB DW.

Visualisation: What’s that? OLAP client? Reporting client? Charting? Well, all of the above. MicroStrategy, CubePlayer, Tableau, QlikView, Excel, SSRS, SCA (Strategy Companion Analyzer), Panorama, Crescent. For SSAS client, I think Crescent would be the winner, but that’s 12 months away. How about now? From features, not Panorama, but Tableau, CubePlayer and SCA. From architecture/installation, etc, not Excel, but SCA. Vidas Matelis composed a comprehensive list of SSAS client here. The issue with SSAS client is not the features, but the vendor. Majority are small vendors. Which ones will still stand in 10 years time? How can we be sure that they will not be bought and then killed like ProClarity?

Trend in DWBI Jobs

It is important for DWBI architects and managers to consider the resource/people/skill factor when choosing DWBI software. Employees & contractor also need to actively look at the trend, to consider what training they need to do in 2011. Agencies know this well, but employees/contractor usually not: best site for salary/rate and trend is ITJobsWatch. Below I only mention the rate (contract), for salary just click the “Permanent” tab on the IT Jobs Watch site.
ETL: Informatica rate is higher than SSIS. In London market (not UK rates), SSIS current market rate is £350-450 DOE, where as Informatica is £500-600 (because you can’t download the product like SSIS). Note that the £450 SSIS also comes with SSRS & SQL Server skills and to a degree SSAS, so it’s a clear advantage to the recruiter.
OBIEE: is gaining popularity. More people wants OBIEE Developer. This is caused by version 11g. Good rate too, OBIEE 11g was £500 in Nov but now £600. Big shortage (many installations but few developers). Unlike MS BI which is saturated.
MPP: No change in Teradata & Netezza skills. Still very low volume. Stable/no change in rates, 375-450. Risky to get into because of low volume. And for most people, it’s impossible to get into. Not only you can’t download evaluation version, but where can you get hardware from? The only chance is taking training at Marylebone (Teradata) or Marble Arch (Netezza).
Solvency II: Solvency II is the name of the game now, driving lots of DWBI projects. So people with Lloyds market knowledge has advantage. And this trend is going until 2012, perhaps 2013. Again it’s London (and Europe, e.g. Dublin, Brussels, Zurich) but it does not happen in for example Boston or Melbourne.
BO: BO is interesting: decreasing volume (companies are moving to RS) but increasing rate (400-450): see here. The best of BO is a chance to get SAP experience.
QlikView: skill is more in demand compared to last year (number of jobs doubled) but rates decrease (understandably): was 400 now 350. Again this is London rate, outside London is lower, see here.

Future Trend in BI (not DW but BI)

Some people ask me what’s next in BI. This was my usual answer. There are 2 things that will gain popularity in the next 2 years:
Self Service BI. They already gain popularity. Users want to compose their own reports. Connect to the DW / DM themselves. I can imagine it will be a mess in a few years, but there you go, that’s the trend. Main stream BI will still be built by IT, only “personal” view are in the users’ hands. Main players are QlikView & PowerPivot. And Board to a degree. No body heard Rosslyn yet (“What’s that?” is the response I’m getting at the moment). In terms of features, QlikView is still perceived to be much better than PowerPivot as PowerPivot is still version 1. But the take up is good. I’ve seen a few companies using it so it’s only a matter of time before PowerPivot overtakes QlikView.
Streaming BI. In weather and science research streaming BI is old news. Temperature analysis for example. But in commercial companies this is new. Stream Insight is the major player (part of R2). I think it will take off in the next 2 years. People wants to analyse the data as soon as it comes in. FX trading in an example (well, other asset classes too). Any kind of online investment trading really. O yes and risk analytics. It’s kind of not stream at the moment, but quite a big flow. It’s kind of against all DW principles that’s why it’s a bit odd. DW is batch. Nightly. Not second by second streaming. It’s a revolution, I know. But it’s coming. Read here for an intro and here for architecture.

People make mistakes, so if you know anything is incorrect please rectify me. As usual I welcome discussion & questions at vrainardi@gmail.com.

Vincent Rainardi, 22/12/2010

## 11 December 2010

### Data Warehousing Interview Questions

Filed under: Data Warehousing,Other — Vincent Rainardi @ 10:59 pm
Tags: ,

Following my article last week “SSAS Developer Interview Questions”, which generated quite a lot of responses, I thought I’d write similar thing on the data warehousing. Like before, I will be using LMH to show the complexity (Low, Medium, High). This time, I’ll add the “purpose”, i.e. what the questions are designed for.

Data warehousing skill never stands on its own. You don’t interview people just for their data warehousing knowledge. Usually you are either looking for an ETL developer (Informatica, DataStage, BODI, SSIS), a Microsoft BI developer (RS/AS/IS), a BO/Cognos (Report) Developer, a Data Warehouse Architect, a BI Solution Architect, an ETL Architect, a Data Architect or an Enterprise Architect. All these roles require data warehousing knowledge. So most likely you will be combining these questions with other questions, e.g. for an Informatica developer you will be combining them with Informatica questions.

When I interview to fill a role, what I look particularly is whether the candidate can do the job or not. Nowadays it is different from 10 years ago. We now have Google and Bing. Anything we don’t know we can quickly Google it. So the amount of knowledge is not important now. What is important (to me) is a) experience, b) problem solving and c) character. I remember about 3 years ago I was interviewing a candidate for a data architect role. The candidate was speaker in data architecture conference so we were convinced she must be very good. I asked her what the rule for the 2nd normal form was. She could not answer it. I asked the 1st and 3rd normal form and she could not answer them either. This is like bread and butter. You can’t do data modelling without knowing the normalisation rules.

So in the interview questions below, which crystallises from my experience, I put both knowledge-based and experience/problem solving questions. Generally I’m less interested in “theory only” questions, so I try to wrap them up in real world problems/situations.

I’m a firm believer that experience is the best teacher. So at interviews I always try to find out if the candidate has done it before. So I test them using every day problems. People who have done data warehousing will surely come across those problems, and understand what the solutions are. People who are new in data warehousing, or only know the DW theory from books wouldn’t have encountered those problems and would not have a clue what the answers are.

1. Question: How do you implement Slowly Changing Dimension type 2? I am not looking for the definition, but the practical implementation e.g. table structure, ETL/loading. {M}

Answer: Create the dimension table as normal, i.e. first the dim key column as an integer, then the attributes as varchar (or varchar2 if you use Oracle). Then I’d create 3 additional columns: IsCurrent flag, “Valid From” and “Valid To” (they are datetime columns). With regards to the ETL, I’d check first if the row already exists by comparing the natural key. If it exists then “expire the row” and insert a new row. Set the “Valid From” date to today’s date or the current date time.

An experienced candidate (particularly DW ETL developer) will not set the “Valid From” date to the current date time, but to the time when the ETL started. This is so that all the rows in the same load will have the same Valid From, which is 1 millisecond after the expiry time of the previous version thus avoiding issue with ETL workflows that run across midnight.

Purpose: SCD 2 is the one of the first things that we learn in data warehousing. It is considered the basic/fundamental. The purpose of this question is to separate the quality candidate from the ones who are bluffing. If the candidate can not answer this question you should worry.

2. Question: How do you index a fact table? And explain why. {H}

Answer: Index all the dim key columns, individually, non clustered (SQL Server) or bitmap (Oracle). The dim key columns are used to join to the dimension tables, so if they are indexed the join will be faster. An exceptional candidate will suggest 3 additional things: a) index the fact key separately, b) consider creating a covering index in the right order on the combination of dim keys, and c) if the fact table is partitioned the partitioning key must be included in all indexes.

Purpose: Many people know data warehousing only in theory or only in logical data model. This question is designed to separate those who have actually built a data warehouse and those who haven’t.

3. Question: In the source system, your customer record changes like this: customer1 and customer2 now becomes one company called customer99. Explain a) impact to the customer dim (SCD1), b) impact to the fact tables. {M}

Answer: In the customer dim we update the customer1 row, changing it to customer99 (remember that it is SCD1). We do soft delete on the customer2 row by updating the IsActive flag column (hard delete is not recommended). On the fact table we find the Surrogate Key for customer1 and 2 and update it with customer99’s SK.

Purpose: This is a common problem that everybody in data warehousing encounters. By asking this question we will know if the candidate has enough experience in data warehousing. If they have not come across this (probably they are new in DW), we want to know if they have the capability to deal with it or not.

4. Question: What are the differences between Kimball approach and Inmon’s? Which one is better and why? {L}

Answer: if you are looking for a junior role e.g. a developer, then the expected answer is: in Kimball we do dimension modelling, i.e. fact and dim tables whereas in Inmon’s we do CIF, i.e. EDW in normalised form and we then create a DM/DDS from the EDW. Junior candidates usually prefer Kimball, because of query performance and flexibility, or because that’s the only one they know; which is fine. But if you are interviewing for a senior role e.g. senior data architect then they need to say that the approach depends on the situation. Both Kimball & Inmon’s approaches have advantages and disadvantages. I explained some of the main reasons of having a normalised DW here.

Purpose: a) to see if the candidate understands the core principles of data warehousing or they just “know the skin”, b) to find out if the candidate is open minded, i.e. the solution depends on what we are trying to achieve (there’s right or wrong answer) or if they are blindly using Kimball for every situation.

5. Question: Suppose a fact row has unknown dim keys, do you load that row or not? Can you explain the advantage/disadvantages? {M}

Answer: We need to load that row so that the total of the measure/fact is correct. To enable us to load the row, we need to either set the unknown dim key to 0 or the dim key of the newly created dim rows. We can also not load that row (so the total of the measure will be different from the source system) if the business requirement prefer it. In this case we load the fact row to a quarantine area complete with error processing, DQ indicator and audit log. On the next day, after we receive the dim row, we load the fact row. This is commonly known as Late Arriving Dimension Rows and there are many sources for further information; one of the best is Bob Becker’s article here in 2006. Others refer to this as Early Arriving Fact Row, which Ralph Kimball explained here in 2004.

Purpose: again this is a common problem that we encounter in regular basis in data warehousing. With this question we want to see if the candidate’s experience level is up to the expectation or not.

6. Question: Please tell me your experience on your last 3 data warehouse projects. What were your roles in those projects? What were the issues and how did you solve them? {L}

Answer: There’s no wrong or right answer here. With this question you are looking for a) whether they have done similar things to your current project, b) whether their have done the same role as the role you are offering, c) whether they faces the same issues as your current DW project.

Purpose: Some of the reasons why we pay more to certain candidates compared to the others are: a) they have done it before they can deliver quicker than those who haven’t, b) they come from our competitors so we would know what’s happening there and we can make a better system than theirs, c) they have solved similar issues so we could “borrow their techniques”.

7. Question: What are the advantages of having a normalised DW compared to dimensional DW? What are the advantages of dimensional DW compared to normalised DW? {M}

Answer: For advantages of having a normalised DW see here and here. The advantages of dimensional DW are: a) flexibility, e.g. we can accommodate changes in the requirements with minimal changes on the data model, b) performance, e.g. you can query it faster than normalised model, c) it’s quicker and simpler to develop than normalised DW and easier to maintain.

Purpose: to see if the candidate has seen “the other side of the coin”. Many people in data warehousing only knows Kimball/dimensional. Second purpose of this question is to check if the candidate understands the benefit of dimensional modelling, which is a fundamental understanding in data warehousing.

8. Question: What is 3rd normal form? {L} Give me an example of a situation where the tables are not in 3rd NF, then make it 3rd NF. {M}

Answer: No column is transitively depended on the PK. For example, column1 is dependant on column2 and column2 is dependant on column3. In this case column3 is “transitively dependant” on column1. To make it 3rd NF we need to split it into 2 tables: table1 which has column1 & column2 and table2 which has column2 and column3.

Purpose: A lot of people talk about “3rd normal form” but they don’t know what it means. This is to test if the candidate is one of those people. If they can’t answer 3rd NF, ask 2nd NF. If they can’t answer 2nd NF, ask 1st NF.

9. Question: Tell me how to design a data warehouse, i.e. what are the steps of doing dimensional modelling? {M}

Answer: There are many ways, but it should not be too far from this order: 1. Understand the business process, 2. Declare the grain of the fact table, 3. Create the dimension tables including attributes, 4. Add the measures to the fact tables (from Kimball’s Toolkit book chapter 2). Step 3 and 4 could be reversed (add the fact first, then create the dims), but step 1 & 2 must be done in that order. Understanding the business process must always be the first, and declaring the grain must always be the second.

Purpose: This question is for data architect or data warehouse architect to see if they can do their job. It’s not a question for an ETL, report or cube developer.

10. Question: How do you join 2 fact tables? {H}

Answer: It’s a trap question. You don’t usually join 2 fact tables especially if they have different grain. When designing a dimensional model, you include all the necessary measures into the same fact table. If the measure you need is located on another fact table, then there’s something wrong with the design. You need to add that measure to the fact table you are working with. But what if the measure has a different grain? Then you add the lower grain measure to the higher grain fact table. What if the fact table you are working with has a lower grain? Then you need to get the business logic for allocating the measure.

It is possible to join 2 fact tables, i.e. using the common dim keys. But the performance is usually horrible, hence people don’t do this in practice, except for small fact tables (<100k rows). For example: if FactTable1 has dim1key, dim2key, dimkey3 and FactTable2 has dim1key and dim2key then you could join them like this:

```select f2.dim1key, f2.dim2key, f1.measure1, f2.measure2
from
( select dim1key, dim2key, sum(measure1) as measure1
from FactTable1
group by dim1key, dim2key
) f1
join FactTable2 f2
on f1.dim1key = f2.dim1key and f1.dim2key = f2.dim2key
```

So if we don’t join 2 fact tables that way, how do we do it? The answer is using the fact key column. It is a good practice (especially in SQL Server because of the concept of cluster index) to have a fact key column to enable us to identify rows on the fact table (see my article here). The performance would be much better (than joining on dim keys), but you need to plan this in advance as you need to include the fact key column on the other fact table.

```select f2.dim1key, f2.dim2key, f1.measure1, f2.measure2
from FactTable1 f1
join FactTable2 f2
on f2.fact1key = f1.factkey
```

I implemented this technique originally for self joining, but then expand the usage to join to other fact table. But this must be used on an exception basis rather than the norm.

Purpose: not to trap the candidate of course. But to see if they have the experience dealing with a problem which doesn’t happen every day.

11. Question: How do you index a dimension table? {L}

Answer: clustered index on the dim key, and non clustered index (individual) on attribute columns which are used on the query’s “where clause”.

Purpose: this question is critical to be asked if you are looking for a Data Warehouse Architect (DWA) or a Data Architect (DA). Many DWA and DA only knows logical data model. Many of them don’t know how to index. They don’t know how different the physical tables are in Oracle compared to in Teradata. This question is not essential if you are looking for a report or ETL developer. It’s good for them to know, but it’s not essential

12. Question: Tell me what you know about William Inmon? {L} Alternatively: Ralph Kimball.

Answer: He was the one who introduced the concept of data warehousing. Arguably Barry Devlin was the first one, but he’s not as popular as Inmon. If you ask who is Barry Devlin or who is Claudia Imhoff 99.9% of the candidates wouldn’t know. But every decent practitioner in data warehousing should know about Inmon and Kimball.

Purpose: to test if the candidate is a decent practitioner in data warehousing or not. You’ll be surprise (especially if you are interviewing a report developer) how many candidates don’t know the answer. If someone is applying for a BI architect role and he never heard about Inmon you should worry.

13. Question: How do we build a real time data warehouse? {H}

Answer: if the candidate asks “Do you mean real time or near real time” it may indicate that they have a good amount of experience dealing with this in the past. There are two ways we build a real time data warehouse (and this is applicable for both Normalised DW and Dimensional DW):

a) By storing previous periods’ data in the warehouse then putting a view on top of it pointing to the source system’s current period data. “Current period” is usually 1 day in DW, but in some industries e.g. online trading and ecommerce, it is 1 hour.

b) By storing previous periods’ data in the warehouse then use some kind of synchronous mechanism to propagate current period’s data. An example of synchronous data propagation mechanism is SQL Server 2008’s Change Tracking or the old school’s trigger.

Near real time DW is built using asynchronous data propagation mechanism, aka mini batch (2-5 mins frequency) or micro batch (30s – 1.5 mins frequency).

Purpose: to test if the candidate understands complex, non-traditional mechanism and follows the latest trends. Real time DW was considered impossible 5 years ago and only developed in the last 5 years. If the DW is normalised it’s easier to make it real time than if the DW is dimensional as there’s dim key lookup involved.

14. Question: What is the difference between a data mart and a data warehouse? {L}

Answer: Most candidates will answer that one is big and the other is small. Some good candidates (particularly Kimball practitioners) will say that data mart is one star. Whereas DW is a collection of all stars. An excellent candidate will say all the above answers, plus they will say that a DW could be the normalised model that store EDW, whereas DM is the dimensional model containing 1-4 stars for specific department (both relational DB and multidimensional DB).

Purpose: The question has 3 different levels of answer, so we can see how deep the candidate’s knowledge in data warehousing.

15. Question: What the purpose of having a multidimensional database? {L}

Answer: Many candidates don’t know what a multidimensional database (MDB) is. They have heard about OLAP, but not MDB. So if the candidate looks puzzled, help them by saying “an MDB is an OLAP database”. Many will say “Oh… I see” but actually they are still puzzled so it will take a good few moments before they are back to earth again. So ask again: “What is the purpose of having an OLAP database?” The answer is performance and easier data exploration. An MDB (aka cube) is a hundred times faster than relational DB for returning an aggregate. An MDB will be very easy to navigate, drilling up and down the hierarchies and across attributes, exploring the data.

Purpose: This question is irrelevant to report or ETL developer, but a must for a cube developer and DWA/DA. Every decent cube developer (SSAS, Hyperion, Cognos) should be able to answer the question as it’s their bread and butter.

16. Question: Why do you need a staging area? {M}

a) Some data transformations/manipulations from source system to DWH can’t be done on the fly, but requires several stages and therefore needs to “be landed on disk first”

b) The time to extract data from the source system is limited (e.g. we were only given 1 hour window) so we just “get everything we need out first and process later”

c) For traceability and consistency, i.e. some data transform are simple and some are complex but for consistency we put all of them on stage first, then pick them up from stage for further processing

d) Some data is required by more than 1 parts of the warehouse (e.g. ODS and DDS) and we want to minimise the impact to the source system’s workload. So rather than reading twice from the source system, we “land” the data on the staging then both the ODS and the DDS read the data from staging.

Purpose: This question is intended more for an ETL developer than a report/cube developer. Obviously a data architect needs to know this too.

17. Question: How do you decide that you need to keep it as 1 dimension or split it into 2 dimensions? Take for example dim product: there are attributes which are at product code level and there are attributes which are at product group level. Should we keep them all in 1 dimension (product) or split them into 2 dimensions (product and product group)? {H}

Answer: Depends on how they are going to be used, as I explained in my article “One or two dimensions” here.

Purpose: To test if the candidate is conversant in dimensional modelling. This question especially is relevant for data architects and cube developers and less relevant for a report or ETL developer.

18. Question: Fact table columns usually numeric. In what case does a fact table have a varchar column? {M}

Purpose: to check if the candidate has ever involved in detailed design of warehouse tables. Follow up with question 19.

19. Question: What kind of dimension is a “degenerate dimension”?  Give me an example. {L}

Answer: A “dimension” which stays in the fact table. It is usually the reference number of the transaction. For example: Transaction ID, payment ref and order ID

Purpose: Just another question to test the fundamentals.

20. Question: What is show flaking? What are the advantages and disadvantages? {M}

Answer: In dimensional modelling, snow flaking is breaking a dimension into several tables by normalising it. The advantages are: a) performance when processing dimensions in SSAS, b) flexibility if the sub dim is used in several places e.g. city is used in dim customer and dim supplier (or in insurance DW: dim policy holder and dim broker), c) one place to update, and d) the DW load is quicker as there are less duplications of data. The disadvantages are: a) more difficult in “navigating the star*”, i.e. need joins a few tables, b) worse “sum group by*” query performance (compared to “pure star*”), c) more flexible in accommodating requirements, i.e. the city attributes for dim supplier don’t have to be the same as the city attributes for dim customer, d) the DW load is simpler as you don’t have to integrate the city.

*: a “star” is a fact table with all its dimensions, “navigating” means browsing/querying, “sum group by” is a SQL select statement with a “group by” clause, pure star is a fact table with all its dimensions and none of the dims are snow-flaked.

Purpose: Snow flaking is one of the classic debates in dimensional modelling community. It is useful to check if the candidate understands the reasons of just “following blindly”. This question is applicable particularly for data architect and OLAP designer. If their answers are way off then you should worry. But it also relevant to ETL and report developers as they will be populating and querying the structure.

I hope these interview questions will be useful for the data warehousing community. Not only for the interviewees but also for the interviewers. I’m sure I made some mistakes in this article (everybody does) so if you spot one please contact me. As usual I welcome any question & discussion at vrainardi@gmail.com.

Vincent Rainardi, 11/12/2010

Update 6/5/2012: Just a quick note that Arshad Khan has put together 500 data warehousing and BI questions and answers in his new book: Business Intelligence & Data Warehousing Simplified: 500 Questions, Answers, & Tips. It is by far the most comprehensive Interview Questions I have ever seen and will be very useful for preparing interviews. Some of the questions are:

• What is a fact?
• What are the different types of facts?
• What are the characteristics of a fact table?
• What is fact table granularity?
• What is OLAP?
• What are the benefits of OLAP?
• What are OLAP limitations?
• How does OLAP impact the data warehouse?
• What are OLAP Models?
• What is the Inmon approach?
• What is the Kimball approach?
• What is a star schema?
• What are the benefit of a star schema?
• What is the snowflake schema?

## 30 November 2010

### SSAS Developer Interview Questions

Filed under: Analysis Services,Other — Vincent Rainardi @ 9:47 pm
Tags: ,

There are 5 key areas to test:

1. Cube design: the stages of creating a cube in BIDS, i.e. data source, DSV, dimension, cube.
2. Performance tuning: troubleshooting query performance as well as processing performance.
3. MDX: the multi dimensional expression that we all love.
4. Client tools: creating reports and dashboards on Strategy Companion, Excel, Tableau, Panorama, etc.
5. Administration: installing and upgrading SSAS servers, configuring SSAS server settings, security, processing, deployment, migration, backing up and restoring the cubes.

Beyond these 5, I think it is very important for a SSAS developer (and ETL developer) to understand the concept of dimensional modelling well. Hence I like throw in one or two Kimball questions. Usually candidates understand dimensional modelling, at least the basic, but we need to make sure.

1. Cube design

Out of the 5 things above, Cube Design is the most important one. It is the bread and butter of the job. We need to establish 2 things here:

a) Do the candidates know the basic cube design (things that your company use now)

b) Do they know the advanced cube design (features that your company don’t use now)

Because it depends on the SSAS features your company use, what constitutes “basic” are different from company to company. But it’s probably not far from this list:

• Data source: connection string, impersonation
• DSV: table, view, named query, relationships between entities
• Dimension: hierarchy, sort order, attribute relationships, OrderByAttribute, display folder, default member
• Cube: referenced dimension, many-to-many, partitioning, aggregation, measure expression, format string, calculated members, error configuration, display folder
• Role: membership, cube access

• DS & DSV: multiple DS, multiple DSVs, logical primary key, friendly name, retrieve relationship, schema restriction, number of connections
• Dimension: parent child dim, translations, date calculation dimension, multiple KeyColumns, ragged dim, type 2 dim, custom rollup, unary operator, write enabled, ROLAP.
• Cube: actions (URL & drill through), translations, perspectives, HOLAP, proactive caching, input dimension (dimension which is not connected to any MG, but used in calculation), IsAggregatable, KPI.
• Role: dimensional security, cell security
• AMO: doing all of the above programmatically using C#.

I usually like to establish first if the candidate can do the job, i.e. basic stuff above/things that we used in the company. I do this by picking 3 from the basic stuff, see if they are comfortable. If they do, I pick 3 from the advanced stuff. If they are not comfy with the basic, the question I need to answer is whether the candidate would be able to sort it out by themselves on the job. I do that by giving them some hints.

Whenever possible, I prefer to wrap the question in a business context, rather than bare theory. For example, question a) below is with business context, where as question b) is theory:

a) Explain how you handle currency conversion.
They can use measure expression, many-to-many, but we know immediately if they have done this stuff or not.

b) Explain different modes of impersonation.
That is just theory. I prefer to give a context i.e. when would we want to use service account and when to use a specific user name? Or: if the connection string already defines how to connect to the database, then what’s impersonation for?

2. Performance tuning

I think what I wrote here [Optimising Cube Query and Processing Performance] is quite comprehensive to be used as a base for our questions. As per its title, it covers the troubleshooting of both query performance and processing performance.

For query performance we can ask MDX queries, aggregation, partitioning, server settings. For processing performance: dimension, materialized view, partitioning and the most important of them all is incremental processing. I explain more about processing performance here.

Again I like to wrap them in a case, for example: a cube takes 5 hours to process and we need to get it down to 1 hour. How can we do that? Many candidates never optimise dimension processing, but any good SSAS developer will at least understand partitioning. An expert developer may even give you a lecture about why aligning table partitioning and incremental partition processing are very important.

Another favourite question of mine is what to do if a) a query is slow, and b) a certain measure is slow (other measures are fine). For a) trap the MDX, run it in MDX Studio (thanks Mosha) and find out which part makes it slow by altering the query bit by bit. For b) establish first if it’s a calculated measure or base measure. If it’s calculated, look at the MDX and analyse on MDX Studio. If it’s a base measure, look at the storage mode (ROLAP?), the dimension usage tab (relationship type = fact or many to many? Try removing some relationships & reprocess), measure expression (is it a simple A/B formula?), simplify the attributes involved in a query (perhaps the issue is with the attributes, not with the measures?).

3. MDX

This is always tricky. Not about figuring out what to ask, but what level of MDX do you need in your company/project. Any SSAS developer can do basic MDX, right? Not quite. Yes they can do the basic MDX select statement, but they might not know even the basic stuff like filter and sorting. So we need to test those. I’d recommend the following list:

Count, YTD, parallel period, filter, existing, order, first/lastchild, as/descendents, currentmember, if/iif/case, existing, head & tail, crossjoin, nonempty, except, scope, top/bottomcount.

They are considered every day use. Any SSAS developer who always queries their cubes using client tools would be separated from those who write MDX to query the cubes. Obviously we don’t need to ask all of them, just pick 3 or 4.

Before we test those functions, it’s worth establishing if they grasp the concepts of multidimensional database (OLAP) or not. For example: tuple, set, cube space, axis, level, member. If they can’t differentiate a tuple from a set, it’s probably not worth asking them any MDX function.

You might argue that “O but we don’t use MDX here”. But it is essential that they understand the concept and the basic functions. You can’t effectively compose SSRS reports displaying data from your cubes if your SSAS developer can’t write MDX. You can’t effectively build a cube if you don’t understand multidimensional database concepts.

XMLA (ASSL) is a bit different. They are not as important as MDX because you can script them on SSMS. Whether it is to create or to alter SSAS objects, or to process them, you can script them on SSMS. Besides, we have AMO for creating/altering SSAS objects programmatically. With AMO you get proper development environment like looping and branching, so you have more control. And it’s .NET. True that you can use tools like Excel, Panorama & ProClarity to get the MDX, or use Profiler to trap Strategy Companion & Tableau’s MDX statements, but you can’t optimise the calculated members if your SSAS developer doesn’t know MDX.

Another thing to consider is whether you need to cover DAX. This depends on whether you are using (or will be using) PowerPivot or not. But DAX is becoming more and more central in MS OLAP technology so in the long run (like 2 years from now) you will need to cover DAX. But for now it’s still optional, depending on the use of PowerPivot.

4. Client Tools

Specific features are different from one client tool to the next, but the basic ideas are similar: creating report, chart and dashboard. Some of the specific features to test are: create custom set, create calculated member, time calculation function such as YTD and QTD, hyperlink to differ reports, drilldown. One of my favourite questions is whether a specific calculation should be done in the client tool or in SSAS.

The administration of the client tool is usually done by somebody else, not the SSAS developer, so we shouldn’t worry about it in this interview. But if you are a small company then usually one person does it all, in which case you will need to ask the candidate about installation, migration, user security, and other aspects of the client tool.

Administration tasks are not the responsibility of an SSAS developer. They are the responsibility of the SQL DBA. These days, in large corporations, SQL Server DBAs are asked to manage all 4 aspects of SQL Server. They manage not only the relational engine, but also SSAS server, SSIS packages and SSRS servers. But in smaller companies SSAS developers are involved in administering the servers. So I include this section (section 5, Admin) in the interview.

The Questions

Enough with the background, let’s get on with the questions. I’m going to label each question with 1-5 indicating the 5 areas above, and LMH (low, medium, high) indicating the difficulty. I’ll put the label in curly brackets: {}. Hopefully the labels help you compose the right mix of interview questions.

1. How do you ensure that January, February, March, etc will be in the correct sequence when the user browse the cube? {1-L}
2. In SSAS, how do you design currency conversion for a) from many transaction currencies to 1 reporting currency, b) from 1 transaction currency to several reporting currencies and c) from many transaction currencies to many reporting currencies? {1-M}
3. A user reported that a Panorama report that she uses every day is very slow today. Normally it opens in 1 or 2 seconds, but today it’s didn’t open at all (she has waited for 4 minutes). Explain the approach how you are going to handle this case systematically. {4-L}
Note: you can still ask this question if though you are not using Panorama.
4. We have a risk analysis cube. This cube is 500 GB in size, containing 20 billion fact rows. Explain how we: a) process this cube efficiently; b) ensure a good query performance. {2-M}
5. Explain (step by step) how you build a cube from a data mart. {1-L}
6. Explain how you migrate a cube from a dev AS server to production. {5-L}
7. A cube has 300 partitions and we want to process only the partitions where the fact table row changed. We don’t want to process the entire measure group. How do we do that? {2-M}
8. We have a cube with 1 measure and 1 dimension. That dimension has only 1 attribute. Write an MDX to list the members of that attribute of which measure1 is bigger than 10. We want to sort it in descending order, i.e. largest amount first. {3-L}
9. Like above but sort on the attribute, not on the measure. {3-L}
10. People say that we shouldn’t use cell security because it’s slow. Explain how we should use it then? How do you combine it with dimensional security? {2-H}
11. What’s the difference between a tuple and a set? What is an axis? {3-L}
12. You need to grant access to 200 users to access a cube. Each user can only access certain data within the cube. The access is by product and region, i.e. user1-20 can only access product1-50 and region1-3, user21-40 can only access product30-80 and region2-5. How would you do that? {5-M}
13. Some users need to see the sales figures in USD, other users in GBP. How would you do that? {1-H}
14. You need to upgrade 4 SSAS servers (dev, UAT, prod, DR) from 2005 to 2008. Explain would you approach this. {5-L}
15. People say that we can have a real time cube using ROLAP or ProActive Caching. Explain how, and the plus/minus of each approach. {1-H}
16. Cube processed OK but all measures return no value. How do you troubleshoot this issue systematically? {1-M}
17. How do you do Slowly Changing Dimension type 2 in SSAS? {1-M}
18. The data mart only has 1 measure: amount. This measure could be asset, expense, income, etc depending on an attribute called “Account”. How do you implement this in SSAS cube? {1-H}
19. The value of measure “price” is only valid if the product dimension is at the leaf level and the location dimension is at the “store” level. How do you set it? {2-M}
20. Attribute relationship: In what case would we want to set: a) the cardinality to “one to one”? b) The relationship type to rigid? What’s the risk (of doing so)? {1-M}
21. Why do you need to design aggregations? How do you design aggregations (usage-based and non usage-based)? Explain how you test it. {1-H}
22. People say that perspective is not a security measure. What do they mean by that? {1-M}
23. Star vs snowflake: what’s the benefit of having a snowflake data mart in SSAS? {1-M}
24. We have a dimension which is as big as the fact table (in terms of the number of rows). What do you suggest we do with this dimension? And with regards to SSAS, what would you do differently? {1-H}
25. Excel: how do you change language from English to French when browsing a cube?

1. Order by key and put month number in the key column. If they can’t answer this you should worry.
2. It’s best if the candidate can explain about the currency conversion BI wizard. But if he can explain the “traditional way” it’s quite good too: using calculated measure and many-to-many relationship.
3. As with any support call, we need to find out which report. Try to reproduce what the user did in NovaView. Find what the MDX is (Tools menu, Direct MDX, ctrl-alt-V), execute in SSMS. If this is slow then find out which part of the MDX makes it slow by building the query step-by-step. Then consider: a) aggregation, b) block computation, c) convert to base measure (do calculation in the relational/ETL), d) improve the MDX if it’s a calculated measure, e) partitioning. Most of the time, when a user says that usually it is 2 seconds but today it timed out, it is not the MDX or the cube structure. It’s usually either the server or the connection. So check that a) the server is up and running and b) from Panorama you can “open” the cube in that server.
4. Incremental processing (IP). If you have huge cubes and the candidate has never done IP before, end the interview. If you have huge cubes in your company (100 GB), your questions should be focused on performance (processing & query). The job spec should clearly say: wanted: SSAS performance tuning specialist (not a “normal” AS developer) and prepare to pay 20% premium.
5. It’s an easy question; every single candidate should be able to answer this question as it is bread and butter. But their answer shows their class. One of the important things I’m looking for here is requirement analysis. An SSAS developer who “just build it” without analysing the requirements is usually less helpful then those who analyse the requirements. The other thing I’m looking for here is performance, e.g. when building the DSV do they ensure that the SQL is efficient? When creating calculated measure, do they ensure that the MDX is efficient? The last thing I’m looking for here is how they do unit testing.
6. a) backup-restore, b) script (XMLA)-execute-process, c) deployment wizard.
7. First detect (in relational) which partition changed (you need some kind of flag / timestamp column on the fact table). In the metadata table, update the status of each partition of which the data has changed. Then based on this metadata table, process the changed partitions using either a) AMO or b) “AS execute DDL” task in SSIS.
8. See below.
9. See below.
10. Look at Jason Thomas’ post herehere, and Bryan Smith’s post here.
11. A tuple is a cut of a cube; a set is a collection of several tuples (could be 1 tuple). A tuple is enclosed with parentheses: (); a set is enclosed with curly brackets: {}. An axis is an edge of a multidimensional result set returned by an MDX query (columns, rows, pages, sections, chapters)
12. Put users into AD groups and add these groups as the members of the cube roles. For each role, define the dimensional security (dimension data tab) on product and region dims.
13. Create 2 roles: USD and GBP and add users into these roles via AD groups. In each role, set the default member in the currency dimension using dimensional security (dimension data tab in the role).
14. Upgrade Dev first (document the process), test the cubes by running the reports (Panorama/Strategy Companion/Excel, etc). Use the documented process to upgrade UAT, and test the UAT cubes. Then upgrade prod and DR at the same time (using the documented process). Or DR first, test then Prod. Before upgrading Dev, you might want to copy a Prod cube to Dev and export some reports into Excel. After upgrading Dev, run the same reports and compare with the results before upgrade.
15. There are 2 things here: dim and MG. In reality, a real time cube doesn’t mean that all dims and all MGs are real time. But only 1 or 2 MG are real time. Usually all dims are MOLAP, perhaps with the exceptions of 1 or 2 dims. To make an MG real time, we can define it as ROLAP storage mode. For performance reason, best not to define the whole of MG as ROLAP, but only 1 partition (usually the last one, if partition by month/period). Unless it’s a small MG (<100,000 rows). For dims, to make it ROLAP set the StorageMode property. ProActive Caching (PC) provides automatic management of MOLAP storage. The cube incorporates the changes in the relational tables, based on a notification mechanism. You can do PC on partition and dim. If the fact table (or partition of the fact table) is small, and data is changing frequently (like every minute), best to use ROLAP. If the fact table (or partition of the fact table) is large (like 8m rows in partition), and the data is rarely changed, best to use PC.
16. Do “explore data” on the fact table on DSV. Run the SQL if it’s a named query. If this doesn’t return anything, check the fact table. If this returns rows, check the dimension usage tab. Focus on 1 MG. Remove all relationships (keep the cube dims), reprocess cube and see if you get a measure at the top level. If you do, add the relationship to the dims one by one until you find out which dim causing the issue.
17. No special settings, just create the dim as normal from the SCD2 table. The only difference that it has 3 extra attributes: IsCurrent, Valid From & To date.
18. Parent child dim, unary operator column, custom rollup.
19. Scope statement
20. a) If for each member of attribute1 there is only 1 member of attribute2, b) if “member mappings” are fixed. Risk: if a member changes its mapping.
21. Why: to increase query performance. How (usage based): OLAP query log. How (non usage based): based on attributes which are queried/used most. Refer to SSAS2008PerfGuide section 3.4.3.1. Test in Mosha’s MDX Studio, before and after, 2 different cubes, same query.
22. You can’t prevent access to dim/attribute/measure using perspective. They just can’t see it, but if they know the name of the dim/attribute/measure they can access it.
23. Flexibility in building a dimension, speed of processing. No difference in query performance.
24. Dimensional modelling: break into smaller dimensions and/or snow flake the dim. SSAS: limit the number of attributes (don’t just take all attributes but be selective), index on those attributes (relational), limit the number of rows going into the cube by using a where clause on the DSV/view, put attributes with lower grain (smaller number of distinct values) into a referenced dim, linked via the main dim to the MG.
25. Modify the connection string in the ODC file, see my post here.

Number 8:

```select order
( filter
( {[dim1].[attribute1].[All].Children},
[Measures].[Measure1] &amp;gt; 10
),
[Measures].[Internet Sales Amount],
desc
) on rows,
[Measures].[Measure1] on columns
from [Cube]

Number 9:

select order
( filter
( {[dim1].[attribute1].[All].Children},
[Measures].[Measure1] &amp;gt; 10
),
[dim1].[attribute1].CurrentMember.Name,
desc
) on rows,
[Measures].[Measure1] on columns
from [Cube];

```

## 28 September 2010

### Why write a blog?

Filed under: Other — Vincent Rainardi @ 8:30 am
Tags:

Some people asked why I write a blog. Some people asked why I wrote a book. Why I spent a lot of hours for it. Why don’t I do other things instead, like relaxing in the garden? They wonder what’s in it for me.

I write blog posts usually at night, or on the train like now. I spend on average probably 3 hours a week writing blog. In 2007 I spent the whole year, from January to December, writing a book. Almost every night and every week end. And my friend asked why I did that.

I also spend a lot of hours preparing presentations. For SQLBits, for SSUG and for other events. The Stock Broker case study I presented at SQLBits 6 in Wales Nov last year, I spent a lot of hours creating the data. My friends wondered why I gave it away. I spent a lot of hours preparing the demos for SSUG at Microsoft (and the coming SQLBits 7). My colleagues wondered why.

I write and present because I want to share my experience and thoughts with others. I believe that the experience, thoughts and ideas that I had, were given by God and in return I have to share it. I understand that not everybody believe in God. Many people don’t. So I’ll share this next reason:

By writing we increase our knowledge. If, say, before writing a book, your level is 5. After finish writing that book, your level would be 8. The same goes with writing a blog. This is because you have to do research. You have to read and you have to test it. You need to be sure that what you write is correct. And in doing so you grow your knowledge.

Next Page »

Blog at WordPress.com.