Data Platform and Data Science

16 January 2011

Team Setup for DWBI Projects

Filed under: Business Intelligence,Data Warehousing,Other — Vincent Rainardi @ 8:40 am
Tags: , ,

A few people contacted me with regards to the team setup for a BI/DW project. One of the feedback I received was something like: I’m a PM, I’m running a BI project. It’s MSBI end to end. DW/DM on SQL05, IS as ETL, RS reports, AS cubes. I have 3 people working on the project: 1 ETL developer, 1 report/cube developer and 1 data architect/modeller. How do I run the project to ensure that everybody is “loaded”, i.e. in the first month when the DW is being modelled/designed, the report/cube guy will not be doing anything, and ditto is the ETL guy.

Another feedback I received today was: My experience from my clients is that the BI teams most often tend to be (way) too small for the task at hand. In the Kimball books they suggest a team setup with a multitude of roles spread on preferably as many people, when comparing that to my reality the gap is huge where most roles do exist but only spread over a handful people. I find it a bit hard to separate concerns too much since everything within BI/DW is interconnected, architectural needs are often captured through report demands, two ends of the rope. What’s a good team size and setup (of course this is related to the project size but in general)?

I’ve written about this topic on chapter 3 (DW Development Methodology) of my book, and early part of chapter 4. But for now I’ll just focus on directly addressing the 2 situations above.

I’m a believer that we learn better from examples, rather than from theory. So I will explain this team setup thing using an example. The project is a very small, only 4 people on the project.

Team Members

In this project we have 4 people: 1 architect, 1 BA, 1 PM, 1 BI Developer. Because the project is very small, we want a hands on architect. Do not use architects who can only do data modelling. Advertise for “BI architect”, not for “data architect”, they are completely different. The architect must have a DBA experience, i.e. they know how to partition a table, update statistics, etc. When selecting a PM, choose someone who has delivered at least 3 DWBI projects. And ask which part of the DWBI project he delivered, because some PM only did part way through the project, e.g. just the reporting part. It is better to pay an experienced PM highly, then an inexperienced PM cheaply. Do not use PMs who are not a DWBI PM. I know that a DWBI PM is difficult to get, but it is a must that you use a DWBI PM, not a “App Dev” PM or any other PM.

The BI Developer must be an all-rounder. Throw away people who can only do RS or IS. He or she must be able to do T-SQL, IS, RS and AS. I know they are more expensive, but you have no chance of succeeding if your developer is 1 one sided (i.e. just RS). And choose someone who has delivered at least 3 BI projects. Do not use developers who done less than 3 projects. And ask which part of the BI project they did, i.e. ETL or report. Pick a BA who really understands the business. Ideally he/she has been with the business for at least 3 years. The BA will be involved throughout the project, all the way to the end.

The goal of the project is to build a data warehouse on SQL Server 08, with 10 reports and 3 cubes (TBD), in 6 months. Say it’s a utility company, selling gas & electricity to 3 million customers. The source is 1 business system used for sales, stock, bulk purchase, distribution, etc. They want to analyse sales figure, profitability, supplier performance, and distribution performance. The architecture is 1 DDS (see my book chapter 2), no ODS or NDS in the middle.

Break Into Small Parts

The key of managing this project effectively (in terms of its resource use and time) is to break the project into several parts, based on functionality. Team first work together to deliver part 1 (say it’s sales analysis). On week 1, the BA & Architect sit down for a day or two, scoping the functionality. Then the architect spend 3 days doing data modelling (designing the DW table structure), just for part 1. @Architect: I would emphasise not to worry too much about getting it right 100%. Just take a first stab on it. You can refine it later. On week 2, the Architect & the BA spend 2 days specifying the mappings (which column in source going to which column in DW). The architect then spend 2 days creating the DW tables. On the first day of week 2, the architect creates the stage tables, which mimic the structure of the source tables. This way the BI Dev will not be idle week 2: he/she needs to populate the stage tables. Please do not take the number of days literally, as they depends on the size of the work. You could easily double all the numbers. But the proportion should be more or less the same.

Remember that at this time, what we are doing is Sales Analysis (SA). We do not touch customer profitability, supplier performance or distribution performance. Say the ETL goes for 2 weeks (SA only), which is done by the BI Developer. In this time the Architect are designing the reports and cube (SA only). Starting with the UI (which is done together with the BA), then onto the “where the data should be sourced” and then the SQL statement (rough SQL only, leave the details to the developer). One question that the BA should ask him/herself all the time is: Is this what the business need? And the one question that the Architect should ask him/herself all the time is: Will it work?

Testing and Rhythm

The Architect needs to be able to guide the developer. If the Developer is stuck on a particular point (don’t know how to do it), it is the Architect’s job to show how to do it. If there are several approaches how it can be built (say view or direct SQL), it is the Architect’s job to select the best approach. While the developer is busy building the reports and the cube, the Architect and the BA do two things a) create a test plan for SA, and b) design the Customer Profitability (CP), ETL, report & cube. The test plan will probably take ½ week, whereas designing CP will probably take 1-1.5 week. When they are building CP, they will probably get the hang of the rhythm (the repeated cycle of timing). And so, onwards with part 3 (supplier performance) and part 4 (distribution performance).

How about testing? The BA writes the test plan. The BA (helped by the Architect) does the system testing & UAT. The architect prepares the DWBI system for testing. The developer does the unit testing.

It is my experience and a pair of Architect + BA will be able to feed 1 ETL developer and 1 Report/Cube developer quite adequately. This is a point where I find the balance. The workload is not too much or too little on either side.

Management

What does the PM do? The PM is the one ultimately responsible for the delivery of the project. Apart from managing the project (project plan, monitoring progress, providing adequate resource, budgeting), there are 2 main things that the PM must do: a) to shield the team from outside pressures, b) to communicate with outside parties/stake holders. Some people say that PM (in this team structure of 4-5 people) is a part time job, which I tend to agree. The PM can take 1  more project (or two), while at the same time running this BI project. Some PM plays 2 schedules: 1 for the external stake holders and 1 for the team member. This does not work. Within 3 weeks he or she will get caught. Need to have 1 schedule and be open. As with any projects, the name of the game is trust. In order for the project to run well, you need to trust the other team members.

For a bigger project, the 3 pillars are still there: PM, BA, Architect. These people are the ones who drive the rest of the project team. They drive it from 3 different sides: 1 from the management side, 1 from the business side and 1 from the technical side. You can add 2 more developers and 1 more tester, as long as you break it into smaller part like above, and the 3 pillars stand, you will be able run the project in balance and keep everyone occupied.

Hope this helps. As usual I welcome any question and discussion at vrainardi@gmail.com or via comments. Vincent Rainardi, 16/1/2011

11 January 2011

4 Sessions for SQLBits 8, Brighton, April 2011

Filed under: Event — Vincent Rainardi @ 2:54 pm
Tags:

I have submitted 4 sessions for SQLBits 8, 7-9th April 2011, Brighton:

1. Testing Your Data Warehouse Using SSAS

I wrote a book about Building a Data Warehouse. One chapter of my book is about testing a data warehouse. Testing a data warehouse is an art. On the one hand we need to make sure that the figures in the warehouse tally with the source systems. On the other hand there are billions of numbers to test. And to add complexity sometimes the numbers from the source are modified, on their way to the warehouse.  One technique to test the data warehouse is to use SSAS. We build a cube and put the numbers from both the source and DW into this cube. Not only this technique saves a lot of time, but we can quickly pin point where the problems are.

2. Advanced Dimensional Modelling

This session is for anybody involved in the design of a data warehouse. Many of us know what dimensions and fact tables are. But as I explained in my blog, dimensional modelling is more than just dimensions and fact tables. This session is about advanced dimensional modelling topics such as Fact Table Primary Key, Vertical Fact Tables, Aggregate Fact Tables, SCD Type 6, Snapshotting Transaction Fact Tables, 1 or 2 Dimensions, Dealing with Currency Rates, When to Snowflake, Dimensions with Multi Valued Attributes, Transaction-Level Dimensions, Very Large Dimensions, A Dimension With Only 1 Attribute, Rapidly Changing Dimensions, Banding Dimension Rows, Stamping Dimension Rows and Real Time Fact Table. Prerequisites: You need have a basic knowledge of dimensional modelling and relational database design.

3. Partitioning Your Fact Tables

In data warehousing, partitioning is one of the top 3 things that can speed up the performance (the other two being summary tables and indexing, as I explained in my book). Partitioning your fact table can speed up the performance up to 10 times, both query and loading. Yet 9 out of 10 the fact tables are not partitioned. One of the reasons is because initially there were only 5-10 million rows. After a year, the fact table contains 1 billion rows and loading time degrades, from 2 hour to 5 hours. Queries slows down too, from 1 second to 20 seconds. In this session I’m going to explain only partitioning, but I’ll leave some time for Q&A if you want to ask about the other two. Prerequisites: Requires a little bit of data warehousing knowhow.

4. SQL Server Data Warehousing

8 years ago people raised their eyebrows when you said you’re building a data warehouse on SQL Server. In today analogy, it is probably similar to saying that you are building a data warehouse on MS Access. In those days DB2 and Oracle reigned the market.

Thanks to Microsoft, today SQL Server is a respectable platform for data warehousing. And for business intelligence. It’s completely reversed now. If we say we are building a DW-BI on DB2 or Oracle, they question “Why not SQL Server?” This session in a way celebrates this golden era of SQL Server DWBI, just like my book. As time is limited I’m going to talk about the DW side only, not the BI side (RS, IS, AS) e.g. merge, change data capture, star join query, bitmap filter, change tracking, minimally logged insert, resource governor, backup compression, data compression and indexed views.

For further info about SQLBits please look at SQLBits web site, www.sqlbits.com.

8 January 2011

Oracle ETL Tools

Filed under: Oracle — Vincent Rainardi @ 8:59 am
Tags:

For years the ETL for Oracle Data Warehousing has been OWB (Oracle Warehouse Builder). But in 2007 Oracle introduced another ETL tool called ODI, Oracle Data Integrator (not to confused with BODI, Business Object Data Integrator). ODI was originally from Sunopsis. Sunopsis had an ETL product called Data Conductor, famous for its ELT approach, i.e. load into the target server first, then use the power of the DB engine to transform and upsert into the target tables. This approach suited Oracle so Oracle bought Sunopsis in 2006 and branded Data Conductor as ODI.

It was (and still is) Oracle’s intention to combine OWB and ODI into 1 product. In January 2009 they bundled ODI and OWB into 1 license and called it ODIEE. Still two separate software, but 1 licence. Some of ODI functionalities were added into OWB and it was called OWBEE. Formal statement here. Formal OWB blog here. ODI product page here. Data Integration blog here. OWB product page here.

So today Oracle has 4 ETL products:

  • ODI: the one from Sunopsis. This is the future.
  • OWB: the old OWB (without the new ODI functions), available with Oracle DB. Aka “Basic ETL”.
  • ODIEE: bundled license of OWB & ODI. EE stands for “Enterprise Edition”.
  • OWBEE: modified OWB, with the new ODI functions. EE stands for “Enterprise ETL”, not “Enterprise Edition”

In the future they will be merged into 1 software: ODI. For now (2011), companies which are new to Oracle ETL should use ODI, because that is the future. ODI is the strategic product. Companies which already use OWB for a long time should use OWBEE. In the end, the old OWB will be gone, replaced by ODI. But they can’t do that now, because there are many companies using OWB at the moment. There must be a smooth way of migrating them to ODI, and there must be support for OWB for many more years. But there’s no escape that they must migrate all to ODI, just like DTS to SSIS.It will probably take 5 years for the world to change. For companies using OWB, this means a big project, and significant costs.

But there’s no “middle way”. Oracle as a company can’t maintain 2 ETL tools, because of the cost. They must have only 1 tool. ODI is the strategic product. In 5 years time this is ETL product that will survive. This is the product that will be enhanced and developed further. ODI is Oracle’s strategic ETL tool, as per the formal statement here.

The alternative is to move all ODI core features to OWB, which in my opinion can’t be done. There is a fundamental difference in principle of how they work. If OWB is like a house, we can’t just add the core ODI features on top that house. But we have to demolish that house, build a new foundation (the ODI way) then rebuild the house.

In addition to OWB & ODI, Oracle also has GoldenGate, a famous CDC (change data capture) tool that Oracle acquired in July 2009. GoldenGate capture the changes in the source systems by reading the transaction log file and apply the data changes to the target system such as DW or ODS using a native DB interface or ODBC. The component that applies the data to target is called Replicat. GoldenGate 11g Admin Guide is here, architecture diagram is on page 12. The real beauty of it is that GoldenGate can operate between various RDBMS, i.e. DB2, MySQL, Oracle, SQL Server, Sybase and Teradata, integrating data between them in real time (synchronous) or in batch using CDC. Using GoldenGate is not easy, there is a lot of command line interface. Not as easy as (and not as modern in UI) for example, DataMirror, DBMoto and Attunity. But functionality wise, GoldenGate is probably the richest.

One word about the term “Data Integration”. In 2006 all ETL companies seem to be in agreement that the word “ETL” had a bad perception. The word “ETL” only covered a small part of the scope. They all rebranded their product as “Data Integrator”. a) they could add/integrate data quality and data profiling into the product, b) they could get into MDM market with their tool, c) they could get into data integration projects (not only data warehousing projects), d) they can get into operational systems such as ERP, business systems, and ODS projects. I totally agreed. It should be called Integration, not ETL. Look at Business Objects. They called it BODI, Business Objects Data Integrator. SSIS: Integration Services. In 2007 in Boston I listened to Michael Gonzales’ ETL session at the TDWI conference. He said that “ETL is dead, it is now called Data Integration”. I totally agreed. But, in 2010, because everybody in the market (and the press) still call it ETL, the word ETL seems revived again. Hence the title of this post 🙂

If you find something incorrect above, I’d be grateful if you could let me know, either via comments or vrainardi@gmail.com, I will correct them.

7 January 2011

Introduction to Oracle BI for MS BI Developer

Filed under: Oracle — Vincent Rainardi @ 8:15 pm
Tags: ,

If you are a MS BI developer or Architect (SS, IS, RS, AS) you need to look at Oracle BI. a) so that we know what’s out there, and b) we increase our understanding of our own area by studying other areas. When I ask people which BI stack did you implement? These days the answer is almost always MS BI. There other BI suites: SAS, SAP (BW & BO), IBM (Cognos + InfoSphere) and Oracle. Out of these, OBI is not the most popular (BO or Cognos probably is), but OBI is the “latest trend”, i.e. with the arrival of OBIEE 11g, in the 2nd half of 2010 people start considering OBI as an alternative to MSBI.

The documentation for OBIEE in general is here. Release 11g is here. Release 10g is here.

History

The OBIEE 11g originated from Siebel Analytics (latest version was 7.8, Jan 2006). Answers, Interactive Dashboards, and Delivers were from Siebel Analytics (read here). Discoverer was not from Siebel Analytics, it was from Fusion Middleware (see this book page 31), along with Forms, Reports and Portal. At that time, OBI was part of “Oracle Fusion Middleware 11g Release 1”, see here. Apart from OBI, OFM 11g R1 consists of Communication Services, Identity Management, SOA Suite, WebCenter, WebLogic, HTTP Server, and Application Development Framework.

So, in Jan 2006, OBI is located in 2 places: a) Siebel Analytics contains Answers, Delivers and Intelligent Dashboards, b) OFM 11g R1 contains Discoverers, Forms, Reports and Portal.

Then OBI 10g was released in June 2006 (10.1.3.2). It consisted of Answers, Delivers, Interactive Dashboards, BI Publisher, Disconnected Analytics. Discoverer 10g was a separate product, released in Dec 2004 (10.1.2, aka Drake), see this book page 4. Discoverer was a very mature product, has been in the market for a long time, first released April 1997. Other Oracle BI tools: Beans, Reports, Data Miner, OWB, Spreadsheet Add-In were all separate tools, not in the “OBI 10g” suite.

OBIEE 11g was released in July 2010. 10g’s Answers is renamed as Analysis. 10g’s Interactive Dashboards is renamed as Dashboards. 10g’s Delivers is renamed as Agents. New features added: Scorecards, KPI, Actions, Conditions, Filters, Prompts and OBI Add-in for MS Office. BI Publisher (Report Designer) and Real-Time Decision are also part of OBIEE 11g.

Components

I’m going to use 11g terms here, instead of 10g.

An Analysis is a DW/BI SQL query presented in the form of tables, pivot tables and charts. These tables and charts can then be included in a Dashboard. We can create a prompt in an Analysis, to allow the users to select a value to filter the result of the Analysis. We can integrate an Analysis with an Excel Internet Query (IQY) to enable us to run the Analysis from Excel. We can create an Agent from an Analysis to schedule the Analysis to run at certain time. We can use variables practically anywhere within an Analysis (title, column, formula, formatting conditions, SQL, header, etc).

A Dashboard is a collection of analysis. A Dashboard consists of one page or several pages (shown as “tabs” on the top of the Dashboard). Apart from displaying Analysis, a Dashboard page can also display images, text, alerts, action links, views, reports, URL links and embedded objects. There is a toolbar on the top right of the Dashboard containing buttons such as: Edit, Print, etc. We can create a Skin to change the appearance of a Dashboard, e.g. background colour, logo and style sheets. A Skin can be assigned to users so each user can have a different look. A Style control how a Dashboard are formatted, e.g. the colour, font & size of the text, table border, graph attributes. Styles are organised into folders that contains Cascading Style Sheets, images and graph templates. We can create a Dashboard Template, which acts as a starting point for users to build their own Dashboard pages.

A Briefing Book is a collection of Dashboards. A Briefing Book can also contain an individual Analysis and a BI Publisher Report. The Dashboards, Analyses and Reports in a Briefing Book can be static (can’t be updated) or updatable (the content is refreshed everytime we open it). We can export a Briefing Book to PDF. We can deliver a Briefing Book to users using an Agent.

BI Publisher is a tool to create and publish reports. We can read data from databases (using SQL), spreadsheets, Analysis and OLAP DB (using MDX). We can also read from LDAP, Web Services, HTTP feeds, and XML files. If the data source is unrelated (such as individual files), we can create relationships between data sources. Using Publisher we can create layouts for our reports and publish the reports to various output format, including: HTML, PDF, Excel, PowerPoint, XML, CSV, and Rich Text. The output can be sent to: printer, email, file, fax, WebDAV or FTP. We can split a single report into multiple sections, each section is sent to different recipients on multiple destinations, using different formats. This is called Bursting. We can create a Template that contains style information that can be applied to RTF layouts. For PDF output, we can split the PDF into several smaller files (called PDFZ). We can schedule a report to produce multiple outputs, each with different layout, and each is sent to different destinations. There is an interactive viewer, where users can sort or filter a table, scroll the table, get a pop-up graph, and do “propagated filtering” across different area of the report.

Real Time Decision (RTD) is a tool to create/model business processes, gather data/statistics, and make recommendations (all that is called an Inline Service). An Inline Service consists of several elements including: Performance Goals, Choices, Rules, Statistics  Collector and Models. A Performance Goal is KPIs used for setting criteria for the scoring of choices. A Rule is used to target a segment of population. A Rule can also be used to decide whether a choice is eligible. A Model is self-learning and predictive, used for optimising decisions and providing real-time analysis. RTD’ is fully integrated with Eclipse development environment. We can deploy an Inline Service, run it to process requests, monitor its performance, gather and view the statistics, and fine-tune/optimise it.

How these components fits together is shown in OBIEE system architecture diagram here.

 

Blog at WordPress.com.