One of the most challenging issues in managing a data warehouse project is that the report/analytics development can’t start before the ETL is completed and populated the warehouse tables with data.
This is traditionally solved by using “data interface”, i.e. the fact & dim tables in the warehouse are created and populated manually with minimum data, in order for the report/cube developer to start their work.
But the issues when doing that are:
- It takes a long time and large effort to create that data
- The manually-handcrafted data does not reflect the real situations
- The manually-handcrafted data does not cover many scenarios
A data interface is an agreement between the report building part and the ETL part of a DW project that specifies how the data will look like. A data interface consists of two essential parts:
- The data structure
- The data content (aka data values)
The data structure takes a long time to create, because it requires two inputs:
- The requirements, which determines what fields are required in the DW
- The source data, which determines the data types of those fields
Hence the project plan for a traditional data warehouse or data mart project looks like this: (let’s assume the project starts in January and finished in 12 months/December, and we have 1 ETL developer, 1 report/analytic developer, 1 data architect, 1 BA, 1 PM)
- Jan: Inception, who: PM, task: produce business case and get it approved to get funding.
- Feb-Apr: Requirement Analysis, who: BA, task: create functional requirements.
- Apr-May: Design, who: DA, task: create data model and ETL specs.
- Jun-July: Design, who: DA, task: report & cube specs
- Jun-Aug: Build, who: ETL Dev, task: create ETL packages
- Aug-Oct: Build, who: Report Dev, task: create report & cubes
- Sep: Test, who: BA & ETL Dev, task: test ETL packages and fix
- Nov: Test, who: BA & Report Dev, task: test report & cubes and fix
- Dec: Implementation, who: BA, DA & Dev, task: resolve production issues
The above looks good, but it is actually it is not good from resourcing point of view. There are a lot of empty pockets burning budgets while people are sitting idle. There are 27 empty boxes (man-months) in the resource chart below, which is 37.5% of the total 60 man-months.
The report developer starts in August. If we can make the report developer starts in June at the same time as the ETL developer, we would be able to shorten the project by 2 months or so. But how can we do that?
The answer is data interface. The DA creates the DW tables and populates them with real data from the source system. This is quicker than trying to manually create the data, and the data reflect the real situations, covering many scenarios. Real scenarios (the ones which are likely to happens) with realistic data, not made-up scenarios with unrealistic data which are unlikely to happen.
Using the populated DW tables (facts and dims), with real data, the report developer will be able to create the reports and the BA will be able to test/verify the numbers in the reports, at the same time when the ETL is being developed to populate the tables. We are removing the dependency between the report development and the ETL development, which a crucial link in the project that prolongs the project duration.
The resource chart now looks like this, 2 months quicker the original plan:
The value of this 2 months is approx. : 2 months x 5 people x $500/person/day x 22 days/month = $110k. It is a significant figure. The DW development cost goes down 16.7% from $660k to $550k.
We notice that there is a big white area on the lower left of the resource chart, i.e. the DA and 2 developers are not doing anything in the first 3-5 months. This can be tackled using iterative development, i.e. break the project into 3 parts (by functional areas) and deliver these 3 parts in one after the other. In the chart below, part 1 is yellow, part 2 is green and part 3 is purple.
Part 1 (the yellow boxes) goes live in June, part 2 goes live in Aug, part 3 in Sep. The big system test at the end of the project won’t be required any more, because we go live bit by bit. During April the BA can prepare the test plan.
The resource utilisation is now higher. We now only have 8 white boxes, out of 54. That’s 14.8%.
The project duration is shortened further, only 9 months rather than 10. That’s another $55k cost saved. The cost is now $495k, 75% of the original cost ($660k).