Data Warehousing and Data Science

23 June 2011

Data Sourcing

Filed under: Data Warehousing,Project Management — Vincent Rainardi @ 7:28 am
Tags: ,

One of the trickiest parts of a data warehousing project is to determine where we should get the data from. On the one hand, the business user knows that they would like to have attribute X, but they won’t know where in the enterprise we could get it from. On the other hand, if you ask someone from the source system team, “Do you have field X?”, your terminology might be different from what they are using.

For example: a bank. The business users are from the Compliance department and they want to have “Third Party Name” and “Account Status” on their report. You setup a meeting with the business users, trying to clarify what “Third Party” means and what “Account Status” means. There are 150 systems in the bank and you identified 5 possible systems that may have the data. You spent days trying to get the source system teams into a meeting. After a few weeks, you concluded that only 2 possible systems you can source the data from. Then you ask for sample data. And what a disappointment, both systems do not contain the data you are looking for.

Say you do get hold of a system which had the data you are looking for. There are usual data sourcing routine that you need to perform:

  1. Ask for sample data, import it into your system and examine the data quality. It’s no good if most of the rows are null for that column! Are there rows with null business keys?
  2. Check if the grain of the data matches what we are looking for. If you need it at trading level but the data is at account level, it’s no good! If you expect the data to be 1 row for each account per day, and in the data file you find that it’s 1 row for each customer (not per day) , then it’s no good!
  3. Understand how the data needs to be joined in our data warehouse. Test and retest again. Yes it’s left (outer) join of course, but are you sure that the data on the joined fields matches our warehouse?
  4. In a large enterprise you are not allowed to import directly from their database (point to point). You have to subscribe to their “publication”. In a large corporation, you are not the only consumer of this data; there might be 5 other systems consuming the same data.
  5. Understand the data delivery: Is the feed incremental or full load every time. What is delivery mechanism, FTP or messaging? Is it daily feed or continuous throughout the day? How often does the data get updated? What’s their scheduled down time?
  6. Understand the age of the data: If today you receive a file from that source system, does the content of this file reflect the condition in their system last night, or 2 days ago? Working days only, or week end included? In which situation this is not the case, at year end?

Welcome to the world of data sourcing. This is one of the activities in data warehousing that takes a lot of time. Having been into many warehousing projects, I feel now that the most important thing in a data warehousing project is probably the data sourcing. We need to have somebody in our DW team who understands the source systems. The dimensional modelling is equally important, but if you don’t understand the source data/systems, then the model would not be correct, and the ETL spec would not be correct either. And data sourcing needs to be done early: if we don’t have the data we can’t do anything!

This is how “analysis” is different from “design”. This is where Data Quality work happens. You perform DQ check and recheck on the incoming data. Are there duplicate rows in the data? Do the values match what you expect?

Data sourcing is often overlooked. It is often not considered seriously. Many PMs (especially those who are not from DW background) think like “Ah it’s about getting data from finance department, easy…” We should not underestimate data sourcing in a DW project. If we can’t get good, valid data, we won’t have a data warehouse, let alone producing a BI front end.

That is why, in any DW project, it is crucial that we develop and test with Production data. Lots of PMs don’t get this. They are so used to Data Entry Application project and they ask “Why do you need Production data for development?” Two reasons: the first reason is data quality / data sourcing / functionality. The second reason is performance / technical / size.

The first reason is about making sure that you get the right data, and therefore can build the correct functionality in the DW. The second reason is about making sure that we can deal with vast amount of data. It’s all very well that the ETL package only took 2 hours in dev, but when we go live it could take 7 hours! Why? “Oh we develop using 50,000 rows sample data”. You must be joking! This is a data warehousing project, not a data entry system application development project. And you are back to square one: redo the development, redo the ETL package, redo the testing, etc, which would be a huge waste of valuable project time.

Vincent Rainardi, 23/6/2011

17 May 2011

Don’t Worry About the Sign Off

Filed under: Project Management — Vincent Rainardi @ 5:26 pm

A traditional DW/BI project has requirement, design, build and test phases. And between each phase we have deliverables. Dimensional Model for example, is one of the early deliverables, produced at design phase. If you look into the design phase in more detail, it consists of several activities: ETL Design, Dim Model Design (aka Warehouse Design), etc. At some point, a design document (or business requirement/analysis document) needs to be signed off, and it is used as a base for the next phase.

Here’s an example: business requirement document (BRD) –> functional requirement document (FRD) –> technical specification (TS) –> coding/build. The person writing the FRD will use the requirements defined in BRD. And the person writing the TS will use the functionalities defined in the FRD. Of course there are other things like quality requirement, non functional requirement, etc but that’s beside the point.

So a BRD needs to be signed off, before the FRD can be written. And the FRD needs to be signed off, before the TS can be written. And the TS needs to be signed off, before the code can be written. Right? Wrong. That’s the old concept. In these days and age, in BI projects, we don’t wait until a document is signed off, before we start the next stage. A soon as the author declared “I think I have the guts of it written”, then the next stage begins. Of course, when the upstream document changes, the downstream document needs to be updated. But that’s fine, because usually it’s minor changes. This method can save the project delivery time by a quarter. Now we are talking huh?

The second reason is: because no body knows the requirement in details. They can only specify the requirement to the best of their knowledge at that time. Of course it will change. Especially in BI project. Once you show the cube/report only then the business understand it better, and from that understanding comes out new requirements. Of course. That’s natural. Best systems are the ones which achieved version 5. This means that the software have incorporated feedback from the first version, second version, etc. It is a mature product.

The same goes with BI. It’s not a good idea to do it like 20 years ago: spending a lot of time interrogating the business users to get the requirements and then ask them to sign it off. So that if something goes wrong, you have it on paper. That’s a thing of the past. We don’t do that any more in 2011. Why? Because 20 years ago, the pace is very slow. You have 3 years to design a system. Now? A year is probably too long. Especially in BI and data warehousing.

Agile? Yes. Should we use Sprint? Absolutely. This actually removes the responsibility for the project team to “deliver on time”. Because we are not measured by time. We are measured by the user satisfaction. What’s the point delivering the system on time and on budget but doesn’t satisfy the business? We delivered according to the requirements, and they signed off the requirements. That’s why it’s not difficult to sell the Agile concept to the business, because what they need is satisfaction, not a system that is delivered on time.

Blog at WordPress.com.