One of the trickiest parts of a data warehousing project is to determine where we should get the data from. On the one hand, the business user knows that they would like to have attribute X, but they won’t know where in the enterprise we could get it from. On the other hand, if you ask someone from the source system team, “Do you have field X?”, your terminology might be different from what they are using.
For example: a bank. The business users are from the Compliance department and they want to have “Third Party Name” and “Account Status” on their report. You setup a meeting with the business users, trying to clarify what “Third Party” means and what “Account Status” means. There are 150 systems in the bank and you identified 5 possible systems that may have the data. You spent days trying to get the source system teams into a meeting. After a few weeks, you concluded that only 2 possible systems you can source the data from. Then you ask for sample data. And what a disappointment, both systems do not contain the data you are looking for.
Say you do get hold of a system which had the data you are looking for. There are usual data sourcing routine that you need to perform:
- Ask for sample data, import it into your system and examine the data quality. It’s no good if most of the rows are null for that column! Are there rows with null business keys?
- Check if the grain of the data matches what we are looking for. If you need it at trading level but the data is at account level, it’s no good! If you expect the data to be 1 row for each account per day, and in the data file you find that it’s 1 row for each customer (not per day) , then it’s no good!
- Understand how the data needs to be joined in our data warehouse. Test and retest again. Yes it’s left (outer) join of course, but are you sure that the data on the joined fields matches our warehouse?
- In a large enterprise you are not allowed to import directly from their database (point to point). You have to subscribe to their “publication”. In a large corporation, you are not the only consumer of this data; there might be 5 other systems consuming the same data.
- Understand the data delivery: Is the feed incremental or full load every time. What is delivery mechanism, FTP or messaging? Is it daily feed or continuous throughout the day? How often does the data get updated? What’s their scheduled down time?
- Understand the age of the data: If today you receive a file from that source system, does the content of this file reflect the condition in their system last night, or 2 days ago? Working days only, or week end included? In which situation this is not the case, at year end?
Welcome to the world of data sourcing. This is one of the activities in data warehousing that takes a lot of time. Having been into many warehousing projects, I feel now that the most important thing in a data warehousing project is probably the data sourcing. We need to have somebody in our DW team who understands the source systems. The dimensional modelling is equally important, but if you don’t understand the source data/systems, then the model would not be correct, and the ETL spec would not be correct either. And data sourcing needs to be done early: if we don’t have the data we can’t do anything!
This is how “analysis” is different from “design”. This is where Data Quality work happens. You perform DQ check and recheck on the incoming data. Are there duplicate rows in the data? Do the values match what you expect?
Data sourcing is often overlooked. It is often not considered seriously. Many PMs (especially those who are not from DW background) think like “Ah it’s about getting data from finance department, easy…” We should not underestimate data sourcing in a DW project. If we can’t get good, valid data, we won’t have a data warehouse, let alone producing a BI front end.
That is why, in any DW project, it is crucial that we develop and test with Production data. Lots of PMs don’t get this. They are so used to Data Entry Application project and they ask “Why do you need Production data for development?” Two reasons: the first reason is data quality / data sourcing / functionality. The second reason is performance / technical / size.
The first reason is about making sure that you get the right data, and therefore can build the correct functionality in the DW. The second reason is about making sure that we can deal with vast amount of data. It’s all very well that the ETL package only took 2 hours in dev, but when we go live it could take 7 hours! Why? “Oh we develop using 50,000 rows sample data”. You must be joking! This is a data warehousing project, not a data entry system application development project. And you are back to square one: redo the development, redo the ETL package, redo the testing, etc, which would be a huge waste of valuable project time.
Vincent Rainardi, 23/6/2011