All these years I always assumed that data warehouse was the place for a single version of the truth. It is the place for integration. If the department A has data different from department B, then we integrate it in the DW. If the DW is for a group of companies, and the companies have different data, we integrate it there. And I have done it many times (integration).
1. Integrating Rows: (as always it is easier to explain by example)
For example: it is a group DW for 2 companies. Company A has Customer1 to Customer100. Company B has Customer80 to Customer 140. They have 20 common customers. We integrate this in DW. Our DW Customer Dimension has 120 rows (plus SCD2 rows).
2. Integrating Attributes: If company A has 3 customer attributes (A, B, C) and company B has 4 attributes (A, B, D, E) then the DW customer dim has 5 attributes (A, B, C, D, E). We populate attribute A and B for all 120* rows. We populate attribute C only for 80* rows (Customer1 to 80). We populate attribute D & E for 60 rows* (customer80 to 140). *Plus SCD2 rows.
The whole point of the “DW is the point of integration” theory is so that we can report at group wide numbers.
3. Multiple Versions of the Truth
But this week I learned something contradictive: different departments/projects may want different versions. Again it is easier to explain by example:
Application1 creates and stores customer data into a database. This customer data is then fed into the company DW.
Application2 takes the customer data from the app1’s database. App2 then modifies the customer attributes and uses it for its calculations. App2 has rules like this: if Attribute A is value1 then replace it with value2. App2 stores the modified customer attribute into its database. Clearly, App2 customer table is contains different values to App1.
Say the attribute is Class. In App1 the Class contains value from A to Z. In App2, there is rule which says: if the Class is H, I or J, change it to L. So in App1, the value of Class is A to Z. In App2, the value Class is A to W. Say in App1 we have 10,000 customers. In App2 we have 2,000 customers, all of which exist in App1’s list of 10,000 customers.
Now, how do we build the DW’s customer dim?
We need to have two Class attributes: “App1 Class” and “App2 Class”. We populate App1 Class for 10,000 rows. We populate App2 Class for 2,000 rows.
Can we have 1 Class attribute in the DW? No we can’t. Why? Because both Apps want to see different things. App2 rule clearly says: if the Class is X, Y or Z then change it to L.
The basic principle of Data Warehousing is that: what you see in the source system must be what you get from the DW/BI. It does not make sense if Customer1 in App2 has Class = L and in the BI report it has Class = Y. If in App2 the Class of Customer1 is L then in the DW/BI it must be L.
This “DW must report what is in the source system” principle is the most important principle of data warehousing. It takes the precedence over any other principles, including the principle of integration.
4. Measures
Say in App1 we have a column called CFI, with value ranging from 0.0 to 1.0. This in the Data Warehouse becomes a measure in a fact table. App2 reads App1’s CFI and added a rule: if CFI is >= 0.1 use it. But if CFI < 0.1 then use the a specific formula to calculate CFI.
So in DW we should have 2 measures: App1 CFI and App2 CFI. The App1 CFI can just be called CFI because that’s what everyone thinks CFI is about, but The CFI from App2 must be clearly labelled in the DW as “App2 CFI”.
5. The Importance of Timing
App1 and App2 data may arrive in the Data Warehouse at different timing. App1 data arrives a day after it is created in App1. App2 data arrives in DW 2 days after it is created in App2. Why? Because App2 reads the App1 database overnight.
When creating App1 CFI and App2 CFI the DW/BI must be aware that App2 CFI is 1 day later than App1 CFI. So the DW/BI must not add them together. Tuesday App1 CFI can be summed with Wednesday App2 CFI. But Tuesday App1 CFI cannot be summed with Tuesday App2 CFI. This is in the DW’s ETL and also in the BI’s Report. You can’t add data from different timing.
Similarly, on the dimension/measure front, App1 Class has different timing to App2 Class so when a user tries to compare them, they need to take Tuesday App1 Class and compare it with Wednesday App2 Class. They can’t compare Tuesday with Tuesday.
6. The downstream principle
If App2 is getting data from App1, and App3 is getting data from App2, then
a) There’s an increased chance of getting it wrong (Chinese whispers)
b) We are creating dependency on App2
It is better that App3 reads directly from App1. In large corporation this is the known as the downstream principle: flowing data downstream is not allowed. All downstream apps must get the data direct from the app that creates the data.
It is not too bad in small companies. But if a group has many companies it becomes a big problem. In big corporation there is a concept of “middleware”, a software that publishes data from 1 app to another app. Middleware arrange the messages, queuing, the resilience, etc between apps. The middleware prevent direct ODBC connection between the consuming app and the source app.
Yes the consuming app must get data direct from the source app (downstream principle) but it must not use direct ETL i.e. direct DB to DB connectivity. The consuming app must read the data from the Message Queuing System (MQS) middleware, such as MSMQ and WebSphere MQ. And the source app must send the data to the MQS.
This is because it is likely that there are many consumer of that data, not only the DW, but also other downstream app. That is why: “Publish once consume many times”. This principle is enabled by the MQS, aka Message Broker System. If all consuming apps read direct from the source app, the source app can’t handle it. There will be a bottleneck on the source app.
Conclusion
If Customer1 in App2 has Class = L, in the DW it must be L. A data warehouse must always reflect the values in the source system. A MQS middleware enables many downstream apps including DW to consume the same data from the source app without causing bottleneck because the source app publishes it only once.