I realise that the concept of “point to point” like ETL is a costly option, if you have a many consumers. The alternative is for everybody to come to a Central Point to get data. Consider this “Point to Point” diagram:
System A publishes to 3 systems (B, C, D), indicated by the red arrows. C consumes 3 sources (A, B, D). In real life it is also like that. A transaction system is consumed by many, whereas a data warehouse/mart consumes many. In large corporations, there could be 100 systems instead of just 4. Here it gets really complicated. Imagine for a minute the above diagram, but with 100 nodes. Below is a diagram for just 10 systems, imagine if it’s 100.
In the above diagram, system B, C, D, G and J essentially do the same thing, i.e. all of them consume data from system A. Yes but, they consume different things. OK, fair point, but still, there is a cost saving opportunity here. If A publishes its data, and B, C, D, G, J all subscribe to it, then it would be simpler. Imagine all data sources publish to a Central Point and all data consumers read from that Central Point:
From data integration point of view, that’s a lot simpler. So in large corporation, that’s what’s happening. The data warehouse does not read directly from many source systems, but consumes data from a Central Point. That’s Data Integration.
Apart from simplifying the route, the other benefit of having a Central Point is standardisation mechanism of data publishing and consumption. This standardisation will also reduce the cost of development. Because it’s only developed once, all systems then use it to consume data.
By having a Central Point, we can monitor the data traffic. We can police the traffic. We can ensure that everybody adhere to the standard format. We have a catalogue of publishers, the data they publish, when they were published, what are their frequencies, what validation were made before publishing, who the consumers are, when they were last sent out, who is the support team, etc. We have a catalogue of consumers, the data they consume, when it was last consumed, etc. We can build a logging mechanism, number of records sent, Kilo bytes sent, volume of data from each publisher, etc.
Because all traffic must go through a central point, monitoring is a lot easier. The disadvantage of course is: a single point of failure. And bandwidth: we need to provide adequate bandwidth. Considering the benefit (cost saving, simplification, monitoring, compliance, validation) and the disadvantages, the key decision factor is the number of nodes. If you are small company with only 4 systems, then it’s not worth it. If you are a large corporation with 40 systems, then it’s definitely worth it.
That’s data integration. Not ETL.
As usual I welcome questions and comments at email@example.com. Vincent, 5/3/11.