Data Warehousing and Business Intelligence

5 April 2011

ETL and Data Integration

Filed under: Data Warehousing — Vincent Rainardi @ 6:06 pm
Tags:

I realise that the concept of “point to point” like ETL is a costly option, if you have a many consumers. The alternative is for everybody to come to a Central Point to get data. Consider this “Point to Point” diagram:

System A publishes to 3 systems (B, C, D), indicated by the red arrows. C consumes 3 sources (A, B, D). In real life it is also like that. A transaction system is consumed by many, whereas a data warehouse/mart consumes many. In large corporations, there could be 100 systems instead of just 4. Here it gets really complicated. Imagine for a minute the above diagram, but with 100 nodes. Below is a diagram for just 10 systems, imagine if it’s 100.

In the above diagram, system B, C, D, G and J essentially do the same thing, i.e. all of them consume data from system A. Yes but, they consume different things. OK, fair point, but still, there is a cost saving opportunity here. If A publishes its data, and B, C, D, G, J all subscribe to it, then it would be simpler. Imagine all data sources publish to a Central Point and all data consumers read from that Central Point:

From data integration point of view, that’s a lot simpler. So in large corporation, that’s what’s happening. The data warehouse does not read directly from many source systems, but consumes data from a Central Point. That’s Data Integration.

Apart from simplifying the route, the other benefit of having a Central Point is standardisation mechanism of data publishing and consumption. This standardisation will also reduce the cost of development. Because it’s only developed once, all systems then use it to consume data.

By having a Central Point, we can monitor the data traffic. We can police the traffic. We can ensure that everybody adhere to the standard format. We have a catalogue of publishers, the data they publish, when they were published, what are their frequencies, what validation were made before publishing, who the consumers are, when they were last sent out, who is the support team, etc. We have a catalogue of consumers, the data they consume, when it was last consumed, etc. We can build a logging mechanism, number of records sent, Kilo bytes sent, volume of data from each publisher, etc.

Because all traffic must go through a central point, monitoring is a lot easier. The disadvantage of course is: a single point of failure. And bandwidth: we need to provide adequate bandwidth. Considering the benefit (cost saving, simplification, monitoring, compliance, validation) and the disadvantages, the key decision factor is the number of nodes. If you are small company with only 4 systems, then it’s not worth it. If you are a large corporation with 40 systems, then it’s definitely worth it.

That’s data integration. Not ETL.

As usual I welcome questions and comments at vrainardi@gmail.com. Vincent, 5/3/11.

Advertisement

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 37 other followers