Data Warehousing and Data Science

5 April 2011

ETL and Data Integration

Filed under: Data Warehousing — Vincent Rainardi @ 6:06 pm

I realise that the concept of “point to point” like ETL is a costly option, if you have a many consumers. The alternative is for everybody to come to a Central Point to get data. Consider this “Point to Point” diagram:

System A publishes to 3 systems (B, C, D), indicated by the red arrows. C consumes 3 sources (A, B, D). In real life it is also like that. A transaction system is consumed by many, whereas a data warehouse/mart consumes many. In large corporations, there could be 100 systems instead of just 4. Here it gets really complicated. Imagine for a minute the above diagram, but with 100 nodes. Below is a diagram for just 10 systems, imagine if it’s 100.

In the above diagram, system B, C, D, G and J essentially do the same thing, i.e. all of them consume data from system A. Yes but, they consume different things. OK, fair point, but still, there is a cost saving opportunity here. If A publishes its data, and B, C, D, G, J all subscribe to it, then it would be simpler. Imagine all data sources publish to a Central Point and all data consumers read from that Central Point:

From data integration point of view, that’s a lot simpler. So in large corporation, that’s what’s happening. The data warehouse does not read directly from many source systems, but consumes data from a Central Point. That’s Data Integration.

Apart from simplifying the route, the other benefit of having a Central Point is standardisation mechanism of data publishing and consumption. This standardisation will also reduce the cost of development. Because it’s only developed once, all systems then use it to consume data.

By having a Central Point, we can monitor the data traffic. We can police the traffic. We can ensure that everybody adhere to the standard format. We have a catalogue of publishers, the data they publish, when they were published, what are their frequencies, what validation were made before publishing, who the consumers are, when they were last sent out, who is the support team, etc. We have a catalogue of consumers, the data they consume, when it was last consumed, etc. We can build a logging mechanism, number of records sent, Kilo bytes sent, volume of data from each publisher, etc.

Because all traffic must go through a central point, monitoring is a lot easier. The disadvantage of course is: a single point of failure. And bandwidth: we need to provide adequate bandwidth. Considering the benefit (cost saving, simplification, monitoring, compliance, validation) and the disadvantages, the key decision factor is the number of nodes. If you are small company with only 4 systems, then it’s not worth it. If you are a large corporation with 40 systems, then it’s definitely worth it.

That’s data integration. Not ETL.

As usual I welcome questions and comments at Vincent, 5/3/11.


  1. Hi Vincent,

    I have a question about ETL. When designing an ETL we can do data transformation steps in database (procedures or SQL) and we can do this steps in a ETL tool (PowerCenter, Pentaho DI, SSIS…), as far as I know there are steps that have better performance in ETL tool and others have better performance in database. For example there are ETL tools that make the sorts or aggregation faster than database procedures or SQL.I would like to know if there are more steps that go different in a ETL tool than in a database, in order to have some criteria to make a design decision.

    Thanks in advance

    Comment by Juan V — 4 December 2016 @ 10:29 pm | Reply

    • Hi Juan, thank you for contacting me. I’ll answer according to my experience but I’m sure there are many people out there who have different experience and can advise you as well.
      The ETL tool ability to pin tables into memory is crucial to the performance. This enables faster joins, look up and sorting. Usually pinning into memory gives better performance for small to medium size tables. The other very important feature is to push down the work to the RDBMS engine. This is particularly useful for aggregations and joins, if appropriate indices are available. ETL tool ability to fast-read many big files in “parallel batches” is also crucial for performance. “In-memory looping”, generaly speaking, is ETL tool area of strength, in comparison to the slow, old cursor technology in RDBMS. Another performant area of ETL tool is calculations and functions, particularly mathematical or financial.
      For other features please read my article “Choosing an ETL Tool” on this blog:
      But in my experience, feature is not the primary criteria in selecting an ETL tool. The priority is usually: 1) Total cost, 2) Conformed architecture, 3) Vendor financial strength. So don’t spend too much time looking at the features. If the vendor is weak, or the total cost is too high for your project, or the technology architecture does not suit your company, ditch them. We don’t want a vendor who could go bust in 2022, or the ETL costing 50% of the project budget. It is the TOTAL cost, so include servers, storage, annual maintenance cost, training, consultant cost. Not just licence cost.

      Comment by Vincent Rainardi — 5 December 2016 @ 8:39 am | Reply

  2. Thak you very much Vincent, very useful your answer. In my case the database is Teradata with parallell features and that’s becasuse I am thinking to push down some of the ETL steps, but I will test performance before.

    Comment by Juan V — 5 December 2016 @ 5:15 pm | Reply

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at

%d bloggers like this: