Data Warehousing and Business Intelligence

5 April 2011

Stored Procedure vs ETL Tool

Filed under: Data Warehousing — Vincent Rainardi @ 7:03 am
Tags:

Some SQL server data warehouses are loaded using stored procedures (SP). Data from the source system(s) is dumped into stage, and then many SPs are executed one after the other to load the data from stage into the warehouse. This batch of SPs resembled what we had 20 years ago, when mainframes or AS/400 were running overnight batches.

If the source system is not a SQL Server, an ETL tool like SSIS or Informatica may be used to import the data into stage. If the source system is a SQL Server, a series of SPs can be used to import the data into stage. An ETL tool can read data from any RDBMS and deliver data to any RDBMS. That is the distinct advantage over SPs, which only works on SQL Server.

A few days ago I talked to a friend who uses SSIS just to dump the source data into stage. He then uses SPs to load the stage data into the warehouse. This scenario is quite common, and usually chosen because of these reasons:

  1. Both the source/operational system and the data warehouse are in SQL Server
  2. The person designing the warehouse was not familiar with ETL tools. Probably he was a SQL developer who was asked to develop a data warehouse. He had written hundreds of SPs in the past, so “when you hold a hammer everything looks like a nail”.
  3. The person responsible for the ETL believed that SPs deliver more performance than an ETL tool.

The question he asked was, if both the source system and the data warehouse are in SQL Server, is there a reason of using an ETL tool to load data from stage into the warehouse? I explained that there were a few main reasons why people took the decision to use an ETL tool:

  1. Using an ETL tool it is easier to maintain the code
  2. An ETL tool is more flexible than SPs (it has more functionalities)
  3. An ETL tool deliver better performance than SPs
  4. Features such as logging/audit and metadata are built in

Using an ETL tool it is easier to maintain the code

An ETL tool is GUI based. You drag and drop. We map columns by “connecting the lines”. We have visuals of what the source is, what the destination is and what the transformations are. SPs on the other hand are text based. True you can use Information Schemas Routine or syscomments to search, but it’s like a jungle of spaghetti code. Even with a decent documentation the production support team would still be struggling to understand it.

An ETL tool is more flexible than SPs

With a decent ETL tool like BODI, SSIS, Informatica, Ab Initio and Data Stage/Web Sphere, everything is built in. Lookup, aggregation, branching based on certain condition, expression, string & date manipulations, are all built in. I’ve seen more than 3 times now, in the case SPs, the developer created SQL functions for string or date manipulation. They are clever and nice, but with an ETL tool it’s easier because they have a library of ready-made functions. When it comes to the advanced features, SPs can’t talk much: fuzzy lookup, debugging, ability to create custom scripts, data matching algorithm, data quality features, data profiling, ability to process XML feed, etc.

An ETL tool delivers better performance than SPs

Some people said to me that they don’t need performance. It’s only a small data mart. OK, but nevertheless, let’s talk about performance, in case in the future you do bigger project. A lot of performance related features are built in such as in-memory lookup, parallel running tasks, ability to utilise 3 different servers to run a massive load, high performant aggregation, high performant join, high availability features (when node 1 stop node 2 is still going).

Logging, Audit and Metadata

If we use SPs we will need to do the logging, auditing and metadata manually. With a decent ETL tool you don’t have too. You can do data lineage analysis, logging into database tables or files, tracing who’s doing what and when, all out-of-the-box standard functionality.

As usual I welcome any comments and questions at vrainardi@gmail.com. Vincent, 5/3/11.

About these ads

10 Comments »

  1. its really very nice

    Comment by sreekanth — 14 June 2011 @ 4:03 pm | Reply

  2. The fact that you have not listed any drawbacks with standard ETL tools tells a lot about the type of projects you are basing your comments on. To claim that SSIS performs better than standards T-SQL is rather absurd to say the least.

    Comment by Tahir Riaz — 15 November 2011 @ 11:15 am | Reply

    • Thank you for your comment Tahir

      Comment by Vincent Rainardi — 15 November 2011 @ 7:52 pm | Reply

    • Vincent, I agree with what you are saying, but I would not include SSIS as an ETL tool, in fact it impairs performance. Yes it moves and transforms data, but very Clumsily. I would consider a real ETL tool to be repository based. I think this is the biggest advantage of a robust ETL tool like Informatica. The main reason companies don’t use a robust ETL tool is the cost.

      Comment by Tim — 20 December 2011 @ 7:37 pm | Reply

  3. Hi Vincent,

    thanks for your interesting post. As I think it is written a bit onesided, I would like to discuss some point from another perspective.

    .) Using an ETL tool it is easier to maintain the code
    It’s true that ETL tools are GUI based, but in my experience, that doesn’t necessarily mean, that it’s easier to maintain. There are much more SQL coders out there than people with experience in a specific ETL tool. I personally find an error much faster in a PL/SQL code, than with debugging mappings of ETL tools.
    I would also argue, that developing PL/SQL can be faster than developing ETL logic with ETL tools, as you can easily write a code generator to automatize the development with dynamic SQL code.

    .)An ETL tool is more flexible than SPs
    lookups, aggregations, branching based inserts, conditions, expressions, string & date manipulations are built in modern database solutions as well, and they are even faster.
    You are right, that some ETL tools may deliver some advanced functions, that you don’t find in a database.

    .)An ETL tool delivers better performance than SPs
    In almost all cases the SPs, which are executed directly in the database must be faster than any ETL tool. Not only, because you don’t have to transfer your data to your ETL server and back to your DB. Also because joins/lookups, aggregations, filters, etc can ran much faster in DB and can be tuned using indexes and other techniques.

    .)Logging, Audit and Metadata
    I have to agree here, that monitoring/logging is usually easier with ETL tools. You have to implement your own logging mechanism in your SPs

    Personally I have used both, ETL tools (Informatica, SSIS, Pentaho, Talend) as well as SPs to develop ETL processes.
    My opinion is, that it totally depends on the project which approach is better. When you have different source systems, web services, files, etc. that you have to bring together, and you want to design your ETL process DB-independent, ETL tools are clearly the choice.
    When the source and DWH systems are on the same DB, and you have a lot of similar transformations, I would use SPs to automatize the development process and get a better performance.

    Comment by Mike — 22 January 2012 @ 7:29 pm | Reply

  4. Hi Vincent,

    This is a very passionate topic both at your blog and at many organizations. I think this problem like many is a focus on the part than on the whole. What I mean by that is professionals needing to build a specific solution look at the tools they are most familar with and can execute on using that tool. Many of the best programmers choose storeed procedures because their ability to execute on it is very high.

    The whole focuses on the bigger picture, what happens after the SQL expert that built that solution leaves and someone else has to support it, or what happens after the small project turns into a mammoth project. The SP/View solution begins to be crushed under its own weight with very few options for tuning.

    Comment by Jonathan Smith — 30 October 2012 @ 2:12 pm | Reply

  5. Hi Author,
    Based on my experience I found SP works faster than ETL. SPs are pre-compiled and stored in SQLServer.

    Comment by momtazulkarim — 4 September 2013 @ 9:07 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 207 other followers

%d bloggers like this: