Some SQL server data warehouses are loaded using stored procedures (SP). Data from the source system(s) is dumped into stage, and then many SPs are executed one after the other to load the data from stage into the warehouse. This batch of SPs resembled what we had 20 years ago, when mainframes or AS/400 were running overnight batches.
If the source system is not a SQL Server, an ETL tool like SSIS or Informatica may be used to import the data into stage. If the source system is a SQL Server, a series of SPs can be used to import the data into stage. An ETL tool can read data from any RDBMS and deliver data to any RDBMS. That is the distinct advantage over SPs, which only works on SQL Server.
A few days ago I talked to a friend who uses SSIS just to dump the source data into stage. He then uses SPs to load the stage data into the warehouse. This scenario is quite common, and usually chosen because of these reasons:
- Both the source/operational system and the data warehouse are in SQL Server
- The person designing the warehouse was not familiar with ETL tools. Probably he was a SQL developer who was asked to develop a data warehouse. He had written hundreds of SPs in the past, so “when you hold a hammer everything looks like a nail”.
- The person responsible for the ETL believed that SPs deliver more performance than an ETL tool.
The question he asked was, if both the source system and the data warehouse are in SQL Server, is there a reason of using an ETL tool to load data from stage into the warehouse? I explained that there were a few main reasons why people took the decision to use an ETL tool:
- Using an ETL tool it is easier to maintain the code
- An ETL tool is more flexible than SPs (it has more functionalities)
- An ETL tool deliver better performance than SPs
- Features such as logging/audit and metadata are built in
Using an ETL tool it is easier to maintain the code
An ETL tool is GUI based. You drag and drop. We map columns by “connecting the lines”. We have visuals of what the source is, what the destination is and what the transformations are. SPs on the other hand are text based. True you can use Information Schemas Routine or syscomments to search, but it’s like a jungle of spaghetti code. Even with a decent documentation the production support team would still be struggling to understand it.
An ETL tool is more flexible than SPs
With a decent ETL tool like BODI, SSIS, Informatica, Ab Initio and Data Stage/Web Sphere, everything is built in. Lookup, aggregation, branching based on certain condition, expression, string & date manipulations, are all built in. I’ve seen more than 3 times now, in the case SPs, the developer created SQL functions for string or date manipulation. They are clever and nice, but with an ETL tool it’s easier because they have a library of ready-made functions. When it comes to the advanced features, SPs can’t talk much: fuzzy lookup, debugging, ability to create custom scripts, data matching algorithm, data quality features, data profiling, ability to process XML feed, etc.
An ETL tool delivers better performance than SPs
Some people said to me that they don’t need performance. It’s only a small data mart. OK, but nevertheless, let’s talk about performance, in case in the future you do bigger project. A lot of performance related features are built in such as in-memory lookup, parallel running tasks, ability to utilise 3 different servers to run a massive load, high performant aggregation, high performant join, high availability features (when node 1 stop node 2 is still going).
Logging, Audit and Metadata
If we use SPs we will need to do the logging, auditing and metadata manually. With a decent ETL tool you don’t have too. You can do data lineage analysis, logging into database tables or files, tracing who’s doing what and when, all out-of-the-box standard functionality.
As usual I welcome any comments and questions at email@example.com. Vincent, 5/3/11.