The 6 most important features of an ETL tool are:
- Connections
- Tasks
- Workflow
- Execution
- Performance
- Management
Apologies for being old fashioned calling it ETL tool, rather than modern term which is Data Integration tool. “It is not just a data loader!” I can hear some of you are saying 🙂 Unfortunately, that is the name that people know.
1. Connections
The most important feature of an ETL tool is the Connections. It has to be able to connect to Excel, FTP, SharePoint, Bloomberg, Thomson Reuter, FIX, FpML, Salesforce, SAP, Cloud, Hadoop, MQ, LDAP, and web services (complete list below). If it can’t connect to the data source or target system, then it doesn’t matter what processing ability the ETL tool has, it can’t be used. Some tools can’t even connect to Excel or SharePoint, or connect to a database located in different servers, let alone interfacing with web services, Azure or MQ. All tools can connect to a database/RDBMS, but some tools have “native client” drivers, giving far better performance and control compared to ODBC.
2. Tasks
The second most important feature is the richness of the transformations to do day-to-day tasks, such as data conversion, lookup, expression, joining records, splitting data, filtering, ranking, sorting, grouping, looping, and combining data. All ETL tools can do (some) of these tasks, but some are much better than others.
Doing pivot/unpivot, converting dates, setting variables based on files, merging rows, finding the latest file, and splitting data based on certain conditions for example, could be a walk in the park for an ETL tool, but a headache for other tool. Some ETL tools can do advanced tasks out-of-the-box, such as running web methods, watching files, transforming XML documents, rebuilding indexes, sending emails, profiling data, handling arrays, processing unstructured data, and doing custom task using .NET scripts.
Some ETL tools have wizards to assist creating complex packages. For example: loading fact tables, or type two dimensions. Some ETL tools have built-in Data Quality tasks such as validation, profiling, and statistical analysis of the data. In some ETL tools we can even create the ETL packages programmatically.
3. Workflow
The third most important feature is to create workflows that arrange and connect these tasks and transformations. This includes constraint (criteria), looping (repeating), branching, and grouping of tasks. Let me give you some examples (the best way to learn is by examples).
- If task A and task B are successful, do task C, otherwise do task D.
- If Criteria1 is true do task A, else if Criteria2 is true do task B.
- Repeat the execution of task A for 10x, or until the number of rows is 0.
- Process the Rejected Rows from task A using task B, but the Good Rows using task C.
- Processing files: if a file fails, don’t stop. Keep processing all other files.
- Group these 3 tasks together. They always have to run as a unit.
All ETL tools can do (some) of the above, but some can only do one item, while others can do all items in the list, or even more, such as:
- Pass a variable from task A to task B.
- Derive an instance of this task, and add something to the derived task
- When an instance of this task runs, make it do different things depending what parameters are passed.
- Make this group of tasks available to all packages
(we don’t want to create the same task everywhere, because code duplication causes exponential maintenance effort)
4. Executing
When an ETL package runs, it is very important to be able to log how it executed i.e. we need to know how long it takes, what time it started/ended, who triggered it, was it successful or failed, what was the error message, etc.
Apart from logging, when an ETL tool runs, it also needs to be able to run at scheduled times, re-run when failed, and limit the execution duration. Let me explain with examples:
- Run / do not run at scheduled times: run every 8pm on week days, except on bank holidays (we give it a list of bank holidays).
- Run / do not run on certain conditions: if file A exists, run it. If package B is already running, this package (package A) must not run.
- Rerun when failed: if it fails, re-run it 3 times, but wait for 5 minutes. But if Criteria1 is true, don’t rerun it. If the whole ETL package fails, run this command.
- When it fails, rerun from the point of failure not the entire package.
- After running, do this command. Before running, do this command.
- Concurrent running: allow only 2 copies of this package to run simultaneously.
- Overrun / time limit: allow this package to run for only 30 minutes. If it exceeds 30 minutes, stop it. Allow this package to run not more than 10% of the last 7 days average.
- Alert/notification: If it fails, send an email to admins. If it is completed, sends an email to admin, regardless of the status. If it fails only once, don’t send email, but if it fails after 3x rerun, send an email. If it overruns, send an email. If it does not run, send an email.
- Escalation: if after sending email, no action is taken for 1 hour, send an email to manager.
- Log the error rows to a table.
- Execute this package based on the parameters in this configuration file.
- Run on grid: Run this package on 2 nodes only. Run this package on all available nodes. Run task A and C on node 1, and task B on node 1. Run task A on as many nodes as possible, but run task B on 1 node only.
- If another instance of this package is already running, stop it and run a new instance.
- Load balancing: process files 1-100 on node 1, files 101-200 on node 2. Process big files on node 3.
- Data Lineage: provide an audit trail where this data is coming from.
Some of the points I mentioned above are only “dreams” to certain ETL tools, i.e. even if we wait until 2020 their developers will not be able to build those features in the software (due to lack of funding, not lack of skills). But some ETL tools are able to do every single point in the list.
In some projects, if the package fails they don’t lose money at all. For these companies, they prefer to use “free” ETL tools. Well, there is no free lunch, so let’s just say $10k. But in some projects, if the package fails they would lose $1 million a day. For these companies, paying $1 million for an ETL tool with excellent Execution features is a no brainer.
5. Performance
Some projects only process 1 million rows. But some projects process 10 billion rows and it needs to be done in 1 hour, not 10 hours. For these companies, performance is really important. For example: bulk loading, caching, partitioning, high availability, etc. Let me give you some examples.
- Use Bulk Load option, and define large commit interval to increase performance.
- Run the process as multi threads. Monitor the thread statistics such as idle time, work time and busy time. Allow task A to be done in 10 threads, but task B only in single thread.
- If it the performance is better*, let the database do the joins, filtering, sorting, aggregations, and calculations.
*The ETL tool has the ability to judge if it would be faster to do it in the ETL tool, or in the RDBMS, based on the size and type of the data. - When doing lookup, cache the lookup table (1000 rows) in memory, then do the lookups 1 million times without touching the disk. Automatically decide whether to do full cache, no cache, or partial cache, depending on the data volume and type.
- If we read from Oracle or SQL Server, improve the performance by increasing the size of the network packet, or by using IPC protocol. If we read from Teradata, use FastExport reader rather than Relational reader.
- Automatically drop and recreate all indexes, or temporarily disabling RI constraints on the target tables. Automatically disable triggers on the target tables.
- Minimize deadlock occurrence by using different connections to write to the target database.
- Use incremental aggregation, or increase the size of the data cache to hold all data in memory without paging to disk.
- Automatically improve the performance by re-ordering the task. For example, perform filtering before aggregation, regardless of how the tasks are written.
- Automatically sort the data, if it will improve the performance (known as “sorted output” and “sorted input”)
- Improve the join operation by automatically designate the dataset with fewer duplicates as the master leg.
- Automatically force SQL Server to use the most optimum join*, for example hash join rather than nested loop. *based on the data we are joining
- If the workflow contains 10 tasks, split into 2 partitions, processing the data in different batches simultaneously to increase performance. When row 1-100 is being processed by partition 2, partition 1 can start processing row 101-200.
- Control over how many CPUs can be used, control the amount of memory used, pagination,
- HA on Grid: By default use both nodes, but if node 1 fails, node 2 automatically picks up node 1’s jobs.
6. Management
The last important feature (but not least) is the management and administration. For example, source code control and repository, managing the nodes in the grid, administering and deploying the ETL packges to Dev, Test and Production environments.
Also configuring the packages to run at certain times, setting alerts of the job fails and email addresses for notifications. Creating users and giving them the right permissions to run jobs or as a developer. Allocating memory and CPUs (virtually) to the ETL instances. Minimizing downtime when upgrading. Attaching a database to the ETL instances (it store the metadata in this database). All these administration and management tools add values to the ETL tool.
All these features may not be important to the developer, but they are important to the Production Support team, and to the Info Sec team.
I hope this article is helpful. I would welcome any feedback at vrainardi@gmail.com, or through comments below.
Connections
- Basic Connections: Files, FTP, OLEDB, ODBC/JDBC, XML
- Connections to Databases using Native Connector: SQL Server, Oracle, Teradata, Netezza, Sybase
- Connections to Microsoft Office: Excel, PowerPoint, Word, SharePoint
- Connections to Applications: SalesForce, SAP, Dynamics, Siebel, NetSuite, Marketo, SugarCRM, StrikeIron, Taleo, Lotus Notes
- Connections to Financial services: Bloomberg, FIX, SWIFT, Thomson Reuter, FpML, ISDA, RiskMeter, SEPA, Mismo, Nacha
- Connections to Cloud, Web Services, Big Data: Cloud DBs, Azure, Redshift, Hadoop, JSON, MongoDB,
- Connections to Messaging/Protocols: WebSphere MQ, Microsoft MQ, Tibco, OAGI, OData, REST, LDAP, SOAP
- Public App: Twitter, Facebook, Linked In, Amazon API, Google API,
- BI & Warehouse Connectors: Tableau, SAS, SSAS, Essbase