Data Warehousing, BI and Data Science

9 February 2015

Data Extraction Method

Filed under: Data Warehousing — Vincent Rainardi @ 12:43 am
Tags:

To populate the data warehouse we need to get data from the source systems. The source system could be an RDBMS, it could be a file based system (including XML), it could be a spreadsheet, or it could be something else (a web service, a web page, a document, etc). In this article we discuss RDBMS source system, i.e. the source data is located in a table on SQL Server, Oracle, Informix, DB2, Access, etc. File based source systems and spreadsheets will be discussed in a separate article.

There are 3 main methods to extract a table out of an RDBMS source system into our data warehouse: get the whole table every time, get it incrementally (only the changes), or using a fixed period. Each of these 3 basic approaches will be explained below one by one.

There are several considerations when choosing which method to use. The first and foremost is normally the time window, i.e. the data extraction process should be completed in certain amount of time. The less time it requires to do data extraction, the better it is for a data warehouse system, because this means that we can update the data warehouse more frequently. And it generally means that we put less load on the source system.

The second consideration is practicality, i.e. what are possible and what are not possible. Ideally we would like to extract it incrementally, but if it is not possible then we will have to use other method. Or the other way around: ideally we would like to extract the whole table (just to be sure), but if there are 100 millions records in the table, then we may have to do it incrementally otherwise the extract could take a very long time.

You can read Part 2 here.

1. Whole Table Every time

The simplest way to download a table from the source system into data warehouse is to extract all records every time, e.g. “SELECT * FROM table1“. For example, a store table. Say a retail group has 500 stores. Each store has a record on this table containing the store name, address and various attributes such as store grade or classification, last inspection date and ID of the store manager. For 500 records, even if the store table has timestamp for both create and update, it is probably quicker to get the entire table every time.

If we just do a select without a where clause, the extraction begins instantly. If we put something on the where clause, such as timestamp, the extraction process begins a few moments later (it could be a second later, it could be a few minutes later, depending on how many records are on the table on whether the columns used on the where clause are indexed or not). If you are connected on a LAN with the source system, say 100 Meg, 58 records should take less than a second. So, if the delay caused by the where clause is a few seconds, it is quicker to download the whole table every time.

A quick test using an ETL software can easily measure how long the delay is. But as a guideline, for a table that has less than 1000 rows, we should definitely consider getting the whole table every time. If the table has 1000 to 10,000 rows, we should still probably consider getting the whole table every time. If the table has more than 1 million rows , we should definitely consider downloading it incrementally (unless the record length is very small, such as 10 bytes). One sure way of determining whether download the whole table every time is suitable is to do it and measure the time. If the time it takes is not possible to be accommodated, then we need to extract the data incrementally.

2. Incremental Extract

If the table contains 2 million records, say it is the customer table, it could take 15-20 minutes to download the whole table. The purpose giving the stats/numbers here is only for illustration, so we can get the feel of it. It is not exact and it is not scientific. It could vary significantly, for example if you are using RAID 10 or using Gigabyte network. OK, back to customer table. If there are timestamp columns for creation and updates, then it could be a sensible idea to try to download this table incrementally, which may decrease the download time to, say, 3-5 minutes. The number of records decreases from 2 million to 2000 (daily changes only) and the amount of data transferred decreases from 2 GB to 2 MB (assuming the record length is 1000 bytes).

In OLTP systems, customer and product are probably the biggest master tables. But transaction tables are normally a lot larger. It does depends on the size of the company (multinational or local), and the industry (telecommunication and retail normally have huge tables). A transaction table with 100 million rows are not uncommon. At early stages of building the data warehouse, the one of the key question to answer is whether it is possible to extract the transaction tables incrementally. It is often not practical (in terms of time window) to extract the whole table every time. This question is so important so that for many projects. So important that if the answer is no then the project could not proceed.

There are 6 main ways to perform incremental extract:
a. Using timestamp columns
b. Using identity column
c. Using triggers
d. Using transaction date
e. Combination of the above
f. Using off-the-shelf tools

a. Incremental Extract Using Timestamp Columns

If the table is large (>100,000), it is always worth to find out if the table has creation and update timestamp or datestamp columns, and whether the timestamps are indexed. And more importantly if the timestamps are updated every time a new record is created and updated. The index is a simple and straightforward. We could easily prove if a column is indexed or not. Your source system DBA would know how to. Different RDBMS has different implementation for indexing (some platforms call it logical file), but they all speed up the query. Adding the timestamp on the where clause of the select statement could greatly vary the extraction time, especially for tables with millions of rows. Sadly, more often than not, the timestamp columns are not indexed. In this case it may be worth to find out if they could index it for you. But before you put forward the request consider the impact to the source system – as you know the more index there are on a table, the slower the DML e.g. insert, update and delete process.

The second one is a little bit more difficult to find out: whether the timestamps are updated every time a new record is created and every time an existing record is updated. The supplier of the ERP system may confidently say that the timestamps are always updated on insert and update throughout the application, but your internal IT colleagues may issue a direct DML statement to the database. “It is just a normal annual exercise every beginning of fiscal year”, so you may hear from them. Even if the timestamps are maintained by triggers, we can not be 100% scientifically sure that the timestamps are always kept up-to-date. Step 4 of the ‘normal annual exercise’ could be disabling the triggers on that table!

To be 100% certain, put the table on daily extract based on timestamp and leave it running for a few weeks. This period does depend on how frequent the table is updated. If the table is updated daily then a few weeks is a sensible evaluation period. After this period, compare the table on the data warehouse with the table on the source system using the primary keys. Find out if there is a “leak”, i.e. records which exist on the source system but you can not find on your DW. If you find a leak, check your logic on the date movement. Also if the date window moved when the download process failed. If there is no leak, we need to check that the updates are working, i.e. all columns on all rows on DW table matches those on the source system. We do this by using CRC comparison. The 3rd thing we need to check is if there are any deletion on the source system. This is done using the same means as checking for “leak” above, in fact they are both done using the same process. Deletion is identified by finding out if there are records which exist on the data warehouse but not on the source system. If these 3 checks are satisfied, then we can use the timestamps columns for incremental loading.

Once the timestamp columns are tested to our satisfaction, we can use them for incremental extract. The basic syntax for the select statement is: “SELECT * FROM table1 WHERE date_col > LSET“. LSET = Last successful extract time. Before the extract begins, we store the data warehouse system time in a variable. If the extract is successful, we store this variable on data warehouse control database as LSET. If the extract fails, it is important that we don’t store this variable on the database. Next time the extract runs it will use the stored LSET so it will only get the records since the last successful extract.

It is important to set the LSET to the time before the extraction begins, not when the extraction finishes. And it is important to limit the maximum time we want to extract to. For example: if the data extraction routine runs at 1 am for 10 minutes, and if the data extraction routine is running daily, then we want to limit the time window for today’s data extraction from 09/3/06 01:00:00 to 10/03/06 01:00:00. We do not want records created at 10/03/06 01:00:01, because this record is for tomorrow’s extract. Thanks to Steve Wright from C&C group for making me aware of the maximum limit. So the where clause becomes: “SELECT * FROM table1 WHERE date_col > LSET and col1 <= CET“, where CET = Current extract time, i.e. time before the extraction began.

The reason why we don’t want records created or updated after the data extraction began is because we can not guarantee if the record will be extracted or not. Say for instance we have a record with timestamp of 10/03/06 01:05:00. Data extract is running for 10 minutes, from 10/03/06 01:00:00 to 10/03/06 01:10:00. If our select statement is “SELECT * FROM table1 WHERE date_col > 10/03/06 01:00:00” (without the CET), will we get that record? It depends on the ISOLATION LEVEL, or CONCURRENCY CONTROL of the source system. Different RDBMS have different implementations of isolation level: IBM DB2 on z/OS, IBM DB2 UDB on iSeries, Microsoft SQL Server 2005, Oracle 10g, IBM Informix DS10.

So in the data warehouse control system for each table we are extracting, we have 2 timestamps defining the range of data extraction: Last Successful Extract Time and Current Extract Time. In the previous example the LSET is 09/3/06 01:00:00 and the CET is 10/03/06 01:00:00. If the data extraction is successful (for this particular table), then we set the LSET = CET, so both of them becomes 10/3/06 01:00:00. Tomorrow when the extract begins, the CET will be 11/03/06 01:00:00 and it will extract from 10/3/06 01:00:00 to 11/03/06 01:00:00. On the other hand, if the extract fails, we don’t set the LSET = CET, so the LSET would still be 09/3/06 01:00:00 and the CET would still be 09/03/06 01:00:00. Tomorrow when the extract begins, the CET will be 11/03/06 01:00:00 and it will extract from 09/3/06 01:00:00 to 11/03/06 01:00:00.

If the date column is separated from time column, for example date column contains ’28/02/2006′ or ’28-02-2006′ or ‘2006-02-28’ and the time column contains ‘11.03.21’ or ’11:03:21′, the algorithm we could use is:
WHERE (date_col > min_date AND date_col < max_date) — first segment
OR (date_col = min_date AND time_col > min_time) — second segment
OR (date_col = max_date AND time_col <= max_time) — third segment

min_date and min_time are determined from LSET whilst max_date and max_time are determined from CET.
For example: we want to extract records from 07/03/06 01:00:00 to 10/03/06 01:00:00 (because the data extraction route failed to connect to the source system from 3 days). The first segment above is to get records created/updated on 08/03/06 and 09/03/06. The second segment is to get records created/updated on 07/03/06 after 1 am. The third segment is to get records created/updated on 10/03/06 before 1 am.

It is very very important to ensure that we do not update the LSET if the extraction fails, otherwise we will loose data. I can not stress this important enough. If there is only 1 thing you can get from this article, I hope you get this one: ensure and test that LSET is not updated when the extraction fails.

One apparent weakness of incremental extract using timestamp is that we can not identify deletion. If the source system is using soft deletion (records to be deleted are marked), we have nothing to worry about. But if they are using hard deletion (records to be deleted are physically removed), then we need to use other technique to identify deletion, for example: using trigger or identity column. Fortunately, most ERP / business systems do not delete from the main transaction table, particularly when it is on header-detail format. Instead, they normally use status column, e.g. order status, call status or transaction status.

This is because of cascade delete / cascade update on the PK-FK relationship, or more widely known as “cascading referential integrity constraints”. Basically this means: if the table is a foreign key table (a child table if you like), you would think twice before deleting rows from that table, because you have to delete all corresponding rows from the other table (the parent table). And vice versa: if you want to delete rows from the primary key table (the parent table if you like) then you need to delete all corresponding rows from the foreign key table (the child table). Otherwise you will break the referential integrity constraints.

When the timestamp column is tested for reliability (see 9 paragraphs above, the one begins with “To be 100% certain …”), we will know whether there are deletion involved. It is good to ask the client / system provider (if there is or there isn’t deletion on the transaction table), but we still have to test it.

One last point on using timestamp for incremental extract: some times we need to get the timestamp from other table(s), or from combination of this table (that we want to extract) and other table. For example: order header and order detail, linked on order number. In early stages, we may think that it is enough to extract the header table based on the timestamp columns of the header table. And to extract the detail table based on the timestamp columns of the detail table. But sometimes it is possible that we will find that it is not enough. To extract the header table, we may need to use the timestamp columns from both the header table and the detail table. And to extract the detail table, we also need to use the timestamp columns from both tables. The code looks like this:

SELECT h.* FROM order_header h
LEFT JOIN order_detail d
ON h.order_number = d.order_number
WHERE (h.timestamp > LSET and h.timestamp <= CET)
OR (d.timestamp > LSET and h.timestamp <= CET)

b. Incremental Extract Using Identity Column

If there is no timestamp column in the transaction table (this is unlikely), or if the timestamp column is not reliable (this is more likely), then we could use identity column for incrementally extract the data. Identity column is unique. We can test the data to make sure there is no duplications as follows: (id_col being the identity column)
SELECT id_col, count(*) FROM table1
GROUP BY id_col HAVING count(*) > 1

Identity column is sequential. This enables us to easily identify new records. The basic syntax is WHERE id_col > LSEI. LSEI = Last successful extract identity, i.e the value of the identity column of the last record successfully extracted. Again, just like the time window, we want to limit the maximum. So it becomes WHERE id_col > LSEI and id_col <= CEI, CEI = current extract identity.

To identify deletion we compare the identity column between the source and data warehouse. This is done by downloading the identity columns (all records) from the source system. In many cases, especially if the table is a transactional table, the identity column is often a primary key, but it’s not always. Basically we identify the missing ones, i.e. rows that exist on source system but not in the data warehouse. We then delete (or move/archive) these rows from our data warehouse.

OK, that’s new records and deletion. How about updates? To identify updates we need to download the columns that we want to load to our data warehouse (yes, all records < LSET unfortunately), and compare these columns with the records on the data warehouse. If they are different, we update the records in the data warehouse.

After the source data is downloaded, the basic statement for updating incrementally is as follows:
UPDATE dw_table dw
SET dw.col1 = src.col1, dw.col2 = src.col2
FROM stage_table src
WHERE dw.col1 <> src.col2, dw.col2 <> src.col2

In most cases they have an archiving or purging system implemented on the ERP system, resulting in not too many rows on the active transaction table. For example, only keeping last 12 months data on the order table, resulting in say 5 million rows, with download time of say, 30 minutes. Sometimes they keep many years records without ever purging them to an archive database, resulting in (say) 30 million records on the active order detail table. If this is the case, try to identify whether it is possible to impose a time range for identifying updates, say last 6 months records. For the purpose of identifying this 6 months date range, if there is no timestamp / datestamp columns (created_date, last_updated_date, etc), then transaction date column (order date, delivery date, etc) will do. This will limit the amount of records downloaded to compare for identifying updates.

We can also use other mechanism to identify updates, such as update trigger, update timestamp or log files. The update timestamp doesn’t always need to be from the same table, it could be from other table. For example, in a header detail relationship.

If the source system is Oracle, the identity column is probably implemented on primary key column using a trigger and CREATE SEQUENCE, something like this: oracle-base.com. In DB2 and SQL Server, identity column is built in. Note that DB2 also support CREATE SEQUENCE. In Informix it is SERIAL data type.

c. Incremental Extract Using Triggers

Triggers are the most convenient way of doing data extraction. It is the belt and braces approach. There are 3 types of triggers, e.g. for insert, for update and for delete. Most RDBMS also differentiate before and after trigger, i.e. whether the trigger is executed before or after the insert/update/delete. By creating triggers on the source system table, we can program the source system to provide us with the records every time there is a new record, when updates are made to the existing records, or when the records are deleted. We need to be careful when implementing triggers because it slows down the source system performance.

One way of implementing triggers in the source system is to keep the primary keys of the changed table in a specially created table (let’s call this delta table). In the delta table we only keep the primary keys, not the whole record. This is because different tables have different columns so it is not possible to create a delta table which can store all tables. In delta table, the primary keys could be stored in 1 column, using separators. The name of the primary key columns are stored in another column, also using separators. Delta table also contains 2 important columns: the creation timestamp column (no need to have updated timestamp column) and the table name column. This enable us to get to the right table and to extract it incrementally. The delta table looks like this:

ID table_name PK_columns PK_values time_stamp
1 inventory store_id|product_id 23|5643_G 17/03/2006 10:10:46
2 order_detail order_id|line_no 454AS1|2 17/03/2006 10:11:01
3 customer customer_id 343 17/03/2006 10:11:98

The delta table needs to be cleared out when it has been processed. When processing the delta table, it is important to impose a CET, Current Extract Time, i.e. the system time before the data extraction begins. So the process is: get all records from delta table where timestamp <= CET, then delete all records from delta table where timestamp <= CET. Alternatively (preferred, for trace/history reason, and also for performance – see having a good PK below), we can also not clearing delta table when we have extracted them, but we simply store the CET as LSET (Last Successful Extract Time). See the section on timestamp above for details. If we don’t clear the delta table, we need to implement 1) a good integer identity (1,1) primary key, such as the ID column in the example above, and please don’t forget to cluster index it for performance, and 2) a good purging mechanism (say leaving last 3 months data) otherwise the system will be slower and slower every time – in this case a non cluster index on the time_stamp column is useful. One note on the time_stamp column if your source system in a SQL Server, I would prefer to put is column as datetime data type rather than timestamp data type, for compatibility reasons both with future version and with other RDBMS (portability of implementation).

Depending on the structure and complexity of the source system, sometimes it is better to have a separate delta table for each table on the source system, rather than using a single delta table. It is also common to have the delta tables extracted to files at certain periods of the day, ready for the data warehouse to get it.

The main difference between doing data extraction using triggers and the previously mentioned 2 methods is that if we use trigger, we “touch” the source system. The word touch may be a little bit too soft, the word “change” is probably more appropriate. Yes, we change the source system. This is not a luxury that everyone have. In many cases, the source system is an off-the-shelf ERP system, e.g. SAP, Oracle 11i, Sage, MFG, JD Edwards, PeopleSoft, Axapta, Sieble CRM, SalesForce, etc. In these cases, most probably, we don’t have the luxury of modifying the source system as we like. If we ask the supplier to modify it for us, we are risking ourselves to be in the position where our ERP is not a standard version any more, and therefore a) not upgradable to the next version, and b) not easily supported by other software when interfacing. Luckily, the big players in the ERP market such as SAP and Oracle already prepared themselves and have their own data warehousing tools. Tools such as Oracle Change Data Capture can extract data incrementally, and it is implemented either using triggers (synchronous) or using redo log file (asynchronous).

d. Incremental Extract Using Transaction Date

If there is no good create and update datestamp or timestamp columns in the source table, and it is not possible to put triggers in the source system, and there is no identity column either, then our next best option for incremental data extraction is to use transaction date column. By this I mean columns which contains the date the transaction happen.

Generally speaking, there are 2 major categories of tables in an ERP system or business system, or as the data warehouse community call it, OLTP system (in contrast with OLAP that we have in data warehouse system). The first category is the ‘master’ tables. Tables such as product, customer, supplier, account, status code, product group, store, and disposal site. In the data warehouse star schema, these tables become the dimension tables. The second category is “transaction” tables. Tables such as sales order header, sales order detail, purchase order header, purchase order detail, inventory transaction, account transaction, price review and disposal. In the data warehouse star schema, these transaction tables become the fact tables.

If the table we extract is a transaction table, it normally has one or more transaction date columns. In an order header table, it is common to find the following transaction date columns

  • order date – the date the order was placed by the customer
  • system order date – the date the order header record was created
  • estimated delivery date – the date the order supposed to be out of our door
  • actual delivery date – the date the order really out of our door

The difference between order date and system order date is that: if the order came by fax yesterday and it is typed into the system today, order date will be typed as yesterday whilst the system order date will be today’s date. Operator has no control on this column, it is maintained internally by the system. Please note that in better ERP systems, to enable partial delivery, the estimated and actual delivery dates are put on the order detail table rather than order header table.

In the above example, system order date is an excellent transaction date. It is guaranteed to be greater than or equal to today. Orders which happened later have later system order date as well. That makes this column, in a way, sequential, just like an identity column. The order date column, on the other hand, can be deceiving, because user can type a past date on it. This cause a problem, as we can not be sure which orders come after which orders.

The basic syntax of the ETL select statement is generally the same as the ones for datestamp / timestamp column, e.g. SELECT col1, col2 FROM table1 WHERE trans_date_col >= LSED. If we only deals with date (not with time as well), there is little point constraining the maximum.

So that’s for inserts. How about deletes and updates? Well, simply speaking, we can’t get deletes and updates using transaction dates. We need to use other methods to get them. This restriction made us often have to combine this method with fixed period method (see point 3 below), for example. we download last 6 weeks every time, based on transaction date column.

e. Combination Of The Above

Sometimes we need to combine the above 4 methods, to suit the condition of the source table, to get all inserts, updates and deletes. Inserts are generally the easiest to get. We can use all for methods to get inserts: timestamp column, incremental column, triggers and transaction date column. For deletes we can can use timestamp, incremental column or triggers. To get updates we can use CRC and triggers. Yes triggers can be use to get all 3 kinds of changes, incrementally.

f. Using Off-The-Shelf Tools

Now a days to get or extract the data out of business system is getting easier. Some BI companies have written ETL package for most popular business systems such as SAP, Oracle applications, Siebel, PeopleSoft and J.D. Edwards. In the BI industry, these ETL packages are normally called Data Integration (DI) software. We install the DI software, set some parameters, and off you go. It extracts the data from our business systems into our data warehouse. Isn’t that wonderful?

Well in reality it’s not as straight forward as that. We still have to do data analysis (profiling) and data quality process. We still have to customise it a little to suit our business system and to suit our data warehouse. But we don’t start from stratch, most definitely not. Below we try to illustrate the situation by describing a few DI software that are around in the market, along with the links.

  1. IBM Web Sphere Information Integration (formerly Ascentia)
    DataStage extracts, transforms and loads data. ProfileStage analyses source data, creates data model, generate source to target mappings, and generate ETL routine. QualityStage detects data duplication, inconsistencies and missing data. Package Application Connectivity Kits (PACKS) helps connect to SAP R/3 and BW, Oracle, Siebel, PeopleSoft and JD Edwards systems and extract data from them. Supports most RDBMS.
  2. Informatica PowerExchange and PowerCenter
    PowerExchange Changed Data Capture operates within the source environment to detect data updates and to deliver them in real time or batch. PowerCenter transform and load data; it also offers cleansing and profiling. Most RDBMS are supported.
  3. BusinessObject DataIntegrator
    Define data mapping and transformation. Cleanse (integration with FirstLogic), preview and profile data. Has prebuilt transformations. Creates portable ETL jobs. Integrates with SAP R/3 and BW, Oracle, Siebel, PeopleSoft and JD Edwards. Support most RDBMS.
  4. Pervasive Data Integrator (formerly Data Junction)
    Extract, transform and load data. Creates ODS. Real time or batch execution. Integrates with SAP, Goldmine, Solomon, Great Plains, Navision, SAS, Market Force, Salesforce, and many accounting packages: AccountMate, ACT, Champion, Platinum , Macola, Peachtree, RealWorld, CYMA, etc. Support a comprehensive list of RDBMS.
  5. Hummingbird Genio
    Extract, transform and load data. Integrates with SAP.
  6. SAS DataFlux
    Extract, transform and load data. Has data profiling and data quality capabilities. Integrates with SAP R/3, Siebel, Oracle Warehouse Builder, Informatica. Support most RDBMS.
  7. Cognos 8 BI Data Integration (formerly DecisionStream)
    Extract, transform and load data. Integrates with SAP R/3 and Siebel CRM. Support most RDBMS.
  8. DataMirror Transformation Server
    Uses database log files to capture changed data to deliver them real time.
  9. Microsoft SQL Server 2005 Integration Services (formerly DTS)
    Extract, transform and load data. Support most RDBMS.
  10. CA Advantage Data Transformer
    Extract, transform and load data. Integrates with SAP R/3. Support most RDBMS. Platinum InfoPump, InfoHub, and InfoRefiner were acquired by CA.
  11. Oracle Warehouse Builder
    Extract, transform and load data. Integrates with SAP (produces ABAP code). Carleton Pasport was acquired by Oracle.
  12. Embarcadero DT/Studio
    Extract, transform and load data. Changed data capture delta agent. Data profiling and metadata import/export. Support most RDBMS.
  13. Sybase Enterprise Connect
    Replication Agents for Informix, Oracle, and Adaptive Server Anywhere. Support most RDBMS.
  14. ETI Solution
    Extract, transform and load data. Integrates with Siebel, WebSphere MQ and ClickStream. Support most RDBMS.

Ascentia (now IBM) and Informatica are probably the most popular data extraction tools in the BI market now a days. BusinessObject, Genio and DataFlux are decent, but I prefer to use SSIS as personal choice. Informatica, DataMirror and Oracle CDC (changed data capture) could be a life saver in some situation where we must extract incrementally. Alkis Simitsis compiled a comprehensive list of ETL tools, which might worth having a look if you want to explore some of the nice players.

Vincent Rainardi
29th March 2006

Note: This article was originally posted on SQL Server Central: link.

Advertisements

1 Comment »

  1. […] more info please read my article about incremental extraction here: link. I described 4 things that we can use for doing incremental extraction: timestamp columns, identity […]

    Pingback by Estimating the Size of Dimension and Fact Tables | Data Warehousing and Business Intelligence — 13 February 2015 @ 11:25 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: