Data Warehousing and Data Science

13 February 2015

Estimating the Size of Dimension and Fact Tables

Filed under: Data Warehousing — Vincent Rainardi @ 11:25 pm
Tags:

In most cases, when we started designing a data warehouse, as a data warehouse architect the PM needs us to tell them what the size of the DW will be, so that infrastructure can start the procurement process, capacity planning, etc. The problem is, we do not know the size of the warehouse until we finished building the warehouse and ETL.

Below I describe the technique that I used to estimate the size of dimension and fact tables. Once you get that, you can estimate the size of the staging database based on the warehouse size (say 30-50% of DW size), size of the cubes (say 5-10% of DW size), size of the extract files (say 10-20% of DW size).

This way you can estimate the disk requirement, which is one of the most difficult thing to do, if you don’t have the experience. I’ve done about 10 projects of data warehousing, i.e. designing and building data warehouses and BI, and what I wrote here is based on that experience. Of course, other DW architects may have different experience, and if so, please share it so we can all learn from you as well.

OK let’s start with dimension tables, then the fact tables.

Estimating the Size of Dimension Tables

Categorise dimension tables into small, medium, and large, based on how many rows in the dimension. Up to 1000 rows is small, 1000 to 100k is medium, and > 100k rows is large dimension. For example:

  • Country dimension: small
  • Customer dimension: large
  • Currency dimension: small
  • Date dimension: small
  • Salesperson dimension: small
  • Product dimension: large
  • Industry Sector dimension: small
  • Store dimension: medium

We can roughly estimate the size of a small/medium/large dimension based on the KB per row, like this:

  • Small: 1 KB/row x 1000 rows = 1 MB
  • Medium: 5 KB/row x 100k rows = 500 MB
  • Large: 10 KB/row x 1m rows = 10 GB

Then we multiply with the number of dimensions for each of the Small/Medium/Large category, like this:

  • Small: 11 tables x 1 MB = 11 MB
  • Medium: 6 tables x 500 MB = 3 GB
  • Large: 4 tables x 10 GB = 40 GB

Total: 44 GB

We can then cross check this 44 GB using the daily volume of dimension data flowing into the data warehouse, to verify that we are not too far off from the reality.

  1. Total data volume each day loaded by ETL into the data warehouse from all source systems: 15 GB, comprised of say 13 GB delta load (incremental extract / transactional) and 2 GB static load (whole-table extract). This 15 GB is the total volume of data extract from all source systems, which is usually about 5-10% of the size of the source system, if we can do incremental extraction (use 50% if you can’t do extract the data incrementally).
  2. Of all date going through the ETL process, about 5-10% is dimension/static data, and 90-95% is fact/transaction data. If you are not sure which number to take between 5% and 10%, take 5%. This is because most cases are 5%, only a few cases it is 10%. So 5% x 15 GB = 0.75 GB. This is the volume dimension data uploaded to the data warehouse every single day.
  3. About 0.1-3% of the rows in the dimension tables are update or inserted every day. 3% for a very active dimension (like customer and product dimensions), 0.1% for very static dimension (like country and currency dimensions). So if your customer dimension has 10 million rows, about 300k rows (3%) are updated or inserted every day. If your country dimension has 200 rows, about 0.2 rows (0.1%) are updated every day, which means that 1 row per week on average.
  4. So based on #2 and #3 we can estimate the size of the dimension. There 1.75 GB dimension data flowing into the DW every day, and that is 1.5% of the dimension size (0.1% for a very static dimension, 3% for a very active dimension, so the average is (3%+0.1%)/2 = 1.5%). So the size of all dimension tables combined is 100%/1.5% = 67 times of the daily volume of dimension data = 67 x 0.75 GB = 50.25 GB.

So one method giving 44 GB and another method giving us 50.25 GB so we are not too far off. We take the larger of the two methods, which is 50.25 GB.

I don’t like to to be vague or unclear: what’s the tolerance range for “too far off”?

Answer: Based on my experience, if the second method gives us 500 GB (>10x than the first method), we need to question it and look into it again. But if the second method is 2-3x bigger than the first method (132 GB) that’s fine. This 3x is the limit, beyond 3x we have to question it.

Estimated the Size of Fact Tables

We estimate the size of the fact tables based on the data volume in the source systems, like this:

  1. Say that we have 3 source systems, 100 GB, 10 GB and 400 TB, total of 510 TB. If we can do incremental extracting, the volume of the extracted data is about 1-10% of the size of the source system (this is daily). The bigger the source system, the smaller the percentage, i.e. if the source system is > 1 TB, it is more likely to be 2-3% than 8-10%. The newer the source system, the bigger the percentage, i.e. if the the source system is 1 year old it will be closer to 10% than to 1%. If you have no idea, don’t take any risk just take 10%. In this case, as it is less than 1 TB, let’s take 9%, assuming they are quite new. So the daily volume of extracted data is 9% x 510 GB = 46 GB per day.
  2. If you can’t do incremental extracting, use 50%. So if the total volume of the source system is 510 GB, the size of the data extract (daily) is be 255 GB. It is extremely unlikely that you can’t do incremental extraction at all. In all 10 DW projects I’ve done, we could do incremental extraction on every single one of them. If it was a good ETL (we could do incremental extraction for almost all tables) then it was something like 5%. But if it was a bad ETL (could not find ID columns in some tables required for incremental extraction) then it was something like 15%. So before you use 50%, verify that there isn’t really anything that you can use for incremental extraction.

For more info please read my article about incremental extraction here: link. I described 4 things that we can use for doing incremental extraction: timestamp columns, identity column, triggers, and transaction date. It would be an extremely rare case that there in nothing in any tables in the source system that you cannot use to do incremental data extraction. If that is the case, the source system would be a small system any way, or very old. Modern, large, complex transaction systems were built with incremental extraction in mind, or, at the very least, with a unique row identifier in every table. Even in Hadoop.

  1. Landing area and staging tables

Landing area is the file system (directories) which is used to store data dump / export files from the source system, before they are loaded into the data warehouse. Instead of using files, we can keep the data dump from the store system in staging tables. Staging tables are the tables in the staging database. They have no indexes, no primary keys. They are usually truncated before each load. In modern RDBMS, such as SQL Server 2014, staging tables resides in memory so that loading and reading them are a lot faster than disk-based tables.

To estimate the size of the landing area, we multiply the daily data extract volume with the number of days we want to keep the data. In this example, let’s assume that we want to store the extracted data, which is 46 GB/day (see #1 above), for a week (5 days). So the volume of the landing area is 5 x 46 GB = 230 GB. The same calculation principles apply for calculating the staging tables.

  1. Intermediate fact tables

Sometimes, the fact data requires an intermediate tables. For example, to perform grouping and decile calculation. Or to calculate loyalty points. So the fact data flows into a layer of intermediate tables, before they are loaded into the final fact tables. This is because “one-pass” SQL cannot be done to perform that operation. The data needs to be landed in a table frist, then a second-pass is performed. The reason of not doing this in a temp table is because

  1. The data volume is large, for example 500 million rows of weather measurement points per day. Using a temp table would affect the performance of other streams in the warehouse.
  2. A comparison of today’s data with previous days is required. If we use a temp table, we would lose previous days’ data. By having a permanent table we don’t have to rebuild previous days’ data again.

Without performing data modelling, how do we estimate the size of intermediate fact tables? By taking a percentage of a day’s worth of data, like this: one day’s data is 46 GB. Of this, 10% is dimension data and 90% is fact data. So the fact data is 41.4 GB. Of this 41.4 GB fact data, most of it does not require intermediate fact tables. Only 10-20% requires double-passes (or triple-passes). Taking the maximum to be safe (20%), 20% x 41.4 GB = 8.3 GB. That’s the size of the intermediate tables to process fact data.

  1. Final fact tables

The core question is, how do we estimate the size of the fact tables, without first designing them. So at this point of the project we do not know yet whether we will have FactAccountBalanceSnapshot, FactPremiumTransaction, etc. We don’t know how many fact tables, and what they are. Yet.

One approach is to look at the size of the fact data that is coming into the warehouse every day. We know that this is 46 GB (see point 4 above). When this data is loaded into a dimensional model, it become about 20-40%, because the textual columns are put into the dimensions, leaving only the numerical columns in the fact tables. Textual columns take up most of the space in a row.

Most of the fact tables are periodic snapsnot, only 1-2 are transactional. This is because the business usually need to analyse the changes to measures over a period of time. Accummulating snapshot fact table is rare too, usually only 1 or 2 within the whole warehouse. The ratio is about 80-90% by number of tables, and about 95% by data volume (GB). 95% of the fact data GB is in the snapshot fact table. Transactional fact tables are very small compared to periodic snapshot fact tables, because they do not repeat the data every day.

In the case above, assuming that we snapshot all of the incoming fact data, the volume of the periodic snapshot fact tables would be 46 GB x 30%* x 5 days x 52 weeks = 3.6 TB per year. But only 95% (3.42 TB) is in periodic snapshot fact tables, and 5% (0.18 GB) is in transactional and accumulating snapshot fact tables.

*30% is the mid between 20% and 40%, the ratio when the raw fact data is loaded into dimensional model.

It is a common practice that beyond 1 year, we don’t store daily snapshot. But either weekly snapshot or monthly snapshot. So it is something like: daily snapshot for the last 1 year, weekly snapshot for the last 2-3 years, then monthly snapshot for the last 3-10 years. So the number of snapshots are 5 x 52 = 260 for the last 1 year, 2 x 12 = 24 snapshots in the last 2-3 years, and 7 x 12 = 84 snapshots for the last 3-10 years. 260 + 24 + 84 = 368 snapshots.

So to store 10 years of data in the fact tables, the space required would be 46 GB x 30% x 95% x 368 = 48.2 TB.

  1. Infrastructure factors

Depending on the RDBMS chosen, we will then need to add or deduct space for indices, log files, fill factor and database compression. Exadata can compress data to 20% of the original volume. The reference says 10% but the reality is 33% (see Ofir’s article here: link).

In SQL Server and Oracle (and I believe other non MPP RDBMS such as DB2), the space for indices in a DW environment is typically 20-40% of the data. The log files are only a small percentage of the data, say 1-5%. DB maintenance will keep the log files small.

Temp DB is another one to consider, particularly if you are using SQL Server, which is a fixed amount. In some companies it’s 20 GB, but some companies it’s 200 GB. It also depends on memory, if the server has 130 GB RAM (say 120 usable by SQL), then 20 GB may be enough. But if the server only has 16 GB RAM (say 14 GB usable by SQL), then we may need 200 GB. And crucially it depends on what kind of SQL queries are flowing into the RDBMS: joining 20 tables (3 of them self joins) with grouping and rank functions, with each table being 50 GB in size, that could take 100 GB tempDB space. Depends on the execution plan too, i.e. hash join or nested loop. Using recompile or not. But that’s specific to SQL Server.

We also need to consider high availability requirements, i.e. mirroring will double the number of space, RAID 5 stripping adds 20%, RAID 1+0 doubles the space, etc. Hot or cold stand by, snapshot database, replication, etc also requires additional space.

9 February 2015

Populating Fact Tables

Filed under: Data Warehousing — Vincent Rainardi @ 12:48 am
Tags:

Fact tables are normally loaded from transaction tables such as order tables or from transactional files, such as web logs. Hence the number of rows to update or insert in a load is much larger than in dimensions. The core of loading fact tables is to change the natural keys into surrogate keys. Let’s start with the basic steps, then continue with some practical experiences such as loading partitioned fact tables, slim lookup tables, deduplication, loading snapshot and delta fact tables, fact table indexes and reloading history data.

Basic Steps

When loading a transaction table from OLTP into a fact table in the data warehouse, the value columns on the transaction table become fact table measures, the primary keys on the transaction table such as order number become degenerate dimension columns on the fact table, and the alternate primary keys such as date, customer and product become dimensional key columns on the fact table.

As rows of fact tables are loaded, each of the key columns is converted from the natural key. This is why all the dimension tables must be populated first before we populate the fact tables: because we need the surrogate keys on the dimension tables to translate the source table’s natural keys. This is the very basic and the very heart of data warehouse loading, so it is important to understand it. Perhaps the best way to describe this concept is using an example. Consider a simplified order detail table with the following columns: order_id (primary key), order_date, product_id, quantity and price. In reality there should be line number column as well but in order to simplify it let’s not use line number in this example, i.e. let us assume that each order only contain one line.

Source table: order_detail

order_id order_date product_id quantity price last_update
352 15/07/2006 BGCKQ 1 12.99 30/10/2006
410 30/10/2006 KMSCG 2 7.99 30/10/2006

In the data warehouse let us assume that we have an over simplified star schema consisting of a date dimension, a product dimension and a sales fact table, as describe below.

Date dimension table: dim_date

date_key date day month
2388 15/07/2006 Saturday July
2485 30/10/2006 Monday October

Product dimension table: dim_product

product_key product_id product_name description
53076 BGCKQ Aike U2 IDE HD case USB 2
92184 KMSCG Sony BP71 VIAO CG7 battery

Sales fact table: fact_sales

fact_key date_key product_key order_id quantity price load_time
830923 2388 53076 352 1 11.99 15/07/2006

Notes:

  1. Note that order_id 352 already exists in the data warehouse but the price has changed. When it was loaded on 15/07/2006 the price was 11.99 but now the price is 12.99.
  2. In reality date dimension would contains many more other attributes, such as day of the week (2 columns: name and number), ISO date (YYYY-MM-DD), date of the year (number of elapsed days since the beginning of the year), Julian date (number of elapsed days since 4713 BC), SQL date (in SQL Server datetime format), day of the month, calendar week / month / quarter / year (3 columns for month: full name, short name and number), fiscal week / period / quarter / year, weekday flag, statutory holiday flag, last day of the month flag, etc. Plus day and month names in other languages if your data warehouse is internationalised.
  3. In reality the dimension tables would have standard dimension table columns such as load_time and SCD attributes but for this simplified case these columns are not displayed here.
  4. It is a common practice not to have a fact_key column in the fact table, with the argument being the combination of all the dimension keys will make the record unique. In this case, the primary key of the fact table is a composite key. When loading this kind of fact table we will need to compare all dimensional key columns to identify the correct row when doing updating (see step 3 below).

The basic steps in loading data warehouse fact tables are described below. It is not that different from the steps in loading dimension tables (see my previous article). In this example, it is assumed that the source table, order_detail on the OLTP, has been loaded incrementally into a stage table called order_detail. The stage database name is stg. It is also assumed that we are using SQL Server 2000 or 2005, hence the SQL statements are in Transact SQL. So what we need to do now is to load the fact table records from the stage into the data warehouse.

Step 1. Create The Temp Table

SELECT * INTO #fact_sales FROM dw.dbo.fact_sales WHERE 1 = 0

The temp table after it is created:

fact_key date_key product_key order_id quantity price load_time

Notes:

  1. By deliberately specifying a false condition, we create an empty temp table, taking the structure of the target dimension table.
  2. You may want / need to increase the size of the temp database depending on the size of your load.
  3. SELECT INTO is better than CREATE TABLE because it is more flexible to adapt to structural changes.

Step 2. Populate The Temp Table

SET IDENTITY_INSERT #fact_sales ON

INSERT INTO #fact_sales
(date_key, product_key, quantity, price, load_time)
SELECT
ISNULL(f.fact_key, 0),
ISNULL(d.date_key, 0),
ISNULL(p.product_key, 0),
ISNULL(s.quantity, 0),
ISNULL(s.price, 0),
@current_load_time
FROM stg.dbo.order_detail s
LEFT JOIN dw.dbo.dim_date d ON s.trans_date = d.sql_date
LEFT JOIN dw.dbo.dim_product p ON s.product_id = p.product_id
LEFT JOIN dw.dbo.sales_fact f ON d.date_key = f.date_key
AND p.product_key = f.product_key
WHERE s.load_time BETWEEN @last_run AND @current_run

The temp table after it is populated:

fact_key date_key product_key order_id quantity price load_time
830923 2388 53076 352 1 12.99 30/10/2006
0 2485 92184 410 2 7.99 30/10/2006

Source table: order_detail

order_id order_date product_id quantity price last_update
352 15/07/2006 BGCKQ 1 12.99 30/10/2006
410 30/10/2006 KMSCG 2 7.99 30/10/2006

Date dimension table: dim_date

date_key date day month
2388 15/07/2006 Saturday July
2485 30/10/2006 Monday October

Product dimension table: dim_product

product_key product_id product_name description
53076 BGCKQ Aike U2 IDE HD case USB 2
92184 KMSCG Sony BP71 VIAO CG7 battery

Notes:

  1. Look how dimensional keys are looked up in a single step, by joining the fact table to the dimension tables on the natural keys.
  2. Because they are LEFT JOINS, the fact table rows without corresponding imensional
    rows will results in NULL dimensional keys in the fact table. The ISNULL then convert these NULLs to zeros.
  3. Notice that we don’t populate the load_time with getdate(), but with a variable named @current_load_time. This variable is populated with getdate() at the beginning of the loading batch and used by all processes in the batch. This is necessary so that in the event of failure, we know the point in time we have to restart the process from.
  4. Notice that we only load the rows between @last_run and @current_run. This is necessary if we keep the records on the stage table for a few days, i.e. if the stage contains more than 1 day data. If we clear the data from the stage table as soon as we load them into data warehouse, we don’t need to specify this where clause.
  5. The example above is using only 2 dimensions but in the real practice we would have to deal with more dimensions. 10 to 15 dimension key columns on the fact tables are common occurance.

Step 3. Update Existing Records

UPDATE f
SET f.date_key = t.date_key,
f.product_key = t.product_key,
f.order_id = t.order_id,
f.quantity = t.quantity,
f.price = t.price,
f.load_time = t.load_time
FROM dw.fact_sales f
INNER JOIN #fact_sales t ON f.fact_key = t.fact_key
WHERE t.fact_key <> 0 AND
(f.date_key <> t.date_key OR
f.product_key <> t.product_key OR
f.order_id <> t.order_id OR
f.quantity <> t.quantity OR
f.price <> t.price)

Source table: order_detail

order_id order_date product_id quantity price last_update
352 15/07/2006 BGCKQ 1 12.99 30/10/2006
410 30/10/2006 KMSCG 2 7.99 30/10/2006

fact_sales after the update:

fact_key date_key product_key order_id quantity price load_time
830923 2388 53076 352 1 12.99 30/10/2006

Notes:

  1. Here we update the fact table, based on the data on the temp table. In this case the price was updated from 11.99 to 12.99.
  2. We only update the rows where the tmp table’s fact_key is not 0, i.e. the rows already exist on the target fact table. For the rows where the fact_key is 0 (not exist on the fact table), we will insert them into the fact table later on.
  3. Notice that when updating rows we update the load time column as well. The last line is used to specify which changes we want to pickup. In most cases, we want to pick up changes on all columns, but sometimes there are legitimate reasons for business rules to specify that changes on certain columns are to be ignored.

Step 4. Insert New Records

INSERT INTO dw.dbo.fact_sales
(date_key, product_key, order_id, quantity, price)
SELECT date_key, product_key, order_id, quantity, price
FROM #fact_sales
WHERE fact_key = 0

Source table: order_detail

order_id order_date product_id quantity price last_update
352 15/07/2006 BGCKQ 1 12.99 30/10/2006
410 30/10/2006 KMSCG 2 7.99 30/10/2006

Date dimension table: dim_date

date_key date day month
2388 15/07/2006 Saturday July
2485 30/10/2006 Monday October

Product dimension table: dim_product

product_key product_id product_name description
53076 BGCKQ Aike U2 IDE HD case USB 2
92184 KMSCG Sony BP71 VIAO CG7 battery

fact_sales after the insert:

fact_key date_key product_key order_id quantity price load_time
830923 2388 53076 352 1 12.99 30/10/2006
916912 2485 92184 410 2 7.99 30/10/2006

Notes:

  1. It is a good practice to always declare the column names. This is important for flexibility and maintenance.
  2. Let RDMBS maintains the fact_key. When setting up the data warehouse, set this column to be IDENTITY (1,1).

Logging and Closing

In every step above we need to do error handling and logging. Error handling is important because if the loading fails on any steps, we need to be able to recover from the failure. Logging is important to know what exactly happened on each steps, i.e. how many records are processed, etc.

At the end of the program, we should not forget to clean everything up, i.e. drop the temp table(s), follow control protocol e.g. set the process to complete state, etc.

The above code shows how to do upsert with native SQL codes, which is very fast and efficient. But it is worth to note here, that good dedicated ETL tools such as Informatica and Data Integrator have the facilities to do in-memory lookups which has very good performance too. Disk-base lookup is definitely not they way to go here, as they are slow. Mainly because there are a lot of SQL statements to execute, i.e. one for each lookup, and each of these statements ‘touches’ the disk, which is a costly operation.

Now that we have understood the basic steps in loading fact tables, let’s familiar ourselves with a few practical experiences such as loading partitioned fact tables, slim lookup tables, deduplication, loading snapshot and delta fact tables, and dealing with fact table indexes.

Slim Lookup Tables

When a dimension is very large, sometimes it would be a significant performance improvement if we provide a separate lookup table for key management. For example, if our customer dimension has 100 columns and 10 million rows, we could create a customer key lookup with only 3 columns: customer_id, customer_key, load_time, which would increase the performance of dimensional key lookup process on step 2 above. It is also more suitable for performing in-memory lookup. If you are using Oracle 8i and above, physical table structures such as IOT (Index Organised Tables) would useful in terms of performance. On SQL Server it’s a good practice to cluster index slim lookup tables on the natural keys. On Teradata, we could use Practical Primary Index (PPI) on these lookup tables.

Example of Slim Key Lookup table: lookup_customer

customer_id customer_key load_time
493238 30012 02/10/2006
493240 30013 03/10/2006
493241 30014 03/10/2006

The load_time column would be useful if we have very long dimension table, e.g. 25 million rows. This is not uncommon when the data warehouse is utilised to drive CRM systems (Customer Relationship Management), especially in the dot com companies dealing with online campaigns, be it subscription based or tactical campaigns. In this case we can specify a where clause to limit the selection on this slim key lookup, for example where load_time is within the last 2 years. This would cover, say, 98% of the lookup. For the remaining 2%, e.g. the one older than 3 years, we then go to the main customer dimension table to find the customer_key.

Altough it takes a bit of time to maintain the lookup table, overall we still save time as the time saved by querying a slim key lookup table is a lot greater than the time required to maintain the lookup table itself. This is especially true for a Master Data Management (MDM) dimensions such as customer, product and account, where they are used all over the place in the data warehouse and data marts. Time dimension is not that long – 10 years is only 3651 rows – hence we tend not to have a lookup for time dimension. For those of us who think to set the grain of time dimension to hours or minutes, the general advice is: don’t. Either put a time stamp column on the fact table or have a time of day dimension. But this discussion (time of day dimension) is for another article.

Natural Key Changes

Sometimes a natural key which we thought was a good solid natural key could change and therefore can no longer be a natural key. Perhaps the best way to explain it is using an example. In a OLTP source system the customer table has a composite primary keys as combination of branch code and customer code. Customer code is only unique within a branch. Occasionally a branch could be closed and all the customers in that branch are moved or assigned to another branch. For example, see the customer table below.

branch_code customer_code customer_name other_attributes
1 1 Andrew
1 2 John
2 1 Steve
2 2 Paul

When branch 1 is closed, and its customers are moved to branch 2, the customer
table becomes:

branch_code customer_code customer_name other_attributes
2 1 Steve
2 2 Paul
2 3 Andrew
2 4 John

If we use branch_code and customer_code as natural key in the data warehouse, we need to handle this branch closure event with care. In the previous project we utilise another column in the customer table which would help identify a unique record and we create a special table in the data warehouse to translate the change natural key to the correct data warehouse key.

Unknown Records

If a fact table record does not have a corresponding dimension record, basically we have 2 choices: either we don’t load that record into the fact table, or we load it but we put 0 as the dimension key, referring to the unknown record in the dimension table. An unknown record is a record in the dimension table, with a dimension key of 0 and all the attributes are populated with blank string, number 0 or low value date, depending on the data type. Name and description columns are usually populated with the word “Unknown”. The load_time column is populated with the date of the record was created. This date is normally equal to the date the data warehouse was setup, because the record was created by the data warehouse setup scripts. For example:

Product dimension table: dim_product

product_key product_id product_name description min_level valid_until load_time
0 0 Unknown Unknown 0 1/1/1900 12/10/2004

Here is an example of a row in the source table with a product_id that does not exist in the dimension table:

order_id order_date product_id quantity price last_update
358 19/08/2006 BGCKZ 3 2.99 31/10/2006

Below is how fact_sales looks after that record is loaded. Notice that the product key is 0, which refers to the unknown record in the product dimension.

fact_key date_key product_key order_id quantity price load_time
830937 2424 0 358 3 2.99 31/10/2006

If we load it into the warehouse, we need to flag it into the data quality system, so that it can be reported and corrected on the subsequent load. If we don’t load it it should still be set as a data firewall rule. If we don’t load it, the total of measure on fact table would not be accurate. In the example above, the total sales amount for October 2006 would be 3 x £2.99 less than what it should be. Because of this we tend to load fact table record that does not have a corresponding dimension record and set the dimension key to 0. This way the fact table measure total would be correct, it’s just that the sales could not be traced to a valid product record. But all other dimension keys would still be valid, e.g. it can be traced to a valid date dimension record, a valid customer dimension record, etc. And above all, the referential integrity between the fact tables and the dimension tables are still valid.

Deletion

There are 2 main causes why we perform deletion on the fact tables: 1. because the data extraction approach is fixed period extraction, and 2. to accomodate reverse transactions such as cancellations. No 1 is usually physical deletion and no 2 is usually logical deletion. Let’s discuss them one by one, using examples to clarify.

No 1, loading approach. In one of the data warehousing projects I’ve been involved with, we had difficulties extracting a transaction table from the source system incrementally, because the date stamp was not very reliable. We tried with 3 weeks tolerance i.e. where last updated date or created date was within the last 3 weeks but we still find some leakage, i.e. a few records were updated without the date stamp columns were not updated. Please refer to this article for discussion about incremental extraction and leakage. Luckily, source system did not allow the user to update records that were more than 6 months old. When a user tried to update a transaction record that was created more than 6 months ago, the system displayed an error, something like “You can not modify this record.” So our approach of extracting this transaction table to the stage was to get 6 months data every time. And consequently, the approach of loading the fact table from stage was to delete the rows on the fact table that exist in stage, then reload all the stage records into the fact table. It was actually the same as updating those records, but we found that it was quicker to delete then reload. We identify the records on the fact table that exist in stage by comparing all the dimension key columns, i.e. the composite
natural keys of the source transaction table.

No 2, accomodating reverse transaction. Sometimes, in the source table we had a main transaction table containing normal transactions and a smaller secondary transaction table containing cancellation records. We have 2 options loading this kind of tables into the fact table in the data warehouse: either we load the cancellation as a new record with negative measures, or we load the cancellation as logical deletion. Reverse transaction such as refund and credit notes needs to be implemented as negative measures, but for cancellation we have 2 options.

Each approach has its own advantages and disadvantages. Which one is better depends on the situation, for example whether we will be loading from that fact table into OLAP cubes or not, whether we will be creating a summary fact table or not, whether we need the historical dates or not, and whether the secondary source table contains complete data or not. If we decided to go for logical deletion, then for each cancellation record exists on the secondary source table, we mark the logical delete column on the fact table record. All processes further down the stream such as loading into data marts, creating summary tables or loading into OLAP cubes need to be aware of this column and they need to handle it properly.

Deduplication

When loading records from stage to fact tables, sometimes we have duplicate records. For example: we declare that the grain of the fact table is 1 day for every product for each store. But we found this on the stage table of the fact table:

date product_id store_id quantity
19/8/2006 BGCKZ 309 30
19/8/2006 BGCKZ 309 1

Do we add them up, take the maximum, or take the minimum, or take the average?
It is very important to understand why it happened in the first place. In this case I always found it useful to go back to the business users or the source system expert. The second record could be an error, and in this case we take the earliest record. Or it could be a correction and in this case we sum them up. Or there can only be 1 measurement per day and in the case of 2 or more records found it must be a system error and therefore they want us to take an average. Whatever the business rule is, we need to document it, obtain sign off and then implement it in the code. This process is called deduplication. This normally happens if the source system allow duplication, i.e. it does not have the necessary contraints in the database to make it unique as per the fact table grain criteria.

Deduplication does not only happen in fact tables, but also in dimension tables, especially MDM dimensions such as customer and product. Deduplication can also occurs when the fact table is loaded from 2 or more different source system. Many data quality software such as TrilliumDataFlux, DQ Global, have facilities to deduplicate data.

Fact Table Indexes

Fact tables can be very large. It is not unusual for them to have millions of rows. To improve query performance, fact tables are normally indexed. The cluster index of a fact table is normally a composite of the dimensional keys. Or the fact key column, if there is one. For the non clustered indexes, deciding which column to index depends on how the fact table is used. Whether it is used for direct queries by end users, for populating OLAP cubes or by reports. SQL Profiler and Index Tuning Wizard are useful to understand what indexes would help improving the query performance.

Indexing in data warehousing is quite different from OLTP. In data warehousing it is normal for tables to be indexed heavily to improve query performance. In OLTP, because the records are frequently updated the indexes are not too heavy. It is normal for OLTP tables with 10 to 30 columns to have 1 to 3 indexes. In data warehousing, it is normal for tables with 10 to 30 columns to have 5 to 15 indexes. I’m talking about Kimball’s dimensional model data warehouse here, not Inmon’s 3rd normal form data warehouse in Corporate Information Factory.

Both models of data warehouse are indexed a lot more than OLTP, but generally speaking dimensional model data warehouses are indexed more than the 3NF ones. Because data warehousing tables are heavily indexed, in data warehouse the indexes takes much more space than OLTP. It is normal in a data warehouse for the indexes to take more space than the data. As an example, in a data warehouse implementation I’ve been involved recently, the 3 main fact tables occupies 100 GB for data, but the indexes for these fact tables are 150 GB. This would never happen in an OLTP system.

If our fact table contains 100,000 rows or less, we just load the fact table with the indexes on. There is no need to worry about dropping indexes when loading. If our fact table contains more than 1 million rows, it may be quicker to drop the indexes before loading the data, and recreate them afterwards. If the load is less than 10% of the fact table length, generally speaking we don’t need to worry about dropping indexes. Chances are we could decrease theloading performance by doing so. But if we are loading 20% or more (of the fact table length, i.e. number of rows in the fact table) we may want to consider dropping and recreating indexes. It is very difficult to generalise this, as the performance differs depending on what indexes we have and what loading operations we perform. Hence we always need to test it in dev and QA to prove that our load is significantly improved by dropping and recreating indexes, before implementing it in production.

If you use Oracle, bitmap indexes and Index Organised Table (IOT) are very useful in data warehousing. If implemented correctly, for example bitmap indexes on columns with very low cardinality can be 10 times faster than B-Tree index. IOT are normally used for slim lookup tables. These features don’t exist in SQL Server. As far as I’m aware SQL Server only support B-Tree indexes.

Fact Table Partitioning

Table partitioning is new in SQL Server 2005, but has been in Oracle since 1997 on Oracle 8 and improved in version 8i and 9i. In SQL Server 2000 we only have partitioned view, not partitioned table. This article provides a good overview on the partitioning on SQL Server 2005. Joy Mundy wrote an article about partitioning on SQL Server 2000. For Oracle, it’s on this document and this manual. In DB2, table partitioning was introduced in version 8 since 2002, and was greatly improved in version 9 which was release in July 2006. Paul McInerney describes DB2 partitioning features for data warehousing in this article.

Fact table partitioning is basically dividing the fact table into several physical parts. Each part is called a partition. Each partition is ideally located on a different physical data file and ideally each file is located on different disk. For example, we can divide the fact table so that rows for January 2006 are located on partition 1, rows for February are located on partition 2, and so on. Each partition can be loaded separately in parallel. Partitioning can improve query performance and increase availability. We can add new partition, drop existing partition, truncate (empty) a particular partition, move a partition, split a partition, merge several partitions and exchange/switch partitions. All these activities improve maintenance operations and greatly simplify administration tasks.

If our fact table contains more than 1 million rows, we definitely need to consider partitioning it. Even if the fact table contains hundreds of thousand rows, we still need to consider partitioning it. It can improve the loading performance significantly. We are talking 10-20 times quicker here. This is because we load new data to just 1 partition, which is say 12 times smaller than the size of the whole fact table, hence quicker. There are also techniques to load data into a new empty table with exactly the same structure as the fact table, then switch/exchange partition that new table with the main table. Queries can be a lot quicker too. Again we are talking 10-20 times quicker here. We can also partition indexes so that each part of the index serves only one table partition. This allow more processes to run in parallel. Partition can really
be a life saver for a data warehouse.

If you use Oracle, always use local partitioned indexes instead of global partitioned index. If you use SQL Server 2005, SQL Server automatically partitions the index by using the same partition scheme and partitioning column as the table, unless you specify a different partition scheme. For more information on partition indexes on SQL Server, please refer to these guidelines.

Internal Data Warehouse Entities

Some data such as performance target or budget does not exist in any source system but it needs to exist in data warehouse. Hence they are known as internal data warehouse entities. For example, for each store we can calculate out of stock percentage, i.e. how many times a required product is out of stock. There may be a concensus that it needs to be under 5%. This 5% is not written any where in any source system but will be required by the reports so they need to exist somewhere in the data warehouse .

We have 2 options on how to get this performance target into the warehouse.
1) we can build an application which will store or persist the data into its database, then from there we ETL into staging and into the warehouse, or 2) put it on a simple spreadsheet and we import it into the warehouse. In any case we should not allow the data to be directly entered into data warehouse, for example using direct SQL statement or through SQL Server enterprise manager. All data stored in the warehouse must go through the data quality / data firewall rules, so that any exceptions to the rules are recorded and reported.

Loading Snapsnot and Delta Fact Tables

A snapshot fact table is a fact table that contains condition of measures or status at specific point in time. For example:

  • a fact table that contains actual inventory level for each product at 9 am every day
  • a fact table that contains balances of every saving account in all branches on the first day of every month
  • a fact table that contains the details of all accounts that each customer has every day

There are 2 kinds of snapshot fact tables, periodic and accumulating. Periodic snapshot fact tables contain condition of status taken at regular interval, i.e. we take the snapshot of the source table at regular interval. All 3 examples above are periodic snapshot fact tables. Accumulating snapshot fact tables show the status at any given moment. It is useful to track items with certain life time, for example: status of order lines. Please refer to this Ralph Kimball article for more details about snapshot fact tables. An example of accumulating snapshot can be found here.

How do we load periodic snapshot fact tables? We extract all records that satisfy the criteria from the source table at certain period. For example, take all active rows from the account tables including the balance amount. Do this automatically once a month on the first day. Loading accumulating snapshot is rather different.
We still take all records from the source table that satisfy the criteria, then we update the fact table. For the example of purchasing accumulating snapshot above, everytime there is new piece of information about a particular purchase, we update the fact table record. We only insert a new record in the fact table when there is a new purchase requisition.

Delta fact table is a fact table that we produce as a result of comparing the condition of a source table on 2 different time point. For example: account table. This table in the source system contains all customer accounts. Say on Monday we have 100,000 active accounts and on Tuesday we have 100,001 active accounts, i.e. there were 2 new accounts opened, 1 account closed and 3 accounts changed. Out of the 3 accounts changed, 1 is changed interest rate (from 5.75% to 5.50%), 1 changed the credit limit (from £3000 to £3500), and 1 the interest payment frequency (from daily to monthly). On the delta fact table there will be 6 new rows today (assume today is 6th Nov 2006). Legend for change_type: NA = new account, CA = closed account, IR = Interest Rate,
CL = Credit Limit, IPF = Interest Payment Frequency.

account_key change_date change_type IR_before IR_after CL_before CL_after IPF_before IPF_after
49912 6/11/2006 NA
26077 6/11/2006 CA
32109 6/11/2006 IR 5.75 5.50
19387 6/11/2006 CL 3000 3500
29462 6/11/2006 IPF D M

To populate delta fact table, we download the source account table everyday and compare today’s copy with yesterday’s copy and entered the differences on the delta fact table. As an illustration of practical implementation of delta fact tables, in my previous data warehousing project we use them to tract customer retention or customer churns. It is useful for Customer Relationship Management and also to calculate financial impact of customer churns.

Purging Or Pruning A Fact Table

Some people call it purging, some call it pruning. Purging or pruning a fact table is an activity to remove certain rows from the fact that satisfy certain criteria. To give us an idea below are some examples of purging criteria:

  1. older than 2 years
  2. older than 2 years and status is not active
  3. keep daily records for the last 4 weeks then Monday only for the last 2 years

Example 1 and 2 is normally for transaction fact table and example 3 is normally applicable for periodic snapshot. Purging is important when our data warehouse is a few years old. Purging can improve query and load performance significantly.
If the fact table is partitioned, it is a common exercise to archive the oldest partition (say older than 5 years old, partitioned by month) then drop or truncate the partition. Archiving can be done by exporting the partition (using exp in Oracle) or by backing up the partition.

Reloading Fact Table History

Soon after the data warehouse is in production, user may want us to reload say last 2 years history from the OLTP source system. Say they want 2 years history of order tables. It’s not difficult to reload the history. Normally we just need to modify the time window on the ETL package. For example, if you use SSIS or DTS, we could set the time window on a table in the metadata database, where the Last Successful Extraction Time (LSET) and Current Extraction Time (CET) are kept. See this article for the concept of LSET and CET.

The problem with reloading fact table history is: we need to reload all related dimensions too. And most OLTP do not keep this history. All orders in the last 2 years, yes no problem they have it. But all customer and products and store details in the last 2 years? Often the customer and products history are not kept, their details are overwritten with new ones. SCD concept is not known and implemented in OLTP systems. This way they lost the history. The best way is probably to load the last condition of product, customer and store tables from the source system, and for all order table records which we can’t find a match in the product, customer and store tables, we reference them to the unknown records. This way the sum of measure in the fact table will still be valid and the referential integrity will still be intact.

Conclusions

Fact tables are loaded after the dimension tables are populated. As rows of fact tables are loaded, each of the key columns is converted from the natural key. When loading fact tables sometimes we need to deduplicate the rows. To load fact table record that does not have a corresponding dimension record we set the dimension key to 0, referring to the unknown record in the dimension table. Loading performance could be a lot higher if the fact table is partitioned. Consider removing fact table indexes before loading the data and recreate them afterwards. Purging or pruning fact tables can improve query and load performance significantly.

Vincent Rainardi
12 th November 2006

Data Extraction Method

Filed under: Data Warehousing — Vincent Rainardi @ 12:43 am
Tags:

To populate the data warehouse we need to get data from the source systems. The source system could be an RDBMS, it could be a file based system (including XML), it could be a spreadsheet, or it could be something else (a web service, a web page, a document, etc). In this article we discuss RDBMS source system, i.e. the source data is located in a table on SQL Server, Oracle, Informix, DB2, Access, etc. File based source systems and spreadsheets will be discussed in a separate article.

There are 3 main methods to extract a table out of an RDBMS source system into our data warehouse: get the whole table every time, get it incrementally (only the changes), or using a fixed period. Each of these 3 basic approaches will be explained below one by one.

There are several considerations when choosing which method to use. The first and foremost is normally the time window, i.e. the data extraction process should be completed in certain amount of time. The less time it requires to do data extraction, the better it is for a data warehouse system, because this means that we can update the data warehouse more frequently. And it generally means that we put less load on the source system.

The second consideration is practicality, i.e. what are possible and what are not possible. Ideally we would like to extract it incrementally, but if it is not possible then we will have to use other method. Or the other way around: ideally we would like to extract the whole table (just to be sure), but if there are 100 millions records in the table, then we may have to do it incrementally otherwise the extract could take a very long time.

You can read Part 2 here.

1. Whole Table Every time

The simplest way to download a table from the source system into data warehouse is to extract all records every time, e.g. “SELECT * FROM table1“. For example, a store table. Say a retail group has 500 stores. Each store has a record on this table containing the store name, address and various attributes such as store grade or classification, last inspection date and ID of the store manager. For 500 records, even if the store table has timestamp for both create and update, it is probably quicker to get the entire table every time.

If we just do a select without a where clause, the extraction begins instantly. If we put something on the where clause, such as timestamp, the extraction process begins a few moments later (it could be a second later, it could be a few minutes later, depending on how many records are on the table on whether the columns used on the where clause are indexed or not). If you are connected on a LAN with the source system, say 100 Meg, 58 records should take less than a second. So, if the delay caused by the where clause is a few seconds, it is quicker to download the whole table every time.

A quick test using an ETL software can easily measure how long the delay is. But as a guideline, for a table that has less than 1000 rows, we should definitely consider getting the whole table every time. If the table has 1000 to 10,000 rows, we should still probably consider getting the whole table every time. If the table has more than 1 million rows , we should definitely consider downloading it incrementally (unless the record length is very small, such as 10 bytes). One sure way of determining whether download the whole table every time is suitable is to do it and measure the time. If the time it takes is not possible to be accommodated, then we need to extract the data incrementally.

2. Incremental Extract

If the table contains 2 million records, say it is the customer table, it could take 15-20 minutes to download the whole table. The purpose giving the stats/numbers here is only for illustration, so we can get the feel of it. It is not exact and it is not scientific. It could vary significantly, for example if you are using RAID 10 or using Gigabyte network. OK, back to customer table. If there are timestamp columns for creation and updates, then it could be a sensible idea to try to download this table incrementally, which may decrease the download time to, say, 3-5 minutes. The number of records decreases from 2 million to 2000 (daily changes only) and the amount of data transferred decreases from 2 GB to 2 MB (assuming the record length is 1000 bytes).

In OLTP systems, customer and product are probably the biggest master tables. But transaction tables are normally a lot larger. It does depends on the size of the company (multinational or local), and the industry (telecommunication and retail normally have huge tables). A transaction table with 100 million rows are not uncommon. At early stages of building the data warehouse, the one of the key question to answer is whether it is possible to extract the transaction tables incrementally. It is often not practical (in terms of time window) to extract the whole table every time. This question is so important so that for many projects. So important that if the answer is no then the project could not proceed.

There are 6 main ways to perform incremental extract:
a. Using timestamp columns
b. Using identity column
c. Using triggers
d. Using transaction date
e. Combination of the above
f. Using off-the-shelf tools

a. Incremental Extract Using Timestamp Columns

If the table is large (>100,000), it is always worth to find out if the table has creation and update timestamp or datestamp columns, and whether the timestamps are indexed. And more importantly if the timestamps are updated every time a new record is created and updated. The index is a simple and straightforward. We could easily prove if a column is indexed or not. Your source system DBA would know how to. Different RDBMS has different implementation for indexing (some platforms call it logical file), but they all speed up the query. Adding the timestamp on the where clause of the select statement could greatly vary the extraction time, especially for tables with millions of rows. Sadly, more often than not, the timestamp columns are not indexed. In this case it may be worth to find out if they could index it for you. But before you put forward the request consider the impact to the source system – as you know the more index there are on a table, the slower the DML e.g. insert, update and delete process.

The second one is a little bit more difficult to find out: whether the timestamps are updated every time a new record is created and every time an existing record is updated. The supplier of the ERP system may confidently say that the timestamps are always updated on insert and update throughout the application, but your internal IT colleagues may issue a direct DML statement to the database. “It is just a normal annual exercise every beginning of fiscal year”, so you may hear from them. Even if the timestamps are maintained by triggers, we can not be 100% scientifically sure that the timestamps are always kept up-to-date. Step 4 of the ‘normal annual exercise’ could be disabling the triggers on that table!

To be 100% certain, put the table on daily extract based on timestamp and leave it running for a few weeks. This period does depend on how frequent the table is updated. If the table is updated daily then a few weeks is a sensible evaluation period. After this period, compare the table on the data warehouse with the table on the source system using the primary keys. Find out if there is a “leak”, i.e. records which exist on the source system but you can not find on your DW. If you find a leak, check your logic on the date movement. Also if the date window moved when the download process failed. If there is no leak, we need to check that the updates are working, i.e. all columns on all rows on DW table matches those on the source system. We do this by using CRC comparison. The 3rd thing we need to check is if there are any deletion on the source system. This is done using the same means as checking for “leak” above, in fact they are both done using the same process. Deletion is identified by finding out if there are records which exist on the data warehouse but not on the source system. If these 3 checks are satisfied, then we can use the timestamps columns for incremental loading.

Once the timestamp columns are tested to our satisfaction, we can use them for incremental extract. The basic syntax for the select statement is: “SELECT * FROM table1 WHERE date_col > LSET“. LSET = Last successful extract time. Before the extract begins, we store the data warehouse system time in a variable. If the extract is successful, we store this variable on data warehouse control database as LSET. If the extract fails, it is important that we don’t store this variable on the database. Next time the extract runs it will use the stored LSET so it will only get the records since the last successful extract.

It is important to set the LSET to the time before the extraction begins, not when the extraction finishes. And it is important to limit the maximum time we want to extract to. For example: if the data extraction routine runs at 1 am for 10 minutes, and if the data extraction routine is running daily, then we want to limit the time window for today’s data extraction from 09/3/06 01:00:00 to 10/03/06 01:00:00. We do not want records created at 10/03/06 01:00:01, because this record is for tomorrow’s extract. Thanks to Steve Wright from C&C group for making me aware of the maximum limit. So the where clause becomes: “SELECT * FROM table1 WHERE date_col > LSET and col1 <= CET“, where CET = Current extract time, i.e. time before the extraction began.

The reason why we don’t want records created or updated after the data extraction began is because we can not guarantee if the record will be extracted or not. Say for instance we have a record with timestamp of 10/03/06 01:05:00. Data extract is running for 10 minutes, from 10/03/06 01:00:00 to 10/03/06 01:10:00. If our select statement is “SELECT * FROM table1 WHERE date_col > 10/03/06 01:00:00” (without the CET), will we get that record? It depends on the ISOLATION LEVEL, or CONCURRENCY CONTROL of the source system. Different RDBMS have different implementations of isolation level: IBM DB2 on z/OS, IBM DB2 UDB on iSeries, Microsoft SQL Server 2005, Oracle 10g, IBM Informix DS10.

So in the data warehouse control system for each table we are extracting, we have 2 timestamps defining the range of data extraction: Last Successful Extract Time and Current Extract Time. In the previous example the LSET is 09/3/06 01:00:00 and the CET is 10/03/06 01:00:00. If the data extraction is successful (for this particular table), then we set the LSET = CET, so both of them becomes 10/3/06 01:00:00. Tomorrow when the extract begins, the CET will be 11/03/06 01:00:00 and it will extract from 10/3/06 01:00:00 to 11/03/06 01:00:00. On the other hand, if the extract fails, we don’t set the LSET = CET, so the LSET would still be 09/3/06 01:00:00 and the CET would still be 09/03/06 01:00:00. Tomorrow when the extract begins, the CET will be 11/03/06 01:00:00 and it will extract from 09/3/06 01:00:00 to 11/03/06 01:00:00.

If the date column is separated from time column, for example date column contains ’28/02/2006′ or ’28-02-2006′ or ‘2006-02-28’ and the time column contains ‘11.03.21’ or ’11:03:21′, the algorithm we could use is:
WHERE (date_col > min_date AND date_col < max_date) — first segment
OR (date_col = min_date AND time_col > min_time) — second segment
OR (date_col = max_date AND time_col <= max_time) — third segment

min_date and min_time are determined from LSET whilst max_date and max_time are determined from CET.
For example: we want to extract records from 07/03/06 01:00:00 to 10/03/06 01:00:00 (because the data extraction route failed to connect to the source system from 3 days). The first segment above is to get records created/updated on 08/03/06 and 09/03/06. The second segment is to get records created/updated on 07/03/06 after 1 am. The third segment is to get records created/updated on 10/03/06 before 1 am.

It is very very important to ensure that we do not update the LSET if the extraction fails, otherwise we will loose data. I can not stress this important enough. If there is only 1 thing you can get from this article, I hope you get this one: ensure and test that LSET is not updated when the extraction fails.

One apparent weakness of incremental extract using timestamp is that we can not identify deletion. If the source system is using soft deletion (records to be deleted are marked), we have nothing to worry about. But if they are using hard deletion (records to be deleted are physically removed), then we need to use other technique to identify deletion, for example: using trigger or identity column. Fortunately, most ERP / business systems do not delete from the main transaction table, particularly when it is on header-detail format. Instead, they normally use status column, e.g. order status, call status or transaction status.

This is because of cascade delete / cascade update on the PK-FK relationship, or more widely known as “cascading referential integrity constraints”. Basically this means: if the table is a foreign key table (a child table if you like), you would think twice before deleting rows from that table, because you have to delete all corresponding rows from the other table (the parent table). And vice versa: if you want to delete rows from the primary key table (the parent table if you like) then you need to delete all corresponding rows from the foreign key table (the child table). Otherwise you will break the referential integrity constraints.

When the timestamp column is tested for reliability (see 9 paragraphs above, the one begins with “To be 100% certain …”), we will know whether there are deletion involved. It is good to ask the client / system provider (if there is or there isn’t deletion on the transaction table), but we still have to test it.

One last point on using timestamp for incremental extract: some times we need to get the timestamp from other table(s), or from combination of this table (that we want to extract) and other table. For example: order header and order detail, linked on order number. In early stages, we may think that it is enough to extract the header table based on the timestamp columns of the header table. And to extract the detail table based on the timestamp columns of the detail table. But sometimes it is possible that we will find that it is not enough. To extract the header table, we may need to use the timestamp columns from both the header table and the detail table. And to extract the detail table, we also need to use the timestamp columns from both tables. The code looks like this:

SELECT h.* FROM order_header h
LEFT JOIN order_detail d
ON h.order_number = d.order_number
WHERE (h.timestamp > LSET and h.timestamp <= CET)
OR (d.timestamp > LSET and h.timestamp <= CET)

b. Incremental Extract Using Identity Column

If there is no timestamp column in the transaction table (this is unlikely), or if the timestamp column is not reliable (this is more likely), then we could use identity column for incrementally extract the data. Identity column is unique. We can test the data to make sure there is no duplications as follows: (id_col being the identity column)
SELECT id_col, count(*) FROM table1
GROUP BY id_col HAVING count(*) > 1

Identity column is sequential. This enables us to easily identify new records. The basic syntax is WHERE id_col > LSEI. LSEI = Last successful extract identity, i.e the value of the identity column of the last record successfully extracted. Again, just like the time window, we want to limit the maximum. So it becomes WHERE id_col > LSEI and id_col <= CEI, CEI = current extract identity.

To identify deletion we compare the identity column between the source and data warehouse. This is done by downloading the identity columns (all records) from the source system. In many cases, especially if the table is a transactional table, the identity column is often a primary key, but it’s not always. Basically we identify the missing ones, i.e. rows that exist on source system but not in the data warehouse. We then delete (or move/archive) these rows from our data warehouse.

OK, that’s new records and deletion. How about updates? To identify updates we need to download the columns that we want to load to our data warehouse (yes, all records < LSET unfortunately), and compare these columns with the records on the data warehouse. If they are different, we update the records in the data warehouse.

After the source data is downloaded, the basic statement for updating incrementally is as follows:
UPDATE dw_table dw
SET dw.col1 = src.col1, dw.col2 = src.col2
FROM stage_table src
WHERE dw.col1 <> src.col2, dw.col2 <> src.col2

In most cases they have an archiving or purging system implemented on the ERP system, resulting in not too many rows on the active transaction table. For example, only keeping last 12 months data on the order table, resulting in say 5 million rows, with download time of say, 30 minutes. Sometimes they keep many years records without ever purging them to an archive database, resulting in (say) 30 million records on the active order detail table. If this is the case, try to identify whether it is possible to impose a time range for identifying updates, say last 6 months records. For the purpose of identifying this 6 months date range, if there is no timestamp / datestamp columns (created_date, last_updated_date, etc), then transaction date column (order date, delivery date, etc) will do. This will limit the amount of records downloaded to compare for identifying updates.

We can also use other mechanism to identify updates, such as update trigger, update timestamp or log files. The update timestamp doesn’t always need to be from the same table, it could be from other table. For example, in a header detail relationship.

If the source system is Oracle, the identity column is probably implemented on primary key column using a trigger and CREATE SEQUENCE, something like this: oracle-base.com. In DB2 and SQL Server, identity column is built in. Note that DB2 also support CREATE SEQUENCE. In Informix it is SERIAL data type.

c. Incremental Extract Using Triggers

Triggers are the most convenient way of doing data extraction. It is the belt and braces approach. There are 3 types of triggers, e.g. for insert, for update and for delete. Most RDBMS also differentiate before and after trigger, i.e. whether the trigger is executed before or after the insert/update/delete. By creating triggers on the source system table, we can program the source system to provide us with the records every time there is a new record, when updates are made to the existing records, or when the records are deleted. We need to be careful when implementing triggers because it slows down the source system performance.

One way of implementing triggers in the source system is to keep the primary keys of the changed table in a specially created table (let’s call this delta table). In the delta table we only keep the primary keys, not the whole record. This is because different tables have different columns so it is not possible to create a delta table which can store all tables. In delta table, the primary keys could be stored in 1 column, using separators. The name of the primary key columns are stored in another column, also using separators. Delta table also contains 2 important columns: the creation timestamp column (no need to have updated timestamp column) and the table name column. This enable us to get to the right table and to extract it incrementally. The delta table looks like this:

ID table_name PK_columns PK_values time_stamp
1 inventory store_id|product_id 23|5643_G 17/03/2006 10:10:46
2 order_detail order_id|line_no 454AS1|2 17/03/2006 10:11:01
3 customer customer_id 343 17/03/2006 10:11:98

The delta table needs to be cleared out when it has been processed. When processing the delta table, it is important to impose a CET, Current Extract Time, i.e. the system time before the data extraction begins. So the process is: get all records from delta table where timestamp <= CET, then delete all records from delta table where timestamp <= CET. Alternatively (preferred, for trace/history reason, and also for performance – see having a good PK below), we can also not clearing delta table when we have extracted them, but we simply store the CET as LSET (Last Successful Extract Time). See the section on timestamp above for details. If we don’t clear the delta table, we need to implement 1) a good integer identity (1,1) primary key, such as the ID column in the example above, and please don’t forget to cluster index it for performance, and 2) a good purging mechanism (say leaving last 3 months data) otherwise the system will be slower and slower every time – in this case a non cluster index on the time_stamp column is useful. One note on the time_stamp column if your source system in a SQL Server, I would prefer to put is column as datetime data type rather than timestamp data type, for compatibility reasons both with future version and with other RDBMS (portability of implementation).

Depending on the structure and complexity of the source system, sometimes it is better to have a separate delta table for each table on the source system, rather than using a single delta table. It is also common to have the delta tables extracted to files at certain periods of the day, ready for the data warehouse to get it.

The main difference between doing data extraction using triggers and the previously mentioned 2 methods is that if we use trigger, we “touch” the source system. The word touch may be a little bit too soft, the word “change” is probably more appropriate. Yes, we change the source system. This is not a luxury that everyone have. In many cases, the source system is an off-the-shelf ERP system, e.g. SAP, Oracle 11i, Sage, MFG, JD Edwards, PeopleSoft, Axapta, Sieble CRM, SalesForce, etc. In these cases, most probably, we don’t have the luxury of modifying the source system as we like. If we ask the supplier to modify it for us, we are risking ourselves to be in the position where our ERP is not a standard version any more, and therefore a) not upgradable to the next version, and b) not easily supported by other software when interfacing. Luckily, the big players in the ERP market such as SAP and Oracle already prepared themselves and have their own data warehousing tools. Tools such as Oracle Change Data Capture can extract data incrementally, and it is implemented either using triggers (synchronous) or using redo log file (asynchronous).

d. Incremental Extract Using Transaction Date

If there is no good create and update datestamp or timestamp columns in the source table, and it is not possible to put triggers in the source system, and there is no identity column either, then our next best option for incremental data extraction is to use transaction date column. By this I mean columns which contains the date the transaction happen.

Generally speaking, there are 2 major categories of tables in an ERP system or business system, or as the data warehouse community call it, OLTP system (in contrast with OLAP that we have in data warehouse system). The first category is the ‘master’ tables. Tables such as product, customer, supplier, account, status code, product group, store, and disposal site. In the data warehouse star schema, these tables become the dimension tables. The second category is “transaction” tables. Tables such as sales order header, sales order detail, purchase order header, purchase order detail, inventory transaction, account transaction, price review and disposal. In the data warehouse star schema, these transaction tables become the fact tables.

If the table we extract is a transaction table, it normally has one or more transaction date columns. In an order header table, it is common to find the following transaction date columns

  • order date – the date the order was placed by the customer
  • system order date – the date the order header record was created
  • estimated delivery date – the date the order supposed to be out of our door
  • actual delivery date – the date the order really out of our door

The difference between order date and system order date is that: if the order came by fax yesterday and it is typed into the system today, order date will be typed as yesterday whilst the system order date will be today’s date. Operator has no control on this column, it is maintained internally by the system. Please note that in better ERP systems, to enable partial delivery, the estimated and actual delivery dates are put on the order detail table rather than order header table.

In the above example, system order date is an excellent transaction date. It is guaranteed to be greater than or equal to today. Orders which happened later have later system order date as well. That makes this column, in a way, sequential, just like an identity column. The order date column, on the other hand, can be deceiving, because user can type a past date on it. This cause a problem, as we can not be sure which orders come after which orders.

The basic syntax of the ETL select statement is generally the same as the ones for datestamp / timestamp column, e.g. SELECT col1, col2 FROM table1 WHERE trans_date_col >= LSED. If we only deals with date (not with time as well), there is little point constraining the maximum.

So that’s for inserts. How about deletes and updates? Well, simply speaking, we can’t get deletes and updates using transaction dates. We need to use other methods to get them. This restriction made us often have to combine this method with fixed period method (see point 3 below), for example. we download last 6 weeks every time, based on transaction date column.

e. Combination Of The Above

Sometimes we need to combine the above 4 methods, to suit the condition of the source table, to get all inserts, updates and deletes. Inserts are generally the easiest to get. We can use all for methods to get inserts: timestamp column, incremental column, triggers and transaction date column. For deletes we can can use timestamp, incremental column or triggers. To get updates we can use CRC and triggers. Yes triggers can be use to get all 3 kinds of changes, incrementally.

f. Using Off-The-Shelf Tools

Now a days to get or extract the data out of business system is getting easier. Some BI companies have written ETL package for most popular business systems such as SAP, Oracle applications, Siebel, PeopleSoft and J.D. Edwards. In the BI industry, these ETL packages are normally called Data Integration (DI) software. We install the DI software, set some parameters, and off you go. It extracts the data from our business systems into our data warehouse. Isn’t that wonderful?

Well in reality it’s not as straight forward as that. We still have to do data analysis (profiling) and data quality process. We still have to customise it a little to suit our business system and to suit our data warehouse. But we don’t start from stratch, most definitely not. Below we try to illustrate the situation by describing a few DI software that are around in the market, along with the links.

  1. IBM Web Sphere Information Integration (formerly Ascentia)
    DataStage extracts, transforms and loads data. ProfileStage analyses source data, creates data model, generate source to target mappings, and generate ETL routine. QualityStage detects data duplication, inconsistencies and missing data. Package Application Connectivity Kits (PACKS) helps connect to SAP R/3 and BW, Oracle, Siebel, PeopleSoft and JD Edwards systems and extract data from them. Supports most RDBMS.
  2. Informatica PowerExchange and PowerCenter
    PowerExchange Changed Data Capture operates within the source environment to detect data updates and to deliver them in real time or batch. PowerCenter transform and load data; it also offers cleansing and profiling. Most RDBMS are supported.
  3. BusinessObject DataIntegrator
    Define data mapping and transformation. Cleanse (integration with FirstLogic), preview and profile data. Has prebuilt transformations. Creates portable ETL jobs. Integrates with SAP R/3 and BW, Oracle, Siebel, PeopleSoft and JD Edwards. Support most RDBMS.
  4. Pervasive Data Integrator (formerly Data Junction)
    Extract, transform and load data. Creates ODS. Real time or batch execution. Integrates with SAP, Goldmine, Solomon, Great Plains, Navision, SAS, Market Force, Salesforce, and many accounting packages: AccountMate, ACT, Champion, Platinum , Macola, Peachtree, RealWorld, CYMA, etc. Support a comprehensive list of RDBMS.
  5. Hummingbird Genio
    Extract, transform and load data. Integrates with SAP.
  6. SAS DataFlux
    Extract, transform and load data. Has data profiling and data quality capabilities. Integrates with SAP R/3, Siebel, Oracle Warehouse Builder, Informatica. Support most RDBMS.
  7. Cognos 8 BI Data Integration (formerly DecisionStream)
    Extract, transform and load data. Integrates with SAP R/3 and Siebel CRM. Support most RDBMS.
  8. DataMirror Transformation Server
    Uses database log files to capture changed data to deliver them real time.
  9. Microsoft SQL Server 2005 Integration Services (formerly DTS)
    Extract, transform and load data. Support most RDBMS.
  10. CA Advantage Data Transformer
    Extract, transform and load data. Integrates with SAP R/3. Support most RDBMS. Platinum InfoPump, InfoHub, and InfoRefiner were acquired by CA.
  11. Oracle Warehouse Builder
    Extract, transform and load data. Integrates with SAP (produces ABAP code). Carleton Pasport was acquired by Oracle.
  12. Embarcadero DT/Studio
    Extract, transform and load data. Changed data capture delta agent. Data profiling and metadata import/export. Support most RDBMS.
  13. Sybase Enterprise Connect
    Replication Agents for Informix, Oracle, and Adaptive Server Anywhere. Support most RDBMS.
  14. ETI Solution
    Extract, transform and load data. Integrates with Siebel, WebSphere MQ and ClickStream. Support most RDBMS.

Ascentia (now IBM) and Informatica are probably the most popular data extraction tools in the BI market now a days. BusinessObject, Genio and DataFlux are decent, but I prefer to use SSIS as personal choice. Informatica, DataMirror and Oracle CDC (changed data capture) could be a life saver in some situation where we must extract incrementally. Alkis Simitsis compiled a comprehensive list of ETL tools, which might worth having a look if you want to explore some of the nice players.

Vincent Rainardi
29th March 2006

Note: This article was originally posted on SQL Server Central: link.

7 February 2015

Data Scientist

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 8:57 am

Some people wonder what is the meaning of data scientist. Put it simply, a data scientist is an expert in:

  1. Statistical analysis
  2. Data mining
  3. Big data
  4. Data visualisation

In addition to the above technical skills, a data scientist usually has good business knowledge in one or two sectors, such as banking, insurance, investment, oil or biotech.

In this article, I will explain each of the 4 technical areas, and then pick an example.

Statistical Analysis

I consider the following statistical analysis skills to be important to have for a data scientist:

  1. Generalised Linear Model (Bernoulli distribution, Bayesian methods, Gaussian regressor)
  2. Cluster analysis (k-means, Expectation Maximisation, fuzzy clustering)
  3. Goodness of Fit Tests (Kolmogorov Smirnov, Pearson’s Chi2, Anderson-Darling)
  4. Bayesian Statistics (Naïve Bayes Classifier, Bayesian Network, Hidden Markov Models)
  5. Factor Analysis (Maximum Likelihood Extraction, Principal Component Extraction)
  6. Time Series, Matrix computations, Latent Dirichlet Allocation

For many candidates, this first subject (Statistical Analysis) is the biggest stumbling block. The above topics are advanced statistics, some of them are at PhD level. Very few people in IT know them because they are usually taught in universities when you take mathematics, physics or engineering.

As with anything else, Statistical Analysis requires the use of software/tool. The most widely used Statistical Analysis tools are SPSS and SAS. These two are defacto industry standard. Matlab has a special place. It’s not as user friendly nor comprehensive as SPSS and SAS, but can be programmed to do anything and extremely flexible for any kind of statistical processing.

Statistical analysis requires a good programming skills, particularly R and S. Knowledge of parallel processing and multithreading will be also useful. R is the defacto standard language for statistical computing, data analysis and data mining. R is derived from S, with Lexical Scoping semantics added to it, so S provides good fundamental but practically no longer used. Some of the popular IDE for writing R are Rstudio, Rattle GUI and RWeka.

Data Mining

The reason Data Mining is required is for predictive analysis and forecasting the future, as well as descriptive analysis (explaining the past). For this you need to be able to build data mining models, train the models and use it for forecasting. Data mining requires strong mathematical foundation, such as clustering, regression and neural networks. It also requires knowledge about specific tools such as SAS and Analysis Services.

Data mining requires knowledge of data warehousing and BI, because data mining can only uncover patterns actually present in the data. As the data is usually a very large set, it is commonly stored in a data warehouse, and undergo data warehousing processes such as data integration, denormalisation modelling, data quality improvement, and data cleansing.

Data mining requires business knowledge, such as Customer Relationship Management, market basket analysis, credit card fraud, insurance claims, marine underwriting, credit risk and FX trading. Without the business knowledge, it is impossible to create a good mining model.

Big Data

A data scientist need to know about big data because increasingly, more data is stored in big data architecture, i.e. Hadoop, HBase, MapReduce, EC2/S3. They do not need to know about Pig, Oozie, Lucene, Flume, Sqoop in detail, but they need to have experience with platform that the company uses, such as Hortonworks, Cloudera, BigInsights (IBM), and HDInsight (Microsoft). These platforms are fully equiped with all the tools that you need to load and process data in Hadoop. Data access layers, streaming, query language, security, scheduling, and governance all rolled-in into an Enterprise-ready platform.

A data scientist may not be write complex MapReduce transformation in PigLatin and extend the functions using Python or Ruby (that’s for the programmer to do), but they do need to understand the basic concept. For instance, how the Reduce job combines the output tuples from a Map into a smaller set of tuples, what are keys and values in Hadoop, why you need a Shuffle between a Map and a Reduce, etc. Whether you use Horton, BigInsights or HDI – the implementations are different between companies, but the core concepts are always the same.

Using Big Data platforms such as BigInsights enable data scientists to do data discovery and visualisation. It comes with advanced text analytics tool, machine learning analytics, large scale indexing, adaptive MapReduce, compression, security and stream analytics. Not having knowledge of such a platform means that the data scientist limits their capability to process the data.

Data Visualisation

It’s the buzz word for BI tool. Tableau, QlikView and Spotfire are the most popular. IBM has five: Cognos Insight, Cognos Express, SPSS Analytic Catalyst, SPSS Visualisation Designer and Watson Analytics. SAP has Lumira. SAS has Visual Analytics and Visual Statistics. And there are tons of other tools: Dundas, iDashboards, Datazen,  MicroStategy, Birst, Roambi, Klipfolio, Inetsoft, to name a few.

A good data scientist must have experienced creating visualisation using one or two of the above popular tools. All these tools are not difficult to use, compared to programming in Python or Ruby for example, or even compared to Statistical Analysis.Within 2 weeks you will be able to grasp the basics, and within a month you would be able to use them fluently. They are very user friendly, highly visual GUI (i.e. point and click, drag and drop, that sort of things).

Example

One of the most famous team of data scientist is AIG’s. As the Wall Street Journal reported (link): “AIG analyzes reams of data, from massive databases to handwritten notes from claim adjusters, to identify potentially fraudulent activity. It produces a list of around 100 claims to investigate, ranked highest priority to lowest. It also produces charts and other visualizations, like heat maps, that helps investigators understand why machine learning algorithms make the connections they do.”

Jobs and Pay

I recently saw a job advert for a data scientist for £800 per day. Admittedly this is the highest I have ever seen. I believe the normal range is £400 to £600 per day, depending on industry sector. Banking paying the highest, insurance or pharma probably second, and public sector including NHS the lowest.

  • ITJobsWatch (link) reported that the average salary for permanent position is £57,500; whereas for contract position it is £450/day.
  • Glassdoor (link) reported that the national average for data scientist is £40,000.
  • PayScale (link): reported that the median is ££39,691.

It is impossible to get all 4 skills in 1 person. So in many companies this role is carved up to two to three positions: a big data guy, a visualisation developer, and a statistical analyst. In some companies, you don’t need to know any statistics to be called Data Scientist; you just need to know SQL. They call the SQL database guy as Data Scientist just to make the role more attractive and looks good. Ten years ago the “SQL guy” it used to be called “report developer”, but now a days it could be (mis)labelled as “data scientist”.

 

Blog at WordPress.com.