After we migrate the new ETL code into Production, we may need to load historical data, load past transactions, update/insert dimension rows or update some attributes. In data warehousing this is known as initial data load (IDL). IDL doesn’t only happen when the a DW goes live for the first time, but every single time a new ETL code goes to Production, we may need to do IDL.
In data warehousing projects, people tend to underestimate the effort and time involved in the initial load. In this article I am going to explain why Initial Data Load (IDL) could take a significant effort.
How big are we talking? I’ve seen projects where they estimated that the IDL would be taking 2 man-days effort, 3 days elapsed time, but in reality it took 20 man-days (10 times bigger than the estimate), 2 months elapsed time (20 times longer than the estimate). If we are not careful, IDL could blow the project budget and ruin the whole project.
The followings are things which need to be prepared for an IDL:
- Need to Write IDL Scripts
- Need to Prepare the Data to be Loaded
- Need to Test in Production-Like Environment
- Need to Prepare the Required Capacity in Production
- Need to Schedule Production Down Time
- Need to Schedule Who’s Going to Run it in Production
- Need to Schedule Who’s Going to Sign it off in Production
1. Need to Write IDL Scripts
We may need to load fact rows. We may need to load dimension rows. Or we may need to update the dim/fact table(s) with SQL scripts. These scripts need to be designed and confirmed with business, coded, tested in Dev and run in SIT/UAT environment using SIT/UAT data, validated/checked. When it’s all good, we run it again in SIT/UAT env, using real Production data.
You should never run a SQL Script in Production unless that script has been tested/run with real Production data in SIT/UAT environment. It is a suicide and you could be paying with very high price. The more complex the scripts, the more mandatory this “test with production data” becomes.
Because of this “test with production data” task, the development effort of the IDL script could easily increases by 10 times or even 20 times.
Typical IDL scripts are:
a) New attribute: set all historical values to a constant/fixed value
b) New attribute: set all historical values based on a file containing 2 columns: the business key and the attribute value.
c) New attribute: like b) but with certain business rules for type 2 attribute, i.e. only look at the effective date and ignore the expiry date, etc.
d) Change the data type of an attribute/measure: recalculate the past version of the attribute/measure to a higher precision.
e) New measure: Load the past values from files (file name contains dates).
Basically every time we create a new attribute or measure in an existing dim or fact table, we need to ask ourselves the important question: what should be the values of this attribute (measure) for past records?
2. Need to Prepare the Data to be Loaded
In the case of 1b, 1c and 1e above, we need to prepare the data files to be loaded into the data warehouse. The preparation of these data files could take significant effort and/or duration.
a) The person assigned to create the files may not have access to the production data. Getting access could take a week or two, and delay the process. The production data may contain confidential data and the developer may need to be cleared first (for example: in an investment bank where the FSA have public/private separation due to the Chinese Wall). The developer might be a contractor (rather than a permanent employee) who in big financial companies may have restricted access to production data.
b) The data warehouse team may require SIT data files in addition to the Production data files. And they may require the SIT data files to contain “boundary values”, “unknown values”, and “exception values”, which requires more effort to produce.
c) We/you may not have access to the source system and have to request the source system team to produce the data. And the source system team may not have the data for certain time period, forcing you to go back to the drawing board. Or, if they do have the data, they may not have the resource to do it until next month, which will delay your DW project.
d) To extract the data from the Prod source system, you may have to prepare a SQL script or an ETL package. This script/package will need to be tested first in SIT or UAT, before it is run in Production. This takes time. Do not run any Script or Package in Production without testing it successfully first in test env (SIT/UAT). If your company has a good IT organisation, there will be processes in place to ensure that all scripts/packages are tested and signed off first before they are run in Production. And if it causes production issue, then the person who signed it off will be brought responsible.
Because of all these points above, what you thought to be “ah it only takes a day to create that IDL data file” may take weeks of elapsed time and efforts, causing significant project delays.
3. Need to Test in Production-Like Environment
What I meant with this is that we need to test the IDL script or ETL package which updates the production data in Production-Like Environment. Let’s call this environment “IDL Test Env” (ITE). Say your production ETL code is version 3.5 but your SIT is version 3.6. Obviously you cannot run/test the IDL script/package in the SIT env. You need to create an ITE containing prod data and version 3.5 of the ETL code. If you ran it in SIT and it ran successfully, it didn’t mean anything. It might fail in Production because Prod is version 3.5 whereas SIT is version 3.6.
Preparing ITE with version 3.5 of ETL may take weeks, and again this could cause significant delay to the project. The DBA need to prepare DW tables which is exactly like Prod. The infrastructure people need to put the correct version of the ETL code, and they need to “dry run” the ITE first. Then after that they need to reset the ITE, ready for the real test.
4. Need to Prepare the Required Capacity in Production
Say next week you are going live for the first time (say it’s Tuesday 1/8/12), and as part of the business requirement you need to load last 2 years data into all fact tables. You need to calculate how much space is required to do IDL for this 2 years data. Is it 500 GB? Is it 1 TB? You then need to discuss with DBA/infrastructure team about this data volume you are about to load into Production environment.
The infrastructure team may say to you that they need to order more disks and it requires a few weeks. To avoid this disaster, it’s best to actively manage the data warehouse capacity as routine, regular exercise, not just when we do IDL. For example:
As you can see above, the infrastructure or DW manager is expanding the DW capacity every January, to 1400 GB of available space. The average consumption is about 50 GB / month. In Dec 2010, the available DW space was 420 GB and in Dec 2011 it was 610 GB. The go live in Aug 2012 bring down the available DW space down to 600 and it’s still 4 months to go.
The infrastructure / DW manager would not be happy, because this means that by Dec 2012 the available space in the DW would only be 600 – 4×50 = 200 GB. Meaning that a) there’s no room for another initiative, and b) there is a risk of running out of disk space. So the manager would need to order extra disk space asap (mid year) breaking the normal schedule of January.
So, tell whomever managing the DW capacity about your estimate of the data volume (500 GB in the above case), way before you do it for real in Production so that they can be prepare it in advance and manage the capacity properly.
5. Need to Schedule Production Down Time
When we load data into Production DW, often we need to have extra downtime. For example, if we schedule the IDL on Saturday morning to Sunday night, we may need to reserve Monday morning as a precaution in case the IDL overrun. If the IDL is estimated to take 4 days, we may want to break it into 2 weekends.
Some IDLs need to be run mid-week, particularly when the DW is already live. In this case we may need to schedule production down time. For example, you need to load new data on Wed next week, and you send out downtime notification to affected users this Friday. If you use materialised views, it is not downtime, but possibility of downtime. But if you don’t use MVs, then you’ll need to revoke all user access and have DW/BI production downtime.
To that end, it is important to make the IDL script “runnable bit by bit”. We need to run the IDL script bit by bit, not all in one go. For example if you need to load 25 months of data, load just 1 month first and check how long 1 month load takes. If it takes 1 hour, then load 3 months data. This 3 months data can take 5 hours or 2 hours (not linear) and you need to adjust the timing for the subsequent loads accordingly. Loading 25 months’ worth of data in one go is risking for disaster.
6. Need to Schedule Who’s Going to Run it in Production
Who will run the IDL script? Is it the DBA? Is it the Developer? In large corporations it’s usually the Production DBA, with the developer on the stand-by. In small companies it’s usually the developer. Whoever it is, it is usually out of office hours, i.e. week end or at night. The DW manager needs to arrange in lieu or overtime for the DBA/developer.
7. Need to Schedule Who’s Going to Sign it off in Production
In large corporations they usually have a system that monitors the production deployment. The deployment steps, the back out steps, the contact persons are all detailed there. Good corporations also require the proof of testing and the sign off document. Before they can schedule production deployment, they will insist on having the UAT sign off.
This sign off could be politically risky, particularly for large deployment. The person signing of the UAT will be a hot target to blame if any functional error is found in production. “Why did you miss that important piece of functionality?” is the question they need to face. Hence as preparation of the IDL, we need to clearly indicate who will be signing off the UAT.
After the UAT sign off is obtained, production support will need to sign off the production deployment. He or she understands the least risky time for deployment. For example, it is the development team’s wish (and the business’ wish) to run the IDL on Wednesday, but if the Production Support team decides that it is too risky, they could veto it (for the right reasons) and insist of running it on the following weekend instead.