One of the critical activity in a data warehouse project (and in any data project) is data sourcing. The business users have a few reports which need to be produced from the data warehouse. There is no data for that yet in the data warehouse, so we look at the report and ask ourselves: “Where can I get the data from to produce this report?” The plan is to find the data source, and then bring the data into the data warehouse.
There are 6 steps of data sourcing:
- Find the source
- Reproduce the numbers
- Verify the extraction
- Check the coverage
- Check the timing
- Check the cost
Step 1. Finding the source
We find out how the report was created, who created it, and from where. The report could be a PDF which is created from PowerPoint, which was created from an Excel spreadsheet, which was created from a file, which was exported from a system. It is this system that we need to check. Find the tables, the columns and the rows where the data is coming from.
Step 2. Reproduce the numbers
Once we have located the tables, we then try to reproduce the numbers we see on the report. For example, the report could be like this:
Try to reproduce the first number (24.3%) by querying the tables which you think the data is sourced from. In doing this you will need to look at the data processing happening in the Excel spreadsheet. Do they exclude anything? Is it a straight forward sum of the holdings in government bonds? Do they include supranational? Government agencies? Regional? Municipal? Emerging Market Debt?
If you can match 24.3% then great. If not, investigate if the data is manually overridden before it was summed up. For example, in Excel, there might be a particular bond which was manually classified as Government even though the asset class is Corporate, and this is because it was 80% owned by the government, or because it was backed by the government.
We need to particularly careful with regards to the totals. For example, on the “Allocation by Country of Risk”, if the total portfolio value is $200m, but they exclude FX forwards/swaps, or bond futures, or certain derivatives, then the total portfolio value could decrease to $198m, and all the percentages would be incorrect (they are slightly higher).
Understanding the logic behind the report is critical in reproducing the numbers. In-depth industry knowledge will be helpful to understand the logic. For example:
- The credit rating for each debt security is categorised into IG (Investment Grade) if it is BBB- or above, and into HY (High Yield) if it is BB+ or below, or NR.
- The credit rating used for point 1 is the average between S&P, Moody’s an Fitch, except a) for own fund use look-through, b) for outside fund use IMMFA
- The “Allocation by Country of Risk” table excludes cash and FX forwards/swaps.
- Each country is categorised into Developed, Emerging or Frontier market.
- When determining the country of risk in point 4, for derivative use the underlying.
In the above case, if we are not familiar with the investment banking industry, it would take a long time for us to understand the logic. So, yes, when doing Data Sourcing, it is best if it is done by a Business Analyst with good knowledge & experience in that industry sector.
Step 3. Verify the extraction
Once we can reproduce the numbers, we need to verify if we can get the data out. A few gotchas are:
- The numbers are calculated on-the-fly by the system, and not stored anywhere in the database. If this is the case, find out from the vendor if they have an export utility which produces a file after the numbers have been calculated.
- Are we allowed to connect to that database and query it? Do not assume that we can, because I’ve encountered a few cases that we are not allowed to do that. It could be because of the system work load / performance (it is a critical transaction system and they don’t want any big query ruining the front end users), or it could be because they have provided daily extract files which all downstream systems must use (instead of querying the database directly). From the system admin point of view, it makes sense not to allow any external query runs on the database, because we don’t know what kind of damage those external queries can cause, it could block the front end queries and causing a lock.
- Loosing precision, i.e. the data must be exported from the database but during the export process the precision decreases from 8 decimal places to 2 decimal places.
- There is a security restriction because the it is against the “chinese wall” compliance rules (in investment banking, the public-facing departments must not get data from the M&A department)
- The system is being migrated, or rewritten, so it is still in a state of flux and we need to wait a few months.
- The system is not a solid “production quality”, but only a “thrown-away”, which means that within a few months they could be dropping those tables.
Step 4. Check the coverage
This step is often missed by many business analysts. We need to check if all the products that we need is available in that system. If the report we are trying to reproduce is reporting 4000 products from 100 branches, but the source system/tables only covers 3000 products from 70 stores, than we’ve got to find out where the other 1000 products and 30 stores are sourced from. Are they produced from a different system.
Not only product and stores. We need to check the coverage in terms of: customer/clients, portfolios, securities, line of business, underwriting classes, asset classes, data providers. And the most important coverage check is on dates, e.g. does the source system have data from 2011 to present? It is possible that the source system only have data from 2014.
Step 5. Check the timing
After checking the coverage, we need to check if the data is available when we need it. We need to check these 3 things: data is too late, available extraction window, the data is overwritten.
- Data is too late: If our DW load starts at 2.15 am, will the data be available before that? If not, could the business user live with a bit of stale data (data from 2 days ago, i.e. if today is Wednesday, the latest data in the DW would be Monday data).
- Available extraction window: In the production environment, when can we extract the data from that source system? If from 11pm to 6am there is an overnight batch running, and we can’t run our extract during that time, then the ealierst we can run is 7am. If the DW load takes 3 hours, DW users can access it at 10am. Is that too late for the users or not?
- The data is overwritten: the data from the source system can be updated many times during the day and when we extract it at night, we have no trace of these changes. Is that ok? Do we need intraday, push-driven data load into the DW? Or would 10 minutes data extraction frequency (pull-driven) be enough?
Step 6. Check the cost
There is no free lunch. We need to check how much it would cost us to use that source data.
- If the data is valuable (such as prices, yield and rating from Bloomberg, Reuters and Markit) we would have to pay the data providers. We need to check the cost. The cost could be per $5 security, per call, so it could easily be $20-30k per day. The cost is usually shared with other department.
- Check with the data provider, if you use the data only as an input to your calculation, and you don’t publish it / send it on to any external parties (clients, etc.), would it still cost you a lot? Even if you don’t have to pay the data providers, your DW project might still have to share the cost with other departments. Say the data provider is Markit and your company pays $300k/year for prices data and it is currently shared by 5 departments ($60k each). Your project may have to bear the cost of $50k/year ($300/6).
- The cost could be a killer to the whole thing, i.e. even if #1 to #5 above are all ok, if the cost of the data is $50k, it could force you to cancel the whole project.
- Sometimes other department has to create the data for you. Let’s day yield calculation, or risk engine, or OTC pricing engine, and the requirement from the BI/DW is specific so they have to develop it. It could take them 3 months x 3 people and they could cross charge your project $50k (one off). And that could also be a killer to the DW project.
- Developing interface: some systems do not allow external system to pull the data out. They insist to develop an export, and charge the cost to your DW project.
- Standard data interface system: some large companies (such as multinational banks) have standard interface (real time, end of day, etc.), and the central middle ware team might charge your DW project some low amount (say $2000 one off + $50/month) to use that standard data interface system. Say you need FX rate data from FX system, and there is already a standard message queue for FX rates with publication time of 11am, 1pm and 3pm. So you “subscribe” to this publication MQ and pay the cost (project cross charge).