Data Warehousing and Data Science

26 February 2016

Investment Performance

Filed under: Business Knowledge — Vincent Rainardi @ 5:24 am

One of the fundamental functions of a data warehouse in an investment company (wealth management, asset management, brokerage firm, hedge funds, and investment banking) is to explain the performance of the investments.

If in Jan 2015 we invest $100m and a year later it becomes $112m, we need to explain where this $12m is from. Is it because 40% of it was invested in emerging market? Is it because 30% of it was invested in credit derivative and 50% in equity? Is it because of three particular stocks? Which period in particular contributed the most to the performance, is it Q3 or Q4?

Imagine that we invested this $100m into 100 different shares, each of them $1m. These 100 shares are in 15 different countries, i.e. US, UK, France, Canada, India, China, Indonesia, Mexico, etc. These 100 shares are in 10 different currencies, e.g. USD, CAD, EUR, CNY, MXN, IDR, etc. These 100 shares are in 15 different sectors, e.g. pharmaceutical, banking, telecommunication, retail, property, mining, etc. These 100 shares have different P/E multiples, i.e. 8x, 12x, 15x, 17x. These 100 shares have different market capitalisation, i.e. small cap, mid cap and large cap. And we have 50 portfolios like that, each for a different client, some are for open funds. In the case of a fixed income investment (bonds), there are other attributes such as credit rating (AAA, A, BBB, etc.), maturity profile (0-1 year, 1-3 years, 3-5 years, etc.), and asset class, e.g. FRN, MBS, Gilt, Corporate, CDS, etc.

Every day we value each of these 100 shares, by taking the closing prices from the stock exchanges (via Bloomberg EOD data, Thomson Reuters, or other providers) and multiply them by the number of shares we hold. So we have the value of each share for every single working day. They are in different currencies of course, but we use the FX rate (closing rate) to convert them to the base currency of the portfolio.

A portfolio has a benchmark. The performance of the portfolio is compared to the performance of the benchmark. A portfolio manager is measured against how well they can outperform the benchmark.

The mathematics of performance attribution against the benchmark is explained well in this Wikipedia page: link. That is the basic. Now we need to do the same thing, but not just on 2 rows, but on 100 rows. Not just on asset allocation and stock selection, but also on all the other factors above. Not only against the benchmark, but also comparing the same portfolios month-to-month, or quarter-to-quarter.

The resulting data is a powerful tool for a portfolio manager, because they can understand what caused the outperformance. And more importantly, what caused the under performance, against the benchmark, and between time-points.

This month we beat the benchmark by 1%. That’s good, but what caused it? Why? It is important to know. This month we grow by 1%. That is good, but what caused it? This month we are down 2%. We obviously need to know why. Our client would demand an explanation why their money which was $100m is now $98m.

That would be a good reason for having a data warehouse. The value of each and every positions*, from each and every portfolio, for each and every working day, is stored in the data warehouse. And then on top of it, we apply mathematical calculations to find out what caused the up and down, not only at portfolio/fund level, but for each currency, country, industry sector, etc., for any given day, week, month and quarter. That is worth paying $500k to develop this analytical tool. We from the BI industry may be calling it a BI tool. But from the portfolio manager point of view that is an investment analytic system.

*Position: a position is an financial instrument that we purchased, and now hold in our portfolio. For example, a bond, a share, or a derivative. In addition, we also have cash positions, i.e. the amount of money we have with the broker/custodian, as well as MTM margins, repo, FX forwards, and money market instruments, such as cash funds. A position is time-valued, i.e. its value depends on time.

This tool enables the portfolio managers (PMs) in an investment company not only to know the breakdown of their portfolios at any given day, but how each section of the portfolio moved from month-to-month, day-to-day. In addition, if we also put risk measures in there, such as stresses, risk analytics, or SRIs* (I’ll explain all 3 in the next paragraph), the PMs (and their financial analysts) will be able to know the individual risk type for each section of the portfolio, on any given date, and how those risks moved from month-to-month, day-to-day.

*Stresses, risk analytics and SRIs: a stress is a scenario that we apply to all positions in the portfolio. For example, what if the interest rate is up by 0.1%? By 0.25%? By 1%? What if the FX rate is down by 0.1%? By 1%? And also other factors, such as oil price, inflation rate, equity prices, etc. We can also apply an “event”, i.e. during September 11, the S&P moved by X%, FTSE 100 by X%, Gilts by X%, EMD bonds by X%, and FX rates by X%. There are also other historical dates when the market went south. If we apply those “events” into our portfolios, what happens to the value of each position? Value of the overall portfolios? Value of each section of the portfolio, i.e. Asia stocks, EM, or Small Caps?

Risk analytics are numbers which reflect a certain risk to an investment portfolio. For example, duration reflect how much each position will be impacted by an interest rate raise. For fixed income the risk analytics are: PV01, DV01, IE01, CR01, duration (modified duration, spread modified duration, effective duration), credit spread (spread to Libor, spread to benchmark security, spread to Treasury), yield (yield to maturity, effective yield, yield to call, yield to put, yield to worst, running yield, simple yield), convexity (how sensitive the duration is to the change of interest rates). For an equity portfolio we have different risk analytics (they are mainly financial ratios of the issuer).

SRIs means socially responsible investing. The theory is, the value of a share (or a bond, or a derivative) is affected by how much the company care about the environment, by how well the company (or group of companies) is governed/managed, by how much the company promote human rights and social justice, how much it avoid alcohol, gambling and tobacco. Some data providers such as Barclays, MSCI and Verisk Maplecroft provide this SRI data in the form of scores, ratings and indices in each area.

The PMs will be able to know each of the 3 risk categories above (stresses, analytics and SRIs) for each positions within their portfolio, on any given day, and how those risks moved from month-to-month, day-to-day. That is a very powerful tool, and is worth creating. And that is one reason why we create a data warehouse (DW) for investment companies.

Not only for managing the performance (which is the most important thing in a PM’s career) but also to manage the risks. Because the DW is able to inform them how every part of their portfolios react to each event, each day/week/month (react in the context of the valuation, performance, attribution, and risk), the PMs will able to tell (at least predict) what will happen to the value, return, and risk on each section of their portfolio if such event happens again.

24 February 2016

Instrument Dimension

Filed under: Business Knowledge,Data Warehousing — Vincent Rainardi @ 6:15 pm

One of the core dimensions in investment banking is instrument dimension. It is also known as security dimension. It contains various different types of financial instruments or financial securities, such as bonds, options, futures, swaps, equities, loans, deposits, and forwards.

The term “securities” used to mean only bonds, stocks and treasuries. But today it means any tradable financial instruments including derivatives, cash, loans, and currencies. Well, tradable and “contract-able”.

Where it is used

In an data warehouse for an investment bank, a brokerage or an investment company, an instrument dimension is used in three primary places: in trade fact tables, in position fact tables and in P&L fact tables. Trade fact tables store all the transactions made by the bank, either for a client (aka flow trading) or for the prop desk (bank’s own money), in all their lifecycle stages from initiation, execution, confirmation, clearing, and settlement. Position fact tables store daily values of all instruments that the bank holds (long) or owes (short). P&L (profit and loss) fact tables store the daily impact of all trades and positions to each of the bank’s financial accounts, e.g. IAS 39.

The secondary usages in an asset management or an investment banking data warehouse are risk fact tables (e.g. credit risk, market risk, counterparty risk), compliance fact tables, regulatory reporting, mark-to-market accounting, pricing, and liquidity fact tables.

The niche usages are ESG-score fact tables (aka SRI, socially responsible investing), rating transition fact tables, benchmark constituent fact tables, netting position fact tables, and collateral fact tables.

Data Structure

The business key of an instrument dimension is usually the bank-wide internal instrument identifier. Every instrument that the bank gets from market data providers such as Bloomberg, Reuters, Markit, index constituents, and internal OTC deals, are mastered in a waterfall process. For example, public instruments (debts, equities, ETDs) are identified using the following external instrument identifiers, in order: ISIN, Bloomberg ID (BBGID), Reuters ID (RIC), SEDOL, CUSIP, Exchange Ticker, Markit ID (RED, CLIP), Moody’s ID (Master Issue ID). Then the internal identifiers for OTCs (e.g. CDS, IRS, FX Swaps), FX Forwards, and cash are added.

The attributes of an instrument dimension can be categorised into 9:

  1. Asset Class
  2. Currency
  3. Country
  4. Sector
  5. Issuer
  6. Rating
  7. Maturity
  8. Instrument Identifier
  9. Asset class specific attributes

1. Asset Class

Asset Class is a classification of financial instruments based on its functions and characteristics, e.g. fixed income, equities, cash, commodity. We also have real assets such as land, buildings, physical gold and oil.

It also covers the hierarchy / groupings of the asset classes, hence we have attributes such as: asset class, asset sub class, asset base class. Or alternatively asset class level 1, level 2, level 3. Or asset class, asset type, asset group.

Good starting points for asset class categorisation are ISDA product taxonomy, Barclays index guides, Academlib option pages and Wikipedia’s derivative page. Here is a list of popular asset classes:

FIXED INCOME: Government bond: sovereign, supranational, municipal/regional, index linked, zero coupon, emerging market sovereign, sukuk sovereign. Corporate bond: investment grade, high yield, floating rate note, convertible (including cocos), covered bond, emerging market corporate, sukuk corporate. Bond future: single name bond future, future on bond index. Bond option: single name bond option, option on bond index. Bond forward: single name bond forward, forward on bond index. Credit default swap: single name CDS, CDS index, CDS swaption, structured CDS. Asset backed security (ABS): mortgage backed security (including RMBS and CMBS), ABS (auto, credit card, etc), collateralised debt obligation (CDO), ABS index. Total Return Swap: single name TRS, TRS index. Repurchase agreement: repo, reverse repo.

EQUITY: Cash equity: common shares, preferred shares, warrant, equity index. Equity derivative: equity option (on single name and equity index), equity future (on single name, equity index, and equity basket), equity forward (on single name, equity index, and equity basket), equity swap (on single name and equity index).

CURRENCY: Cash currency: FX spot, FX forward. Currency derivative: cross currency swap.

RATES: Interest rate: interest rate swap, overnight index swap (OIS), interest rate cap, interest rate future, interest rate swaption, forward rate agreement (FRA), asset swap. Inflation rate: inflation swap, inflation swaption, inflation cap, zero-strike floors, inflation protected annuity.

COMMODITY: commodity future: energy (oil, gas, coal, electricity, wind turbine), base metal (copper, iron, aluminium, lead, zinc), precious metal (gold, silver, platinum, palladium), agriculture (grains: corn, wheat, oats, cocoa, soybeans, coffee; softs: cotton, sugar, butter, milk, orange juice; livestock: hogs, cattle, pork bellies). Commodity index (energy, metal, agriculture). Option on commodity future. Commodity forward.

REAL ASSET: Property: Agricultural land, residential property, commercial property. Art: paintings, antique art. Collectibles: fine wine, rare coins, antique cars, jewellery (including watches and precious stones).

FUND: money market fund, equity fund, bond fund, property fund, commodity fund, currency fund, infrastructure fund, multi asset fund, absolute return fund, exchange traded fund.

OTHER: Private equity. Venture capital.

Note on differences between asset class and asset type: asset class is usually a categorisation based on market, i.e. fixed income, equity, cash, commodity and property; whereas asset type is usually a categorisation based on time and structure, i.e. spot, forward, future, swap, repo, ETD, OTC, etc.

Note on overlapping coverage: when constructing asset class structure, we need to be careful not to make the asset classes overlapping with each other. If we do have an occurrence where an instrument can be put into two asset classes, make sure we have a convention of where to put the instrument. For example, an IRS which is in different currencies are called CCS (Cross Currency Swap). So either we don’t have CCS asset class and assigned everything to IRS (this seems to be the more popular convention), or we do have CCS and make sure that none of the swaps with different currencies are in IRS.

2. Currency

For single-legged “hard” instruments such as bonds and equities, the currency is straightforward. For multi-legged, multi-currency instruments such as FX forward and cross currency swap, we have two currencies for each instrument. In this case, we either have a column called “currency pair”, value = “GBP/USD”, or two column marked as “buy currency” and “sell currency”.

For cash instruments, the currency is the currency of the cash. For “cash like” or “cash equivalent” instruments such as CP, CoD, T-bill, the currency is straightforward, inherent in the instrument. For multi-currency CDS Index such as this (i.e. a basket of CDSes with different currencies), look at the contractual currency of the index (in which the premium leg and protection leg are settled), not the liquid currency (the currency of the most liquidly traded CDS).

For derivatives of equities or fixed income, the currency is taken from the currency of the underlying instrument.

3. Country

Unlike currency which is a true property of the instrument, country is a property of the issuer. There can be three different countries in the instrument dimension, particularly for equities, i.e. country of incorporation, country of risk (aka country of operation, country of domicile), country of listing.

Country of risk is the country where if there is a significant business changes, political changes or regulatory changes in that country, it will significantly changes the operation of the company which issues this security. This is the most popular one particularly for portfolio management, and trade lifecycle. It common for a company to operate in more than one country, in this case it is the main country (from revenue/income point of view), or set to “Multi-countries”.

Country of incorporation is the country where the issuer is incorporated, not the where the holding company (or the “group”) is incorporated. This is used for regulatory reporting, for example FATCA and FCA reporting.

Country of listing depend on the stock market where the equity instrument is listed. So there can be two different rows for the same instrument, because it is listed two different stock exchanges.

The country of risk of cash is determined by the currency. In the case of Euro instruments (not Eurobond*) it is usually set to Germany, or Euroland (not EU). *Eurobond has a different meaning, it is a bond issued not in the currency of the country where it is issued, i.e. Indonesia govt bond issued in USD.

An FX forward which has 2 different currencies has one country of risk, based on the fixed leg (not the floating leg) because that is where the risk is. The country of risk for cross currency swap is also based on the fixed leg. For floating-for-floating CCS, the convention is usually to set the country of risk to the least major currency, e.g. for USD/BRL, USD is more major than BRL, so Brazil is the country of risk. For non-deliverable forward and CCS (meaning the payment is settled in other currency because ND currency can’t be delivered offshore), the country of risk is set based on settlement currency (usually USD).

Like currency, the country of a derivative of equities or fixed income instrument is taken from the country of the underlying instrument.

4. Sector

These attributes are known with many names: sector, industrial sector, industry sector, or industry. I will use the term sector here.

There can be many sector attributes in the instrument dimension, e.g. Barclays level 1/2/3, MSCI GICS (and S&P’s), UK SIC, International SIC, FTSE ICB, Moody’s sector classification, Factset’s sector classification, Iboxx, etc. They have different coverage. Some are more geared up towards equities, some more towards fixed income.

The cash instruments and currency instruments usually have either no sector (blank), or set to “cash”. Rates instruments, commodity futures and real asset usually have no sector.

The sector of fixed income derivatives, such as options and CDSes are determined based on the sector of the underlying instrument. Ditto equity derivatives.

5. Issuer

All equity and fixed income instruments have issuers. This data is usually taken from Bloomberg, or from the index provider if the position is an index constituent.

All corporate issuers have parents. This data is called Legal Entity data, which can be obtained from Bloomberg, Thomson Reuters, Avox/FT, etc. From the Legal Entity structure (parent-child relationship between company, or ownership/subsidiary to be more precise) we can find the parent issuer, i.e. the parent company of the issuer, and the ultimate parent, i.e. the parent of the parent of the parent (… until the top) of issuer.

Legal entity data is not only used in instrument dimension. The main use LE data within an investment bank is for credit risk and KYC (know your customer), i.e. customer due dilligence. PS. LEI means Legal Entity Identifier, i.e. BBG Company ID, FATCA GIIN (Global Intermediary Identifier Number), LSE’s IEI. But LEI also means ROC’s Global LEI – the Regulatory Oversight Committee.

6. Rating

Like sector, there are many ratings. Yes there are only 3 rating providers (S&P, Moody’s, and Fitch), but combined with in-house rating, there can be 15 different permutations of them, i.e. the highest of SMF, the lowest of SMF, the second highest of SMF, the average of SMF, the highest of SM, the lowest of SM, the average of SM, the highest of SMFH, the lowest of SMFH, etc. With M = Moody’s and H = House rating.

Plus we have Rating Watch/Outlook from the 3 provider. Plus, for CDS, we can have “implied rating” from the spread (based on Markit CDS prices data).

7. Maturity

Almost all fixed income instruments have maturity date. Maturity is how far is that maturity date from today, stated in years rather than days. We also have effective maturity, which is the distance in time between today and the nearest call date, also in years.

8. Instrument Identifier

This is the security identifier as explained earlier, i.e. ISIN, Bloomberg ID, Ticker, Markit ID, Sedol, CUSIP, Reuters ID, Moody’s ID.

9. Asset Class Specific Attributes

Each asset classes have their own specific attributes.

For CDS we have payment frequency (quarterly or bi-annually), standard coupon payment dates (Y/N), curve recovery (Y/N), recovery rate (e.g. 40%), spread type (e.g. conventional), restructuring clause (Old R, Mod R, Mod-Mod R, No R), fixed coupon convention (100 or 500), succession event, auction settlement term, settlement type (cash/physical), trade compression, standard accruals (Y/N), contract type (e.g. ISDA).

For IRS we have amortising swap flag, day count convention, following convention (Y/N), no adjustment flag (Y/N), cross currency flag (Y/N), buy currency, sell currency, mark-to-market flag, non-deliverable flag, settlement currency.

Debt instruments such as bonds have these specific attributes: coupon type (e.g. fixed, floating), seniority (e.g. senior, subordinated), amortising notional, zero coupon. Funds also have their own specific attributes, such as emerging market flag, launch date, accumulation/income, base currency, trustee, fund type, etc.


The granularity of an instrument dimension can be a) one row for each instrument (this is the norm), or b) one row for each leg. This is to deal with multi-leg instruments such as CDS (3 legs) and cross currency swap (2 leg). The asset class of each leg is different.

If we opt for one row for each instrument, the asset class for each leg needs to be put in the position fact table (or transaction fact table, compliance fact table, risk fact table, etc).


There are millions of financial instruments in the market and through-out the life of an investment bank there can be millions of OTCs created in its transactions. For investment companies, there are a lot of instruments which they had holdings in the past, but not any more. Coverage of an instrument dimension means: which instruments are we going to maintain in this dimension? Is it a) everything in the market, plus all OTCs, b) only the one we ever used, c) only the one we hold in the last N years.

We can set the coverage of the instrument dimensions to cover all bonds and equities which ever existed since 1900, but this seems to be a silly idea (because of the cost, and because they are not used), unless we plan conduct specific research, e.g. analyse the quality changes over a long period. The most popular convention is to store only what we ever used.


Most of instrument dimensions are in type 2, but which attributes are type 2 are different from project to project, bank to bank (and non-bank). Most of the sector, rating, country attributes are type 2. Maturity date is type 2 but maturity (which is in years) are either type 1 or kicked out of this dimension into a fact table (this is the more popular option). Next coupon date and effective date are usually type 1.

Numerical attribute

Numerical attributes such as coupon, rating factor, effective maturity, etc. are treated depending on their change frequency. If they change almost every day, they must be kicked out of the instrument dimension, into a fact table. For example: effective maturity, maturity, holding positions.

If they change a few times a year, or almost never change (static attribute), they stay in this instrument dimension, mostly as type 2. An example of a static numerical attribute is coupon amount (e.g. 6.25%). An example of a numerical attribute is rating factor (e.g. 1 for AAA and 3 for A), and associated PD (probability of default, in bps, e.g. such as 0.033 for AAA, 0.54 for AA, and 10 for BBB), which on average changes once every 1 to 3 years.

The difference between a numerical attribute and a fact is that a numerical attribute is a property of the instrument, whereas a fact is a measurement result.

System Columns

The usual system columns in the instrument dimension are: created datetime, last updated datetime, and the SCD system columns e.g. active flag, effective date, expiry date.

17 February 2016

The Problem with Data Quality

Filed under: Analysis Services — Vincent Rainardi @ 8:44 am

The problem with data quality is not the technicality. It is not difficult to design and build a reconciliation system, which checks and validates the numbers in the data warehouse and in BI reports/analytics (or non-BI reports/analytics!).

The problem is, once the reconciliation/validation program spits out hundreds or thousands of issues, who will be correcting them? That is the issue!

It requires someone to investigate the issues, and, more importantly, fix the issues. This requires funding which can seldom be justified (it is difficult to quantify the benefits), and requires a combination of skills which rarely exists within one person. So that doubles the money because we need to hire 2 people. The “checker” who checks the DQ reports is largely an operational application support type of person, with whilst the “fixer” need to have a detective mind set and development skills. To make matters worse, these development skills are usually platform specific, i.e. .NET, Oracle, SSIS, Business Objects, etc.

Assuming £40k salary in the UK, then adding NI, pension, software, desk, training, bonus, insurance, consumables, appraisal, and payroll cost (total of £15k), and multiply by 2 person, it is a £110k/year operation. Adding half-time of a manager (£60k salary + £20k costs), it is a £150k/year operation.

It is usually difficult to find the benefit of a data quality program bigger than £100k. The build cost of a DQ program can be included in the application development cost (i.e. data validation, data reconciliation, automated checks, etc.), but the operational cost is an issue.

So the fundamental issue is not actually finding a person, or a team of people. The fundamental issue is actually to get the funding to pay these people. The old adage in IT is usually true: anything is possible in IT, provided we have the funding and the time.

The benefit can’t come from FTE reduction (full time employee, means headcount), because it is not a reduction of workload (unless a manual DQ is already in place of course). And it doesn’t come from increased sales or revenue either. Try to find a link between better DQ and increased revenue, and you’ll experience that it is hard to find this link. And we know that headcount reduction and revenue increase are two major factor for funding an activity/work within a company.

3 factors that drives data quality work

But fortunately there are 2 another factors that we can use: compliance and risk.

Compliance in financial services industry, healthcare industry, etc. requires reports to the regulators in a timely manner, and with good accuracy. That drives the data quality work. For example, if we report that the credit derivative loss position is $1.6bn, where as actually it is $2.1bn, we could be risking penalty/fine of several million dollars.

Risk: there are other risks apart from compliance, namely credit risk, interest rate risk, counterparty risk, etc. Different industry has different risks of course, with financial services probably have the largest monetary amount, but they all drives data quality work. If the data quality is low, we are risking misstating the risk amount, and that could cost the company a fortune.

The 3rd factor to use is data warehouse. If your company stores a lot of data in one place, such as a data warehouse, and the data quality is low, then all the investments are wasted. A £600k DW warrants a £50k DQ. And if your DW has been there for 4 years, the actual cost (development + operation) could easily exceed £1m mark. A general 10% ratio yields a £100k investment in the DQ work.

The key statement to use with regards to DW’s DQ is the “DW usage”. A centralised data store such as a DW is likely to be used across many applications/teams/business lines. Each of these app/business are in risk of having operational issues if the data in the DW is incorrect. And if we don’t monitor the DQ, we can be sure that the data will be incorrect. That is quite an argument for a Data Quality project.

15 February 2016


Filed under: Analysis Services — Vincent Rainardi @ 8:36 am

It’s been about 6 months since I came across an automated data warehouse builder software called Effektor, based in Copenhagen. I don’t exactly remember when but I think I got it from SQLBits. The product is the best in class, better than the other 4 DW automation software I know (WhereScape, Kalido, Insource and Dimodelo). Effektor can generate the DW tables and the ETL packages in SSIS, and it can also create MDM, SSRS reports, balanced scorecard (BSC), and SSAS cube. None of the other 4 software creates MDM, SSRS, BSC or SSAS cube, as far as I’m aware.

It is 5 years old (2010), it has 40 customers, but it only run on SQL Server, not any other RDBMS. It runs in Azure and  it runs on Standard Edition of SQL Server (2008 R2 to 2014), as well as other editions.

All DW automated builder software safe costs, as I explained in my Dimodelo article. But Effektor goes further. It doesn’t only create the ETL packages, but also the cube, the MDM, the scorecard and reports. The integration with RS amazed me because, unlike SSIS and SSAS, SSRS does not have API interface like AMO so we need to create the XML SSRS files manually.

Now (6.3) it also have WebAPI, i.e. we can control the DW sync, data import, DW load and OLAP security from PowerShell via WebAPI. SSRS usage from the portal is logged, so we know who uses which report and when.

The only negative side I can think of is the financial strength. As I said in the Dimodelo article, and in the Choosing ETL Tool article, there are 3 factors which come above the functionality: the price,  financial strength and infrastructure. I think Effektor will satisfy the price and infrastructure aspect for most companies (companies who are a SQL Server shop that is), but it is the financial strength which is a question mark. Will the company still be there in 5 years time, or will it be a takeover target by a bigger company and have the product diminished? i.e. they do the takeover to get the customers, not the product.

At the moment I don’t see that, because it only has 40 customers and an excellent product, so it would be crazy for say, Microsoft, to be interested in Effektor just to get its customers. On the contrary, if a big company is interested to buy Effektor, they must be doing so because of its excellent product. So this actually plays better for the existing customers, because they will get better support from that bigger company, and the product development team will get better funding, as the product is marketed to the bigger company’s existing customers. The only drawback/disadvantage is that the bigger company might increase the licencing cost (to increase profitability to more demanding shareholders, and covering a more aggressive marketing plan).

Disclaimer: I don’t receive any rewards or incentives, financially or otherwise, from Effektor or any of its partners or competitors, in writing this article.


Data Interface (How to Manage a DW Project)

Filed under: Analysis Services — Vincent Rainardi @ 5:54 am

One of the most challenging issues in managing a data warehouse project is that the report/analytics development can’t start before the ETL is completed and populated the warehouse tables with data.

This is traditionally solved  by using “data interface”, i.e. the fact & dim tables in the warehouse are created and populated manually with minimum data, in order for the report/cube developer to start their work.

But the issues when doing that are:

  • It takes a long time and large effort to create that data
  • The manually-handcrafted data does not reflect the real situations
  • The manually-handcrafted data does not cover many scenarios

A data interface is an agreement between the report building part and the ETL part of a DW project that specifies how the data will look like. A data interface consists of two essential parts:

  • The data structure
  • The data content (aka data values)

The data structure takes a long time to create, because it requires two inputs:

  • The requirements, which determines what fields are required in the DW
  • The source data, which determines the data types of those fields

Hence the project plan for a traditional data warehouse or data mart project looks like this: (let’s assume the project starts in January and finished in 12 months/December, and we have 1 ETL developer, 1 report/analytic developer, 1 data architect, 1 BA, 1 PM)

  • Jan: Inception, who: PM, task: produce business case and get it approved to get funding.
  • Feb-Apr: Requirement Analysis, who: BA, task: create functional requirements.
  • Apr-May: Design, who: DA, task: create data model and ETL specs.
  • Jun-July: Design, who: DA, task: report & cube specs
  • Jun-Aug: Build, who: ETL Dev, task: create ETL packages
  • Aug-Oct: Build, who: Report Dev, task: create report & cubes
  • Sep: Test, who: BA & ETL Dev, task: test ETL packages and fix
  • Nov: Test, who: BA & Report Dev, task: test report & cubes and fix
  • Dec: Implementation, who: BA, DA & Dev, task: resolve production issues

The above looks good, but it is actually it is not good from resourcing point of view. There are a lot of empty pockets burning budgets while people are sitting idle. There are 27 empty boxes (man-months) in the resource chart below, which is 37.5% of the total 60 man-months.

Original plan

The report developer starts in August. If we can make the report developer starts in June at the same time as the ETL developer, we would be able to shorten the project by 2 months or so. But how can we do that?

The answer is data interface. The DA creates the DW tables and populates them with real data from the source system. This is quicker than trying to manually create the data, and the data reflect the real situations, covering many scenarios. Real scenarios (the ones which are likely to happens) with realistic data, not made-up scenarios with unrealistic data which are unlikely to happen.

Using the populated DW tables (facts and dims), with real data, the report developer will be able to create the reports and the BA will be able to test/verify the numbers in the reports, at the same time when the ETL is being developed to populate the tables. We are removing the dependency between the report development and the ETL development, which a crucial link in the project that prolongs the project duration.

The resource chart now looks like this, 2 months quicker the original plan:

10 months

The value of this 2 months is approx. : 2 months x 5 people x $500/person/day x 22 days/month = $110k. It is a significant figure. The DW development cost goes down 16.7% from $660k to $550k.

We notice that there is a big white area on the lower left of the resource chart, i.e. the DA and 2 developers are not doing anything in the first 3-5 months. This can be tackled using iterative development, i.e. break the project into 3 parts (by functional areas) and deliver these 3 parts in one after the other. In the chart below, part 1 is yellow, part 2 is green and part 3 is purple.

Split into 3 parts

Part 1 (the yellow boxes) goes live in June, part 2 goes live in Aug, part 3 in Sep. The big system test at the end of the project won’t be required any more, because we go live bit by bit. During April the BA can prepare the test plan.

The resource utilisation is now higher. We now only have 8 white boxes, out of 54. That’s 14.8%.

The project duration is shortened further, only 9 months rather than 10. That’s another $55k cost saved. The cost is now $495k, 75% of the original cost ($660k).

13 February 2016


Filed under: Analysis Services — Vincent Rainardi @ 6:28 am

Recently I came across an automated data warehouse builder software called Dimodelo, based in Brisbane. Dimodelo is an SSIS plugin on SSDT. It is a new company with unknown financial strength, but their product is excellent. It is the best warehouse builder in the market today, better than the three software already in the market (WhereScape, Insource and Kalido) and as good as Effektor.

Dimodelo creates the data warehouse tables, including staging tables, and create an SSIS package which does the ETL to populate the DW tables. Yes we need to point it where the source tables are, what SCD we’d like, what is the business key, etc. But it is amazingly good and very user friendly. It is more modern and user friendly than Red and Kalido, and it creates SSIS packages, which is a big plus for a SQL Server shops. It is as user friendly and as modern as Effektor who is based in Copenhagen, which is also built for SQL Server, with integrated interface to SSRS and SSAS, not just SSIS (so yes, Effektor is more complete than Dimodelo in terms of BI stack).

Back to the article I just wrote, “Choosing an ETL Tool” (link), we should not consider vendors who can’t demonstrate financial strength, no matter how good their software is. But as a good start up I think and I hope Dimodello will be able to demonstrate this.

Dimodelo said in their website that their average development time for a fact or a dimension table is 6.8 hours. In my experience, doing it manually in SSIS takes 1-3-5 for a dimension (1 day for a simple dimension, 3 days for medium and 5 days for a complex dimension) and 3-5-7 for a fact table (3 days for a simple fact table, 5 days for medium and 7 days for a complex dimension). This is development + testing but excluding design. So 6.8 hour is very impressive. But perhaps it was tested on simple dimensions, not complex fact tables?

This is the 5th time I heard about a software which automates the development of a data warehouse. A DW Automation software can build simple DWs in a standard way. But for complex DWs, I don’t think they will be able to cope. But still, there is a lot of values to be gained by using them. We can jump start the development in a big way, saving months of initial DW development effort (saving 3 months x 4 people x $500/day is equal to $132k). When we use these products, the onus shifted to the architecture and design, which now takes the longest time, instead of the development itself.

Examples of “complex” things in DW development are: loading data from message queue, a normalised data warehouse (link, link), accumulating snapshots (link, link, link), distinct attribute dimension (link, link), connecting fact tables (link), deduplication (link), initial data load (link), initialising a new attribute in SCD type 2 (link), a data warehouse without surrogate keys (link), flip flopping in dimension tables (link), partitioning fact tables (link), processing past data (link). Most of the above are issues in dimensional warehouses, but apart from dimensional DWs these days we also have data lakes, Hadoop-based warehouses, Inmon’s CIFs (ODSes, EDW, Normalised DW, etc), real time DWs, MPP-based DWs, Cloud-based DWs, EIIs, and the latest incarnation of temporal-based DW in SQL Server 2016, which I don’t think DW Automation software can handle well either.

So in the end, as our DW developed into a more complex one, we will have to alter the ETL code manually (we will have to do some manual development, along with the automated one), but the $100-200k saving development jump-start is very valuable in any DW project. So DW automation software still give us a lot of value. The bigger the DW project, the more cost saving can be made by using DW automation software.

The value is even greater if your business is consultancy, i.e. you build DWs for companies. Because you will make this cost saving on every single project, every single client. And you can choose to pass that cost saving to your client, or to add to your bottom line (meaning the Profit Before Tax line in your Cash Flow).

Note: I do not receive any incentive or reward, in monetary term or otherwise from Dimodelo or its related partners, in writing this article.

12 February 2016

Choosing an ETL Tool

Filed under: Analysis Services — Vincent Rainardi @ 2:15 pm

Recently I was asked about choosing an ETL tool. And that was probably the sixth times somebody asked me that question, around choosing an ETL tool. I have written long and wide about all the considerations when we choose an ETL tool, see my article here:

But it is wrong to consider every single factor because most of them will be irrelevant. When choosing an ETL tool, or indeed any other system/software, you should only consider what is relevant to you.

When comparing ETL software many people look at things which are not relevant to them. We must not fall into this trap. For example, suppose you have to choose between Pentaho, SSIS and Informatica. You should not look at the all the features of Pentaho, SSIS, and Informatica, but only the features that you need. For example, you should not look at the transformations that you will never use.

In most cases, you should not look at performance either, because most ETL software will be able to satisfy your performance requirement. You won’t be loading 1 TB data in 30 minutes! You are no where near the performance limit of most ETL tools. All three I mentioned (Pentaho, SSIS, Informatica) will be able to satisfy your performance requirements, unless you work for the top companies doing extreme data movements.

Your primary considerations should be not be functionality, but these 3 things:

a) Price

I know both Pentaho and SSIS are free, but there are hidden costs, and actual costs. Prices vary according to usage. The more money you spend, the lower the unit price. So most vendors have tailored prices, very few have standard prices across all usages/volumes. So you have to ask around, most vendors will only be too pleased to answer your enquiries. My point is: know your limits. If an ETL tool is $1.2 million initial outlay + $200k annually, and your budget is only £200k initial + £20 annually, quickly strike them off your list. So first step: establish which ETL tools are within your budget.

b) Compatibility with the existing technologies in your company

I would say that this is the most important factor. Are you a SQL Server shop? If so then don’t look at Pentaho, but or Oracle ODI or BODI (or whatever SAP calls it these days), or WebSphere. Look at SSIS and Oracle. If you are an Oracle shop, look at ODI, Informatica, and probably Ab Initio. Don’t just look at RDBMS, but your middleware/messaging too. Is it WebSphere MQ or MSMQ or Tibco? If your data movement will be dealing with messages, then you need to choose a suitable one. Look at your BI tools too, i.e. if you use SAS BI then you have to consider SAS ETL first, as well as vendor neutral tool such as Informatica and Ab Initio.

c) Vendor financial strength

If the ETL vendor went bankrupt, your company will be in trouble. So choose a vendor which will be there in 10 years time? This doesn’t only apply to ETL, but when choosing any system/software. Choose a vendor with 50 employees with $1m in the bank, not 5 employees with $50k. Choose a vendor with 100 customers, not 5 customers. Asking for last 5 years balance sheet is a normal process. You can’t risk one vendor bringing your company down.

Your secondary considerations should be:

a) Functionality

Once again we must not look at the candidates with a “general view”. But look at them with specifically “in your context”. For example: SSIS can do A and B, Pentaho can’t do C and D. Ask yourself: do you use A,B,C,D? If you don’t use them, then they are irrelevant! It does not matter whether SSIS or Pentaho can or can not do them. Look at them within your context.

b) The user friendliness

Download them (or ask for an evaluation copy) and use them both to your daily tasks and you can find out which one is easier to use. One or two vendor refuse to give evaluation copy before they see real opportunity to sell. Again this is simply a “level mismatch”, i.e. if your budget is $200k the $1.2m vendor would not be interested in talking to you. But neither should you!

c) Your skill set

You and your colleagues already have certain skills (and looking forward to develop them). Which one are you more capable of? If you are a SQL Server shop, naturally your skill set will be in SSIS, and not Pentaho, or Ab Initio. Same with SAS, IBM, and SAP.

Factor you should not care about

Those are the primary and secondary consideration factors. But there are a few factors which you should not spend any time on. They are:

  1. Don’t look at what they are made of. You should not care whether they are written in Java or .NET or R. This factor is irrelevant.
  2. Don’t think about what other companies think. Don’t bother asking for a reference because the vendor will always refer you to their best client, who will always sing the praises for them. The vendor will never put you in touch with a failed client who might say negative things about the software.
  3. Look at the Magic Quadrant. Do read the details but don’t look at the quadrant. The details specify the strengths and weaknesses of each ETL tool, but the position in the quadrant will blind your actual need. Again, only look at the features which are relevant to you, not the whole features. And the quadrant is constructed to reflect the whole feature set, hence it is irrelevant to you.


In addition to ETL tools you could consider EII tool (Enterprise Information Integration). The best in this field is Composite Software. EII does not move the data, but integrate it and feed it directly to the BI tools. It is probably irrelevant to your company’s need. But in special cases it is relevant. So don’t get blinded with ETL, ETL, ETL. But also consider the alternative.

11 February 2016

Six Dimension Types

Filed under: Analysis Services — Vincent Rainardi @ 7:29 pm

In data warehousing there are 6 types of dimension:

  1. Normal dimension
  2. Junk dimension
  3. Split dimension
  4. Text dimension
  5. Stacked dimension
  6. Distinct Attribute dimension

1. Normal Dimension

A normal dimension is when all attributes are related (they are all about 1 entity, e.g. Product), it has a business key (natural key) and all attributes are dependent on the surrogate key, like this:

1 Normal Dimension

In the above example, the business key is PRODUCT_CODE.

Sometimes we find a date attribute created as a date column (DATE_PRICE_UPDATED in the above example, and sometimes as a surrogate key (DATE_LAUNCHED_KEY in the above example). Another snowflake example is SUPPLIER_KEY, enabling us to access all the supplier attributes stored in the DIM_SUPPLIER.

The above example is type 2, but a normal dimension can be Type 0, 1, 2, 3, 4, or 2+1. Type 4 is a temporal table, type 0 is static, e.g. dim_date. Type 2+1 is where for the same attribute we have type 1 column (latest value) and type 2 column (historical values over many rows).

2. Junk Dimension

The classic scenario is: the source table in the transaction/source system has 4 to 8 columns, each with 2 to 6 values (some are Yes/No columns). And these columns are not related to each other, and they are not an attribute of a big dimension. Most of these columns explain the fact table itself.

As an example, let’s consider a payment table in a transaction system. The payment table has order ID column (which can be composite, i.e. a group of several orders), customer ID column, due due date, actual payment date, payment amount, payment method. But this table also has the following columns: processed_indicator, has_deposit, is_partial, and recurring_flag. Each of these columns is Yes/No column.

So we can create a junk dimension like this:

2 Junk Dimension

Note that a junk dimension is always type 0 (static). The name of the dimension contains the word “junk”, usually after “dim” rather than at the end. For the Y/N columns, the column names are standardised with “IS_…” or “…_FLAG”. The data type is consistent, i.e. for a Y/N column either it is bit or CHAR(1), but not INT nor VARCHAR(N) – my preference is CHAR(1) for clarity of values, and possibility of expanding into 3 or 4 values.

A junk dimension does not have a business key.

Expanding a junk dimension: if we add a column to the junk dimension, here’s what happens. The above junk dimension has 9 rows i.e. 8 for the combination of 4 Y/N columns, plus the unknown rows. If we add 1 more Y/N column (say IS_REFUNDABLE), the number of rows becomes 17, i.e. for the existing 8 rows (not the unknown row), we set IS_REFUNDABLE to N. Then we copy the existing 8 rows to row 10-17, with IS_REFUNDABLE as Y.

A junk dimension can also expand because there are more value in a column. For example: the IS_PROCESSED column was Y/N column but now it also has a third value, which is U (unknown).

In the fact table we have JUNK_PAYMENT_KEY column, containing value 0 to 17.

Note that in a junk dimension the attributes are not related to each other. For example, they are not customer attributes or product attributes. Because if they are, they should be moved to the customer dimension or product dimension. The attributes in the junk dimension are at transaction level (or whatever the fact table is about), not at the dimension level.

3. Split Dimension

When a dimension is predicted to be very big, say 20 million rows (usually because it has many type 2 attributes), the designer may take a conscious decision to split the dimension into two or three, for performance reason. A classic scenario is customer dimension for ecommerce companies with internet-based customer registration facility.

The split is always vertical, i.e. some columns go to dim 1 and some columns go to dim 2. For example, the customer contact details are put into dim_customer_contact, the customer attributes related to order processing are put into dim_customer_order, and the customer attributes related to marketing, promotions and loyalty programs are put into dim_customer_marketing.

3 Split Dimension

The related rows in each dimension have the same business key (natural key). It is not practical to put the surrogate key of the other dimension(s). It is much better to put the business key instead.

When a big type 2 dimension is split into two or three, in most cases all of the resulting dimensions are also type 2. This is because we split by function (business area). It would not be good to split by SCD type of the attributes, i.e. all type 1 columns go to dim 1 and all type 2 columns go to dim 2, because it would be impossible to construct the hierarchies.

Joining split dimensions: we join on the business key, taking the latest version. The best practice is not to join on-the-fly (using a view), but during the ETL, stored physically as DIM_CUSTOMER table, containing the latest version of customer records, having all 3 surrogate keys. Joining on-the-fly usually have performance issue when querying because of the data volume. During the ETL we can use a procedure which creates 3 temp tables containing the latest versions of all customers (without the system columns e.g. effective date columns), then join them. Then we update only the changed rows, which is best done using hashbytes or checksum depending on row length and null values, with DIM_CUSTOMER clustered index on the business key (CUSTOMER_ID).

4. Text Dimension

If the source transaction table has narrow text column, say 10 to 20 characters, and this column is not at dimension level, but is at the fact table level, then we leave it in the fact table. For example: order ID, transaction ID, payment ID.

But if the source transaction table has a wide text column, say varchar(255) or varchar(800), we have two design choices. The standard text book way is to put this varchar column in a dimension, which I called Text Dimension. The other design choice is to leave it in the fact table. An example is a free text comment column, a transaction description column, or a order note column.

4 Text Dimension

There are 2 considerations to choose which design: the length of the text column, and the repeatability of the values. The longer the text column, the more right it is to put it in a dimension, especially if the fact table is large. This is because of space consideration, i.e. if the fact table is more than 1 TB, DBAs might raise concern of its management (backup, indexing, stats) as it needs to be duplicated into Dev, UAT and DR environment.

If the value in this varchar column is highly repeatable, say a factor of 10 (10 million rows equal to 1m distinct rows), then we would safe a lot of space. This approach is good for performance too, as most of the queries accessing the fact table are not concerned with the text column.

Boundary: If the length of the text column is 20-70 characters, and the repeatability is very low (say a factor of 1.2) then it is reasonable to leave this text column in the fact table. But if the length is over 100 character, it is difficult to argue that the performance hindrance is outweighing the user-friendliness.

SCD: a text dimension is always type 1.

Business key: a text dimension does not have a business key.

Combining several text dimensions: if we have two commentary/notes column in the source table, it is best to create two text dimension, rather than combining them into one text dimension, because a) functional clarity – different surrogate keys in the fact table describe what they are, and b) query performance – the dim tables will have fewer rows, avoiding Cartesian explosion when they are combined

5. Stacked dimension

A stacked dimension is a dimension where two or more dimensions are combined into one dimensions, like this:

5 Stacked Dimension

A stacked dimension only has one or two attributes, and it is always SCD type 0 (no updates).

A stacked dimension is not recommended. It is wrong to do it like that. But they do exists. Usually because it was like that in the source system, so they simply copied it into the data warehouse.

We often come across a few type and status columns: customer status, product type, store type, security type, security class, broker type, etc. All of these columns should go to their respective dimensions, because they are truly the property of that dimension.

But there are type and status columns which are the property of the fact table, such as: transaction type or transaction status. To combine the transaction type and the transaction status into one dimension, we just create a junk dimension, as in point #2 above. Not using a stacked dimension like this point.

6. Distinct Attribute Dimension

The last type is a distinct attributes dimension where all the attributes are the property of the fact table itself. For example, consider a holding table in fund management, where the grain is one row for each fund, security and date. Each security has many attributes, such as sector, rating, country, currency, asset class, etc. Examples of security are bonds, equities, CDS, IRS, options, futures and swaps. In theory these attributes are consistent across funds. But that is not the case, because there might be an overwrite.

In a normal situation, sector, rating, country, currency, and asset class are all properties of the security dimension. But because these “security attributes” can be different across funds, then they are truly the property of the holding fact table, and they becomes “position attribute”. And therefore, to store them properly, we should create a distinct attribute dimension like this:

6 Distinct Attribute Dimension

The above example is over simplified. In reality there are many different rating columns, e.g. S&P, Moody’s, Fitch, internal/house, and combinations between them. There are many different types of industry sectors such as Barclays, ICB, GICS (MSCI), Moody’s, etc. There are many different asset class categorisation. So in practice the number of columns in the above dimension is not 6, but closer to 50.

A “distinct attribute” dimension is always SCD type 0. We can’t change existing rows (no update). We can only insert new rows.

As the name implies, the above dimension contains a distinct list of attributes. Not all the possible values, but only values which actually exist in the fact table. So we don’t have a Cartesian explosion here.

A “distinct attribute” dimension does not have a business key. All the attributes are the property of the fact table, not the property of a particular dimension.

A “distinct attribute” dimension can be split into two or three physically. It is a vertical split, like in point #3 above (Split Dimension), based on logical grouping of the attributes. The purpose of splitting it is to reduce the number or rows, and thus improve the query performance.

In a way this becomes like a junk dimension, but covering a wider range of attributes, i.e. 50-60 columns instead of only 3-5 columns.

10 February 2016

Alt Key in SSMS

Filed under: Analysis Services — Vincent Rainardi @ 6:39 pm

The Alt key in SQL Server Management Studio, functions as “vertical mode”. If we press the alt button whilst clicking and dragging the mouse, we can select a vertical block of text like this:

Block KPI

When then press Tab to insert spaces, or press Delete to delete the word KPI from all 3 lines in one go.

After pressing Delete, we can type “Target” and the word Target will be inserted in all 3 lines in one go, like this:


The vertical blue line is the cursor. When this blue line appear, we can press shift-left or shift-right to block text in “vertical mode”. Press shift-down or shift-up will extend/reduce the blocked text by 1 line, like this:


When some text is highlighted like the above, we can find or replace any text, within just the selected area. To do that, just press ctrl-F or ctrl-H like this:

Find and Replace

Also, when some text is highlighted in vertical mode, we can drag them to the right or to the left (all 3 lines in one go), like this:


When we want to type the same thing on many lines, press alt, highlight 1 character/column over many lines, and start typing, like this:

Typing multi lines

Hope this helps.

3 February 2016

Accumulative Snapshot Fact table

Filed under: Analysis Services — Vincent Rainardi @ 6:37 pm

Many people have seen or implemented transaction fact table and periodic snapshot fact table but not the 3rd type of fact table: accumulative snapshot fact table. The purpose of this article is to introduce it, with an example. I have written an article about it in the past (link), but this one is a bit different. I try to be more practical, and an email from a reader I received yesterday made me realise that the previous article does not cover some areas. In particular, the business aspect of it, which is rather important.

We use this fact table when we need to analyse a process. A process is a series of business events, which comes one after the other in stages.

As an example, customers applying for a health care insurance. First, the customer fills an application form, including the bank details for direct debit and the effective date of the policy. The insurer then receives the form and sends medical questionaire. The insurer receives the medical questionaire and checks it, then confirms the medical details with their internal medical team, and at the same time, checks with the applicant’s GP (doctor). Underwriting is the next step, that is when the insurer determines the rate for this applicant (based on their risks), and from the rate the underwriter then calculates the premium and issue a policy along with membership certificate. The policy then become ‘alive’ and the first payment is then taken from the applicant bank account. As per the regulation, the applicant can cancel the policy during the grace period of 21 days and have their money back.

As we can see in the example above, within a process we can have branching where two or more flows happens at the same time. We can also have alternate paths, like cancellation. I won’t go into process flows in details as it is not the purpose of this article. But I will simplify the example above into only 3 stages:

  • Stage 1: Application form received
  • Stage 2: Medical questionnaire received
  • Stage 3: Policy document sent

Next I will discuss the analysis that the business usually want (the “BI Questions”).

Analysis (the “BI Questions”)

The usual BI Questions that the business try to answer are:

  1. Volume of business – Early Stage: How many applications did we receive in Q4 2015? How does this compare to Q4 2014 and Q3 2015? That is quarterly, but it is also common to do monthly comparison (Dec 2015 compared to the previous 3 months), and annual comparison (2015 compared to 2014).
  2. Volume of business – Final Stage: How many policies did we issue in Q4 2015? How does this compare to Q4 2014 and Q3 2015? Again, above is quarterly, but it is also common to do monthly and annual comparison.
  3. Efficiency of the business process – Conversion Rate: Of all applications we received in Q3 2015, how many became policies? How does this compare to Q2 2015? Note that this question can only be answered 3 months after, because it takes a few weeks for an application to become a policy. So at the moment (3rd Feb 2015) we can’t ask for Q4 2015 because they are still being processed / in progress.
  4. Efficiency of the business process – Speed of Flow: Of all application we received in Q3 2015, on average how many business days (BD) does it take to reach stage two (medical form received)? And how many BDs does it take to reach stage three (policy document sent)? How does this compare to the previous quarter and previous year? Let’s say that the answers are: 7.2 BDs to reach stage 2 and 15.8 BDs to reach stage 3.
  5. Efficiency of the business process – Issues and Anomalies: Show me the applications we received in Q3 2015, which reached stage 3 in more than 20 BDs. How many in total? Why were they late – is it because of stage 2 being slow or stage 3 being slow?
  6. Efficiency of the business process – Best Practices: Show me the applications we received in Q3 2015, which reached stage 3 in less than 12 BDs. How many in total? Why were they early – is it because of stage 2 being fast, or stage 3 being fast?
  7. The same questions as per 1 to 6 above, but specifically for customers in a particular region, e.g. London, North East, Scotland; and a particular customer type, e.g. family (rather than individual), business (rather than personal).
  8. The same questions as per 1 to 6 above, but specifically for a particular product type, e.g. “comprehensive cancer treatment”, “policies which include alternative treatments”


The basic design of accumulative snapshot fact table which can answer the above question is:


The fact table consists of 4 groups of columns:

  • Group A: Dimension Keys. The most important ones are product_key, and customer_key. This enables us to do #7 and #8 above, i.e. filtering the data on a particular product attribute(s) or customer attribute(s).
  • Group B: Dates. This group contains the dates of all stages in the process. In the above example, this group would contain date_app_received, date_medical_received, and date_policy_sent. The purpose of these columns is to specify the time frame of the dataset, i.e. applications received in Q4 2015, or policies issued in Q3 2015.
  • Group C: Durations. This group contains the duration between stages. In the above example, we have 2 columns in group C: days_medical_received (the number of days from the day we received application until the day we received the medical questionnaire) and days_policy_sent (the number of days from the day we received the medical questionnaire until the day we sent the policy document). The purpose of the columns in group C is to enable us analyse the business process efficiency (#4 above, the speed of flow), as well as “bottleneck analysis” (#5 and #6 in the example above).
  • Group D: Measures. This group contains numerical measures such as: quantity of applications (hardcoded to 1 for all rows) , the premium amount (annualised), the number of people in the policy, the sum insured, etc. The purpose of Group D columns is to enable us to answer “how many applications did we…”, “what is the value of the policies that we…”, etc.

The way we query this table is in 3 steps:

  1. First we set the time period of the analysis by filtering the row on one (or more) of the Group B columns.
  2. Then we filter the rows further on the product or customer dimensions using the dim keys in Group A.
  3. Once we got the rows that we want to involve, we sum one (or more) of the the measure columns (Group D). Or, we can sum one (or more) of the duration columns (Group C).







Blog at