Data Warehousing and Business Intelligence

24 April 2017

Choosing between Big Data and Data Warehousing

Filed under: Analysis Services — Vincent Rainardi @ 4:14 am

If we have 100 files each containing 10 million rows that we need to load to a repository so that we can analyse the data. What should we do? Do we put them into Hadoop (HDFS), or into a database (RDBMS)?

Last week I defined the difference between Big Data and Data Warehousing as: Big Data is Hadoop, and Data Warehousing is RDBMS. See my article here: link. Today I would like to illustrate using an example where we need to choose between the two.

There are 4 consideration factors:

  1. Data Structure
  2. Data Volume
  3. Unstructured Data
  4. Schema on Read

1. Data Structure: Simple vs Complex
If all 100 files have the same structure, and they all consist of the same 10 columns, then it is better to put them into Hadoop. We can then use Hive* and R to analyse the data. For example, to find patterns within the data, doing statistical analysis, or create forecasts. The development time will be shorter, because it is only 1 layer.
*or Phoenix, Impala, BigSQL, Stinger, Drill, Spark, depending on your Hadoop platform

If the 100 files contain 100 different tables, it is better to put them into a database, create a data warehouse, and use a Analytic/BI tool such as MicroStrategy or Tableau to analyse the data. For example to slice and dice the data, find percentage or anomalies and time series analysis. Yes we need to create 3 layers (staging, 3NF, star schema) but it enables us analyse each measure by different dimensions.

So if the data structure is simple, put into Hadoop, and if the structure is complex, but into a data warehouse. This is the general rule, but there are always exceptions. Can data with simple pattern be put into a data warehouse? Of course it can. Can data with complex pattern be put into Hadoop? Of course it can.

Using Hadoop and Hive/R we can also do slice and dice, find percentage or anomalies and do time series analysis. Using a data warehouse we can also do machine learning and data mining to find patterns in the data, do statistical analysis, and create forecasts. So basically, whether we store the data in Hadoop or in a data warehouse, we can still do complete analysis.

The issue here is storing it. Linking 100 tables in Hadoop is difficult and not natural. RDBMS such as SQL Server or Oracle is designed precisely for that task: linking and joining tables. Constructing a data model linking 100 tables is very suitable for an RDBMS. Can we design a data model linking 100 files with different structures in Hadoop? Of course we can. But it is much more difficult. For starters, it is Schema-on-Read, so the columns in the files have no data types. Schema-on-Read means that we don’t try to understand the relationship between the files when loading them into Hadoop. So yes we can load the 100 files into Hadoop, but we keep them as individual files, without relationships between them. This is the same as in Data Lake, which is also using Schema-on-Read, also using HDFS.

  1. Data Volume: Small vs Large

100 files containing 10 million rows each is 1 billion rows per day. If all 100 files have the same structure (say they all consists of the same 10 columns), then we will have a performance problem if we load them into an SMP database such as SQL Server or Oracle. Within 3 years, this table will have about 1 trillion rows. Even with partitioning and indexing, it will still be slow to query.

Hadoop on the other hand, will have no problem storing and querying 1 trillion rows. It is designed exactly for this task, by storing it in many files and querying it in parallel using Stinger, Drill, Phoenix, Impala or Spark. The file structure is simple (the same 10 columns each) which lends itself to Hadoop.

Teradata, Greenplum, Netezza, Exadata, and Azure SQL Data Warehouse are more than capable to handle this, with excellent query performance. But MPPs are more costly than Hadoop, which is why companies tend to choose Hadoop for this task. Using an MPP for this task is like killing a fly with a canon. Not only it is expensive and unnecesary, but also it is too sluggish and cumbersome for the task.

If the 100 source files have a complex structure (such as an export from SAP system) then yes an MPP is a suitable solution as we need to create relationship between the files/tables. But if the source files have a simple structure and we just need to union them, then Hadoop is more suitable and more economical for this task.

So if the data volume is large, like 1 billion per day, and the data structure is simple, put them into Hadoop. But if the data volume is large and the data structure is complex, put them into an MPP.

  1. Unstructured Data

If most of those 100 source files are MP4 (video) or MP3 (music), then Hadoop or Data Lake is an ideal platform to store them. An RDBMS be it SNP or MPP are not designed to store video or music files. They can (as a blob, or as externally-linked files), but they are not really designed for it.

If the source files are have different number of columns (such as Facebook or Twitter files,) then Hadoop or Data Lake is an ideal platform to store them. An RDBMS is not really designed for it.

Unstructured Data can also comes in the form of free-format text files (such as emails) and documents (such as journals and pattents). Again Hadoop or Data Lake is much better position to store them than a RDBMS.

  1. Schema-on-Read

One of the advantages of using Hadoop or Data Lake is that they are Schema-on-Read. Meaning that we just store those files without determining whether the first column is a numeric or a string. When we want to query it then we need to specify the data type.

Why is this an advantage? Because it makes it flexible. In Data Warehousing the first thing we need to do is to analyse the file structure, and design many tables to host the files in a Staging database. Then we design a normalised database to integrate those Staging tables. And then we design a Reporting layer in the form of Fact and Dimension tables and load those normalised tables into them. The whole thing can take a year if we have 100 files. The more number of files we have, the more complex the process and the longer it takes to design the databases for Integration layer and Reporting layer. It is good for the data architect (it gives them a job) but it is not good for the people who pay for this project.

Hadoop on the other hand is Schema-on-Read. After we put these 100 files in Hadoop, we query the first file. And when we query this first file, we specify the data types of each column. We don’t need to touch the other 99 files, yet. And we can already get the benefit. We can analyse the data straight away. On day one! If the other 99 files have the same structure, then we can union them, without extra effort of designing any database, and we can query them straight away. On day two! It is much simpler, we don’t need a team of 10 people designing any Staging, Normalised or Reporting layer for many many months. We can start analysing the data straight away and the project can finish in 2 to 3 months, with only 3 or 4 people. A lot less costly, a lot more agile, and a lot more flexible.

Just

3 January 2017

Data Warehousing/Business Intelligence in Investment Banking

Filed under: Analysis Services — Vincent Rainardi @ 8:39 am

I’ve written an overview of an investment bank (IB): link, which would provide a useful background to this, if you want to read it first. The intended audience of this article are those who have no experience in investment banking, so I will be explaining all the IB terminologies as we go along. I’m not a business expert in IB, so would welcome corrections from IB practitioners.

Of the 7 business areas in IB, a business intelligence system is highly required in 3 areas: research, fund/asset mgt, trading, and almost not required at all in the other 4. I’ll start with trading as it is IB’s main business, and it is the one area that uses DW/BI the most.

An investment bank does trading on behalf of its clients because it acts as a broker dealer. It receives an order from many clients to buy and sell securities such as shares, bonds, options, FX forwards, IRS, commodity, and CDS. Both Exchange-Traded and Over-The-Counter (bilateral agreement with another dealer).

As a broker dealer, we need to offer our clients good BI, preferably real time, of all the trades we do for them. That’s the minimum. Preferably (most favourably, in fact it is a necessity) that we offer market data. A more appropriate term would be market intelligence. Clients usually don’t use us just because we have thin spreads, and clean-speedly executions, but because we win them over with our information systems which enable them to make decision early and accurately.

This BI enables the clients to see each and every trades, in full life cycle. I’m going to explain the mechanics so please bear with me. Our client setup the order on their OMS (Order Management System) such as Charles River or ThinkFolio. We receive the order via Omgeo CTM or Swift, for example to sell 200,000 of bond X. We start the process of origination, execution, validation, confirmation, clearing, settlement and accounting. To get the best price we might (progammatically!) have to split the order 3 ways: 75k to Goldman, 75k to MS and 50k the next day with Citi.

The terms could be different from each counterparty, particularly if it is margin trading such as a Repo. The order could be complex i.e. a delta neutral trade, a volatility play, or an option spread.

Broker or dealer: we become a broker when we buy an ETS (exchanged-traded-security) or ETD (exchange-traded-derivatives) such as shares, gilts, treasury, or CP (commercial paper) on behalf of our clients. Here we purchase or sell the security/derivative on an exchange. We also become a broker if we make an OTC (over the counter) contract on behalf of our clients, for example IRS and FX Forward or some futures. Here we make a custom “deal” with another bank. We become a dealer when we trade on behalf of ourselves, with our own money, for example for our Proprietary Desk (I’ll explain this shortly) or for our own Treasury department.

Here is an overview of each of the 6 steps:

  1. Origination is to make sure we capture the client orders correctly into our trading system. Quantity, security details, maturity, price requirements, timing restrictions, counterparty restrictions (exchange restriction), and terms (like European or American for options, ISDA / DTCC for CDS).
  2. Execution is to send the order down the wire to different counterparties, or to our internal pool first to match with other client orders. ETS orders are could be combined with other client orders, or with our own prop desk, and placed externally in one big volume to get a good deal.
  3. Validation is done by mid office to check the trade against the counterparty, i.e. we send them our trade details, they send us their trade details, and we compare them electronically. Unmatched trades are reported back to counterparty, and fixed manually.
  4. Confirmation: after we agreed with the counterparty, we send the details of the trade to the clients, electronically.
  5. Clearing: clearning agency e.g. NSCC electronically sends the contract to both parties and both formally accept the trade terms.
  6. Settlement: payment to counterparty (OTC) or Exchange (ETS), usually T+1 or 2 depending on which exchange. Settlement for STP (straight through processing) is same day.
  7. Accounting: trades are booked to clients accounts. Calculates balance of client funds, and margin requirements.

The BI systems we are offering to clients on our extranet needs to reflect every single one of the above status/step, on real time basis (like 2-5 mins after the event). If we only offer overnight updates we may be able to win small clients but big fish (buy side) all needs real time, as it affects the fund manager ability to manage their portfolios intraday.

 

Business Intelligence for Trading

The 5 core requirements are: data coverage, for external & internal use, position exposure, risk management, regulatory compliance.

Data Coverage: There are many different asset classes within the IMSes across the bank (Investment Management System), and for OTCs there are several external trade repositories, all of which need to be tapped, and have the data extracted into our data warehouse. Our DW/BI must store every single trade we execute, and every single order we receive (both from client and internal), regardless of the status i.e. we must still store cancelled order, novations, compressions.

External & Internal Use: the DW/BI is used both by our clients to view their transactions, positions and risks, but also importantly by our traders and managers to understand their activity. In terms of security, Chinese Wall is an absolute must. No traders should be able to view client’s price-sensitive information, and clients must only be able to see their own positions.

Position Exposure: a “position” is an obligation or asset that a client have, on a particular day. A trade to buy a bond for example, results in that client having an long position in that bond. A position can be long or short, open or closed, have a book value and market value, date-and-security based, and has an assocated P&L value i.e. the potential profit or loss. Clients must be able to see their aggregation positions at various levels. For example, number of CDS positions and the grossed & netted exposure for each day. Clients should be able too see overal exposure by broker, by clearing house, by asset type, by status, by sector, by currency, by country, and by instrument.

Risk Management: every position/exposure carry certain market risk, e.g. the EUR-USD rate can move against you, yield and duration of bonds can increase or decrease. To quantify this market risk, we calculate VAR (Value at risk), e.g. there is a 1% chance that the market value of all Investment Grade fixed income positions decrease by 1 million in 1 day. We call this 1% as “99% convidence level”, this 1 day as “the risk time horizon”, and this 1 million as “the VAR”. Clients should be able to see VAR by broker, currency, currency, country, maturity, asset class, on a 1-day, 1-week and 1-month horizon, at 95%, 99% and 99.9% confidence levels.

Every trade has certain operational risks. A dashboard displaying daily automatic confirmation and validation across all market positions (compressed and uncompressed) gives our client confidence about how much risk exposure they have on each asset class, broker, currency, e.g. volume of mismatched trades. Every trade has certain counterparty risks, therefore we need to provide to our clients exposure to each counterparty, e.g. £4.5m agains RBS on 17/12/2016. They can see the this daily counterparty exposure on a time line, for different brokers, for different asset class, with the associated MTM (marked-to-market).

Regulatory Compliance: Different regulators demand different reports which we need to produce simultaneously, e.g. Dodd Frank in US, EMIR in EU, FATCA, Mifid. We need to help our clients fulfill their regulatory reporting requirements, as well as our own reports to the authorities, e.g. we must report every OTC Derivative contract to a trade repository, implement margining standards, and monitor clearing obligation in each CCP (Central Counterparty). This includes fraud detection, insider trading, sanctions, AML (Anti Money Laundering), KYC (Know Your Clients), RWA (Basel III). Clients pass to us their regulatory reporting requirements, and we create automated reports within our DW/BI to give them this data, securely downloadable on extranet.

 

Prop Desk and Fund Management

Many investment banks also invest their own money into some assets, most notably fixed income. We call this “Prop” – short for Proprietary Desk or Proprietary Trading. There 2 kinds: it can be short term, i.e. just this trade, or it can be long term, the trade is part of a big plan to manage money, some kind of “fund management”.

We want to do prop because we want to make market. If a client wants to sell their CDS cheaply, or buy an IRS at good price, why don’t we take the opposite side and create a market? If someone is under pressure to sell a UBS CDS at 6% below the market price it is profitable for us to buy it and sell it to someone else the next day at 2% below the market and we make immidiate money.

The prop desk also actively looks at the market for opportunity. So if they believe that the Columbian bond recent 32% dive has made the price too low, they will instruct our trader to buy some. Or the Prop could want a Delta Neutral Trade, to gain from Volatility. Which means, it doesn’t matter which way the price is moving, whether the price is moving down or up we’ll still make money. The more volatile the price, the more money we make.

So that’s the incidental. But the bank’s money is usually managed from a long term perspective, not just or day-to-day trading. As an example, we may want to create a long position on Cable because we believe US politics will result in the Fed raising rates next year. Or take advantage from the raising EM market next year, or Japan equity.

All the big IBs (JPM, Nomura, UBS, Barclays, etc) have asset/wealth management business. They take client’s money and invest it on the clients’ behalf. This AM business also instructs our trader to buy or sell stuff to create positions for the clients.

The BI requied to support the prop money and asset management has the following objectives:

  1. Compliance: each fund has a “boundary” e.g. 80-90% must be invested in EM bonds. “EM” = Emerging Market, i.e. Chile, Russia, India, China, etc. Must be at least 95% in USD. Use of options must be <= 10%. Latam (Latin America) maximum is 30%. And so on. And all these “limits” are coded into the OMS (Order Management System) so that everytime a trade is created on the OMS, it will be automatically checked before the trade is executed. And therefore, afterwards, we will be able to plot in our BI, the chart of “EM Bonds” content of the fund (percentage wise) for each month from 2014 to 2016. And not just EM Bonds, but on the same dashboard we should be able to see the regional breakdown of the fund, in the last 3 years. We should also see the currency breakdown of the fund, month to month, in the last 3 years. And we could see in that chart if any non-compliance event happened.
  2. Profitability: we can compare the amount of money we make in each trade, to the amount of capital required to do the trade, and the amount of time required. We can aggregate this to client level, fund level, country level, currency level, asset class level etc., for any time duration we like.
  3. Risk: we can look at the level of risk we take in each trade, in terms of contribution to VAR, and in terms of historical events. Meaning that, if the 9/11 event happens tomorrow, what kinds of risk do we have in this portfolio. Have we hedged it? This is tested not just for 1 event like 9/11, but hundreds of events. Some of them are real events which happened in the past. Some of them are theoretical events such as raise of interest rates, and FX rates.

 

Research

The Research guys make money by selling market analysis. Also on economic and geo-political risk. Both the market as a whole (macro/overview level) or on an individual company level. We charge clients some fees to allow them to use our market analysis tool, which enable them to monitor market behaviour, market events, and get signals and alerts on certain situations they want to trade on.

So the BI for Research has two purposes:

  1. To support our analysts doing their market and economy analysis.
  2. To support our clients doing their analysis.

The BI provides for example various market indicators such as CDS spreads for each corp bond, historical interest rates and currency rates, yield curves. From the bond price and maturity we calculate yield, duration, and various spreads on each and every fixed income instrument, and let our analysts and clients feast on this data.

They can filter by asset class, country, currency, region, industry sector. See what is the trend today, this week, this month, and this quarter. We let our clients view our forecasts that our analysts have created.

 

Part 4. Technical Design

There are 5 architectures which can be used for BI for IB. Three of them are old approaches (#1-3 below), and two of them are new approaches (Data Vault & Hadoop)

  1. DDS only (Dimensional Data Store)
  2. ODS (Operational Data Store) + DDS
  3. NDS (Normalised Data Store) + DDS
  4. DV (Data Vault) + DDS
  5. Hadoop + DDS

A bit on terminology:

  • DDS means dim & fact tables. It’s a dimensional model, i.e. Kimball star schema.
  • ODS means current version only. The historical value of attributes are not stored. Surrogate Keys are generated in the DDS.
  • NDS means storing historical values of the attributes. DDS becomes dummy. Surrogate Keys are generated in the NDS.
  • Data Vault stores the attributes separately to the business keys. And the links between business keys are also stored separately. It is superior for capturing time-based relationship between tables.
  • Hadoop stores data in a distributed file system (HDFS, Yarn), and use Hive or NoSQL to query the data out.

As you notice above, in all 5 architectures the BI tool (Cognos, BO, Tableau, etc.) is facing the DDS. The ODS, NDS, DV and Hadoop are all “back end engine” which the users never see. Their function is to integrate data from many source systems.

Because of this the ODS, NDS, DV and Hadoop are called the Integration Layer (IL) or the Enterprise Layer (EL). Whereas the DDS is called the Reporting Layer (LR) or the Presentation Layer (PL). The primary benefit for having an Integration Layer separate to Reporting Layer is to make data integration easier because it has a separate model, which breaks down the entities into more detail level than RL. IL data model is shaped up to be “write friendly” (hence normalised model) whereas RL data model is shaped up to be “read friendly” (hence denormalised model).

 

Dimensional Model Design

Dimensions: Date, Client, Broker, Branch, Product, Instrument, Desk, Trader, Issuer, Country of Risk, Currency, Rating, Collateral, Fund, Asset Class.

Designing instrument dimension (also called “security” dimension) in IB is a tricky business. First we need to decide whether we want cash and OTC in the instrument dimension. Second, we need to decide whether we want to split the derivatives (IRS & CDS in particular) into a separate instrument dimension because of their rich attributes. Third, we need to decide whether the instrument classification (such as instrument currency, rating, country of risk, asset class, etc.) should be stored inside the instrument dimension or outside.

Because of the last point, sometimes you don’t see the rating dimension and country dimension, because they are amalgamated into the instrument dimension. But issuer and currency are usually created (more often than not).

When creating Client, Branch and Broker, some companies feel clever and created a table called Party in the Integration Layer. I advise against doing this and prefer to split them into 3 entities for clarity and flexibility reasons. In DDS they are always separate dimensions. Some companies split the address and country in the ODS/NDS into separate table in the name of normalisation. I advise against doing this and prefer to keep demography and geography attributes (such as address & country) within the client, branch & broker table for simplicity and flexibility, even if that means data redundancy and breaking the 3rd Normal Form.

There are many, many instrument classifications in IB, from asset type, duration bands, maturity bands, country, country group, region, currency, currency group, broker type, coupon type, interest rate type, settlement type, collateral type, contract type, rating type, clearning house type, issue type, market type, derivative type, interest calculation type, direction, swap type, etc. All these can be created as their own table or created as 1 common key-attribute-value table. I prefer the latter due to its simplicity and consistency in modelling, as well as ease of use. The primary benefit for doing so is that we don’t need to change the data structure, i.e. using 1 common table we don’t need to keep creating a new table when we have a new classification, which in IB it happens almost every month. Within RL of course all of them goes to 1 dimension: instrument.

 

Fact Tables

There 8 major fact tables in IB:

  1. Position
  2. Transaction
  3. Risk
  4. Performance
  5. Collateral
  6. Deal Pipeline
  7. Client Account
  8. Prop Desk

I’ll explain the above 8 one by one, bear in mind that each of them might be implemented as two or more physical tables.

  1. The Position fact table stores the daily value of each position (instruments and cash) in each portfolio. It also stores the analytic values of each instrument, such as spread and yield. And time-based properties such as duration and maturity along with its banding.
  2. Transaction fact table stores the buying and selling activities in each portfolio, and non-trade transactions too, e.g. interest payments, defaults, haircuts, instrument maturity, and corporate actions.
  3. The Risk fact table stores the VARs, Tracking Errors, stress test impacts. To be flexible, in some IBs it is designed as a “vertical fact table” where the “risk type” column determines the measures. But for read-efficiency and for clarity reasons I prefer to create the measures as individual columns, with the time horizon (1 month, 1 year, etc.) as a separate columns, and “cuts” (by region, currency, asset class, etc.) as separate columns.

Risk is usually calculated not at instrument level, but at “cuts” level, including at portfolio level. But if it does then we need to create a separate fact table to store risk numbers at instrument level.

  1. The Performance fact tables stores the growth of a fund in the last 1 month, 3 months, 6 months, 1 year, 3 years, 5 years and 10 years period. Each “share class” is stored in their own row.

 

Before continuing to fact table number 5 to 8, let me explain the aggregatability first:

Unlike Position and Transaction fact tables, both Risk and Performance fact tables are not aggregatable. Every “cut” in the Risk fact table stands on their own, and every share class in the Performance fact table stands on their own.

The Position and Transaction fact tables are only aggregatable up to portfolio or fund level. It best not to put the share class into position or transaction fact table, because from position and transaction point of view, they are the same. But in the Risk and Performance fact table, we must put the share class, because the Risk and Performance numbers are different for each share class (because Risk and Performance numbers are affected by the portfolio currency and accumulation/income).

 

  1. The Collateral fact table stores the market value of individual OTC derivative (called MTM – marked to market) and the required value adjustments in collateral against each broker-dealer.
  2. The Deal Pipeline fact table stores the flow of a deal between the bank and lending clients, M&A clients, and transaction banking clients. It record the status since when the client was a prospect, until the deal is agreed and closed.
  3. The Client Account fact table stores the clients money that we manage and invest, including subscriptions (deposits, i.e. client puts money in) and redemptions (withdrawals, i.e. client takes money out), buying and selling activities. This fact table need to store both the movements and the daily balance of each client account (hence preferably split into two – one periodic snapshot, one transactional).
  4. The Proprietary Desk fact table stores the bank’s own money that we manage and invest, including subscriptions and redemptions, buys and sells, and corporate actions. It is a legal requirement in US, EU and UK that we must separate clients money and our own money.

 

That’s the technical design of the BI for IB. Hope it has been useful for you.

Ref:

  1. Message Automation
  2. ISDA OTC Commodity Lifecycle
  3. Trade Processing

 

4 July 2016

Historical Portfolio Positioning

Filed under: Analysis Services — Vincent Rainardi @ 7:08 am

One of the most difficult chart to create in asset management sector is Historical Duration Positioning, such as the one for JPM Strategic Bond here: link.

First we have to get the numbers for each category for each month (at least each quarter), e.g. in April 2014 the portfolio was as follows:

  1. Government 25%, 1.1 years
  2. IG Corporate 20%, 0.7 years
  3. EM Sovereign 10%, 0.4 years
  4. EM Corporate 5%, 0.3 years
  5. High Yield US 10%, 0.2 years
  6. High Yield non US 5%, 0.4 years
  7. Securitised 10%, 0.7 years
  8. Convertibles 10%, 0.9 years
  9. Cash 5%, 0 years

The 2 numbers above are market value weighting (%) and duration (years). In the above the portfolio duration was: 4.7 years (sum of all the durations from 9 categories).

Notice that the categorisation above is cutting across 6 different dimensions:

  1. Government/sovereign bond vs corporate bond
  2. Investment Grade bond vs High Yield bond
  3. US vs non US vs Emerging Market
  4. Securitised vs non-securitised
  5. Convertible bond vs non-convertible bond
  6. Cash

To create this report, first we need to identify cash and cash equivalent positions. Then convertibles and securitised. Then HY vs IG, sovereign/government vs corporate, then Emerging vs Developed Markets.

Because on point 3 we have EM Sov, it means that on point 1 the government is DM (developed market). Because on point 5 & 6 we have HY it means that point 1 to 4 are IG. The business analyst are expected to know that “sovereign” is equivalent as government, and “government” usually means investment grade debt in developed market.

So the actually meaning of the categories are:

  1. DM IG Government 25%, 1.1 years
  2. DM IG Corporate 20%, 0.7 years
  3. EM IG Sovereign 10%, 0.4 years
  4. EM IG Corporate 5%, 0.3 years
  5. HY US 10%, 0.2 years
  6. HY non-US 5%, 0.4 years
  7. Securitised 10%, 0.7 years
  8. Convertibles 10%, 0.9 years
  9. Cash 5%, 0 years

Total is not 100%

Point 1 to 8 above are debt, which means that they exclude cash, but they include debt derivatives (not just bonds), but not other derivatives. So if the portfolio has an equity index option (which is an equity derivative), it would be excluded from the calculation and that makes the total not equal to 100%. It could be less than 100%, or more than 100%. For example, if the portfolio has 2% equity, the total would be 98%. But if the portfolio has 2% equity index option, the total would be 102%.

In this case we would need to decide whether we would prefer to create a 10th category called Other and put the non debt, non cash asset into it, or scale it up to 100%. I would recommend the former than the latter because the latter is misrepresenting the weighting of a particular asset class, e.g. if we say that Securitised is 11%, it is actually 10%.

Usual definitions for cash, securitised and EM/DM

The usual asset class consider as cash are net proceeds, cash balances, FX forwards (but not FX swap), certificate of deposit, money market security, cash funds, commercial paper, etc. The grey area are treasury bills, short term government bonds, and marketable security i.e. for some portfolios such as global credit they could be considered cash equivalent, for other portfolios such as short maturity gilt funds they could not be considered as cash equivalent.

Securitised and convertible bonds are identified regardless of region. Securitised includes CMBS, RMBS (agency and non agency), mortgage passthrough, and other ABS such as car loan.

Definition for “Developed Market” differs from company to company. For example, is Jersey developed or emerging market? How about Cayman Islands? Do we use country of risk, or country of incorporation, or country of domicile? The country of risk might be US, but the country of domicile could be Cayman Islands.

Re-classify the past

Once it is all sorted out, we can calculate the % and duration for each category, for each month. At this point we can plot the timeline. But the security classification in the company changes from time to time. For example before 2011, they might not differentiate between EM Sov and EM Corporate, only EM. Or between HY US and HY non-US, i.e. they only labelled it as HY. This means that we need to reclasify the 2010 positions according to 2016 definitions.

This is one of the usage of a data mart (as oppose to a data warehouse), which is to reclassify the securities used in the position fact tables. We can create a new attribute in the security dimension called Bond Category, and label each security with one of the 10 categories above. Equity and equity derivatives, and currency derivatives would have this attribute set to null (or to their correct category). We can then use this attribute for the above reporting.

Measure: %, duration, spread and DTS

The above example uses % and duration as the measure to present in the chart. Duration is suitable to use for a global credit portfolio (whether it is buy and hold or active) particularly if the portfolio is leaning toward IG or has significant proportion of DM sovereign debt.

But if it is an emerging market debt portfolio, a more appropriate measure would be Duration Times Spread (DTS, which is the modified duration x the yield spread to government debt). And it it is a corporate debt it is more appropriate to use Spread Duration rather than Duration (particularly if it has a high proportion of high yield).

Different measures can also be used to monitory historical portfolio positioning, for example risk measures such as PV01 (interest rate risk), IE01 (inflation risk), DV01 (another interest rate risk). VAR (Value at Risk), portfolio volatility (3Y standard deviation) and TE (Tracking Error) are also common. Tracking these risk measures over the last 3 years (or 10) and compare them between portfolios (and to benchmark) is a good demonstration that the desk has incorporated a good risk management when managing the portfolio. And it is in client’s/investor’s best interest to know that hence such as report provide good value to both the asset management house, and to the investors.

18 June 2016

About NOLOCK

Filed under: Analysis Services — Vincent Rainardi @ 10:13 pm

When we issue a SELECT statement in SQL Server, it locks the table. Many people wonder why does SQL Server needs to lock the table?

When the SELECT is being processed, when SQL Server is reading the rows in the table, imagine if someone drops the table. What will happen? The drop statement will have to wait until SQL Server finish reading all rows for this select. (1)

What if someone truncates the table when the SELECT is reading the rows? Or delete the rows that the SELECT was about to read? Same thing, the TRUNCATE or DELETE will have to wait until the SELECT finishes. (2)

What if someone update the rows that SELECT was about to read? Same thing, the UPDATE has to wait. (3)

What if someone inserts some rows that fall within the SELECT criteria? Same thing, the INSERT has to wait. (4)

(1), (2), (3) and (4) are the purposes of locking the table when reading it, i.e. to get consistent result from the beginning of the read until the end of the read.

This is the default behaviour of SQL Server. I call it “Clean Read”. Formally it is called READ COMMITTED Isolation Level.

The opposite of Clean Read is Dirty Read, which happens when the data is changed before the end of the reading, so that at one point we see the row, but at a split moment later when we update the row we can find the row. Dirty Read is formally know in the SQL Server world as “Non-Repeatable Read”.

SELECT With NOLOCK

What if we do the SELECT with NOLOCK? Does SQL Server prevent someone from dropping the table? Yes, nobody can drop the table when it is being read, even if the reading is using NOLOCK. The DROP will be executed, after the SELECT is finished.

What about Truncate? Can someone truncate the table, while a SELECT with NOLOCK is reading the table? No, the TRUNCATE will wait until the SELECT finishes.

What about Delete? Can someone delete the rows from the table while those rows are being read by a SELECT with NOLOCK? Yes, that is possible.

If you have a table with 100,000 rows and User1 issued SELECT * FROM TABLE1 WITH NOLOCK, and at the same time User2 issued DELETE FROM TABLE1, then User1 will only get something like 15,695 rows*, because the other 85k rows were deleted.

*This is the result of my test. Obviously you can get some other numbers depending on various factors.

These 15,695 rows are not sequential, but jumpy, i.e. User1 will get
row 90351-90354, then jump to:
row 90820-90824, then jump to:
row 94100-94103,
and didn’t get row 94104 to 100,000.

Because the missing rows have been deleted by User2.

Error 601

But sometimes, instead of getting the above rows, User1 will get this error message:
Error 601: Could not continue scan with NOLOCK due to data movement

This is when “the page at the current position of the scan” is deleted. See link.

 

 

Domain Knowledge

Filed under: Analysis Services — Vincent Rainardi @ 9:59 pm

In my opinion one can’t be a data architect or data warehouse architect without having good domain knowledge. For example, asset management, the main dimension or reference data is Security (aka Instrument). A security has many attributes, such as

  • Security IDs: ISIN, Sedol, Cusip, Ticker Code, Bloomberg ID, Markit RED, in house ID, etc.
  • Industry/Sector: Barclays, Markit, JP Morgan, Merrill Lynch, GICS, UK SIC, International SIC, IMF Financial Sub Sector, Moody 35, Moody 11, etc.
  • Credit Ratings: S&P, Moody’s, Fitch, house rating, and numerous combination between them.
  • Stock Exchange (where the security is listed)
  • Issuer, Parent Issuer (for derivative use underlying?)
  • Dates (maturity date, issue date, callable dates, putable dates
  • Country: country of incorporation, country of domicile, country of risk
  • Currency: denomination currency
  • Asset class: inflation-linked, treasury, sovereign, IG corporate, CDS, IRS, options, futures, FX swap, etc.

And we have many measures such as valuations (market value, exposure, P&L), analytics (yield to maturity, current yield, modified duration, real duration, spread duration, convexity, etc) and risk (value at risk, tracking error, PV01, IE01, DV01, etc).

If we don’t understand what they means, and how they relate to each other, then how could we design the data structure to store them properly? We can’t. As a data architect we need to understand the definition of each attribute and measures above.

We also need to understand the meaning of each value. For example, one of possible value for asset class attribute is Equity Index Option. What does Equity Index Option mean? What is the difference with Equity Index Swap and Equity Option? We need to understand that. Because by understanding that we will be able to differentiate the difference between Asset Class and Asset Type. And how these 2 fields relates to each other. And whether we need to create Asset Sub Class attribute or not, and the hierarchy between Asset Class and Asset Sub Class.

Of course a business analyst will need to understand the domain knowledge. But a data architect also need to understand it. A data architect who doesn’t understand the domain knowledge will not be able to do their job properly. A data architect who works for a pharmaceutical company for 5 years will not be able to design a database for a Lloyd’s insurance company, without learning the domain knowledge first. And it could be 6 month before they reach the necessary level of understanding of insurance. Like a business analyst, a data architect job is industry specific.

I agree, not all industries are as complex as Lloyd’s underwriting/claim or investment banking. Retail, mining, manufacturing, distribution, and publishing for example, are pretty easy to understand, perhaps takes only 1-2 months to understand them. But healthcare, insurance, banking, finance and investment are quite difficult to understand, perhaps requiring about 6 months t understand them, or even a year.

So how do get that industry knowledge? The best way is by getting a formal training in that industry sector, which can either be a self study or a class. Fixed-income securities for example, can be learned from a book. Ditto trade lifecycle. I don’t belieave there is nothing that one cannot learn from a book. Yes, some people prefer to go to class and have a teacher explain it to them. But some people prefer to read it themselves (myself included).

I agree that a developer doesn’t necessarily have industry knowledge. It is a nice to have, but not mandatory. Be it an application developer, a database developer or a report/BI developer. A project manager also doesn’t need to have industry knowledge. It is a nice to have, but not mandatory. They will be able to do their job without industry knowledge. But a data architect, like a business analyst, need to have it. A business analyst needs the business knowledge to understand the business requirements. A data architect needs the industry knowledge for designing the data model and analyse where to get the data from.

 

7 May 2016

U-SQL

Filed under: Analysis Services — Vincent Rainardi @ 7:26 pm

U-SQL is a new language used in Azure Data Lake. It is a mix between T-SQL and C#. It is used to load files into Data Lake.

To use U-SQL, we need to add a New Data Lake Store in our Azure Portal, click Sign Up for Preview (we need to use IE; in Chrome we can’t click Sign Up). We will get an email when it is approved (can be a few hours, can be a few weeks).

Here is an example for calculating the duration for each currency in the portfolio:

@rawData = EXTRACT SecurityName string?, Weight decimal?, ModifiedDuration decimal, MaturityDate DateTime, Currency string
FROM “/FixedIncome/EMD.csv” USING Extractors.Csv(silent: true);

@aggData = SELECT Currency, SUM(Weight * ModifiedDuration) as Duration FROM @rawData GROUP BY Currency;

OUTPUT @aggData TO “/Report/Duration.csv” ORDER BY Currency USING Outputters.Csv();

The above U-SQL code imports a comma delimited file called Portfolio.csv, calculates the duration for each currency and create a file called Duration.csv. I’ll explain the 3 lines above (EXTRACT, SELECT, OUTPUT) one by one below.

EXTRACT

Case sensitive: U-SQL is case sensitive and strong-typed. It is case sensitive because it is using C# syntax. So “EXTRACT” must be in upper case, “string” must be in lower case and “DateTime” must be in mixed case.

EXTRACT is used to import a file into a row set. It can be used to import both a structured file (all rows have the same number of columns) or unstructured files (rows have different number of columns). Documentation: here.

The data type is C# data type, not T-SQL. Full list of C# built-in data types is here. DateTime is a structure, not a data type. It has Date, Time, Year, Month, Day, Hour, Minute, Second, Today, Now, TimeOfDay and DayOfWeek properties. See here for documentation.

The ? at the end of “string?” and “decimal?” means NOT NULL.

One thing we need to remember is that in U-SQL it is “schema on read”. Which means that we define the data types of the file we import as we import them.

When importing a file we need to use an extractor. It is a data interface program which reads byte stream in parallel into a row set. U-SQL has 3 built-in Extractors: one reads text files (Extractors.Text), one reads tab delimited files (Extractors.Tsv) and one reads Comma Separated Values files (Extractors.Csv). The last two are derived from the first one. By default U-SQL split the imported file into several pieces and import the pieces in parallel. U-SQL imports unstructured files using user-defined extractor. See here for documentation on extractors.

Silent is a parameter. It means if the row in the CSV file doesn’t have 5 columns, ignore the row (we expect the CSV file to contains 5 columns: SecurityName, Weight, ModifiedDuration, MaturityDate, and Currency). There are other parameters such as: (documentation here)

  • rowDelimiter: the character at the end of the row, e.g. carriage return, linefeed
  • quoting: whether the values are wrapped in double quotes or not
  • nullEscape: what string represents a null value

@rawData is the name of the row set containing data we import from the CSV file. We can then refer to this row set with this name.

SELECT

Once we have this @rawData row set, we can do a normal SQL statement to it, such SELECT, UPDATE, INSERT and DELETE. But we need to use C# syntax. So the where clause is not “where AssetClass = “Options” ” but “where AssetClass == “Options” ”. And SELECT, UPDATE, INSERT, DELETE all needs to be in upper case. Other supported SQL statement in U-SQL are: JOIN, UNION, APPLY, FETCH, EXCEPT, and INTERSECT.

  • SELECT: U-SQL supports WHERE, FROM, GROUP BY, ORDER BY and FETCH. Documentation here. FETCH means get the first N rows, link. For ALIAS see here.
  • JOIN: U-SQL supports INNER JOIN, LEFT (OUTER) JOIN, RIGHT (OUTER) JOIN, FULL (OUTER) JOIN, CROSS JOIN, SEMIJOIN, and ANTISEMIJOIN. Documentation here. Both the “ON T1.Col1 == T2.Col2” and the old style “WHERE T1.Col1 == T2.Col2” are supported (link).
  • UNION: U-SQL supports both UNION and UNION ALL. Doc: here.
  • APPLY: U-SQL supports both CROSS APPLY and OUTER APPLY. Doc: here.
  • Interaction between row sets (these are hyperlinked): EXCEPT, INTERSECT, REDUCE, COMBINE, PROCESS.
  • Aggregate functions: SUM, COUNT, FIRST, LAST, MAX, MIN, AVG, VAR, STDEV, ARRAY_AGG, MAP_AGG. Doc: here. ARRAY_AGG combines values into an array. Split breaks an array into values (link).
  • Operators: IN, LIKE, BETWEEN. Doc: here.
  • Logical operators: Remember that U-SQL is using C# syntax so AND is &&, OR is ||, = is ==, <> is != and not is !
  • “condition ? if_true : if_false” is also supported, see here.

 

OUTPUT

The output of a SQL statement (called Expression in U-SQL) can be assigned to a row set like in the example above i.e. “@aggData = “, or we can do these 3 things:

  1. Output it to a new file using OUTPUT
  2. Output it to a new table using CREATE TABLE AS
  3. Output it to an existing table using INSERT INTO

The OUTPUT expression saves the row set into a file. There are 3 built-in outputters in U-SQL: Outputters.Tcv for tab delimited files, Outputters.Csv for comma delimited files, and Outputters.Text for any delimiter. They accept these parameters: quoting, nullEscape, rowDelimiter, escapeCharacter, and encoding. Documentation: here.

Documentation

The U-SQL language reference is here (still work in progress), and the U-SQL tutorial is here.

Data Lake documentation is here and the learning guide is here.

Articles on U-SQL from Microsoft:

  • From Microsoft U-SQL team: link
  • Ed Macauley U-SQL tutorial: link, click on Tutorial under Prerequisites.
  • From Mike Rys: link

26 February 2016

Investment Performance

Filed under: Analysis Services — Vincent Rainardi @ 5:24 am

One of the fundamental functions of a data warehouse in an investment company (wealth management, asset management, brokerage firm, hedge funds, and investment banking) is to explain the performance of the investments.

If in Jan 2015 we invest $100m and a year later it becomes $112m, we need to explain where this $12m is from. Is it because 40% of it was invested in emerging market? Is it because 30% of it was invested in credit derivative and 50% in equity? Is it because of three particular stocks? Which period in particular contributed the most to the performance, is it Q3 or Q4?

Imagine that we invested this $100m into 100 different shares, each of them $1m. These 100 shares are in 15 different countries, i.e. US, UK, France, Canada, India, China, Indonesia, Mexico, etc. These 100 shares are in 10 different currencies, e.g. USD, CAD, EUR, CNY, MXN, IDR, etc. These 100 shares are in 15 different sectors, e.g. pharmaceutical, banking, telecommunication, retail, property, mining, etc. These 100 shares have different P/E multiples, i.e. 8x, 12x, 15x, 17x. These 100 shares have different market capitalisation, i.e. small cap, mid cap and large cap. And we have 50 portfolios like that, each for a different client, some are for open funds. In the case of a fixed income investment (bonds), there are other attributes such as credit rating (AAA, A, BBB, etc.), maturity profile (0-1 year, 1-3 years, 3-5 years, etc.), and asset class, e.g. FRN, MBS, Gilt, Corporate, CDS, etc.

Every day we value each of these 100 shares, by taking the closing prices from the stock exchanges (via Bloomberg EOD data, Thomson Reuters, or other providers) and multiply them by the number of shares we hold. So we have the value of each share for every single working day. They are in different currencies of course, but we use the FX rate (closing rate) to convert them to the base currency of the portfolio.

A portfolio has a benchmark. The performance of the portfolio is compared to the performance of the benchmark. A portfolio manager is measured against how well they can outperform the benchmark.

The mathematics of performance attribution against the benchmark is explained well in this Wikipedia page: link. That is the basic. Now we need to do the same thing, but not just on 2 rows, but on 100 rows. Not just on asset allocation and stock selection, but also on all the other factors above. Not only against the benchmark, but also comparing the same portfolios month-to-month, or quarter-to-quarter.

The resulting data is a powerful tool for a portfolio manager, because they can understand what caused the outperformance. And more importantly, what caused the under performance, against the benchmark, and between time-points.

This month we beat the benchmark by 1%. That’s good, but what caused it? Why? It is important to know. This month we grow by 1%. That is good, but what caused it? This month we are down 2%. We obviously need to know why. Our client would demand an explanation why their money which was $100m is now $98m.

That would be a good reason for having a data warehouse. The value of each and every positions*, from each and every portfolio, for each and every working day, is stored in the data warehouse. And then on top of it, we apply mathematical calculations to find out what caused the up and down, not only at portfolio/fund level, but for each currency, country, industry sector, etc., for any given day, week, month and quarter. That is worth paying $500k to develop this analytical tool. We from the BI industry may be calling it a BI tool. But from the portfolio manager point of view that is an investment analytic system.

*Position: a position is an financial instrument that we purchased, and now hold in our portfolio. For example, a bond, a share, or a derivative. In addition, we also have cash positions, i.e. the amount of money we have with the broker/custodian, as well as MTM margins, repo, FX forwards, and money market instruments, such as cash funds. A position is time-valued, i.e. its value depends on time.

This tool enables the portfolio managers (PMs) in an investment company not only to know the breakdown of their portfolios at any given day, but how each section of the portfolio moved from month-to-month, day-to-day. In addition, if we also put risk measures in there, such as stresses, risk analytics, or SRIs* (I’ll explain all 3 in the next paragraph), the PMs (and their financial analysts) will be able to know the individual risk type for each section of the portfolio, on any given date, and how those risks moved from month-to-month, day-to-day.

*Stresses, risk analytics and SRIs: a stress is a scenario that we apply to all positions in the portfolio. For example, what if the interest rate is up by 0.1%? By 0.25%? By 1%? What if the FX rate is down by 0.1%? By 1%? And also other factors, such as oil price, inflation rate, equity prices, etc. We can also apply an “event”, i.e. during September 11, the S&P moved by X%, FTSE 100 by X%, Gilts by X%, EMD bonds by X%, and FX rates by X%. There are also other historical dates when the market went south. If we apply those “events” into our portfolios, what happens to the value of each position? Value of the overall portfolios? Value of each section of the portfolio, i.e. Asia stocks, EM, or Small Caps?

Risk analytics are numbers which reflect a certain risk to an investment portfolio. For example, duration reflect how much each position will be impacted by an interest rate raise. For fixed income the risk analytics are: PV01, DV01, IE01, CR01, duration (modified duration, spread modified duration, effective duration), credit spread (spread to Libor, spread to benchmark security, spread to Treasury), yield (yield to maturity, effective yield, yield to call, yield to put, yield to worst, running yield, simple yield), convexity (how sensitive the duration is to the change of interest rates). For an equity portfolio we have different risk analytics (they are mainly financial ratios of the issuer).

SRIs means socially responsible investing. The theory is, the value of a share (or a bond, or a derivative) is affected by how much the company care about the environment, by how well the company (or group of companies) is governed/managed, by how much the company promote human rights and social justice, how much it avoid alcohol, gambling and tobacco. Some data providers such as Barclays, MSCI and Verisk Maplecroft provide this SRI data in the form of scores, ratings and indices in each area.

The PMs will be able to know each of the 3 risk categories above (stresses, analytics and SRIs) for each positions within their portfolio, on any given day, and how those risks moved from month-to-month, day-to-day. That is a very powerful tool, and is worth creating. And that is one reason why we create a data warehouse (DW) for investment companies.

Not only for managing the performance (which is the most important thing in a PM’s career) but also to manage the risks. Because the DW is able to inform them how every part of their portfolios react to each event, each day/week/month (react in the context of the valuation, performance, attribution, and risk), the PMs will able to tell (at least predict) what will happen to the value, return, and risk on each section of their portfolio if such event happens again.

17 February 2016

The Problem with Data Quality

Filed under: Analysis Services — Vincent Rainardi @ 8:44 am

The problem with data quality is not the technicality. It is not difficult to design and build a reconciliation system, which checks and validates the numbers in the data warehouse and in BI reports/analytics (or non-BI reports/analytics!).

The problem is, once the reconciliation/validation program spits out hundreds or thousands of issues, who will be correcting them? That is the issue!

It requires someone to investigate the issues, and, more importantly, fix the issues. This requires funding which can seldom be justified (it is difficult to quantify the benefits), and requires a combination of skills which rarely exists within one person. So that doubles the money because we need to hire 2 people. The “checker” who checks the DQ reports is largely an operational application support type of person, with whilst the “fixer” need to have a detective mind set and development skills. To make matters worse, these development skills are usually platform specific, i.e. .NET, Oracle, SSIS, Business Objects, etc.

Assuming £40k salary in the UK, then adding NI, pension, software, desk, training, bonus, insurance, consumables, appraisal, and payroll cost (total of £15k), and multiply by 2 person, it is a £110k/year operation. Adding half-time of a manager (£60k salary + £20k costs), it is a £150k/year operation.

It is usually difficult to find the benefit of a data quality program bigger than £100k. The build cost of a DQ program can be included in the application development cost (i.e. data validation, data reconciliation, automated checks, etc.), but the operational cost is an issue.

So the fundamental issue is not actually finding a person, or a team of people. The fundamental issue is actually to get the funding to pay these people. The old adage in IT is usually true: anything is possible in IT, provided we have the funding and the time.

The benefit can’t come from FTE reduction (full time employee, means headcount), because it is not a reduction of workload (unless a manual DQ is already in place of course). And it doesn’t come from increased sales or revenue either. Try to find a link between better DQ and increased revenue, and you’ll experience that it is hard to find this link. And we know that headcount reduction and revenue increase are two major factor for funding an activity/work within a company.

3 factors that drives data quality work

But fortunately there are 2 another factors that we can use: compliance and risk.

Compliance in financial services industry, healthcare industry, etc. requires reports to the regulators in a timely manner, and with good accuracy. That drives the data quality work. For example, if we report that the credit derivative loss position is $1.6bn, where as actually it is $2.1bn, we could be risking penalty/fine of several million dollars.

Risk: there are other risks apart from compliance, namely credit risk, interest rate risk, counterparty risk, etc. Different industry has different risks of course, with financial services probably have the largest monetary amount, but they all drives data quality work. If the data quality is low, we are risking misstating the risk amount, and that could cost the company a fortune.

The 3rd factor to use is data warehouse. If your company stores a lot of data in one place, such as a data warehouse, and the data quality is low, then all the investments are wasted. A £600k DW warrants a £50k DQ. And if your DW has been there for 4 years, the actual cost (development + operation) could easily exceed £1m mark. A general 10% ratio yields a £100k investment in the DQ work.

The key statement to use with regards to DW’s DQ is the “DW usage”. A centralised data store such as a DW is likely to be used across many applications/teams/business lines. Each of these app/business are in risk of having operational issues if the data in the DW is incorrect. And if we don’t monitor the DQ, we can be sure that the data will be incorrect. That is quite an argument for a Data Quality project.

15 February 2016

Effektor

Filed under: Analysis Services — Vincent Rainardi @ 8:36 am

It’s been about 6 months since I came across an automated data warehouse builder software called Effektor, based in Copenhagen. I don’t exactly remember when but I think I got it from SQLBits. The product is the best in class, better than the other 4 DW automation software I know (WhereScape, Kalido, Insource and Dimodelo). Effektor can generate the DW tables and the ETL packages in SSIS, and it can also create MDM, SSRS reports, balanced scorecard (BSC), and SSAS cube. None of the other 4 software creates MDM, SSRS, BSC or SSAS cube, as far as I’m aware.

It is 5 years old (2010), it has 40 customers, but it only run on SQL Server, not any other RDBMS. It runs in Azure and  it runs on Standard Edition of SQL Server (2008 R2 to 2014), as well as other editions.

All DW automated builder software safe costs, as I explained in my Dimodelo article. But Effektor goes further. It doesn’t only create the ETL packages, but also the cube, the MDM, the scorecard and reports. The integration with RS amazed me because, unlike SSIS and SSAS, SSRS does not have API interface like AMO so we need to create the XML SSRS files manually.

Now (6.3) it also have WebAPI, i.e. we can control the DW sync, data import, DW load and OLAP security from PowerShell via WebAPI. SSRS usage from the portal is logged, so we know who uses which report and when.

The only negative side I can think of is the financial strength. As I said in the Dimodelo article, and in the Choosing ETL Tool article, there are 3 factors which come above the functionality: the price,  financial strength and infrastructure. I think Effektor will satisfy the price and infrastructure aspect for most companies (companies who are a SQL Server shop that is), but it is the financial strength which is a question mark. Will the company still be there in 5 years time, or will it be a takeover target by a bigger company and have the product diminished? i.e. they do the takeover to get the customers, not the product.

At the moment I don’t see that, because it only has 40 customers and an excellent product, so it would be crazy for say, Microsoft, to be interested in Effektor just to get its customers. On the contrary, if a big company is interested to buy Effektor, they must be doing so because of its excellent product. So this actually plays better for the existing customers, because they will get better support from that bigger company, and the product development team will get better funding, as the product is marketed to the bigger company’s existing customers. The only drawback/disadvantage is that the bigger company might increase the licencing cost (to increase profitability to more demanding shareholders, and covering a more aggressive marketing plan).

Disclaimer: I don’t receive any rewards or incentives, financially or otherwise, from Effektor or any of its partners or competitors, in writing this article.

 

Data Interface (How to Manage a DW Project)

Filed under: Analysis Services — Vincent Rainardi @ 5:54 am

One of the most challenging issues in managing a data warehouse project is that the report/analytics development can’t start before the ETL is completed and populated the warehouse tables with data.

This is traditionally solved  by using “data interface”, i.e. the fact & dim tables in the warehouse are created and populated manually with minimum data, in order for the report/cube developer to start their work.

But the issues when doing that are:

  • It takes a long time and large effort to create that data
  • The manually-handcrafted data does not reflect the real situations
  • The manually-handcrafted data does not cover many scenarios

A data interface is an agreement between the report building part and the ETL part of a DW project that specifies how the data will look like. A data interface consists of two essential parts:

  • The data structure
  • The data content (aka data values)

The data structure takes a long time to create, because it requires two inputs:

  • The requirements, which determines what fields are required in the DW
  • The source data, which determines the data types of those fields

Hence the project plan for a traditional data warehouse or data mart project looks like this: (let’s assume the project starts in January and finished in 12 months/December, and we have 1 ETL developer, 1 report/analytic developer, 1 data architect, 1 BA, 1 PM)

  • Jan: Inception, who: PM, task: produce business case and get it approved to get funding.
  • Feb-Apr: Requirement Analysis, who: BA, task: create functional requirements.
  • Apr-May: Design, who: DA, task: create data model and ETL specs.
  • Jun-July: Design, who: DA, task: report & cube specs
  • Jun-Aug: Build, who: ETL Dev, task: create ETL packages
  • Aug-Oct: Build, who: Report Dev, task: create report & cubes
  • Sep: Test, who: BA & ETL Dev, task: test ETL packages and fix
  • Nov: Test, who: BA & Report Dev, task: test report & cubes and fix
  • Dec: Implementation, who: BA, DA & Dev, task: resolve production issues

The above looks good, but it is actually it is not good from resourcing point of view. There are a lot of empty pockets burning budgets while people are sitting idle. There are 27 empty boxes (man-months) in the resource chart below, which is 37.5% of the total 60 man-months.

Original plan

The report developer starts in August. If we can make the report developer starts in June at the same time as the ETL developer, we would be able to shorten the project by 2 months or so. But how can we do that?

The answer is data interface. The DA creates the DW tables and populates them with real data from the source system. This is quicker than trying to manually create the data, and the data reflect the real situations, covering many scenarios. Real scenarios (the ones which are likely to happens) with realistic data, not made-up scenarios with unrealistic data which are unlikely to happen.

Using the populated DW tables (facts and dims), with real data, the report developer will be able to create the reports and the BA will be able to test/verify the numbers in the reports, at the same time when the ETL is being developed to populate the tables. We are removing the dependency between the report development and the ETL development, which a crucial link in the project that prolongs the project duration.

The resource chart now looks like this, 2 months quicker the original plan:

10 months

The value of this 2 months is approx. : 2 months x 5 people x $500/person/day x 22 days/month = $110k. It is a significant figure. The DW development cost goes down 16.7% from $660k to $550k.

We notice that there is a big white area on the lower left of the resource chart, i.e. the DA and 2 developers are not doing anything in the first 3-5 months. This can be tackled using iterative development, i.e. break the project into 3 parts (by functional areas) and deliver these 3 parts in one after the other. In the chart below, part 1 is yellow, part 2 is green and part 3 is purple.

Split into 3 parts

Part 1 (the yellow boxes) goes live in June, part 2 goes live in Aug, part 3 in Sep. The big system test at the end of the project won’t be required any more, because we go live bit by bit. During April the BA can prepare the test plan.

The resource utilisation is now higher. We now only have 8 white boxes, out of 54. That’s 14.8%.

The project duration is shortened further, only 9 months rather than 10. That’s another $55k cost saved. The cost is now $495k, 75% of the original cost ($660k).

Next Page »

Blog at WordPress.com.