Data Warehousing and Data Science

18 October 2017

Hierarchy with Multiple Parents

Filed under: Analysis Services — Vincent Rainardi @ 7:40 am

The left diagram below shows a single-parent hierarchy. The right diagram is a multiple-parent hierarchy, where a child node can have 2 or more parents.

Figure 1 Single and Multiple parent hierarchy
Figure 1. Left: single-parent hierarchy, right: multiple-parent hierarchy

In the right diagram above, leaf node H has 3 parents: E and F are in one branch, and G is in another branch. This shows an important principle: in a multiple-parent hierarchy, a child node can be located in 2 different branches (or more).

Double Counting

If the amount in node H is £10, this £10 is carried-forward in node E, F, G then to B and C and finally to A. This causes double counting. Node A accumulates £30 from node H (£20 via node B, and £10 via node C). Solving double counting is one of the most important points in dealing with multiple-parent hierarchies.

Ragged Hierarchy

In the right diagram above, Node I has 2 direct parents: F and C. F is 1 level up, and C is 2 levels up. This is an important principle: in multiple-parent hierarchies, the parent node can be 2 levels up (or more). This situation is called “ragged”, meaning skipping a level. Ragged can also happen in a single-parent hierarchy. Node I in the left diagram above is an example of a ragged node. A hierarchy that has a ragged node is called a ragged hierarchy.

Terminology

A hierarchy is like a tree. It has a root, branches and leaves. It is useful to picture the hierarchy upside down like below, so that it looks like a tree.

Figure 2. Upside Down Tree
Figure 2. Upside Down Hierarchy, similar to a Tree

Picturing the hierarchy upside down helps us understanding the terminology. We say that A is the “Root Node”. I, H and D are the “Leaf Nodes”. You see, when we turn the tree upside down like this, the terminology matches the reality. We have the Root Node at the bottom of the tree, and the Leaf Nodes at the top of the tree.

Unfortunately we should only depicted this upside down tree in our mind. Because in every text we will find that people drawing the tree like in Figure 1 above, with the Root Node at the top and the Leaf Nodes at the bottom. So when we communicates with other people, we should use this “root at the top” diagram/convention.

A. Aggregation

In a multiple-parent hierarchy, the aggregation is done by summing the amount for distinct leaf nodes. Taking a distinct is very important, to avoid double counting.

Figure 3 Left. Double counting aggregation   Figure 3 Right. Single counting aggregation
Figure 3. Left: double counting aggregation (sum of nodes), right: single counting aggregation (sum of distinct nodes)

In figure 2 above, each node is worth £10. The red numbers above the nodes are the total of all nodes under it, including itself. As we can see on the left diagram, if we double count, the total for node A (which is at the top of the tree) is £150, which is incorrect. There are only 9 nodes so the total should be £90. The right diagram shows the correct amount at node A, which is £90.

B. Relational Data Model

The relational model is the usual 2 columns parent-child hierarchy table, preferably with a “level” column showing at what level the child node is. Note that the root node is at Level 1, not Level 0. Also note that the parent of the root node is NULL. This is the parent-child relational hierarchy table for the multiple-parent hierarchy in Figure 1 and 2 above:

Figure 4. Parent Child Hierarchy Table
Figure 4. Parent-child hierarchy table

C. Flattened Hierarchy Table

C1. Path to Leaf Nodes

To flatten a multiple-parent hierarchy, we walk down from the top level node, taking the left most path until we reach the leaf level. We do this for each path in the tree. Each of these paths then becomes a row in the flatten hierarchy table. The business key of this table is the combination of all nodes.

Figure 5 Left. Red paths becomes the rows    Figure 5 Right. Flatten Hierarchy Table
Figure 5. Left: red paths become the rows, right: flatten hierarchy table

C2. Paths to Intermediate Nodes

Consider path ABFI below. This path reaches the leaf level. This path has 2 intermediate nodes. Node B and node F. So we have 2 “Intermediate Node Paths”: path AB and path ABF (shown as dash lines below).

Figure 6. Paths to Intermediate Nodes
Figure 6. Paths to Intermediate Nodes

So we have 2 types of path:

  1. Paths to leaf nodes
  2. Paths to intermediate nodes

To be complete, the flattened hierarchy should contain not only paths to leaf nodes, but also paths to intermediate nodes, like this:

Figure 7. Flatten Hierarchy with Paths to Intermediate Nodes
Figure 7. Flatten hierarchy table including paths to intermediate nodes

C3. Recursive CTE

To create a flatten hierarchy table like above, in SQL Server we use a recursive CTE. CTE stands for Common Table Expression. It is essentially a select statement which is defined at the beginning of another select statement. A recursive CTE is a CTE which joins to itself repeatedly, each time joining on a different row. It typically start by joining to first row, then joining to the second row, and so on until it reaches the last row.

We use Recursive CTE to get all the paths in the parent child hierarchy, as per Figure 6 above. The code is like this:

-- Create and populate a parent child hierarchy table
if object_id('dbo.parent_child_hierarchy') is not null
drop table parent_child_hierarchy

create table parent_child_hierarchy
( id int,
  child_node varchar(50),
  parent_node varchar(50),
  child_level int
)

insert into parent_child_hierarchy values
(1, 'A', NULL, 1),
(2, 'B', 'A', 2),
(3, 'C', 'A', 2),
(4, 'D', 'B', 3),
(5, 'E', 'B', 3),
(6, 'F', 'B', 3),
(7, 'F', 'C', 3),
(8, 'G', 'C', 3),
(9, 'H', 'E', 4),
(10, 'H', 'F', 4),
(11, 'H', 'G', 4),
(12, 'I', 'C', 4),
(13, 'I', 'F', 4)

select * from parent_child_hierarchy;

-- Create the "Path Table" containing every path (line) in the tree, including paths to intermediate nodes
if object_id ('tempdb..#path') is not null
drop table #path

;with CTE (child_node, parent_node, path) as
( select child_node, parent_node, cast(child_node as varchar(max)) as path
  from parent_child_hierarchy
  where parent_node is null
  union all
  select p.child_node as child, t.child_node as parent,
  t.path + ' > ' + cast(p.child_node as varchar(50)) as path
  from parent_child_hierarchy p join CTE t on p.parent_node = t.child_node
)
select path into #path from CTE where parent_node is not null;

select * from #path order by path
Output:
A > B
A > B > D
A > B > E
A > B > E > H
A > B > F
A > B > F > H
A > B > F > I
A > C
A > C > F
A > C > F > H
A > C > F > I
A > C > G
A > C > G > H
A > C > I

The recursive CTE code consists of two parts: the root part is union-ed with the low level part. The way this code work is: it gets the root node first (level 1, top of the tree), then join to the level 2 nodes, then join to the level 3 nodes, and so on until it reaches the bottom of the tree. Note that a hierarchy tree can have 2 tops (2 root nodes), or more.

The exact code depends on how data is arranged in the parent child hierarchy table. Usually a root node is indicated by parent = null. But occasionally a root node is indicated by parent = child.

C4. Padding

In the flatten hierarchy table the leaf nodes are not located at the same level, e.g. some leaf nodes are located at level 3, some leaf nodes are at level 4, etc. This makes the query difficult. We can’t easily query the leaf nodes. So we populate the empty cells in the lower levels with the leaf nodes. This is called Padding. Like this:

Figure 8. Flatten Hierarchy Table with empty cells populated
Figure 8. Flatten Hierarchy Table with empty cells populated

In figure 5 we can see that the red D and I on row 1 and 8 are added on the Level 4 column, marked in red. This makes the query easier, i.e. if we want to get the leaf nodes (the most bottom nodes) we just need to query level 4.

D. Dimensional Data Model

The standard dimensional model is a fact table (F), connected to a dimension table (D), which in turn is connected to a hierarchy dimension table (H) using bridge table (B).

Figure 9. Dimensional Design - General
Figure 9. Dimensional Data Model

It is easier to explain by example. I am going to use an example of a sales fact table and a product dimension table:

Figure 10. Dimensional Design - Product Hierarchy
Figure 10. Dimensional Data Model for Product Dimension

The 4 tables above are populated as follows (let’s assume a sale of 1 product, called product A): A row in the fact table is sale transaction for Product A on a particular date, with quantity, unit price and sales amount. The product dimension table has 3 row for product A, 1 row for the active version and 2 rows for the expired versions. The bridge table contains 3 rows for product A, 2 for the Inventory Hierarchy and 1 for the Sales Hierarchy (see below). Each of these 3 rows in the bridge table corresponds to a row in the product hierarchy dimension, with a similar surrogate key.

Inventory Hierarchy is a method of grouping products for the purpose of inventory management. Sales Hierarchy is a method of grouping/classifying products for the purpose of sales management. The Inventory Hierarchy intertwines with the Sales Hierarchy in one big tree. This tree is called Product Hierarchy Dimension. It is a flattened dimension table.

Let’s go through the fact first, then dimension table, then the hierarchy dimension, then the bridge.

1. Sales Fact Table

The sales fact table is the same as single-parent scenario. Each sale is recorded as one row. The fact for each sale event are quantity, unit price and amount.

If a child node has 2 parents in the hierarchy, we do not store it in the fact as two rows. We store it as one row. This is very important to prevent double counting. Because the transaction happens only once, so we should store it only once.

2. Product Dimension Table

The Product Dimension table is a slowly changing dimension type 2. It contains many rows for each product. Each version of each product is stored as one row. Product A has 3 versions, two of them are expired, and one is active. Being type 2, the Product Dimension table has an Effective Date column, an Expiry Date column and an Is Current Flag.

3. Product Hierarchy Dimension Table

The Product Hierarchy dimension is flattened. See “Flatten Table” above. The columns are: Level 1, Level 2, Level 3, etc. Flattening is necessary to increase the user friendliness and speed of the queries. Each path (line) on the Product Hierarchy tree becomes a row in this dimension table. For Product A above, which has 3 parents, becomes 3 rows in the dimension table. 2 rows for Inventory Hierarchy and 1 row for Sales Hierarchy. Each row has different parent. Each row represent a path (line) in the hierarchy tree.

How do you update this dimension? What is the business key? There is no business key. The unique identifier is the path (line), i.e. the combination of all the nodes from the top level to the bottom level, strung together to represent the path. Because there is no business key this dimension is type 1. We can’t identify if a row has changed.

In a single-parent hierarchy, the leaf node is the business key. So the parents (which are attributes) can be updated using that business key. But in a multi-parent hierarchy, we can’t do that. For a multi-parent hierarchy, if the hierarchy data is updated monthly (say it is 10,000 rows), the usual technique is to stamp the 10,000 rows for this month as “March”. And the 10,000 rows for next month as “April”. So we have a “month” or “version” column. But if the data is updated daily, we should delete this month’s rows and reinsert them from source every day.

This dimension has a surrogate key, which is a sequential number. This dimension also contains an unknown row with surrogate key zero.

4. Bridge Table

The bridge table contains only 2 columns: the surrogate key of the product hierarchy dimension, and surrogate key of the product dimension. This bridge table can be implemented as a view over the product dimension. The number of rows in this bridge table is the same as the number of rows in the product hierarchy dimension, but it only contains 2 columns (3 with the SK, plus system columns). It contains both the leaf level nodes, and the intermediate level nodes from the product hierarchy dimension.

In relational world we don’t need this bridge table. The leaf id in Product Hierarchy Dimension table can directly be linked to the Product Dimension table. But in SSAS we can’t connect a dimension to another dimension. We will need a fact table in between. That is why we need this bridge table.

Note: Christopher Adamson offers a design for a bridge table: link. In this design the bridge table contains all children below a parent (not just the children directly below the parent, but from the lower levels as well).

Note2: Margy Ross said that bridge tables are used to represent a ragged or variable depth hierarchical relationship which cannot be reasonably forced into a simpler fixed depth hierarchy of many-to-one attributes in a dimension table: link. I am not sure what she meant, but it is probably what we are implementing here. True, a multiple-parent hierarchy is not the same as a ragged or variable-depth hierarchy, but if it is not many-to-one then it is many-to-many, and a multiple-parent hierarchy is many-to-many.

5. Parent Child Hierarchy table

In some projects, the sales amount in the fact table is aggregated up. At this summary level, the sales amount does not correspond to the bottom level of the product hierarchy. It corresponds to the middle level or the top level of the product hierarchy.

In this case we need to create a new dimension table which contains every single node in the hierarchy, not just the leaf nodes (bottom nodes). The parent column in this table is not mandatory, it is optional. But it is best practice to have it so we can traverse the hierarchy to find leaf nodes.

E. SSAS Cube

In SSAS cube, the design is using many-to-many relationship. See Marco Russo’s paper here: link.

Reference:

  1. Five best ways to split strings, by Aaron Bertrand, 26 July 2012, SQL Performance.com: link

3 January 2017

Data Warehousing/Business Intelligence in Investment Banking

Filed under: Analysis Services — Vincent Rainardi @ 8:39 am

I’ve written an overview of an investment bank (IB): link, which would provide a useful background to this, if you want to read it first. The intended audience of this article are those who have no experience in investment banking, so I will be explaining all the IB terminologies as we go along.

Of the 7 business areas in IB, a business intelligence system is highly required in 3 areas: research, fund/asset management, trading, and almost not required at all in the other 4 areas. I’ll start with trading as it is IB’s main business, and it is the one area that uses DW/BI the most.

An investment bank does trading on behalf of its clients because it acts as a broker dealer. It receives an order from many clients to buy and sell securities such as shares, bonds, options, FX forwards, IRS, commodity, and CDS. Both Exchange-Traded and Over-The-Counter (bilateral agreement with another dealer).

As a broker dealer, we need to offer our clients good BI, preferably real time, of all the trades we do for them. That’s the minimum, but preferably we offer market data / market intelligence. Clients usually don’t use us just because we have thin spreads, and clean-speedly executions, but because we win them over with our information systems which enable them to make decision early and accurately.

This BI enables the clients to see each and every trades, in full life cycle. I’m going to explain the mechanics so please bear with me. Our client setup the order on their OMS (Order Management System) such as Charles River or ThinkFolio. We receive the order via Omgeo CTM or Swift, for example to sell 200,000 of bond X. We start the process of origination, execution, validation, confirmation, clearing, settlement and accounting. To get the best price we might (progammatically!) have to split the order 3 ways: 75k to Goldman, 75k to MS and 50k the next day with Citi.

The terms could be different from each counterparty, particularly if it is margin trading such as a Repo. The order could be complex i.e. a delta neutral trade, a volatility play, or an option spread.

Broker or dealer: we become a broker when we buy an ETS (exchanged-traded-security) or ETD (exchange-traded-derivatives) such as shares, gilts, treasury, or CP (commercial paper) on behalf of our clients. Here we purchase or sell the security/derivative on an exchange. We also become a broker if we make an OTC (over the counter) contract on behalf of our clients, for example IRS and FX Forward or some futures. Here we make a custom “deal” with another bank. We become a dealer when we trade on behalf of ourselves, with our own money, for example for our Proprietary Desk (I’ll explain this shortly) or for our own Treasury department.

Here is an overview of each of the 6 steps:

  1. Origination is about making sure that we capture the client orders correctly into our trading system. Quantity, security details, maturity, price requirements, timing restrictions, counterparty restrictions (exchange restriction), and terms (like European or American for options, ISDA / DTCC for CDS).
  2. Execution is about sending the order down the wire to different counterparties, or to our internal pool first to match with other client orders. ETS orders could be combined with other client orders, or with our own prop desk, and placed externally in one big volume to get a good deal.
  3. Validation is done by mid office to check the trade against the counterparty, i.e. we send them our trade details, they send us their trade details, and we compare them electronically. Unmatched trades are reported back to counterparty, and fixed manually.
  4. Confirmation: after we agreed with the counterparty, we send the details of the trade to the clients, electronically.
  5. Clearing: clearning agency e.g. NSCC electronically sends the contract to both parties and both formally accept the trade terms.
  6. Settlement: payment to counterparty (OTC) or Exchange (ETS), usually T+1 or 2 depending on which exchange. Settlement for STP (straight through processing) is same day.
  7. Accounting: trades are booked to clients accounts. Calculates balance of client funds, and margin requirements.

The BI systems we are offering to clients on our extranet needs to reflect every single one of the above status/step, on real time basis (like 2-5 mins after the event). If we only offer overnight updates we may be able to win small clients but big fish (buy side) all needs real time, as it affects the fund manager ability to manage their portfolios intraday.

 

Business Intelligence for Trading

The 5 core requirements are: data coverage, for external & internal use, position exposure, risk management, regulatory compliance.

Data Coverage: There are many different asset classes within the IMSes across the bank (Investment Management System), and for OTCs there are several external trade repositories, all of which need to be tapped, and have the data extracted into our data warehouse. Our DW/BI must store every single trade we execute, and every single order we receive (both from client and internal), regardless of the status i.e. we must still store cancelled order, novations, compressions.

External & Internal Use: the DW/BI is used both by our clients to view their transactions, positions and risks, but also importantly by our traders and managers to understand their activity. In terms of security, Chinese Wall is an absolute must. No traders should be able to view client’s price-sensitive information, and clients must only be able to see their own positions.

Position Exposure: a “position” is an obligation or asset that a client have, on a particular day. A trade to buy a bond for example, results in that client having an long position in that bond. A position can be long or short, open or closed, have a book value and market value, date-and-security based, and has an assocated P&L value i.e. the potential profit or loss. Clients must be able to see their aggregation positions at various levels. For example, number of CDS positions and the grossed & netted exposure for each day. Clients should be able too see overal exposure by broker, by clearing house, by asset type, by status, by sector, by currency, by country, and by instrument.

Risk Management: every position/exposure carry certain market risk, e.g. the EUR-USD rate can move against you, yield and duration of bonds can increase or decrease. To quantify this market risk, we calculate VAR (Value at risk), e.g. there is a 1% chance that the market value of all Investment Grade fixed income positions decrease by 1 million in 1 day. We call this 1% as “99% convidence level”, this 1 day as “the risk time horizon”, and this 1 million as “the VAR”. Clients should be able to see VAR by broker, currency, currency, country, maturity, asset class, on a 1-day, 1-week and 1-month horizon, at 95%, 99% and 99.9% confidence levels.

Every trade has certain operational risks. A dashboard displaying daily automatic confirmation and validation across all market positions (compressed and uncompressed) gives our client confidence about how much risk exposure they have on each asset class, broker, currency, e.g. volume of mismatched trades. Every trade has certain counterparty risks, therefore we need to provide to our clients exposure to each counterparty, e.g. £4.5m agains RBS on 17/12/2016. They can see the this daily counterparty exposure on a time line, for different brokers, for different asset class, with the associated MTM (marked-to-market).

Regulatory Compliance: Different regulators demand different reports which we need to produce simultaneously, e.g. Dodd Frank in US, EMIR in EU, FATCA, Mifid. We need to help our clients fulfill their regulatory reporting requirements, as well as our own reports to the authorities, e.g. we must report every OTC Derivative contract to a trade repository, implement margining standards, and monitor clearing obligation in each CCP (Central Counterparty). This includes fraud detection, insider trading, sanctions, AML (Anti Money Laundering), KYC (Know Your Clients), RWA (Basel III). Clients pass to us their regulatory reporting requirements, and we create automated reports within our DW/BI to give them this data, securely downloadable on extranet.

 

Prop Desk and Fund Management

Many investment banks also invest their own money into some assets, most notably fixed income. We call this “Prop” – short for Proprietary Desk or Proprietary Trading. There 2 kinds: it can be short term, i.e. just this trade, or it can be long term, the trade is part of a big plan to manage money, some kind of “fund management”.

We want to do prop because we want to make market. If a client wants to sell their CDS cheaply, or buy an IRS at good price, why don’t we take the opposite side and create a market? If someone is under pressure to sell a UBS CDS at 6% below the market price it is profitable for us to buy it and sell it to someone else the next day at 2% below the market and we make immidiate money.

The prop desk also actively looks at the market for opportunity. So if they believe that the Columbian bond recent 32% dive has made the price too low, they will instruct our trader to buy some. Or the Prop could want a Delta Neutral Trade, to gain from Volatility. Which means, it doesn’t matter which way the price is moving, whether the price is moving down or up we’ll still make money. The more volatile the price, the more money we make.

So that’s the incidental. But the bank’s money is usually managed from a long term perspective, not just or day-to-day trading. As an example, we may want to create a long position on Cable because we believe US politics will result in the Fed raising rates next year. Or take advantage from the raising EM market next year, or Japan equity.

All the big IBs (JPM, Nomura, UBS, Barclays, etc) have asset/wealth management business. They take client’s money and invest it on the clients’ behalf. This AM business also instructs our trader to buy or sell stuff to create positions for the clients.

The BI requied to support the prop money and asset management has the following objectives:

  1. Compliance: each fund has a “boundary” e.g. 80-90% must be invested in EM bonds. “EM” = Emerging Market, i.e. Chile, Russia, India, China, etc. Must be at least 95% in USD. Use of options must be <= 10%. Latam (Latin America) maximum is 30%. And so on. And all these “limits” are coded into the OMS (Order Management System) so that everytime a trade is created on the OMS, it will be automatically checked before the trade is executed. And therefore, afterwards, we will be able to plot in our BI, the chart of “EM Bonds” content of the fund (percentage wise) for each month from 2014 to 2016. And not just EM Bonds, but on the same dashboard we should be able to see the regional breakdown of the fund, in the last 3 years. We should also see the currency breakdown of the fund, month to month, in the last 3 years. And we could see in that chart if any non-compliance event happened.
  2. Profitability: we can compare the amount of money we make in each trade, to the amount of capital required to do the trade, and the amount of time required. We can aggregate this to client level, fund level, country level, currency level, asset class level etc., for any time duration we like.
  3. Risk: we can look at the level of risk we take in each trade, in terms of contribution to VAR, and in terms of historical events. Meaning that, if the 9/11 event happens tomorrow, what kinds of risk do we have in this portfolio. Have we hedged it? This is tested not just for 1 event like 9/11, but hundreds of events. Some of them are real events which happened in the past. Some of them are theoretical events such as raise of interest rates, and FX rates.

 

Research

The Research guys make money by selling market analysis. Also on economic and geo-political risk. Both the market as a whole (macro/overview level) or on an individual company level. We charge clients some fees to allow them to use our market analysis tool, which enable them to monitor market behaviour, market events, and get signals and alerts on certain situations they want to trade on.

So the BI for Research has two purposes:

  1. To support our analysts doing their market and economy analysis.
  2. To support our clients doing their analysis.

The BI provides for example various market indicators such as CDS spreads for each corp bond, historical interest rates and currency rates, yield curves. From the bond price and maturity we calculate yield, duration, and various spreads on each and every fixed income instrument, and let our analysts and clients feast on this data.

They can filter by asset class, country, currency, region, industry sector. See what is the trend today, this week, this month, and this quarter. We let our clients view our forecasts that our analysts have created.

 

Part 4. Technical Design

There are 5 architectures which can be used for BI for IB. Three of them are old approaches (#1-3 below), and two of them are new approaches (Data Vault & Hadoop)

  1. DDS only (Dimensional Data Store)
  2. ODS (Operational Data Store) + DDS
  3. NDS (Normalised Data Store) + DDS
  4. DV (Data Vault) + DDS
  5. Hadoop + DDS

A bit on terminology:

  • DDS means dim & fact tables. It’s a dimensional model, i.e. Kimball star schema.
  • ODS means current version only. The historical value of attributes are not stored. Surrogate Keys are generated in the DDS.
  • NDS means storing historical values of the attributes. DDS becomes dummy. Surrogate Keys are generated in the NDS.
  • Data Vault stores the attributes separately to the business keys. And the links between business keys are also stored separately. It is superior for capturing time-based relationship between tables.
  • Hadoop stores data in a distributed file system (HDFS, Yarn), and use Hive or NoSQL to query the data out.

As you notice above, in all 5 architectures the BI tool (Cognos, BO, Tableau, etc.) is facing the DDS. The ODS, NDS, DV and Hadoop are all “back end engine” which the users never see. Their function is to integrate data from many source systems.

Because of this the ODS, NDS, DV and Hadoop are called the Integration Layer (IL) or the Enterprise Layer (EL). Whereas the DDS is called the Reporting Layer (LR) or the Presentation Layer (PL). The primary benefit for having an Integration Layer separate to Reporting Layer is to make data integration easier because it has a separate model, which breaks down the entities into more detail level than RL. IL data model is shaped up to be “write friendly” (hence normalised model) whereas RL data model is shaped up to be “read friendly” (hence denormalised model).

 

Dimensional Model Design

Dimensions: Date, Client, Broker, Branch, Product, Instrument, Desk, Trader, Issuer, Country of Risk, Currency, Rating, Collateral, Fund, Asset Class.

Designing instrument dimension (also called “security” dimension) in IB is a tricky business. First we need to decide whether we want cash and OTC in the instrument dimension. Second, we need to decide whether we want to split the derivatives (IRS & CDS in particular) into a separate instrument dimension because of their rich attributes. Third, we need to decide whether the instrument classification (such as instrument currency, rating, country of risk, asset class, etc.) should be stored inside the instrument dimension or outside.

Because of the last point, sometimes you don’t see the rating dimension and country dimension, because they are amalgamated into the instrument dimension. But issuer and currency are usually created (more often than not).

When creating Client, Branch and Broker, some companies feel clever and created a table called Party in the Integration Layer. I advise against doing this and prefer to split them into 3 entities for clarity and flexibility reasons. In DDS they are always separate dimensions. Some companies split the address and country in the ODS/NDS into separate table in the name of normalisation. I advise against doing this and prefer to keep demography and geography attributes (such as address & country) within the client, branch & broker table for simplicity and flexibility, even if that means data redundancy and breaking the 3rd Normal Form.

There are many, many instrument classifications in IB, from asset type, duration bands, maturity bands, country, country group, region, currency, currency group, broker type, coupon type, interest rate type, settlement type, collateral type, contract type, rating type, clearning house type, issue type, market type, derivative type, interest calculation type, direction, swap type, etc. All these can be created as their own table or created as 1 common key-attribute-value table. I prefer the latter due to its simplicity and consistency in modelling, as well as ease of use. The primary benefit for doing so is that we don’t need to change the data structure, i.e. using 1 common table we don’t need to keep creating a new table when we have a new classification, which in IB it happens almost every month. Within RL of course all of them goes to 1 dimension: instrument.

 

Fact Tables

There 8 major fact tables in IB:

  1. Position
  2. Transaction
  3. Risk
  4. Performance
  5. Collateral
  6. Deal Pipeline
  7. Client Account
  8. Prop Desk

I’ll explain the above 8 one by one, bear in mind that each of them might be implemented as two or more physical tables.

  1. The Position fact table stores the daily value of each position (instruments and cash) in each portfolio. It also stores the analytic values of each instrument, such as spread and yield. And time-based properties such as duration and maturity along with its banding.
  2. Transaction fact table stores the buying and selling activities in each portfolio, and non-trade transactions too, e.g. interest payments, defaults, haircuts, instrument maturity, and corporate actions.
  3. The Risk fact table stores the VARs, Tracking Errors, stress test impacts. To be flexible, in some IBs it is designed as a “vertical fact table” where the “risk type” column determines the measures. But for read-efficiency and for clarity reasons I prefer to create the measures as individual columns, with the time horizon (1 month, 1 year, etc.) as a separate columns, and “cuts” (by region, currency, asset class, etc.) as separate columns.

Risk is usually calculated not at instrument level, but at “cuts” level, including at portfolio level. But if it does then we need to create a separate fact table to store risk numbers at instrument level.

  1. The Performance fact tables stores the growth of a fund in the last 1 month, 3 months, 6 months, 1 year, 3 years, 5 years and 10 years period. Each “share class” is stored in their own row.

 

Before continuing to fact table number 5 to 8, let me explain the aggregatability first:

Unlike Position and Transaction fact tables, both Risk and Performance fact tables are not aggregatable. Every “cut” in the Risk fact table stands on their own, and every share class in the Performance fact table stands on their own.

The Position and Transaction fact tables are only aggregatable up to portfolio or fund level. It best not to put the share class into position or transaction fact table, because from position and transaction point of view, they are the same. But in the Risk and Performance fact table, we must put the share class, because the Risk and Performance numbers are different for each share class (because Risk and Performance numbers are affected by the portfolio currency and accumulation/income).

 

  1. The Collateral fact table stores the market value of individual OTC derivative (called MTM – marked to market) and the required value adjustments in collateral against each broker-dealer.
  2. The Deal Pipeline fact table stores the flow of a deal between the bank and lending clients, M&A clients, and transaction banking clients. It record the status since when the client was a prospect, until the deal is agreed and closed.
  3. The Client Account fact table stores the clients money that we manage and invest, including subscriptions (deposits, i.e. client puts money in) and redemptions (withdrawals, i.e. client takes money out), buying and selling activities. This fact table need to store both the movements and the daily balance of each client account (hence preferably split into two – one periodic snapshot, one transactional).
  4. The Proprietary Desk fact table stores the bank’s own money that we manage and invest, including subscriptions and redemptions, buys and sells, and corporate actions. It is a legal requirement in US, EU and UK that we must separate clients money and our own money.

 

That’s the technical design of the BI for IB. Hope it has been useful for you.

Ref:

  1. Message Automation
  2. ISDA OTC Commodity Lifecycle
  3. Trade Processing

 

4 July 2016

Historical Portfolio Positioning

Filed under: Analysis Services — Vincent Rainardi @ 7:08 am

One of the most difficult chart to create in asset management sector is Historical Duration Positioning like this:

Sample portfolio

First we have to get the numbers for each category for each month (at least each quarter), e.g. in January the portfolio was as follows:

  1. Government 25%, 1.1 years
  2. IG Corporate 20%, 0.7 years
  3. EM Sovereign 10%, 0.4 years
  4. EM Corporate 5%, 0.3 years
  5. High Yield US 10%, 0.2 years
  6. High Yield non US 5%, 0.4 years
  7. Securitised 10%, 0.7 years
  8. Convertibles 10%, 0.9 years
  9. Cash 5%, 0 years

The 2 numbers above are market value weighting (%) and duration (years). In the above the portfolio duration was 4.7 years (sum of all the durations from 9 categories).

Notice that the categorisation above is cutting across 6 different dimensions:

  1. Government/sovereign bond vs corporate bond
  2. Investment Grade bond vs High Yield bond
  3. US vs non US vs Emerging Market
  4. Securitised vs non-securitised
  5. Convertible bond vs non-convertible bond
  6. Cash

To create this report, first we need to identify cash and cash equivalent positions. Then convertibles and securitised. Then HY vs IG, sovereign/government vs corporate, then Emerging vs Developed Markets.

Because on point 3 we have EM Sov, it means that on point 1 the government is DM (developed market). Because on point 5 & 6 we have HY it means that point 1 to 4 are IG. The business analyst are expected to know that “sovereign” is equivalent as government, and “government” usually means investment grade debt in developed market.

So the actually meaning of the categories are:

  1. DM IG Government 25%, 1.1 years
  2. DM IG Corporate 20%, 0.7 years
  3. EM IG Sovereign 10%, 0.4 years
  4. EM IG Corporate 5%, 0.3 years
  5. HY US 10%, 0.2 years
  6. HY non-US 5%, 0.4 years
  7. Securitised 10%, 0.7 years
  8. Convertibles 10%, 0.9 years
  9. Cash 5%, 0 years

Total is not 100%

Point 1 to 8 above are debt, which means that they exclude cash, but they include debt derivatives (not just bonds), but not other derivatives. So if the portfolio has an equity index option (which is an equity derivative), it would be excluded from the calculation and that makes the total not equal to 100%. It could be less than 100%, or more than 100%. For example, if the portfolio has 2% equity, the total would be 98%. But if the portfolio has 2% equity index option, the total would be 102%.

In this case we would need to decide whether we would prefer to create a 10th category called Other and put the non debt, non cash asset into it, or scale it up to 100%. I would recommend the former than the latter because the latter is misrepresenting the weighting of a particular asset class, e.g. if we say that Securitised is 11%, it is actually 10%.

Usual definitions for cash, securitised and EM/DM

The usual asset class considered as cash are net proceeds, cash balances, FX forwards (but not FX swap), certificate of deposit, money market security, cash funds, commercial paper, etc. The grey area are treasury bills, short term government bonds, and marketable security i.e. for some portfolios such as global credit they could be considered cash equivalent, for other portfolios such as short maturity gilt funds they could not be considered as cash equivalent.

Securitised and convertible bonds are identified regardless of region. Securitised includes CMBS, RMBS (agency and non agency), mortgage passthrough, and other ABS such as car loan.

Definition for “Developed Market” differs from company to company. For example, is Jersey developed or emerging market? How about Cayman Islands? Do we use country of risk, or country of incorporation, or country of domicile? The country of risk might be US, but the country of domicile could be Cayman Islands.

Re-classify the past

Once it is all sorted out, we can calculate the % and duration for each category, for each month. At this point we can plot the timeline. But the security classification in the company changes from time to time. For example before 2011, they might not differentiate between EM Sov and EM Corporate, only EM. Or between HY US and HY non-US, i.e. they only labelled it as HY. This means that we need to reclasify the 2010 positions according to 2016 definitions.

This is one of the usage of a data mart (as oppose to a data warehouse), which is to reclassify the securities used in the position fact tables. We can create a new attribute in the security dimension called Bond Category, and label each security with one of the 10 categories above. Equity and equity derivatives, and currency derivatives would have this attribute set to null (or to their correct category). We can then use this attribute for the above reporting.

Measure: %, duration, spread and DTS

The above example uses % and duration as the measure to present in the chart. Duration is suitable to use for a global credit portfolio (whether it is buy and hold or active) particularly if the portfolio is leaning toward IG or has significant proportion of DM sovereign debt.

But if it is an emerging market debt portfolio, a more appropriate measure would be Duration Times Spread (DTS, which is the modified duration x the yield spread to government debt). And if it is a corporate debt it is more appropriate to use Spread Duration rather than Duration (particularly if it has a high proportion of high yield).

Different measures can also be used to monitor historical portfolio positioning, for example risk measures such as PV01 (interest rate risk), IE01 (inflation risk), DV01 (another interest rate risk). VAR (Value at Risk), portfolio volatility (3Y standard deviation) and TE (Tracking Error) are also common. Tracking these risk measures over the last 3 years (or 10) and compare them between portfolios (and to benchmark) is a good demonstration that the desk has incorporated a good risk management when managing the portfolio. And it is in client’s/investor’s best interest to know that hence such as report provide good value to both the asset management house, and to the investors.

18 June 2016

About NOLOCK

Filed under: Analysis Services — Vincent Rainardi @ 10:13 pm

When we issue a SELECT statement in SQL Server, it locks the table. Many people wonder why does SQL Server needs to lock the table?

When the SELECT is being processed, when SQL Server is reading the rows in the table, imagine if someone drops the table. What will happen? The drop statement will have to wait until SQL Server finish reading all rows for this select. (1)

What if someone truncates the table when the SELECT is reading the rows? Or delete the rows that the SELECT was about to read? Same thing, the TRUNCATE or DELETE will have to wait until the SELECT finishes. (2)

What if someone update the rows that SELECT was about to read? Same thing, the UPDATE has to wait. (3)

What if someone inserts some rows that fall within the SELECT criteria? Same thing, the INSERT has to wait. (4)

(1), (2), (3) and (4) are the purposes of locking the table when reading it, i.e. to get consistent result from the beginning of the read until the end of the read.

This is the default behaviour of SQL Server. I call it “Clean Read”. Formally it is called READ COMMITTED Isolation Level.

The opposite of Clean Read is Dirty Read, which happens when the data is changed before the end of the reading, so that at one point we see the row, but at a split moment later when we update the row we can find the row. Dirty Read is formally know in the SQL Server world as “Non-Repeatable Read”.

SELECT With NOLOCK

What if we do the SELECT with NOLOCK? Does SQL Server prevent someone from dropping the table? Yes, nobody can drop the table when it is being read, even if the reading is using NOLOCK. The DROP will be executed, after the SELECT is finished.

What about Truncate? Can someone truncate the table, while a SELECT with NOLOCK is reading the table? No, the TRUNCATE will wait until the SELECT finishes.

What about Delete? Can someone delete the rows from the table while those rows are being read by a SELECT with NOLOCK? Yes, that is possible.

If you have a table with 100,000 rows and User1 issued SELECT * FROM TABLE1 WITH NOLOCK, and at the same time User2 issued DELETE FROM TABLE1, then User1 will only get something like 15,695 rows*, because the other 85k rows were deleted.

*This is the result of my test. Obviously you can get some other numbers depending on various factors.

These 15,695 rows are not sequential, but jumpy, i.e. User1 will get
row 90351-90354, then jump to:
row 90820-90824, then jump to:
row 94100-94103,
and didn’t get row 94104 to 100,000.

Because the missing rows have been deleted by User2.

Error 601

But sometimes, instead of getting the above rows, User1 will get this error message:
Error 601: Could not continue scan with NOLOCK due to data movement

This is when “the page at the current position of the scan” is deleted. See link.

 

 

Domain Knowledge

Filed under: Analysis Services — Vincent Rainardi @ 9:59 pm

In my opinion one can’t be a data architect or data warehouse architect without having good domain knowledge. For example, asset management, the main dimension or reference data is Security (aka Instrument). A security has many attributes, such as

  • Security IDs: ISIN, Sedol, Cusip, Ticker Code, Bloomberg ID, Markit RED, in house ID, etc.
  • Industry/Sector: Barclays, Markit, JP Morgan, Merrill Lynch, GICS, UK SIC, International SIC, IMF Financial Sub Sector, Moody 35, Moody 11, etc.
  • Credit Ratings: S&P, Moody’s, Fitch, house rating, and numerous combination between them.
  • Stock Exchange (where the security is listed)
  • Issuer, Parent Issuer (for derivative use underlying?)
  • Dates (maturity date, issue date, callable dates, putable dates
  • Country: country of incorporation, country of domicile, country of risk
  • Currency: denomination currency
  • Asset class: inflation-linked, treasury, sovereign, IG corporate, CDS, IRS, options, futures, FX swap, etc.

And we have many measures such as valuations (market value, exposure, P&L), analytics (yield to maturity, current yield, modified duration, real duration, spread duration, convexity, etc) and risk (value at risk, tracking error, PV01, IE01, DV01, etc).

If we don’t understand what they means, and how they relate to each other, then how could we design the data structure to store them properly? We can’t. As a data architect we need to understand the definition of each attribute and measures above.

We also need to understand the meaning of each value. For example, one of possible value for asset class attribute is Equity Index Option. What does Equity Index Option mean? What is the difference with Equity Index Swap and Equity Option? We need to understand that. Because by understanding that we will be able to differentiate the difference between Asset Class and Asset Type. And how these 2 fields relates to each other. And whether we need to create Asset Sub Class attribute or not, and the hierarchy between Asset Class and Asset Sub Class.

Of course a business analyst will need to understand the domain knowledge. But a data architect also need to understand it. A data architect who doesn’t understand the domain knowledge will not be able to do their job properly. A data architect who works for a pharmaceutical company for 5 years will not be able to design a database for a Lloyd’s insurance company, without learning the domain knowledge first. And it could be 6 month before they reach the necessary level of understanding of insurance. Like a business analyst, a data architect job is industry specific.

I agree, not all industries are as complex as Lloyd’s underwriting/claim or investment banking. Retail, mining, manufacturing, distribution, and publishing for example, are pretty easy to understand, perhaps takes only 1-2 months to understand them. But healthcare, insurance, banking, finance and investment are quite difficult to understand, perhaps requiring about 6 months t understand them, or even a year.

So how do get that industry knowledge? The best way is by getting a formal training in that industry sector, which can either be a self study or a class. Fixed-income securities for example, can be learned from a book. Ditto trade lifecycle. I don’t belieave there is nothing that one cannot learn from a book. Yes, some people prefer to go to class and have a teacher explain it to them. But some people prefer to read it themselves (myself included).

I agree that a developer doesn’t necessarily have industry knowledge. It is a nice to have, but not mandatory. Be it an application developer, a database developer or a report/BI developer. A project manager also doesn’t need to have industry knowledge. It is a nice to have, but not mandatory. They will be able to do their job without industry knowledge. But a data architect, like a business analyst, need to have it. A business analyst needs the business knowledge to understand the business requirements. A data architect needs the industry knowledge for designing the data model and analyse where to get the data from.

 

7 May 2016

U-SQL

Filed under: Analysis Services — Vincent Rainardi @ 7:26 pm

U-SQL is a new language used in Azure Data Lake. It is a mix between T-SQL and C#. It is used to load files into Data Lake.

To use U-SQL, we need to add a New Data Lake Store in our Azure Portal, click Sign Up for Preview (we need to use IE; in Chrome we can’t click Sign Up). We will get an email when it is approved (can be a few hours, can be a few weeks).

Here is an example for calculating the duration for each currency in the portfolio:

@rawData = EXTRACT SecurityName string?, Weight decimal?, ModifiedDuration decimal, MaturityDate DateTime, Currency string
FROM “/FixedIncome/EMD.csv” USING Extractors.Csv(silent: true);

@aggData = SELECT Currency, SUM(Weight * ModifiedDuration) as Duration FROM @rawData GROUP BY Currency;

OUTPUT @aggData TO “/Report/Duration.csv” ORDER BY Currency USING Outputters.Csv();

The above U-SQL code imports a comma delimited file called Portfolio.csv, calculates the duration for each currency and create a file called Duration.csv. I’ll explain the 3 lines above (EXTRACT, SELECT, OUTPUT) one by one below.

EXTRACT

Case sensitive: U-SQL is case sensitive and strong-typed. It is case sensitive because it is using C# syntax. So “EXTRACT” must be in upper case, “string” must be in lower case and “DateTime” must be in mixed case.

EXTRACT is used to import a file into a row set. It can be used to import both a structured file (all rows have the same number of columns) or unstructured files (rows have different number of columns). Documentation: here.

The data type is C# data type, not T-SQL. Full list of C# built-in data types is here. DateTime is a structure, not a data type. It has Date, Time, Year, Month, Day, Hour, Minute, Second, Today, Now, TimeOfDay and DayOfWeek properties. See here for documentation.

The ? at the end of “string?” and “decimal?” means NOT NULL.

One thing we need to remember is that in U-SQL it is “schema on read”. Which means that we define the data types of the file we import as we import them.

When importing a file we need to use an extractor. It is a data interface program which reads byte stream in parallel into a row set. U-SQL has 3 built-in Extractors: one reads text files (Extractors.Text), one reads tab delimited files (Extractors.Tsv) and one reads Comma Separated Values files (Extractors.Csv). The last two are derived from the first one. By default U-SQL split the imported file into several pieces and import the pieces in parallel. U-SQL imports unstructured files using user-defined extractor. See here for documentation on extractors.

Silent is a parameter. It means if the row in the CSV file doesn’t have 5 columns, ignore the row (we expect the CSV file to contains 5 columns: SecurityName, Weight, ModifiedDuration, MaturityDate, and Currency). There are other parameters such as: (documentation here)

  • rowDelimiter: the character at the end of the row, e.g. carriage return, linefeed
  • quoting: whether the values are wrapped in double quotes or not
  • nullEscape: what string represents a null value

@rawData is the name of the row set containing data we import from the CSV file. We can then refer to this row set with this name.

SELECT

Once we have this @rawData row set, we can do a normal SQL statement to it, such SELECT, UPDATE, INSERT and DELETE. But we need to use C# syntax. So the where clause is not “where AssetClass = “Options” ” but “where AssetClass == “Options” ”. And SELECT, UPDATE, INSERT, DELETE all needs to be in upper case. Other supported SQL statement in U-SQL are: JOIN, UNION, APPLY, FETCH, EXCEPT, and INTERSECT.

  • SELECT: U-SQL supports WHERE, FROM, GROUP BY, ORDER BY and FETCH. Documentation here. FETCH means get the first N rows, link. For ALIAS see here.
  • JOIN: U-SQL supports INNER JOIN, LEFT (OUTER) JOIN, RIGHT (OUTER) JOIN, FULL (OUTER) JOIN, CROSS JOIN, SEMIJOIN, and ANTISEMIJOIN. Documentation here. Both the “ON T1.Col1 == T2.Col2” and the old style “WHERE T1.Col1 == T2.Col2” are supported (link).
  • UNION: U-SQL supports both UNION and UNION ALL. Doc: here.
  • APPLY: U-SQL supports both CROSS APPLY and OUTER APPLY. Doc: here.
  • Interaction between row sets (these are hyperlinked): EXCEPT, INTERSECT, REDUCE, COMBINE, PROCESS.
  • Aggregate functions: SUM, COUNT, FIRST, LAST, MAX, MIN, AVG, VAR, STDEV, ARRAY_AGG, MAP_AGG. Doc: here. ARRAY_AGG combines values into an array. Split breaks an array into values (link).
  • Operators: IN, LIKE, BETWEEN. Doc: here.
  • Logical operators: Remember that U-SQL is using C# syntax so AND is &&, OR is ||, = is ==, <> is != and not is !
  • “condition ? if_true : if_false” is also supported, see here.

 

OUTPUT

The output of a SQL statement (called Expression in U-SQL) can be assigned to a row set like in the example above i.e. “@aggData = “, or we can do these 3 things:

  1. Output it to a new file using OUTPUT
  2. Output it to a new table using CREATE TABLE AS
  3. Output it to an existing table using INSERT INTO

The OUTPUT expression saves the row set into a file. There are 3 built-in outputters in U-SQL: Outputters.Tcv for tab delimited files, Outputters.Csv for comma delimited files, and Outputters.Text for any delimiter. They accept these parameters: quoting, nullEscape, rowDelimiter, escapeCharacter, and encoding. Documentation: here.

Documentation

The U-SQL language reference is here (still work in progress), and the U-SQL tutorial is here.

Data Lake documentation is here and the learning guide is here.

Articles on U-SQL from Microsoft:

  • From Microsoft U-SQL team: link
  • Ed Macauley U-SQL tutorial: link, click on Tutorial under Prerequisites.
  • From Mike Rys: link

17 February 2016

The Problem with Data Quality

Filed under: Analysis Services — Vincent Rainardi @ 8:44 am

The problem with data quality is not the technicality. It is not difficult to design and build a reconciliation system, which checks and validates the numbers in the data warehouse and in BI reports/analytics (or non-BI reports/analytics!).

The problem is, once the reconciliation/validation program spits out hundreds or thousands of issues, who will be correcting them? That is the issue!

It requires someone to investigate the issues, and, more importantly, fix the issues. This requires funding which can seldom be justified (it is difficult to quantify the benefits), and requires a combination of skills which rarely exists within one person. So that doubles the money because we need to hire 2 people. The “checker” who checks the DQ reports is largely an operational application support type of person, with whilst the “fixer” need to have a detective mind set and development skills. To make matters worse, these development skills are usually platform specific, i.e. .NET, Oracle, SSIS, Business Objects, etc.

Assuming £40k salary in the UK, then adding NI, pension, software, desk, training, bonus, insurance, consumables, appraisal, and payroll cost (total of £15k), and multiply by 2 person, it is a £110k/year operation. Adding half-time of a manager (£60k salary + £20k costs), it is a £150k/year operation.

It is usually difficult to find the benefit of a data quality program bigger than £100k. The build cost of a DQ program can be included in the application development cost (i.e. data validation, data reconciliation, automated checks, etc.), but the operational cost is an issue.

So the fundamental issue is not actually finding a person, or a team of people. The fundamental issue is actually to get the funding to pay these people. The old adage in IT is usually true: anything is possible in IT, provided we have the funding and the time.

The benefit can’t come from FTE reduction (full time employee, means headcount), because it is not a reduction of workload (unless a manual DQ is already in place of course). And it doesn’t come from increased sales or revenue either. Try to find a link between better DQ and increased revenue, and you’ll experience that it is hard to find this link. And we know that headcount reduction and revenue increase are two major factor for funding an activity/work within a company.

3 factors that drives data quality work

But fortunately there are 2 another factors that we can use: compliance and risk.

Compliance in financial services industry, healthcare industry, etc. requires reports to the regulators in a timely manner, and with good accuracy. That drives the data quality work. For example, if we report that the credit derivative loss position is $1.6bn, where as actually it is $2.1bn, we could be risking penalty/fine of several million dollars.

Risk: there are other risks apart from compliance, namely credit risk, interest rate risk, counterparty risk, etc. Different industry has different risks of course, with financial services probably have the largest monetary amount, but they all drives data quality work. If the data quality is low, we are risking misstating the risk amount, and that could cost the company a fortune.

The 3rd factor to use is data warehouse. If your company stores a lot of data in one place, such as a data warehouse, and the data quality is low, then all the investments are wasted. A £600k DW warrants a £50k DQ. And if your DW has been there for 4 years, the actual cost (development + operation) could easily exceed £1m mark. A general 10% ratio yields a £100k investment in the DQ work.

The key statement to use with regards to DW’s DQ is the “DW usage”. A centralised data store such as a DW is likely to be used across many applications/teams/business lines. Each of these app/business are in risk of having operational issues if the data in the DW is incorrect. And if we don’t monitor the DQ, we can be sure that the data will be incorrect. That is quite an argument for a Data Quality project.

15 February 2016

Effektor

Filed under: Analysis Services — Vincent Rainardi @ 8:36 am

It’s been about 6 months since I came across an automated data warehouse builder software called Effektor, based in Copenhagen. I don’t exactly remember when but I think I got it from SQLBits. The product is the best in class, better than the other 4 DW automation software I know (WhereScape, Kalido, Insource and Dimodelo). Effektor can generate the DW tables and the ETL packages in SSIS, and it can also create MDM, SSRS reports, balanced scorecard (BSC), and SSAS cube. None of the other 4 software creates MDM, SSRS, BSC or SSAS cube, as far as I’m aware.

It is 5 years old (2010), it has 40 customers, but it only run on SQL Server, not any other RDBMS. It runs in Azure and  it runs on Standard Edition of SQL Server (2008 R2 to 2014), as well as other editions.

All DW automated builder software safe costs, as I explained in my Dimodelo article. But Effektor goes further. It doesn’t only create the ETL packages, but also the cube, the MDM, the scorecard and reports. The integration with RS amazed me because, unlike SSIS and SSAS, SSRS does not have API interface like AMO so we need to create the XML SSRS files manually.

Now (6.3) it also have WebAPI, i.e. we can control the DW sync, data import, DW load and OLAP security from PowerShell via WebAPI. SSRS usage from the portal is logged, so we know who uses which report and when.

The only negative side I can think of is the financial strength. As I said in the Dimodelo article, and in the Choosing ETL Tool article, there are 3 factors which come above the functionality: the price,  financial strength and infrastructure. I think Effektor will satisfy the price and infrastructure aspect for most companies (companies who are a SQL Server shop that is), but it is the financial strength which is a question mark. Will the company still be there in 5 years time, or will it be a takeover target by a bigger company and have the product diminished? i.e. they do the takeover to get the customers, not the product.

At the moment I don’t see that, because it only has 40 customers and an excellent product, so it would be crazy for say, Microsoft, to be interested in Effektor just to get its customers. On the contrary, if a big company is interested to buy Effektor, they must be doing so because of its excellent product. So this actually plays better for the existing customers, because they will get better support from that bigger company, and the product development team will get better funding, as the product is marketed to the bigger company’s existing customers. The only drawback/disadvantage is that the bigger company might increase the licencing cost (to increase profitability to more demanding shareholders, and covering a more aggressive marketing plan).

Disclaimer: I don’t receive any rewards or incentives, financially or otherwise, from Effektor or any of its partners or competitors, in writing this article.

 

Data Interface (How to Manage a DW Project)

Filed under: Analysis Services — Vincent Rainardi @ 5:54 am

One of the most challenging issues in managing a data warehouse project is that the report/analytics development can’t start before the ETL is completed and populated the warehouse tables with data.

This is traditionally solved  by using “data interface”, i.e. the fact & dim tables in the warehouse are created and populated manually with minimum data, in order for the report/cube developer to start their work.

But the issues when doing that are:

  • It takes a long time and large effort to create that data
  • The manually-handcrafted data does not reflect the real situations
  • The manually-handcrafted data does not cover many scenarios

A data interface is an agreement between the report building part and the ETL part of a DW project that specifies how the data will look like. A data interface consists of two essential parts:

  • The data structure
  • The data content (aka data values)

The data structure takes a long time to create, because it requires two inputs:

  • The requirements, which determines what fields are required in the DW
  • The source data, which determines the data types of those fields

Hence the project plan for a traditional data warehouse or data mart project looks like this: (let’s assume the project starts in January and finished in 12 months/December, and we have 1 ETL developer, 1 report/analytic developer, 1 data architect, 1 BA, 1 PM)

  • Jan: Inception, who: PM, task: produce business case and get it approved to get funding.
  • Feb-Apr: Requirement Analysis, who: BA, task: create functional requirements.
  • Apr-May: Design, who: DA, task: create data model and ETL specs.
  • Jun-July: Design, who: DA, task: report & cube specs
  • Jun-Aug: Build, who: ETL Dev, task: create ETL packages
  • Aug-Oct: Build, who: Report Dev, task: create report & cubes
  • Sep: Test, who: BA & ETL Dev, task: test ETL packages and fix
  • Nov: Test, who: BA & Report Dev, task: test report & cubes and fix
  • Dec: Implementation, who: BA, DA & Dev, task: resolve production issues

The above looks good, but it is actually it is not good from resourcing point of view. There are a lot of empty pockets burning budgets while people are sitting idle. There are 27 empty boxes (man-months) in the resource chart below, which is 37.5% of the total 60 man-months.

Original plan

The report developer starts in August. If we can make the report developer starts in June at the same time as the ETL developer, we would be able to shorten the project by 2 months or so. But how can we do that?

The answer is data interface. The DA creates the DW tables and populates them with real data from the source system. This is quicker than trying to manually create the data, and the data reflect the real situations, covering many scenarios. Real scenarios (the ones which are likely to happens) with realistic data, not made-up scenarios with unrealistic data which are unlikely to happen.

Using the populated DW tables (facts and dims), with real data, the report developer will be able to create the reports and the BA will be able to test/verify the numbers in the reports, at the same time when the ETL is being developed to populate the tables. We are removing the dependency between the report development and the ETL development, which a crucial link in the project that prolongs the project duration.

The resource chart now looks like this, 2 months quicker the original plan:

10 months

The value of this 2 months is approx. : 2 months x 5 people x $500/person/day x 22 days/month = $110k. It is a significant figure. The DW development cost goes down 16.7% from $660k to $550k.

We notice that there is a big white area on the lower left of the resource chart, i.e. the DA and 2 developers are not doing anything in the first 3-5 months. This can be tackled using iterative development, i.e. break the project into 3 parts (by functional areas) and deliver these 3 parts in one after the other. In the chart below, part 1 is yellow, part 2 is green and part 3 is purple.

Split into 3 parts

Part 1 (the yellow boxes) goes live in June, part 2 goes live in Aug, part 3 in Sep. The big system test at the end of the project won’t be required any more, because we go live bit by bit. During April the BA can prepare the test plan.

The resource utilisation is now higher. We now only have 8 white boxes, out of 54. That’s 14.8%.

The project duration is shortened further, only 9 months rather than 10. That’s another $55k cost saved. The cost is now $495k, 75% of the original cost ($660k).

13 February 2016

Dimodelo

Filed under: Analysis Services — Vincent Rainardi @ 6:28 am

Recently I came across an automated data warehouse builder software called Dimodelo, based in Brisbane. Dimodelo is an SSIS plugin on SSDT. It is a new company with unknown financial strength, but their product is excellent. It is the best warehouse builder in the market today, better than the three software already in the market (WhereScape, Insource and Kalido) and as good as Effektor.

Dimodelo creates the data warehouse tables, including staging tables, and create an SSIS package which does the ETL to populate the DW tables. Yes we need to point it where the source tables are, what SCD we’d like, what is the business key, etc. But it is amazingly good and very user friendly. It is more modern and user friendly than Red and Kalido, and it creates SSIS packages, which is a big plus for a SQL Server shops. It is as user friendly and as modern as Effektor who is based in Copenhagen, which is also built for SQL Server, with integrated interface to SSRS and SSAS, not just SSIS (so yes, Effektor is more complete than Dimodelo in terms of BI stack).

Back to the article I just wrote, “Choosing an ETL Tool” (link), we should not consider vendors who can’t demonstrate financial strength, no matter how good their software is. But as a good start up I think and I hope Dimodello will be able to demonstrate this.

Dimodelo said in their website that their average development time for a fact or a dimension table is 6.8 hours. In my experience, doing it manually in SSIS takes 1-3-5 for a dimension (1 day for a simple dimension, 3 days for medium and 5 days for a complex dimension) and 3-5-7 for a fact table (3 days for a simple fact table, 5 days for medium and 7 days for a complex dimension). This is development + testing but excluding design. So 6.8 hour is very impressive. But perhaps it was tested on simple dimensions, not complex fact tables?

This is the 5th time I heard about a software which automates the development of a data warehouse. A DW Automation software can build simple DWs in a standard way. But for complex DWs, I don’t think they will be able to cope. But still, there is a lot of values to be gained by using them. We can jump start the development in a big way, saving months of initial DW development effort (saving 3 months x 4 people x $500/day is equal to $132k). When we use these products, the onus shifted to the architecture and design, which now takes the longest time, instead of the development itself.

Examples of “complex” things in DW development are: loading data from message queue, a normalised data warehouse (link, link), accumulating snapshots (link, link, link), distinct attribute dimension (link, link), connecting fact tables (link), deduplication (link), initial data load (link), initialising a new attribute in SCD type 2 (link), a data warehouse without surrogate keys (link), flip flopping in dimension tables (link), partitioning fact tables (link), processing past data (link). Most of the above are issues in dimensional warehouses, but apart from dimensional DWs these days we also have data lakes, Hadoop-based warehouses, Inmon’s CIFs (ODSes, EDW, Normalised DW, etc), real time DWs, MPP-based DWs, Cloud-based DWs, EIIs, and the latest incarnation of temporal-based DW in SQL Server 2016, which I don’t think DW Automation software can handle well either.

So in the end, as our DW developed into a more complex one, we will have to alter the ETL code manually (we will have to do some manual development, along with the automated one), but the $100-200k saving development jump-start is very valuable in any DW project. So DW automation software still give us a lot of value. The bigger the DW project, the more cost saving can be made by using DW automation software.

The value is even greater if your business is consultancy, i.e. you build DWs for companies. Because you will make this cost saving on every single project, every single client. And you can choose to pass that cost saving to your client, or to add to your bottom line (meaning the Profit Before Tax line in your Cash Flow).

Note: I do not receive any incentive or reward, in monetary term or otherwise from Dimodelo or its related partners, in writing this article.

Next Page »

Blog at WordPress.com.