Data Warehousing and Business Intelligence

18 September 2016

Rating Dimension

Filed under: Business Knowledge,Data Architecture,Data Warehousing — Vincent Rainardi @ 2:32 am

In an investment banking data warehouse, we have an instrument dimension (aka security dimension). This instrument dimension is used in many fact tables such as trade, risk, valuation, P&L, and performance.

In this instrument dimension we have rating attributes (columns), from the 3 agencies (S&P, Moody, Fitch), an in-house rating and many combination between them. We may have the ratings at both the issuer/obligor level, and at instrument/issue level. We could also have a hierarchy of ratings, e.g. AA+, AA and AA- are grouped into AA. There are separate ratings for cash instruments.

Usual Design

The usual design is to put all ratings attribute in the instrument dimension. This is because the granularity of rating is at instrument level. In other words, rating is an attribute of an instrument.

Issues with the Usual Design

The issues with doing that are:

  1. Cash and FX may not be created as an intrument but they can have ratings.
  2. Some fact tables requires the rating outside the context of an instrument.

An example for point 1 is a USD bank balance (settled cash) in a GBP portfolio, which is given a rating of AAA. In the instrument dimension there is no instrument called USD bank balance. The other example is USD cash proceeds which is an unsettled cash.

An example for point 2 is a fact table used for analysing the average ratings of a portfolio across different date. This is called Portfolio Credit Quality analysis. On any given day, a credit portfolio has 3 attributes: highest rating, lowest rating and average rating. The average rating is calculated by converting the rating letter to a number (usually 1 to 27, or using Moody’s rating factor, link), multiply the rating number with the weight of each position, and sum them up. There can be several different criteria, for example: including or excluding debt derivatives, funds, cash, or equity, and doing lookthrough or not.

For example, portfolio X on 16th Sep could have these different attributes:

  1. Average rating: AA+
  2. Highest rating: AAA
  3. Lowest rating: BB-
  4. Average rating excluding cash: AA
  5. Average rating excluding cash and derivatives: BBB+
  6. Average rating excluding cash, derivatives and equity: BBB-
  7. Average rating including funds: AA-
  8. Average rating with lookthrough: A+

So the fact table grain is at portfolio level and date. We therefore cannot use instrument dimension in this fact table, and therefore it is necessary to have rating in its own dimension.

Another example of a fact table which requires rating not at instrument dimension is sectoral credit quality. A credit portfolio consists of many bonds and credit derivatives. Each of these bonds (or CDS) are of certain industry sector. A bond issued by BP is in Energy sector. A bond issued by Santander is in Financial sector. The average rating for each sector is calculated by converting each rating to a number, then multiplying it by the weight and sum up for all positions in that sector.

If we use GICS (link) as an example, and government, on any given day a credit portfolio can have sectoral credit quality like this: (usually excluding cash and equity, but including credit derivatives and debt funds)

  1. Consumer Discretionary: AA+
  2. Consumer Staples: AA-
  3. Energy: A+
  4. Financials:AAA
  5. Health Care: BB-
  6. Industrials: BBB-
  7. Information Technology
  8. Materials: BBB+
  9. Telecommunication Services: AA
  10. Utilities: BB+
  11. Real Estate: B+
  12. Government: AAA

Above is the average rating for each sector, for a given day, for a particular portfolio.

Design Choices for Rating Dimension

To satisfy the above requirement (fact tables which analyse rating at a level higher than instrument, e.g. at sector level or portfolio level), we can design the rating dimension in several ways:

  1. Rating schemes and hierarchies are created as attributes
  2. Rating schemes are created as rows, with hierarchies as attributes
  3. Rating schemes and hierarchies are created as rows

Before we dive into the details I’d like emphasize that we should not try to find out which approach is the best one, because different projects will require different approaches. So instead of asking “What is the best approach” we should ask “What is the most suitable approach for my project?”

And secondly I’d like to remind us that if the rating data is always used at instrument level than it should reside in the instrument dimension, not in a rating dimension on its own.

Approach 1. Rating schemes and hierarchies are created as attributes

approach1

Of course we should always have RatingKey 0 with description = Unknown. This row is distinctly difference to NR (No Rating). NR means S&P gives the bond a rating of NR. RatingKey 0 means that we don’t have any info about the rating of this bond.

The distinct advantage of this approach is that we have a mapping between different rating schemes. Moody’s Baa3 for example corresponds to S&P’s BBB-.

The Grade column is either Investment Grade (AAA down to BBB-) and High Yield (HY, from BB+ down to D).

Approach 2. Rating schemes are created as rows, with hierarchies as attributes

approach2

The distinct advantage of approach 2 over approach 1 is RatingKey 33 means Aa3, we don’t have to tell it that we meant Moody’s, like in approach 1. In Approach 1, RatingKey 4 can means Aa3 or AA-, depending whether we meant Moody’s or S&P.

The distinct disadvantage of approach 2 compared to approach 1 is we lost the mapping between S&P and Moody’s. In Approach 1 we know that Aa3 in Moody’s corresponds to AA- in S&P, but in Approach 2 we don’t know that. In the majority of the circumstances, this is not an issue, because an instrument with Moody’s rating of Aa3, may not have an S&P rating of AA- any way. Instead, its S&P rating could be AA or BBB+. So the mapping between S&P and Moody’s are the book standard, academic assumption, not the reality in the real world. In the above example, nobody cares whether the corresponding S&P rating for Aa3 is AA-. What the fund managers and analysts want to know is what Moody’s really assigned to that bond.

Approach 3. Rating schemes and hierarchies are created as rows

approach3

The distinct advantage of Approach 3 over Approach 1 and 2 is that RatingKey 31 means AA tier, it is clear and not ambiguous. In Approach 1 and 2, RatingKey 2,3,4 all means AA tier.

The other advantage of Approach 3 is the ability to create short term rating and cash rating, because they are simply different schemes.

Why do we need to keep the Tier column in Approach 3? Because we still need to convert AA+, AA and AA- to AA.

Rating Value (aka Rating Number)

In some projects we need to convert the rating letters to numbers. For this we can create an attribute in the above rating dimension (any of the 3 approaches) called Rating Value. It is usually AAA = 1, AA+ = 2, AA = 3, AA- = 4, and so on. NR (No Rating) and WR (Withdrawn Rating) got 0 rather than 23.

rating-value

Note that in unlike in S&P and Moody’s schemes, in Fitch scheme the D rating is differentiated into 3: DDD, DD, and D. So DDD = 22, DD = 23, D = 24.

Miscellaneous

A common mistake is that people assume there is CC+ and CC- rating. There isn’t. Only CCC is split into 3 (CCC+, CCC, CCC-), but CC and C are not.

Moody’s rating of Ca corresponds to 2 ratings in S&P scheme: CC and C. This is why approach 1 is tricky, because the rating does not have one to one relationship between 1 scheme and another. Another example is as I mentioned above, the D rating in S&P corresponds to 3 ratings in Fitch: DDD, DD and D.

20 June 2016

Data Lake vs Data Warehouse

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 6:22 pm

A Data Lake is a storage and analytic system which stores structured and unstructured data from all source systems in the company in its raw form. The data is queried, combined and analysed to get useful patterns and insights. It is built on a cost effetive Hadoop storage infrastructure, and can be dynamically scaled.

A Data Warehouse is a system that retrieves and consolidates data periodically from source systems into a dimensional or normalised data store. It usually keeps years of history and queried for business intelligence or other analytical activities. It is typically updated in batches, not every time a transaction happens in the source system.

Data Lake Advantages

Because data from all source systems are there in the data lake, the main advantage of data lake is it enables the analyst to get analytics and insight from multiple systems. Yes the data is still in their raw form, so would requires some processing, but it is a lot quicker (1-2 week) than building a data warehouse (6-12 months). And that is the second advantage: time to build.

The third advantage of data lake is its ability to store unstructured data such as documents and social media, which we can query and combine with structured data from databases.

The fourth and probably the most powerful business case for a data lake is the cost efficiency. In investment banking world for example, we can store market data (e.g. prices, yields, spreads, ratings) not only for the securities that we hold but all securities in the market, cheaper than if we store it in Oracle or SQL Server.

The fifth advantage of a data lake is the flexibility. Unlike Oracle or SQL Server, data lakes dynamically scales up. If this morning it is 2 TB, in the afternoon it can be 3 TB. As new data arrives, and we need new capacity, we can add storage easily. As we requires more computing power, we can get more computing power there and then, instantly. There is no need to wait a few days for adding a new node in a SQL always-on availability groups, or adding storage to the SAN, or extending the computing power of an Oracle Grid. This means that we do not need to spend a lot of money upfront paying for 3 TB data and 16 processors. We can just start with 300 GB and 4 processors, and expand when required.

Data Lake Disadvantages

The first issue is that the data lake technology is immature. The language which can query across database and unstructure files only has very limited features (each vendor has different language, for Microsoft Azure it is U-SQL). It is probably only 10% of what PL/SQL or T-SQL can do. We can solve this by putting QlikView or Tableau on top. We use U-SQL only to query individual table and we join the data in QlikView/Tableau and do further processing there.

The second issue is cost (but I don’t find this argument holds water). The issue is: it actually costs a lot of money to store data from all source systems in the company, let alone storing external market data which requires a lot of licence cost.

Let’s take Microsoft Azure Data Lake for pricing example. Being in the cloud, the price is only US$ 0.04 per GB and US$0.07/1 million transactions (ref: link). Let’s say that “data from ALL systems in the company” is 500 GB. And every day we store this data in the data lake. So 500 GB per day x 365 days = 182,500 GB per year x $0.04/GB = $7300 per year. Let’s say we have 10 million transactions per day, which is 3650 million transactions per year x $0.07/million transaction = $256 per year. So it is only cost us about $7500 per year. This is a very reasonable price to pay, to have “all the data in the company in one place”. Even 5 years later, when the volume grows to 5 times, it is only about $35,000 per year. Still very reasonable price to pay.

The third issue is performance. Because data lakes stores the data in its raw format, the query which joins the different data could be running like a dog. Luckily Data Lake is running on HDFS, a distributed file system, which is very fast. So yes it is slower than a data warehouse, but it is not too bad. We are not talking about 30 minutes to run a query, but something like 20-30 seconds (compared to DW query which is say 1-3 seconds).

Data Lake being “exploration tool”, I can tolerate a little bit of slowness. After we proof that (for example) the risk data we query is making money, then it can pay for creating a Risk data mart specifically for that risk analysis purpose.

The forth issue is skill. Data Lake requires a superb business knowledge. Because the analyst needs to join tables from different source system. What is the difference between Real Modified Duration and Effective Real Duration and Modified Duration? But this issue is the same whether we are building ad Data Warehouse or a Data Lake. Both requires good business knowledge.

I don’t find U-SQL is difficult to understand for people with SQL knowledge, which is a common skill. But how about Machine Learning? That is difficult to master, right? Yes, that is true, but it is worth paying an expert data scientist to discover the insight in the data, because this insight can be used to save cost or boost our revenue. The potential benefit is 10-100 times the salary of the data scientist.

Conclusion

So considering all the pros and cons above, I am in favour for creating a Data Lake. In addition to having a Data Warehouse of course.

 

21 April 2016

Why do we need a data warehouse?

Filed under: Data Warehousing — Vincent Rainardi @ 4:59 pm
Tags:

I have written about this in 2012 (link) but this one is different.

One of the most compelling reasons for creating a data warehouse is to reduce cost. Imagine if you were a large investment bank, trading fixed income, equities, commodity and currency. Every month you need to produce hundreds of reports to your clients and to regulatory bodies, and for management. The reporting team currently spends a lot of time creating the reports. They all have access to various systems. Risk systems, trading systems (order management), compliance systems, fixed income systems, finance systems. They copy and paste data from these systems into Excel, and do some calculations in Excel, which as the data source for making the reports in PowerPoint and PDF. Let’s find out how much this operation cost. Let’s say there are 8 systems, costing $200 per user per month, and there are 10 people in the team. The licence cost is 8 x $200 x 12 x 10 = $16k/month x 12 months = $192k/year, make it $200k. The salary is $60k x 10 = $600k per year. Total cost = $800k/year.

If we make a data warehouse, taking 1 year and costing $500k, we will be able to put together the data from that 8 systems into 1 database, and the reporting team won’t need access to those system anymore. That saves $200k/year.

Secondly, we will be able to automate the production of those reports. That will reduce the amount of time required to make those reports, thus reduce the number of people required, freeing them to do other activities. Assuming that half of the reports can be automated, that would save us $300k per year.

So total saving per year is $500k. Project cost is $500k. Surely a project with this level of “cost vs benefit” is well worth pursuing.

1 April 2016

Data Sourcing

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 8:16 pm
Tags: ,

One of the critical activity in a data warehouse project (and in any data project) is data sourcing. The business users have a few reports which need to be produced from the data warehouse. There is no data for that yet in the data warehouse, so we look at the report and ask ourselves: “Where can I get the data from to produce this report?” The plan is to find the data source, and then bring the data into the data warehouse.

There are 6 steps of data sourcing:

  1. Find the source
  2. Reproduce the numbers
  3. Verify the extraction
  4. Check the coverage
  5. Check the timing
  6. Check the cost

Step 1. Finding the source

We find out how the report was created, who created it, and from where. The report could be a PDF which is created from PowerPoint, which was created from an Excel spreadsheet, which was created from a file, which was exported from a system. It is this system that we need to check. Find the tables, the columns and the rows where the data is coming from.

Step 2. Reproduce the numbers

Once we have located the tables, we then try to reproduce the numbers we see on the report. For example, the report could be like this:

Asset Allocation

Try to reproduce the first number (24.3%) by querying the tables which you think the data is sourced from. In doing this you will need to look at the data processing happening in the Excel spreadsheet. Do they exclude anything? Is it a straight forward sum of the holdings in government bonds? Do they include supranational? Government agencies? Regional? Municipal? Emerging Market Debt?

If you can match 24.3% then great. If not, investigate if the data is manually overridden before it was summed up. For example, in Excel, there might be a particular bond which was manually classified as Government even though the asset class is Corporate, and this is because it was 80% owned by the government, or because it was backed by the government.

We need to particularly careful with regards to the totals. For example, on the “Allocation by Country of Risk”, if the total portfolio value is $200m, but they exclude FX forwards/swaps, or bond futures, or certain derivatives, then the total portfolio value could decrease to $198m, and all the percentages would be incorrect (they are slightly higher).

Understanding the logic behind the report is critical in reproducing the numbers. In-depth industry knowledge will be helpful to understand the logic. For example:

  1. The credit rating for each debt security is categorised into IG (Investment Grade) if it is BBB- or above, and into HY (High Yield) if it is BB+ or below, or NR.
  2. The credit rating used for point 1 is the average between S&P, Moody’s an Fitch, except a) for own fund use look-through, b) for outside fund use IMMFA
  3. The “Allocation by Country of Risk” table excludes cash and FX forwards/swaps.
  4. Each country is categorised into Developed, Emerging or Frontier market.
  5. When determining the country of risk in point 4, for derivative use the underlying.

In the above case, if we are not familiar with the investment banking industry, it would take a long time for us to understand the logic. So, yes, when doing Data Sourcing, it is best if it is done by a Business Analyst with good knowledge & experience in that industry sector.

Step 3. Verify the extraction

Once we can reproduce the numbers, we need to verify if we can get the data out. A few gotchas are:

  1. The numbers are calculated on-the-fly by the system, and not stored anywhere in the database. If this is the case, find out from the vendor if they have an export utility which produces a file after the numbers have been calculated.
  2. Are we allowed to connect to that database and query it? Do not assume that we can, because I’ve encountered a few cases that we are not allowed to do that. It could be because of the system work load / performance (it is a critical transaction system and they don’t want any big query ruining the front end users), or it could be because they have provided daily extract files which all downstream systems must use (instead of querying the database directly). From the system admin point of view, it makes sense not to allow any external query runs on the database, because we don’t know what kind of damage those external queries can cause, it could block the front end queries and causing a lock.
  3. Loosing precision, i.e. the data must be exported from the database but during the export process the precision decreases from 8 decimal places to 2 decimal places.
  4. There is a security restriction because the it is against the “chinese wall” compliance rules (in investment banking, the public-facing departments must not get data from the M&A department)
  5. The system is being migrated, or rewritten, so it is still in a state of flux and we need to wait a few months.
  6. The system is not a solid “production quality”, but only a “thrown-away”, which means that within a few months they could be dropping those tables.

Step 4. Check the coverage

This step is often missed by many business analysts. We need to check if all the products that we need is available in that system. If the report we are trying to reproduce is reporting 4000 products from 100 branches, but the source system/tables only covers 3000 products from 70 stores, than we’ve got to find out where the other 1000 products and 30 stores are sourced from. Are they produced from a different system.

Not only product and stores. We need to check the coverage in terms of: customer/clients, portfolios, securities, line of business, underwriting classes, asset classes, data providers. And the most important coverage check is on dates, e.g. does the source system have data from 2011 to present? It is possible that the source system only have data from 2014.

Step 5. Check the timing

After checking the coverage, we need to check if the data is available when we need it. We need to check these 3 things: data is too late, available extraction window, the data is overwritten.

  1. Data is too late: If our DW load starts at 2.15 am, will the data be available before that? If not, could the business user live with a bit of stale data (data from 2 days ago, i.e. if today is Wednesday, the latest data in the DW would be Monday data).
  2. Available extraction window: In the production environment, when can we extract the data from that source system? If from 11pm to 6am there is an overnight batch running, and we can’t run our extract during that time, then the ealierst we can run is 7am. If the DW load takes 3 hours, DW users can access it at 10am. Is that too late for the users or not?
  3. The data is overwritten: the data from the source system can be updated many times during the day and when we extract it at night, we have no trace of these changes. Is that ok? Do we need intraday, push-driven data load into the DW? Or would 10 minutes data extraction frequency (pull-driven) be enough?

Step 6. Check the cost

There is no free lunch. We need to check how much it would cost us to use that source data.

  1. If the data is valuable (such as prices, yield and rating from Bloomberg, Reuters and Markit) we would have to pay the data providers. We need to check the cost. The cost could be per $5 security, per call, so it could easily be $20-30k per day. The cost is usually shared with other department.
  2. Check with the data provider, if you use the data only as an input to your calculation, and you don’t publish it / send it on to any external parties (clients, etc.), would it still cost you a lot? Even if you don’t have to pay the data providers, your DW project might still have to share the cost with other departments. Say the data provider is Markit and your company pays $300k/year for prices data and it is currently shared by 5 departments ($60k each). Your project may have to bear the cost of $50k/year ($300/6).
  3. The cost could be a killer to the whole thing, i.e. even if #1 to #5 above are all ok, if the cost of the data is $50k, it could force you to cancel the whole project.
  4. Sometimes other department has to create the data for you. Let’s day yield calculation, or risk engine, or OTC pricing engine, and the requirement from the BI/DW is specific so they have to develop it. It could take them 3 months x 3 people and they could cross charge your project $50k (one off). And that could also be a killer to the DW project.
  5. Developing interface: some systems do not allow external system to pull the data out. They insist to develop an export, and charge the cost to your DW project.
  6. Standard data interface system: some large companies (such as multinational banks) have standard interface (real time, end of day, etc.), and the central middle ware team might charge your DW project some low amount (say $2000 one off + $50/month) to use that standard data interface system. Say you need FX rate data from FX system, and there is already a standard message queue for FX rates with publication time of 11am, 1pm and 3pm. So you “subscribe” to this publication MQ and pay the cost (project cross charge).

16 March 2016

Different Measures for Different Product Types

Filed under: Data Architecture,Data Warehousing,Investment Banking — Vincent Rainardi @ 8:33 am

What I mean by a measure here is a time-variant, numerical property of an entity. It is best to explain this by example. In the investment industry, we have different asset classes: equities, bonds, funds, ETFs, etc. Each asset class has different measures. Equities have opening and closing prices, daily volume, market capitalisation, daily high and low prices, as well as annual and quarterly measures such as turnover, pretax profit and EPS. Bonds have different daily measures: clean and dirty prices, accrued interest, yield and duration. Funds have different daily measures: NAV, alpha, sharpe ratio, and volatility, as well as monthly measures such as 3M return, 1Y return, historic yield, fund size and number of holdings. ETFs have daily bid, mid and offer prices, year high and low, and volume; as well as monthly measures such as performance. The question is: what is an appropriate data model for this situation?

We have three choices:

  1. Put all measures from different product types into a single table.
  2. Separate measures from each product types into different tables.
  3. Put the common measures into one table, and put the uncommon measures into separate tables.

My preference is approach 2, because we don’t need to join across table for each product type. Yes we will need to union across different tables to sum up across product types, but union is much more performant than join operation. The main weakness of approach a is column sparsity.

On top of this of course we will need to separate the daily measures and monthly measures into two different tables. Annual and quarterly measures for equities (such as financial statement numbers) can be combined into one table. We need to remember that measures with different time granularity usually are from different groups. For example, the prices are daily but the performance are monthly.

Static Properties

In addition to different time-variant properties (usually numerical), each asset class also different static properties (can be textual, date or numeric). For example, equities have listed exchanges, industry sectors, country of domicile, and dividend dates. Bonds have issuers, call and maturity dates, and credit ratings. Funds have benchmark, trustee, legal structure and inception date. Examples of numerical properties are minimum initial investment and annual charges for funds; outstanding shares and denomination for equities; par and coupon for bonds. Some static properties are common across asset classes, such as ISIN, country of risk, currency.

Static properties from different asset classes are best stored in separate tables. So we have equity table, bond table, fund table and ETF table. Common properties such as ISIN, country of risk, etc. are best stored in a common table (usually named security table or instrument table).

Why not store all static properties in a common table? Because the properties are different for each asset class so it is like forcing a square peg into a round hole.

Historical Data

For time variant properties it is clear that the table already stores historical data in the rows. Different dates are stored as different rows. What we are discussing here is the historical data of the static attributes. Here we have two choices:

  1. Using SCD approach: store the historical values on different rows (called versions), and each row is only valid for certain time period. SCD stands for slowly changing dimension, a Kimball approach in data warehousing.
  2. Using Audit Table approach: store the historical rows in an audit table (also called history table). This is the traditional approach in normalised modelling. The main advantage is that the main table is light weight and performant.

When to use them? Approach a is suitable for situations where the historical versions are accessed a lot, whereas approach b is suitable for situations where the historical versions are very rarely accessed.

The main issue with approach a is that we need to use “between” on the validity date columns. In data warehousing we have a surrogate key to resolve this issue, but in normalised modelling we don’t. Well, we could and we should. Regardless we are using appraoch a or b, in the time-variant tables we need to store the ID of the historical row for that date. This will make getting historical data a lot faster.

24 February 2016

Instrument Dimension

Filed under: Data Warehousing,Investment Banking — Vincent Rainardi @ 6:15 pm

One of the core dimensions in investment banking is instrument dimension. It is also known as security dimension. It contains various different types of financial instruments or financial securities, such as bonds, options, futures, swaps, equities, loans, deposits, and forwards.

The term “securities” used to mean only bonds, stocks and treasuries. But today it means any tradable financial instruments including derivatives, cash, loans, and currencies. Well, tradable and “contract-able”.

Where it is used

In an data warehouse for an investment bank, a brokerage or an investment company, an instrument dimension is used in three primary places: in trade fact tables, in position fact tables and in P&L fact tables. Trade fact tables store all the transactions made by the bank, either for a client (aka flow trading) or for the prop desk (bank’s own money), in all their lifecycle stages from initiation, execution, confirmation, clearing, and settlement. Position fact tables store daily values of all instruments that the bank holds (long) or owes (short). P&L (profit and loss) fact tables store the daily impact of all trades and positions to each of the bank’s financial accounts, e.g. IAS 39.

The secondary usages in an asset management or an investment banking data warehouse are risk fact tables (e.g. credit risk, market risk, counterparty risk), compliance fact tables, regulatory reporting, mark-to-market accounting, pricing, and liquidity fact tables.

The niche usages are ESG-score fact tables (aka SRI, socially responsible investing), rating transition fact tables, benchmark constituent fact tables, netting position fact tables, and collateral fact tables.

Data Structure

The business key of an instrument dimension is usually the bank-wide internal instrument identifier. Every instrument that the bank gets from market data providers such as Bloomberg, Reuters, Markit, index constituents, and internal OTC deals, are mastered in a waterfall process. For example, public instruments (debts, equities, ETDs) are identified using the following external instrument identifiers, in order: ISIN, Bloomberg ID (BBGID), Reuters ID (RIC), SEDOL, CUSIP, Exchange Ticker, Markit ID (RED, CLIP), Moody’s ID (Master Issue ID). Then the internal identifiers for OTCs (e.g. CDS, IRS, FX Swaps), FX Forwards, and cash are added.

The attributes of an instrument dimension can be categorised into 9:

  1. Asset Class
  2. Currency
  3. Country
  4. Sector
  5. Issuer
  6. Rating
  7. Maturity
  8. Instrument Identifier
  9. Asset class specific attributes

1. Asset Class

Asset Class is a classification of financial instruments based on its functions and characteristics, e.g. fixed income, equities, cash, commodity. We also have real assets such as land, buildings, physical gold and oil.

It also covers the hierarchy / groupings of the asset classes, hence we have attributes such as: asset class, asset sub class, asset base class. Or alternatively asset class level 1, level 2, level 3. Or asset class, asset type, asset group.

Good starting points for asset class categorisation are ISDA product taxonomy, Barclays index guides, Academlib option pages and Wikipedia’s derivative page. Here is a list of popular asset classes:

FIXED INCOME: Government bond: sovereign, supranational, municipal/regional, index linked, zero coupon, emerging market sovereign, sukuk sovereign. Corporate bond: investment grade, high yield, floating rate note, convertible (including cocos), covered bond, emerging market corporate, sukuk corporate. Bond future: single name bond future, future on bond index. Bond option: single name bond option, option on bond index. Bond forward: single name bond forward, forward on bond index. Credit default swap: single name CDS, CDS index, CDS swaption, structured CDS. Asset backed security (ABS): mortgage backed security (including RMBS and CMBS), ABS (auto, credit card, etc), collateralised debt obligation (CDO), ABS index. Total Return Swap: single name TRS, TRS index. Repurchase agreement: repo, reverse repo.

EQUITY: Cash equity: common shares, preferred shares, warrant, equity index. Equity derivative: equity option (on single name and equity index), equity future (on single name, equity index, and equity basket), equity forward (on single name, equity index, and equity basket), equity swap (on single name and equity index).

CURRENCY: Cash currency: FX spot, FX forward. Currency derivative: cross currency swap.

RATES: Interest rate: interest rate swap, overnight index swap (OIS), interest rate cap, interest rate future, interest rate swaption, forward rate agreement (FRA), asset swap. Inflation rate: inflation swap, inflation swaption, inflation cap, zero-strike floors, inflation protected annuity.

COMMODITY: commodity future: energy (oil, gas, coal, electricity, wind turbine), base metal (copper, iron, aluminium, lead, zinc), precious metal (gold, silver, platinum, palladium), agriculture (grains: corn, wheat, oats, cocoa, soybeans, coffee; softs: cotton, sugar, butter, milk, orange juice; livestock: hogs, cattle, pork bellies). Commodity index (energy, metal, agriculture). Option on commodity future. Commodity forward.

REAL ASSET: Property: Agricultural land, residential property, commercial property. Art: paintings, antique art. Collectibles: fine wine, rare coins, antique cars, jewellery (including watches and precious stones).

FUND: money market fund, equity fund, bond fund, property fund, commodity fund, currency fund, infrastructure fund, multi asset fund, absolute return fund, exchange traded fund.

OTHER: Private equity. Venture capital.

Note on differences between asset class and asset type: asset class is usually a categorisation based on market, i.e. fixed income, equity, cash, commodity and property; whereas asset type is usually a categorisation based on time and structure, i.e. spot, forward, future, swap, repo, ETD, OTC, etc.

Note on overlapping coverage: when constructing asset class structure, we need to be careful not to make the asset classes overlapping with each other. If we do have an occurrence where an instrument can be put into two asset classes, make sure we have a convention of where to put the instrument. For example, an IRS which is in different currencies are called CCS (Cross Currency Swap). So either we don’t have CCS asset class and assigned everything to IRS (this seems to be the more popular convention), or we do have CCS and make sure that none of the swaps with different currencies are in IRS.

2. Currency

For single-legged “hard” instruments such as bonds and equities, the currency is straightforward. For multi-legged, multi-currency instruments such as FX forward and cross currency swap, we have two currencies for each instrument. In this case, we either have a column called “currency pair”, value = “GBP/USD”, or two column marked as “buy currency” and “sell currency”.

For cash instruments, the currency is the currency of the cash. For “cash like” or “cash equivalent” instruments such as CP, CoD, T-bill, the currency is straightforward, inherent in the instrument. For multi-currency CDS Index such as this (i.e. a basket of CDSes with different currencies), look at the contractual currency of the index (in which the premium leg and protection leg are settled), not the liquid currency (the currency of the most liquidly traded CDS).

For derivatives of equities or fixed income, the currency is taken from the currency of the underlying instrument.

3. Country

Unlike currency which is a true property of the instrument, country is a property of the issuer. There can be three different countries in the instrument dimension, particularly for equities, i.e. country of incorporation, country of risk (aka country of operation, country of domicile), country of listing.

Country of risk is the country where if there is a significant business changes, political changes or regulatory changes in that country, it will significantly changes the operation of the company which issues this security. This is the most popular one particularly for portfolio management, and trade lifecycle. It common for a company to operate in more than one country, in this case it is the main country (from revenue/income point of view), or set to “Multi-countries”.

Country of incorporation is the country where the issuer is incorporated, not the where the holding company (or the “group”) is incorporated. This is used for regulatory reporting, for example FATCA and FCA reporting.

Country of listing depend on the stock market where the equity instrument is listed. So there can be two different rows for the same instrument, because it is listed two different stock exchanges.

The country of risk of cash is determined by the currency. In the case of Euro instruments (not Eurobond*) it is usually set to Germany, or Euroland (not EU). *Eurobond has a different meaning, it is a bond issued not in the currency of the country where it is issued, i.e. Indonesia govt bond issued in USD.

An FX forward which has 2 different currencies has one country of risk, based on the fixed leg (not the floating leg) because that is where the risk is. The country of risk for cross currency swap is also based on the fixed leg. For floating-for-floating CCS, the convention is usually to set the country of risk to the least major currency, e.g. for USD/BRL, USD is more major than BRL, so Brazil is the country of risk. For non-deliverable forward and CCS (meaning the payment is settled in other currency because ND currency can’t be delivered offshore), the country of risk is set based on settlement currency (usually USD).

Like currency, the country of a derivative of equities or fixed income instrument is taken from the country of the underlying instrument.

4. Sector

These attributes are known with many names: sector, industrial sector, industry sector, or industry. I will use the term sector here.

There can be many sector attributes in the instrument dimension, e.g. Barclays level 1/2/3, MSCI GICS (and S&P’s), UK SIC, International SIC, FTSE ICB, Moody’s sector classification, Factset’s sector classification, Iboxx, etc. They have different coverage. Some are more geared up towards equities, some more towards fixed income.

The cash instruments and currency instruments usually have either no sector (blank), or set to “cash”. Rates instruments, commodity futures and real asset usually have no sector.

The sector of fixed income derivatives, such as options and CDSes are determined based on the sector of the underlying instrument. Ditto equity derivatives.

5. Issuer

All equity and fixed income instruments have issuers. This data is usually taken from Bloomberg, or from the index provider if the position is an index constituent.

All corporate issuers have parents. This data is called Legal Entity data, which can be obtained from Bloomberg, Thomson Reuters, Avox/FT, etc. From the Legal Entity structure (parent-child relationship between company, or ownership/subsidiary to be more precise) we can find the parent issuer, i.e. the parent company of the issuer, and the ultimate parent, i.e. the parent of the parent of the parent (… until the top) of issuer.

Legal entity data is not only used in instrument dimension. The main use LE data within an investment bank is for credit risk and KYC (know your customer), i.e. customer due dilligence. PS. LEI means Legal Entity Identifier, i.e. BBG Company ID, FATCA GIIN (Global Intermediary Identifier Number), LSE’s IEI. But LEI also means ROC’s Global LEI – the Regulatory Oversight Committee.

6. Rating

Like sector, there are many ratings. Yes there are only 3 rating providers (S&P, Moody’s, and Fitch), but combined with in-house rating, there can be 15 different permutations of them, i.e. the highest of SMF, the lowest of SMF, the second highest of SMF, the average of SMF, the highest of SM, the lowest of SM, the average of SM, the highest of SMFH, the lowest of SMFH, etc. With M = Moody’s and H = House rating.

Plus we have Rating Watch/Outlook from the 3 provider. Plus, for CDS, we can have “implied rating” from the spread (based on Markit CDS prices data).

7. Maturity

Almost all fixed income instruments have maturity date. Maturity is how far is that maturity date from today, stated in years rather than days. We also have effective maturity, which is the distance in time between today and the nearest call date, also in years.

8. Instrument Identifier

This is the security identifier as explained earlier, i.e. ISIN, Bloomberg ID, Ticker, Markit ID, Sedol, CUSIP, Reuters ID, Moody’s ID.

9. Asset Class Specific Attributes

Each asset classes have their own specific attributes.

For CDS we have payment frequency (quarterly or bi-annually), standard coupon payment dates (Y/N), curve recovery (Y/N), recovery rate (e.g. 40%), spread type (e.g. conventional), restructuring clause (Old R, Mod R, Mod-Mod R, No R), fixed coupon convention (100 or 500), succession event, auction settlement term, settlement type (cash/physical), trade compression, standard accruals (Y/N), contract type (e.g. ISDA).

For IRS we have amortising swap flag, day count convention, following convention (Y/N), no adjustment flag (Y/N), cross currency flag (Y/N), buy currency, sell currency, mark-to-market flag, non-deliverable flag, settlement currency.

Debt instruments such as bonds have these specific attributes: coupon type (e.g. fixed, floating), seniority (e.g. senior, subordinated), amortising notional, zero coupon. Funds also have their own specific attributes, such as emerging market flag, launch date, accumulation/income, base currency, trustee, fund type, etc.

Granularity

The granularity of an instrument dimension can be a) one row for each instrument (this is the norm), or b) one row for each leg. This is to deal with multi-leg instruments such as CDS (3 legs) and cross currency swap (2 leg). The asset class of each leg is different.

If we opt for one row for each instrument, the asset class for each leg needs to be put in the position fact table (or transaction fact table, compliance fact table, risk fact table, etc).

Coverage

There are millions of financial instruments in the market and through-out the life of an investment bank there can be millions of OTCs created in its transactions. For investment companies, there are a lot of instruments which they had holdings in the past, but not any more. Coverage of an instrument dimension means: which instruments are we going to maintain in this dimension? Is it a) everything in the market, plus all OTCs, b) only the one we ever used, c) only the one we hold in the last N years.

We can set the coverage of the instrument dimensions to cover all bonds and equities which ever existed since 1900, but this seems to be a silly idea (because of the cost, and because they are not used), unless we plan conduct specific research, e.g. analyse the quality changes over a long period. The most popular convention is to store only what we ever used.

SCD

Most of instrument dimensions are in type 2, but which attributes are type 2 are different from project to project, bank to bank (and non-bank). Most of the sector, rating, country attributes are type 2. Maturity date is type 2 but maturity (which is in years) are either type 1 or kicked out of this dimension into a fact table (this is the more popular option). Next coupon date and effective date are usually type 1.

Numerical attribute

Numerical attributes such as coupon, rating factor, effective maturity, etc. are treated depending on their change frequency. If they change almost every day, they must be kicked out of the instrument dimension, into a fact table. For example: effective maturity, maturity, holding positions.

If they change a few times a year, or almost never change (static attribute), they stay in this instrument dimension, mostly as type 2. An example of a static numerical attribute is coupon amount (e.g. 6.25%). An example of a numerical attribute is rating factor (e.g. 1 for AAA and 3 for A), and associated PD (probability of default, in bps, e.g. such as 0.033 for AAA, 0.54 for AA, and 10 for BBB), which on average changes once every 1 to 3 years.

The difference between a numerical attribute and a fact is that a numerical attribute is a property of the instrument, whereas a fact is a measurement result.

System Columns

The usual system columns in the instrument dimension are: created datetime, last updated datetime, and the SCD system columns e.g. active flag, effective date, expiry date.

19 January 2016

The ABC of Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 8:52 am

In this article I would like to define the terminologies used in data warehousing. It is a data warehousing glossary, similar to my investing article: The ABC of Investing (link). I won’t cover terms used in business intelligence; I will only cover terms in data warehousing. Terms in bold are defined in this glossary.

This glossary consists of 2 levels. Only the first level is alphabetical. The second level is not. So the best way to use this glossary is by searching the term (Control-F).

People make mistakes and I am sure there are mistakes in this article and I would be grateful if you could correct me, either using the comment below or via vrainardi@gmail.com.

Why I wrote this: many people working in data warehousing do not understand the standard terminology used. Even the simplest of the term such as “dimension” can be a foreign word to them. My intention is to provide a “quick lookup”, enabling them to understand that precise word in about 15 seconds or so.

Why don’t they use internet searches or Wikipedia? Why create another one? Because a) it takes longer to search and find the information, particularly if you are new, b) the pages on search results can be technically incorrect, c) sometimes I have my own view or prefer to emphasise things differently.

  1. Archiving: an approach to remove old data from the fact table and store it another table (usually in different database). It is quite common that the old data is simply deleted and not stored any where else. The correct term for the latter is purging.
  2. Corporate Information Factory (CIF): a top-down data warehousing approach/architecture created by WH Inmon in 1999. The “Data Acquisition” part loads the data from Operational Systems into a Data Warehouse (DW) and an Operational Data Store (ODS). The “Data Delivery” part loads the data from DW and ODS into Exploration Warehouse, Data Mart, and Data Mining Warehouse. The following are the definition of the terms used in CIF (I quote this from CIF website). Back in 1999, these concepts are revolutionary, hence my admiration and respect to the author, WH Inmon.
    • Operational Systems: the internal and external core systems that support the day-to-day business operations. They are accessed through application program interfaces (APIs) and are the source of data for the data warehouse and operational data store. (Encompasses all operational systems including ERP, relational and legacy.)
    • Data Acquisition: the set of processes that capture, integrate, trans-form, cleanse, reengineer and load source data into the data warehouse and operational data store. Data reengineering is the process of investigating, standardizing and providing clean consolidated data.
    • Data Warehouse: a subject-oriented, integrated, time-variant, non-volatile collection of data used to support the strategic decision-making process for the enterprise. It is the central point of data integration for business intelligence and is the source of data for the data marts, delivering a common view of enterprise data.
    • Operational Data Store: a subject-oriented, integrated, current, volatile collection of data used to support the tactical decision-making process for the enterprise. It is the central point of data integration for business management, delivering a common view of enterprise data.
    • Data Delivery: the set of processes that enable end users and their supporting IS group to build and manage views of the data warehouse within their data marts. It involves a three-step process consisting of filtering, formatting and delivering data from the data warehouse to the data marts.
    • Exploration Warehouse: a DSS architectural structure whose purpose is to provide a safe haven for exploratory and ad hoc processing. An exploration warehouse utilizes data compression to provide fast response times with the ability to access the entire database.
    • Data Mart: customized and/or summarized data derived from the data warehouse and tailored to support the specific analytical requirements of a business unit or function. It utilizes a common enterprise view of strategic data and provides business units more flexibility, control and responsibility. The data mart may or may not be on the same server or location as the data warehouse.
    • Data Mining Warehouse: an environment created so analysts may test their hypotheses, assertions and assumptions developed in the exploration warehouse. Specialized data mining tools containing intelligent agents are used to perform these tasks.
    • Meta Data Management: the process for managing information needed to promote data legibility, use and administration. Contents are described in terms of data about data, activity and knowledge.
    • Primary Storage Management: the processes that manage data within and across the data warehouse and operational data store. It includes processes for backup and recovery, partitioning, summarization, aggregation, and archival and retrieval of data to and from alternative storage.
    • Alternative Storage: the set of devices used to cost-effectively store data warehouse and exploration warehouse data that is needed but not frequently accessed. These devices are less expensive than disks and still provide adequate performance when the data is needed.
  3. Current Flag: a column in a dimension table which indicates that the row contains the current values (in all attribute columns). Also known as Active Flag.
  4. Data Lake: a data store built in Hadoop file system, capable to store structured and unstructured It has analytic and query tools which enable users to join the underlying various data types into a single output dataset.
  5. Data Mining: the old meaning: an approach to find pattern in the data using certain algorithm (such as clustering and decision trees), and then use this pattern to predict the future values of data. See also: Machine Learning. The new meaning: an approach to extract certain information from a data warehouse which is valuable to the business. “To mine the warehouse” is term commonly used.
  6. Data Warehouse: a system that retrieves and consolidates structured data periodically from the source system into a dimensional or a normalised data store. A data warehouse is usually used for business intelligence and reporting.
  7. Data Warehouse vs Data Mart: a data warehouse contains everything for multiple purposes, whereas a data mart contains one data area for one purpose. For example: sales mart, customer mart. A data mart is always in dimensional model, whereas a data warehouse can be in dimensional model or normalised model.
  8. DW 2.0: a follow up from Corporate Information Factory (CIF) concept, which stores unstructured data as well as structured data. The DW 2.0 concept is created by WH. Inmon in 2008.
  9. Dimensional modelling: a data modelling approach using fact and dimension tables. A dimensional model is also known as dimensional schema. There are 2 approaches in dimensional modelling: star schema and snowflake.
    • Star schema: only one level of dimension.
    • Snowflake schema: more than one level of dimension.
  10. Dimensional Data Warehouse: a data warehouse which consists of fact and dimension tables. The primary key of a dimension table becomes a foreign key in the fact table. A dimensional data warehouse can be a star schema or a snowflake schema.
  • Fact Table: a table that contains business events or transaction data (time-sensitive data), such as orders and account balances. See also: Fact Table types.
  • Dimension Table: a table that contains “static data” (non time-sensitive), such as customer and product.
  • Bridge Table: a table that defines many-to-many relationship between two dimensions.
  • Measure: a column in a fact table which contains the numbers we want to analyse. For example: sales amount per city. Sales amount is a measure.
  • Attribute: a column in a dimension table which we can use to analyse the numbers. For example: sales amount per city. City is an attribute.
  • Surrogate Key: the primary key column of a dimension table. It becomes a foreign key column in the fact table.
  • Business Key: the primary key of the source table in the transaction system. This becomes the identifier in the dimension table.
  1. Enterprise Data Warehouse (EDW): a normalised database containing data from many departments (more than one divisions), and sourced from several source systems. An EDW stores historical transactional data as well as historical attribute EDW was pioneered by W. H. Inmon. See also: Normalised Data Warehouse.
  2. ETL (Extract Load Transform): the loading of data from the source system into the data warehouse.
  3. Fact Table types: there are 3 types of fact tables.
  • Transaction Fact Table: a fact table that stores the measure value of each business events when the business event happened. In a transaction fact table, each business event is stored only once as a row.
  • Periodic Snapshot Fact Table: a fact table that contains the values of each measure taken at regular interval. In a periodic snapshot fact table, each business event is stored multiple times. A Periodic Snapshot Fact Table is also known as a Snapshot Fact Table. See also: Snapshot Fact Table.
  • Accumulative Snapshot Fact Table: a fact table which the timing and status from different points in time are put as different columns on the same row. The row describes one particular customer (or other dimension).
  1. Machine Learning: an approach to find pattern in the data using certain algorithms then use this pattern to predict the future values. In my opinion, Machine Learning is the new term for the old meaning of Data Mining.
  2. MPP Server (Massively Parallel Processing): a type of database server where a query is broken down into multiple streams. Each stream is executed concurrently in different nodes. The output from multiple streams are then combined into one and passed back to the user. This architecture is also known as Shared Nothing architecture. Examples of MPP server are: Teradata, Netezza, Azure SQL Data Warehouse, and Greenplum.
  3. Normalised Data Warehouse: a data warehouse which consists of transaction tables, master tables, history tables, and auxiliary tables, in first, second or third normal form.
  • Transaction table: a table that contains business events or transaction data (time-sensitive data), such as orders and account balances.
  • Master table: a table that contains “static data” (non time-sensitive), such as customer and product. It contains today’s values.
  • History table: a table that contains “static data” (non time-sensitive), such as customer and product. It contains historical values.
  • Auxiliary table: a table which contains data that describes the codes in a master table. For example: customer type and currency code.
  • Bridge table: a table which implements many-to-many relationship between a transaction table and a master table. Or between two master tables.
  1. OLAP (Online Analytical Processing): an process of interrogating a multidimensional database to explore the data and find patterns. See: multidimensional database.
  2. OLAP Cubes: a compressed, in-memory, multidimensional database.
  3. Operational Data Store (ODS): a normalised database containing transaction data from several source systems (it is an integration point). It is similar to Enterprise Data Warehouse (EDW) but only contains the current values of the attributes. Of course it contains the historical values of the transactions.
  4. Multidimensional Database (quoted from my book, Building a Data Warehouse): a form of database where the data is stored in cells and the position of each cell is defined by a number of hierarchies called dimensions. The structure stores the aggregate values as well as the base values, typically in compressed multidimensional array format.
  5. Slowly Changing Dimension (SCD): an approach to preserve the historical/old values of attributes in the dimension tables. There are 5 types of SCD:
  • SCD Type 0: the attribute is static and the values never changes.
  • SCD Type 1: the old attribute values are not overwritten by the new attribute value. The old values are not preserved; only the latest value is stored.
  • SCD Type 2: the old attribute values are stored in a different row to the new attribute value. In the dimension table, each version of the attribute value is stored as a different row. The row containing the latest value is marked with a special sign.
  • SCD Type 3: the old attribute values are stored a column in the dimension table. Only the last two to three values are stored. Type 3 is usually used when most of the rows in the dimension table change their values together at the same time.
  • SCD Type 4: (as defined by SQL 2016, link) the old attribute values are stored in a different table called history table. Only the latest value is stored in the dimension table.
  1. Slowly Changing Dimension: a dimension which have a type 2 attributes.
  2. Snapshot Fact Table: a fact table that contains the values of each measure at a particular point in time. For example, the balance of every account at the end of the day. There are two snapshot fact tables: periodic and accumulative, but when people say “Snapshot Fact Table” they usually mean periodic. See also: Fact Table types.
  • Daily Snapshot: contains the value of each measure at the end of every day.
  • Weekly Snapshot: contains the value of each measure at the end of every week.
  • Monthly Snapshot: contains the value of each measure at the end of every month.
  1. Staging table: a table in a data warehouse which contains the raw data from the source system. This raw data will be processed further and loaded into either a fact table or a dimension table.

14 January 2016

Temporal Tables in SQL Server 2016

Filed under: Data Warehousing,SQL Server — Vincent Rainardi @ 8:40 am

The temporal table feature in SQL Server 2016 is a breakthrough in data warehousing, because we no longer need to do anything to store the historical values. This cuts down the development effort of a data warehouse by approximately 15% (half of the ETL development effort which is about 30%; the other 70% is analysis, design, database development, BI/report development, management).

Usage scenarios: https://msdn.microsoft.com/en-gb/library/mt631669.aspx

  • Data audit/forensic
  • Point-in-time analysis (time travel)
  • Data anomaly detection
  • Slowly changing dimension (type 4!)
  • Repairing data corruption at row level

Up to now we always have to build a data warehouse, which stores a) the historical changes of attribute values from the source system, and b) fact snapshots over time. With temporal tables, we no longer need to create a data warehouse. We can do a) and b) above within the source system itself, as demonstrated by the above article. This effectively kills the idea of creating a data warehouse, increasing the time we saved from 15% to 95%*.

*This remaining 5% of effort reflects the extra time required to setup the temporal tables in the source system’s SQL Server, along with memory-optimized tables to store the current data in memory, and the full history of changes on the disk. That arrangement gives us the optimum balance between the performance and the cost.

The above article (https://msdn.microsoft.com/en-gb/library/mt631669.aspx) demonstrates that using temporal tables we can do SCD, and we can do snapshot fact tables (called Time Travel). The SCD that temporal tables do is not type 2, but type 4. Type 4 is an old technique where the data changes are kept in the history table. This keeps the original table small and compact as it only contains the latest version.

 

 

29 December 2015

DimMonth, DimQuarter and DimYear

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 6:21 am
Tags: ,

Sometimes the grains of our fact tables are monthly, quarterly, or yearly. In such cases, how do we create DimMonth, DimQuarter and DimYear? Some of the questions in these cases are:

  1. Why do we need to create Month as a dimension? We can’t Month column in the fact table remain as Month, not as a dimension key?
  2. What does DimMonth look like? What are the attributes? How about DimQuarter?
  3. Should we create DimMonth as a physical table, or as a view on top of DimDate?
  4. What is the surrogate key of DimMonth and DimQuarter? Do we create a new SK, or do we use the SK from DimDate?
  5. Do we need to create a dimension for year? It seems weird because it only has 1 column (so that would be against the “degenerate dimension” concept)

For example, in the source system we have a table which stores the monthly targets for each store:

Or quarterly target like this:

How do we create a fact table for this? Should we create it like this?

Question 1 above: Why do we need to create Month as a dimension? Why can’t we leave it as Month in the fact table, not as a dim key, like this?

Question 2 above: if we decided to keep the column as MonthKey, how should DimMonth look like? What are the attributes? Should DimMonth be like this?

What attributes should we put there?

  • Do we need quarter and year?
  • Do we need half year? e.g. 2015 H1 and 2015 H2
  • For month name, should we use the short name or long name? e.g. Oct 2015 or October 2015?
  • Do we need an attribute for “Month Name without Year”, e.g. October?
  • Do we need “Month End Date” column? e.g. 2015-10-30
  • Do we need “Month is a Quarter End” indicator column for March, June, Sep and Dec?
  • Do we need “Number of days in a month” column? e.g. 30 or 31, or 28 or 29.

Or should we use “intelligent key” like this?

Question 3 above: Should we create DimMonth as a view of DimDate like this?

create view DimMonth as

select min(DateKey) as MonthKey, MonthNumber, MonthName, Quarter, [Year]

from DimDate

group by MonthNumber, MonthName, Quarter, [Year]

 

What are the advantages of creating DimMonth and DimQuarter as a view of DimDate? (compared to creating it as a physical table) What are the disadvantages?

 

I think with the above questions and examples we are now clear about what the issue is. Now let’s answer those questions.

 

Q1. Do we need to create Month as dimension? We can’t Month column in the fact table remain as Month, not as a dimension key, like this?

 

We need the Month column in the fact table to be a Dim Key to a month dimension because we need to access Year and other attributes such as Quarter.

 

Bringing Year into the Sales Target fact table like below is not a good idea, because it makes it inflexible. For example, if we want to add Quarter column we have to alter the fact table structure.

 

Using a Dimension Key to link the fact table to a Month dimension makes it a flexible structure:

 

There is an exception to this: Snapshot Month column. In a monthly periodic snapshot fact table, the first column is Snapshot Month. In this case, we do not need to create this column as a dimension key, linking it to DimMonth. In this case, we do not need a DimMonth. Because we do not need other attributes (like Year or Quarter). A monthly periodic snapshot fact table stores the measures as of the last day of every month, or within that month. For example: number of customers, number of products, number of orders, number of orders for each customer, the highest price and lowest price within that month for every product, the number of new customers for that month, etc.

 

Q2. What does DimMonth look like? What are the attributes?

 

Obviously, the grain of DimMonth is 1 row per month. So we are clear about what the rows are. But what are the columns? Well it depends on what we need.

I usually put MonthNumber, MonthName, Quarter and Year in DimMonth, because they are frequently used.

I don’t find “Month Name without the Year” as a useful attribute. I rarely come across the need for “Half Year” attribute.

“Month is a Quarter End” column is also rarely used. Instead, we usually use “Quarter” column.

“Month End Date” and “Number of days in a month” are also rarely used. Instead, we usually use “IsMonthEnd” indicator column in the DimDate.

For month name, should we use the short name (Oct 2015) or the long name (October 2015)? I found that the short name is more used that the long name. But the month number (2015-10) is even more frequently used that the short name

Q3. Should we create DimMonth as a physical table, or as a view on top of DimDate?

This is really the core of this article. A view on top of DimDate is better in my opinion, because we avoid maintaining two physical tables. And it makes the dimensions less cluttered.

If we make DimMonth and DimQuarter as a physical dimensions, in SSMS Object Explorer, when we open the table section we would see these:

DimDate

DimMonth

DimQuarter

DimYear

But if we create DimMonth and DimQuarter as views, then we will only see DimDate in the Object Explorer’s table section. The DimMonth and DimQuarter will be in the view section.

The main disadvantage of creating DimMonth as a view from DimDate is that it is less flexible. The attribute column that we want to appear in DimMonth should exist in DimDate. But I found that DimMonth usually only need 2 or 3 attributes i.e. Month, Quarter, Year; and all of them are available in the DimDate table. So this is not an issue.

Avoiding maintaining 2 physical tables is quite important because when we extend the date dimension (adding more years i.e. more rows) and we forget to extend DimMonth and DimQuarter, then we will cause an error.

The other consideration is of course the performance. I do not find the performance of DimMonth and DimQuarter to be an issue. This is because DimDate is not too large, and more importantly because the monthly and quarterly fact tables are small, less than 1 million rows. They are much smaller than daily fact tables which have millions or billions of rows.

 

Q4. What is the surrogate key of DimMonth and DimQuarter? Do we create a new SK, or do we use the SK from DimDate?

If we create DimMonth and DimQuarter as physical tables, then the surrogate key can either be pure surrogate (1, 2, 3, …) or intelligent key (201510, 201511, 201512, etc.)

But if we create them as a view of DimDate, then the surrogate key can be either the first day of the month (20151001, 20151101, 20151201, etc.) or the month itself (201510, 201511, 201512, etc.). I prefer the latter than the former because it is more intuitive (intelligent key) and there is no ambiguity like the former.

The script to create the view for DimMonth with SK = 201510, 201511, 201512, etc. is like this:

create view DimMonth as

select distinct convert(int,left(convert(varchar,SK_Date),6)) as MonthKey,

[MonthName] , Quarter, [Year]

from DimDate

 

Q5. Do we need to create a dimension for year?

 

No we don’t need to create DimYear, because it would only have 1 column.

What should we call the dim key column in the fact table then? Is it YearKey or Year? We should call it YearKey, to be consistent with the other dim key columns.

A dimension which only has 1 column, and therefore be kept in the fact table is called a Degenerate Dimension. A Degenerate Dimension is usually used to store identifier of the source table, such as Transaction ID and Order ID. But it is also perfectly valid for dimensions which naturally only have one attribute/column, like Year dimension. See my article about “A dimension with only one attribute” here: link.

23 December 2015

Data Dictionary

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 8:15 am
Tags: ,

During my 10 years or so doing data warehousing projects, I have seen several initiatives doing data dictionary. A data dictionary is a list of definitions used in the system. Each definition is about usually about 10 to 50 words long. And there are about 50 to 200 definitions. What being defined is mostly technical terms, such as the meaning of each field in the database. This is the origin of data dictionary, grew from the need to explain the fields. For example, delivery date, product category, client reference number, tax amount, and so on. I saw this in 1998 all the way to today (2015).

This data dictionary was created by the business analyst in the project, who was trying to explain each field to clear any ambiguity about the terms used in the system. I found that the business people view it differently. They found that it is not too useful for them. The data dictionary is very useful to the new starter in the IT development project, but not too useful for the business. I’ll illustrate with several examples below.

In the database there are terms like: CDS, obligor, PD, yield. The BA defines these terms by copying a definition from internet searches, so they ended up with something like this:

  • Credit Default Swap (CDS): a financial contract on which the seller will pay the buyer if the bond defaults
  • Obligor: a person or entity who is legally obliged to provide some payment to another
  • Probability of Default (PD): likelihood of a default over a particular time frame
  • Yield: the interest received from a bond

The CDS definition is trying to explain a big concept in one sentence. Of course it fails miserably: what bond? What do you mean by “defaults”? After reading the whole sentence, the readers are none of the wiser. The readers will get much better understanding about CDS if they read the Wikipedia page about CDS.

The Obligor definition is too generic. It is too high level so that it is useless because it doesn’t provide the context to the project/system. In a credit portfolio project, obligor should be defined as the borrower of the loan, whereas in a fixed income portfolio project, obligor should be defined as the issuer of the bond.

The PD definition contains a common mistake in data dictionary: the definition contains the word being defined, because the author doesn’t have a good understanding about the topic. What is a default? It is not explained. Which PD is used in this field, Through The Cycle or Point In Time? Stressed or Unstressed? Is it calculated using linear regression or discriminant analysis? That is what the business wants to know.

The Yield definition does not explain the data source. What the business users need is whether it is Yield To Maturity, Current Yield, Simple Yield, Yield to Call or Yield to Worse. There are different definitions of yield depending on the context: fixed income, equity, property or fund. The author has firmly frame it for fixed income, which is good, but within fixed income there are 5 different definitions so the business want to know which one.

Delivery mechanism: Linking and Web Pages

To make the life of the business users easier, sometimes the system (the BI system, reporting system, trade system, finance system, etc.) is equipped with hyperlinks to the data dictionary, directly from the screen. So from the screen we can see these words: CDS, obligor, PD, yield are all hyperlinks and clickable. If we click them, the data dictionary page opens, highlighting the definition. Or it can also be delivered as a “balloon text”.

But mostly, data dictionaries are delivered as static web pages. It is part of the system documentation such as architecture diagram, content of the warehouse, support pages, and troubleshooting guide.

It goes without saying that as web pages, it should have links between definitions.

I would say that the mechanism of delivery only doesn’t contribute much to the value. It is the content which adds a lot of value.

Accessibility

A data dictionary should be accessible by the business, as well as by IT. Therefore it is better if it is delivered as web pages rather than links direct from the system’s screens. This is because web pages can be accessed by everybody in the company, and the access can be easily controlled using AD groups.

Data Source

The most important thing in data dictionary is in-context, concise definition of the field. The second most important thing is the data source, i.e. where is this field coming from. System A’s definition of Duration could be different from system B’s. They may be coming from different sources. In System A it might be sourced from Modified Duration, whereas in System B it is sourced from Effective Duration.

Because of this a data dictionary is per system. We should not attempt building a company-wide data dictionary (like this: link) before we completed the system-specific data dictionaries.

How deep? If the field is sourced from system A, and system A is sourced from system B, which one should we put as the source, system A or system B? We should put both. This is a common issue. The data dictionary says that the source of this field is the data warehouse. Well of course! A lot of data in the company ended up being in the data warehouse! But where does the data warehouse get it from? That’s what the business what to know.

Not Just the Fields

A good data dictionary should also define the values in the field. For example, if we have a field called Region which has 3 possible values: EMEA, APAC, Americas, we should explain what APAC means, what EMEA means and what Americas means (does it include the Caribbean countries?)

This doesn’t mean that if we have a field called currency we then have to define USD, EUR, GBP and 100 other currencies.  If the value of a field is self-explanatory, we leave it. But if it is ambiguous, we explain it.

If the field is a classification field, we should explain why the values are classified that way. For example: the value of Asset Class field could be: equity, fixed income, money market instruments (MMI), CDS, IRS. Many people would argue that CDS and MMI are included in fixed income, so why having separate category for CDS and MMI? Perhaps because MMI has short durations and the business would like to see its numbers separately. Perhaps because the business views CDS as hedging mechanism rather than investment vehicle so they would like to see its numbers separately.

Summary

So in summary, a data dictionary should:

  1. Contain in-context, concise definition of every field
  2. Contain where the field is sourced from
  3. Contain the definition of the values in the field
  4. It should be system specific
  5. It should be delivered as web pages
  6. It should be accessible by both the business and IT

A good data dictionary is part of every data warehouse. I would say that a data warehouse project is not finished until we produce a good data dictionary.

Next Page »

Blog at WordPress.com.