Data Warehousing and Data Science

30 June 2012

Not All Surrogate Keys Define the Fact Table Grain

Filed under: Data Warehousing — Vincent Rainardi @ 5:04 pm

Not all surrogate keys in the fact table define the fact table grain.

For example, if you have if you have daily account balance table, the grain of the fact table 1 row for each account, every day. So there are only 2 surrogate keys that define the grain of the fact table:

  1. Snapshot Date Key
  2. Account Key

If we have Customer Key and Product Type Key in that fact table, they don’t form part of the grain:

  1. Snapshot Date Key – part of grain
  2. Account Key – part of grain
  3. Customer Key – not part of grain
  4. Product Type Key – not part of grain

Say that for 27th June 2012 we loaded 1000 rows, and ditto for 28th June 2012. In the BI tool (say Cognos) if we put Snapshot Date, Account Number and Account Balance, we will get 1000 rows. If we add Customer Number, we will still get 1000 rows. If we add Product Type, we still get 1000 rows.

So my point here is: not every surrogate key in the fact table is part of the fact table grain.

29 June 2012

Fact or Dimension

Filed under: Data Warehousing — Vincent Rainardi @ 4:51 am

Should I put this column in a Fact table or in a Dimension table? Have you ever come across that question?

An old, simple rule in data warehousing is: if it is a number, it goes to the fact table. If it is a text, it goes to the dimension table.

Today somebody asked me: “Is it always like that?”
As with everything else in the world, the answer is “Not always. There are exceptions”.

In general, the principles are:

  1. If the number is a monetary amount (money), it should be in a fact table
  2. If the number is aggregatable, it should be in a fact table
  3. If the number is static (doesn’t change with time), it should be in a dimension
  4. If the number changes every day, it should be in a fact table
  5. If the number is an identifier, it should be in a fact table as a degenerate dimension column

As usual it is better to learn by example, rather than by definition. A few cases which are interesting are below:

1. Number of …

[Number of Purchases], [Number of Customers], [Number of Product Lines], … these columns are aggregatable, so they should be in a fact table. And they tend to change over time, so they should be in the fact table. For example: in a customer dimension, we have [Number of Accounts]. [Number of Accounts] is ideally put in a fact table, so that it can be aggregated across customers. And it can change. If a customer opens another account or close his account then the [Number of Accounts] changes. If we put it in a periodic snapshot fact table, we can see the trend across all demographic, across time, across areas, across products, etc.

But what if we don’t need to aggregate them? Then put it in the dimension. But think about the above paragraph again.

Any example of a [Number of …] column that should definitely be in a dimension? Yes. [Number of Wheels] column in a Fleet dimension. The context is a logistics / delivery company that owns 40-50 trucks and vans, large and small. [Number of Wheels] is a permanent, fixed property of a vehicle. If a vehicle has 6 wheels, it will always be 6 weeks. So in this case it is absolutely appropriate that [Number of Wheels] goes to a dimension, not a fact table. It is alright if we can’t aggregate. Who wants to know the number of wheels across the whole fleet?

[Number of Bed Rooms] in a property dimension. The context is property agent marketing data warehouse. If a property has 4 bed rooms, it will always have 4 bed rooms. What if a property is renovated to add 1 extra bed room? Well there are always exceptions to the rule. But it is still appropriate to put [Number of Bed Rooms] as an attribute in a dimension, as opposed to a fact table. But if it is in a dimension we can’t aggregate! Bah, who wants to know the number of bed rooms across the whole city anyway? Certainly not the property agent business!

2. Ratio (Percentage) Column

A ratio column contains a number between 0 and 1. It can be expressed as a decimal (e.g. 0.213), or as a percentage (e.g. 21.3%).

In a business-to-business scenario, the customers (which are companies) have various financial ratios. This Wikipedia page explains about various different types of financial ratios. And this Yahoo Finance page shows an example for BT. Price/Earning, Price/Sales, Debt/Equity, etc.

Do we put those ratios in the customer dimension or in a fact table? Some of them changes every day, so they should be in a fact table. So that we can understand its trend across time, and across other dimension.

Some of them changes only once or twice a year, like Return on Asset and Working Capital Ratio. In this case it is appropriate to put them in the customer dimension. Because we are using these financial ratios as the static properties of a customer. But if we are in the investment business and we need to evaluate how each company grows every year, then we need to put these financial ratios in a fact table. Because we are measuring them against other attributes.

One thing that we have to remember when having a ratio column in a fact table is that: to aggregate it, we need to store it in 2 columns: nomination and denomination.

3. Rank Column

A rank is a sorting order based on certain value. For example, sorting the products based on their sales. Or sorting the customers based on the income we get from them over the years.

A rank can be a simple 1,2,3,… But it can also be a quantile (4 bucket), quintile (5 bucket) or decile (10 buckets). In a decile scenario, we put the rank into 10 buckets. Imagine we have 100 products. We rank them as 1, 2, 3, etc. Then we put number 1 to 10 into bucket 1, number 11 to 20 to bucket 2, etc. If the decile of a product is bucket 1, we know it is a popular product, because it is in the top bucket. Product with decile = 4 is more popular than product with decile = 5.

Should a rank column be in a dimension or in a fact table? Rank is a property of the product / customer / account / etc. If the rank changes every day and we want to track how they change over time, we put it in a periodic snapshot fact table. If the rank changes twice a year, we put it as an attribute in a dimension. If we don’t need to track its changes, we set it as a type 1 attribute. If we need to track its changes, we set it as a type 2 attribute.

4. Time Duration Column

In a car hire company, we have [Days Car Out], which is the time duration from time the customer hire the car to the time the customer returns the car. The value can be an integer e.g. 2 days, or a decimal e.g. 2.21 days.

In banking warehouse, for each credit facility we have Tenor, which is time duration from effective date to maturity date, expressed in years. The value is a decimal e.g. 3.45 years.

Do we put these time duration columns in a fact table or in a dimension? Again the questions are the same: do we need to aggregate it? Do we need to slice it across different dimensions? Do we need to analyse it across time?

In the case of the [Days Car Out] column in a car hire company, [Days Car Out] should be put as in a fact table, because it will aggregated up, and will be analysed across different dimensions and across time. In the case of the [Tenor] column in a bank, it should be put as an attribute of the account dimension. Because we don’t aggregate it across accounts, and we don’t analyse it across time, or across other dimensions.

22 June 2012

Delete Takes a Long Time

Filed under: Data Warehousing — Vincent Rainardi @ 4:54 pm
Tags:

Summary

Deleting today’s rows from the fact table takes ages if the table is not indexed. Best to index it (clustered) on the snapshot date key column. Alternatively, partition the fact table so you can truncate the partition (both table and index).

Intro

My daughter brings my internet stick today so I have no access to the internet on the bus and on the train. It’s annoying as I can’t check email and learn things on the net but it has its own blessing, i.e. I can focus on writing what I experienced in the past week, without being distracted by the temptation of the internet.

Periodic Fact Table

I am helping a friend building a data warehouse for a company. One of the things he found yesterday was performance issue with delete at fact table loading. To recap, there are 3 types of fact table (on pure Kimball method): periodic snapshot, accumulative snapshot and transaction. On the periodic snapshot, we insert all the rows in the source table into the fact table every day. If in the source fact table we have 1000 rows and it grows by 10 rows a day (so 1st June it’s 1000, 2nd June it’s 1010, etc) then after 3 days the fact table will look like this:

So every day, about 1000 rows will be inserted into the fact table.

Rerunability

One of the core principle of ETL in data warehousing is Rerunability. Meaning that, if the ETL fails half way through the batch, we should be able to rerun it again.

In the above scenario, imagine that now is 4th June and we have 1030 rows to load. The ETL (SSIS or Informatica for example) managed to load 900 rows then it failed because of network error. We fixed the network issue and we rerun this fact table load. But there are already 900 rows in the fact table. So after this second run, we will have 900+1030 rows = 1930 rows.

We can’t make it a “transaction”. The concept of transaction is only practically applicable when dealing with row by row. If we insert thousands of rows and we make it a transaction (using row count to indicate if the transaction is complete), we will hit a performance issue when the ETL fails. Because the SQL Server or Oracle need to delete the inserted rows one by one.

Deleting Today’s Rows

So the traditional way of providing Rerunability, is to delete today’s rows from the fact table. In the example above, “delete from Fact1 where SnapshotDateKey = 20120604”. If you use SSIS, this delete task would be an Execute SQL Task, and this 20120604 is an input parameter to the task, taken from a variable called TodaysDate (package scope).

And this TodaysDate package variable is populated by an Execute SQL Task earlier in the package, which contains this SQL: (I’ll build it step by step so you can see the process)

select getdate()
select convert(varchar, getdate(), 20)
select replace(convert(varchar, getdate(), 20), ‘-‘, ‘’)
select convert(int,replace(convert(varchar, getdate(), 20), ‘-‘, ‘’))

The output is an integer like 20120604. And in the Execute SQL Task we put this output into the TodaysDate package variable.

Performance Issue with Delete

The problem is, deleting 1 million rows out of 10 billion is painfully slow. It could take hours. When the data warehouse was first built, there were only a few rows in the fact table, so the delete was performing well. Usually in minutes. But after a year, with 10 billion rows in the fact table, it now takes hours to delete.

Now this is the core of this article, as I explained in the summary above. There are two ways of solving this:

  1. Clustered index on snapshot date key
  2. Partition the fact table

1. Clustered Index on Snapshot Date Key

By adding a clustered index on the Snapshot Date Key column, we force the fact table to be physically ordered on Snapshot Date Key. So that the rows for todays date are located in 1 place, i.e. at the end of the table’s file (group). That is why the delete is so much faster. To give an illustration about the scale of performance improvement, before the clustered index, a fact table with 10 billion rows could take 1-2 hours to delete 1 million rows, but with clustered index, it takes only 2-5 minutes. A major performance improvement.

The syntax is:

create clustered index Fact1_SnapshotDateKey on Fact1(SnapshotDateKey)

For a fact table, as we have only 1 chance of ordering it physically, if it is a snapshot fact table, there is no argument that the clustered index has to be on the snapshot date key. This principle is NOT applicable for transaction fact table or accummulated snapshot. I repeat: NOT applicable.

2. Partition the Fact Table

If clustered index approach above is still too slow, or because you need the clustered index for something else, a better approach is to partition the fact table.

By partitioning it, we now can truncate only a portion of the table, i.e. we can truncate 1 partition. If we partition the fact table on the snapshot date key, we can truncate today’s rows in a zip. Unlike delete operation, truncate is not logged. So it is lightening fast. Whether it’s 1 million rows or 1 billion rows, the truncate only take a zip (a few seconds, i.e. between 0 and 5 seconds).

One thing to remember here is that the partitioned index (or indices) must be aligned with the table so that truncating the table partition will also truncate the index partition. Because if not we need to rebuild the index and dealing with 10 billion rows in the fact table, we don’t want to hear the word “Rebuild” in the same sentence as the word “Index”.

After deleting the today’s rows, the insert is fast. If the insert is slow, look at the constrains. Default and identity is fine but FKs is nightmare. If your fact table is in billions, I do not recommend having physical RI like FKs in the fact table. Deletion will be slow and more importantly insert will be slow. Remove the FKs. But how about querying it, it will be slower. Yes, but only a little bit slower, not as much slower as the insert. At ETL we have made sure the RI (Referential Integrity) any way, so why put FKs?

If your fact table is under 1 million rows then by all means you can put FKs on all Surrogate Key columns, but when the fact table are in billions, every little count.

19 June 2012

SSIS SCD Wizard Performance Issue

Filed under: Data Warehousing,SSIS — Vincent Rainardi @ 7:23 pm
Tags: ,

In Microsoft BI, when we do a slowly changing dimension type 2, we instinctively do SCD Wizard. The problem with SCD Wizard is the performance. For each source row coming to the SCD box, SSIS does a lookup on the target dimension table to find out if the row already exists or not. That is a crazy idea. And SSIS also update the target dimension table on row by row basis. That is also a crazy idea.

If your dimension is 1000 rows and the source data is 1000 rows, the SCD takes a few minutes. But if your dimension is 500k and the source data is 500k, it’s 5-6 hours performance. The whole idea is fundamentally flawed. On the welcome screen of the SSIS box, there should be a warning: Warning, do not use SSIS SCD Wizard if your dimension is above 1000 rows.

So, if your dimension is 1m rows, what should you do?

  1. Do it on set basis, not row by row basis.
  2. Filter the incoming source rows to the changed rows.
  3. Don’t everything type 2, but only necessary columns.
  4. Index the dimension table on the seek columns.
  5. Group the updates to the target dimension table.
  6. Archive unused rows in the dimension table.

To do update an SCD type 2 dimension, in SSIS we should do this:

  1. Obtain Current Time (CT).
  2. Stage the source table. 1m rows (S).
  3. Identity changed rows on the staging table. 50k rows (C).
  4. Put the C rows on a new staging table.
  5. Identify which C rows already exist on the dimension table by comparing the business key.
    Say 40k rows exist (E) and 10k rows don’t exist (D).
  6. Out of the 40k E rows, identify which rows has changed attributes. Say there are 12k rows (CA).
  7. Update dimension table: expire the 12k CA rows (set Expiry Date = CT, IsActive = N).
  8. Insert 22k row (10k D + 12k CA) into dimension table (in 1 go, not in 2 goes1). Expiry Date = 9999, IsActive = Y.
    D rows: Effective Date = 0000. CA rows: Effective Date = CT.

Note:

  1. In step 2, enable fast load, do not put any constraint on the staging table.
  2. Step 3 is done by hashing or checksum.
  3. Staging table C on step 4 can be used by other work flow, especially the Fact Load.
  4. To support step 5 and 7, index the dimension table on the Natural Key column.
  5. To support step 6, index the dim table on the type 2 columns to be compared.
    Do the compare in the order of the indexed columns.
  6. We can use the Merge command to do step 7 & 8, but we need to do it twice. So the old Upsert (update + insert) works as efficiently as Merge. Probably more efficient as we don’t do delete.

This is one major weakness in SSIS compared to Informatica PowerCenter so I have no doubt Microsoft must have been building a set based SSIS SCD Wizard for the next release 🙂

17 June 2012

Distribution Yield vs Underlying Yield

Filed under: Business Knowledge — Vincent Rainardi @ 2:03 pm
Tags:

If we look at a fixed income investment such as high yield bond funds or high income equity funds on Fidelity, HL or Morningstar websites, sometimes we wonder what’s the difference between Distribution Yield, Historic Yield and Underlying Yield. One of the best people explaining this is NFU mutual (link), whereas the authoritative source in the UK is Investment Management Association (IMA, link, link)

  • Historic Yield is last 12 months, for equity funds
  • Distribution Yield is next 12 months, for bond funds
  • Underlying Yield is next 12 months, for bond funds, including the purchase price and expense.

Distribution Yield is useful for comparing between funds. This is what is planned to be given to the shareholder as dividend in the next 1 year, calculated based on the fixed interest payments of the bonds that the fund has.
Underlying Yield is better for measuring the income from the fund, because it takes into account the annual management charge.

Example1: a bond fund with distribution yield of 5.8% (this is the total dividend that should be given to shareholder in 1 year) has an annual management charge of 1.5%, so the underlying yield is 4.3%

Example2: an equity fund with historic yield of 3.5% (this is what has been given as dividend to the fund shareholders in the last 1 year) has an annual management charge of 1.7% so the “real yield” is 1.8%

Bond funds have another type of yield: yield to maturity, which the amount the fund receives from the bonds it holds. It is declared as 6 months rate.

For example: the fund has only 1 high yield bond, with 8% coupon. The yield to maturity is 3.923% because (1+3.923%)^2 = 1.08

15 June 2012

Credit Default Swap (CDS)

Filed under: Business Knowledge — Vincent Rainardi @ 7:15 pm
Tags:

Credit Default Swap, or CDS, is an instrument which is used to insured an exposure against a default. When a bank lends credit to an obligor and this obligor is unable to make the payment on time, it is a credit event or default. When a default happens, then the CDS counterparty (the protection seller) pays the bank an agreed sum called notional amount.

For this service the bank pays the counterparty a fee, which is called CDS spread. This is usually paid upfront and annually. One company that publishes CDS prices is Markit, http://www.markit.com. Markit also publishes the CDS Implied Rating, which is the S&P style of credit reliability of an entity, determined from their CDS price. The higher the price, the lower the rating, because it means that it is more likely that the company will default.

The CDS spreads is noted not as an absolute figure, but relative to the notation. For example, if the notation is USD 10 million, and the spread is 1%, then the fee that the lender needs to pay is $1m x 1% = $10,000. But a CDS spread is not noted in percentage. It is written in basis points, or bps, which is 1 percent divided by 100. So 1% is 100 bps, 2% is 200 bps, and so on. The notional amount is the maximum amount in which we are protected against. For example, if our exposure is $100m, and we buy a CDS with limit of $20m, then the counterparty will only pay us $20m in the event of obligor default.

There is a difference between the market spread and the traded spread. A market spread is the price of a CDS for that referenced entity that the publicly traded on that day. Of course the market spread will depend on the tenor. A tenor is the duration of coverage in which the CDS is active. For example, we can buy a CDS which lasts for 1 year, 3 years, 5 years, or 10 years. The longer the tenor, the higher the spread. Tenor is determined from the maturity date minus effective date. A traded spread is the price that we actually paid when we buy or sell the CDS, which can be above or below the market price on that day.

The obligor which the bank protected against is called the referenced entity. A referenced entity is usually a single name. And this single name is usually a legal entity, i.e. it is incorporated as company, a partnership (LLP), a government, an individual or a municipal which is a department in a local or national government.

In contrast to a single name CDS, we also have a CDS index, which is an amalgamation of several CDSes. Two famous indices are iTraxx and CDX. For example, we have an index for financial services company in Europe. Rather than buying several CDSes for individual companies, we buy a CDS Index for European banks which should lower the cost. If any of the constituent company in the index goes default, the seller pays the protection buyer. The pricing of a CDS Index could be tricky because different obligor has different credit rating, and different likelihood of going default.

ITraxx covers the following sectors: financials (senior and sub), non financials, TMT (telcom, media, tech), industrials, energy, consumers, automotive. Benchmark: top 123 companies, top 30 high volatility, 40 top crossover (between investment grades and junk/high yield).

CDX covers the following: CDX.NA.IG: investment grade, CDX.NA.IG.HVOL: Investment grade but high volatility, CDX.NA.HY: high yield, CDX.NA.HY.BB, B, XO (crossover), EM (Emerging Market), LCDX (Loan only)

CDS has seniority level. Senior Unsecured Debt (for corporate) or for a government it’s Foreign Currency Sovereign Debt is the most senior. Subordinated or Lower Tier 2 Debt has lower seniority. There is also Junior Subordinated or Lower Tier 1 which is even lower in seniority. CDS with higher seniority level has higher prices.

Recovery rate of a CDS reflects the amount that the lender might get back at the credit events. Recovery rate depends on the seniority. The more senior a CDS is, the higher the recovery rate. ISDA (the international body that governs the standards of CDS) assumes that the recovery rate for a senior unsecured (SNRFOR) is 40%, subordinate (SUBLT2) is 20%, and for emerging market (both senior and subordinate) is 25%. The final payment from the seller is the calculated as notional minus the amount recovered (which is recovery rate x notional). ISDA stands for International Swap and Derivative Association.

There two settlements: cash settlement and physical settlement. Physical settlement is where the buyer gives the seller the bonds that they are holding, and the seller pays the notional value (whole amount/par). Cash settlement is where the seller pays the buyer the notional minus the recovered amount (= recovery rate x notional). These days, most CDS are bought/sold not for hedging, but to take a position on credit, so cash settlement is preferred.

The recovery rate is determined in an auction, usually run by Markit or Creditex. A primer on CDS Auction is here, which is a must read. And the formal rules of CDS Auctions are determined by ISDA, here. There are 2 stages:
Stage 1. All dealers submit their bid and offer, at agreed spread and quotation size, and at agreed precision (usually 0.125%). They also submit their physical settlement requests. From these we calculate the Initial Market Midpoint (IMM), and the adjustment amount. And we calculate the open interest, which is the total of the physical settlement requests.

Stage 2: All dealers submit their new bids and offers, which is added to the bids and offers in stage 1. Any bids above IMM + ½ spread is set to IMM + ½ spread.
Check how many bids needed to cover open interest. Say we need four bids to cover the open interest. Then the final price is the lowest of these four. The final price is the recovery rate, the amount the investors get back from the defaulted bonds.

Restructuring. When an obligor defaulted, they restructure their debt. For the same entity, same tenor, same seniority, and same coupon (see below), we have different restructuring mechanism for which the CDS covers for. This is known as the document clause. We have NR (no restructuring, old restructuring, modified restructuring and modified-modified Restructuring). The market price of a CDS depends on: entity, date, tenor, coupon, seniority, document clause and currency. A CDS traded in EUR may have different market spread to JPY.

CDS big bang (read document from Markit, link): We have 2 elements of payments: fixed coupon and upfront payment. Coupon is the payment that the buyer pays every quarter, stated in bps. Coupon is usually either 100 or 500 bps. On top of the coupon, the buyer also pays an upfront payment, also stated in bps.

Conventional Spread is the CDS prices for CDS with coupon of 100 bps. Whereas CDS with coupon of 500 bps will be quoted (generally) with upfront payment.  Buyers pays the coupon in full on the first payment date. So we need to calculate the accrual rebate payment to the buyer at the time of the trade.

CDS Curve is a series of CDS spread across different tenors/years.

Traditionally, the purpose of buy or selling a CDS is to remove asset from balance sheet, to meet regulatory capital requirement/liquidity, improve risk of capital, ROE, ROEC. Basically to off load credit risk from balance sheet. But now a days banks trade CDS to take position on credit risk, not to hedge a portfolio.

DTCC is the clearing house for CDS. It stands for Depository Trust and Clearing Corporation. Their function is not just covering CDS, but also equity and fixed income.

CDO is collateralised debt obligation. It’s an asset backed security with multiple tranches (seniority / risk classes), collateralized by loans or bonds. Junior and equity tranches offer higher coupon and interest rates to compensate the default risk. There are 4 types: CLO (Loan), CBO) (Bonds), CSO (Synthetic/CDS), SFCDO (structured products, i.e. mortgages/MBS, ABS)

Value at Risk: is the risk of loss for a portfolio, for a time horizon and probability.

Interest rate risk: risk of decreasing revenue of bond, loan because of interest rate raise/fall. Basis risk: LIBOR and US prime rate moving in different direction. Yield curve risk: different between short term and long term interest rates. Repricing risk: when interest rate fall, loan with variable rate will generate lower interest income.

Main sources: Wikipedia, Markit web site, ISDA web site.

Blog at WordPress.com.