Data Warehousing and Business Intelligence

24 April 2017

Choosing between Big Data and Data Warehousing

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 4:14 am

If we have 100 files each containing 10 million rows that we need to load to a repository so that we can analyse the data. What should we do? Do we put them into Hadoop (HDFS), or into a database (RDBMS)?

Last week I defined the difference between Big Data and Data Warehousing as: Big Data is Hadoop, and Data Warehousing is RDBMS. See my article here: link. Today I would like to illustrate using an example where we need to choose between the two.

There are 4 consideration factors:

  1. Data Structure
  2. Data Volume
  3. Unstructured Data
  4. Schema on Read

1. Data Structure: Simple vs Complex
If all 100 files have the same structure, and they all consist of the same 10 columns, then it is better to put them into Hadoop. We can then use Hive* and R to analyse the data. For example, to find patterns within the data, doing statistical analysis, or create forecasts. The development time will be shorter, because it is only 1 layer.
*or Phoenix, Impala, BigSQL, Stinger, Drill, Spark, depending on your Hadoop platform

If the 100 files contain 100 different tables, it is better to put them into a database, create a data warehouse, and use a Analytic/BI tool such as MicroStrategy or Tableau to analyse the data. For example to slice and dice the data, find percentage or anomalies and time series analysis. Yes we need to create 3 layers (staging, 3NF, star schema) but it enables us analyse each measure by different dimensions.

So if the data structure is simple, put into Hadoop, and if the structure is complex, but into a data warehouse. This is the general rule, but there are always exceptions. Can data with simple pattern be put into a data warehouse? Of course it can. Can data with complex pattern be put into Hadoop? Of course it can.

Using Hadoop and Hive/R we can also do slice and dice, find percentage or anomalies and do time series analysis. Using a data warehouse we can also do machine learning and data mining to find patterns in the data, do statistical analysis, and create forecasts. So basically, whether we store the data in Hadoop or in a data warehouse, we can still do complete analysis.

The issue here is storing it. Linking 100 tables in Hadoop is difficult and not natural. RDBMS such as SQL Server or Oracle is designed precisely for that task: linking and joining tables. Constructing a data model linking 100 tables is very suitable for an RDBMS. Can we design a data model linking 100 files with different structures in Hadoop? Of course we can. But it is much more difficult. For starters, it is Schema-on-Read, so the columns in the files have no data types. Schema-on-Read means that we don’t try to understand the relationship between the files when loading them into Hadoop. So yes we can load the 100 files into Hadoop, but we keep them as individual files, without relationships between them. This is the same as in Data Lake, which is also using Schema-on-Read, also using HDFS.

  1. Data Volume: Small vs Large

100 files containing 10 million rows each is 1 billion rows per day. If all 100 files have the same structure (say they all consists of the same 10 columns), then we will have a performance problem if we load them into an SMP database such as SQL Server or Oracle. Within 3 years, this table will have about 1 trillion rows. Even with partitioning and indexing, it will still be slow to query.

Hadoop on the other hand, will have no problem storing and querying 1 trillion rows. It is designed exactly for this task, by storing it in many files and querying it in parallel using Stinger, Drill, Phoenix, Impala or Spark. The file structure is simple (the same 10 columns each) which lends itself to Hadoop.

Teradata, Greenplum, Netezza, Exadata, and Azure SQL Data Warehouse are more than capable to handle this, with excellent query performance. But MPPs are more costly than Hadoop, which is why companies tend to choose Hadoop for this task. Using an MPP for this task is like killing a fly with a canon. Not only it is expensive and unnecesary, but also it is too sluggish and cumbersome for the task.

If the 100 source files have a complex structure (such as an export from SAP system) then yes an MPP is a suitable solution as we need to create relationship between the files/tables. But if the source files have a simple structure and we just need to union them, then Hadoop is more suitable and more economical for this task.

So if the data volume is large, like 1 billion per day, and the data structure is simple, put them into Hadoop. But if the data volume is large and the data structure is complex, put them into an MPP.

  1. Unstructured Data

If most of those 100 source files are MP4 (video) or MP3 (music), then Hadoop or Data Lake is an ideal platform to store them. An RDBMS be it SNP or MPP are not designed to store video or music files. They can (as a blob, or as externally-linked files), but they are not really designed for it.

If the source files are have different number of columns (such as Facebook or Twitter files,) then Hadoop or Data Lake is an ideal platform to store them. An RDBMS is not really designed for it.

Unstructured Data can also comes in the form of free-format text files (such as emails) and documents (such as journals and pattents). Again Hadoop or Data Lake is much better position to store them than a RDBMS.

  1. Schema-on-Read

One of the advantages of using Hadoop or Data Lake is that they are Schema-on-Read. Meaning that we just store those files without determining whether the first column is a numeric or a string. When we want to query it then we need to specify the data type.

Why is this an advantage? Because it makes it flexible. In Data Warehousing the first thing we need to do is to analyse the file structure, and design many tables to host the files in a Staging database. Then we design a normalised database to integrate those Staging tables. And then we design a Reporting layer in the form of Fact and Dimension tables and load those normalised tables into them. The whole thing can take a year if we have 100 files. The more number of files we have, the more complex the process and the longer it takes to design the databases for Integration layer and Reporting layer. It is good for the data architect (it gives them a job) but it is not good for the people who pay for this project.

Hadoop on the other hand is Schema-on-Read. After we put these 100 files in Hadoop, we query the first file. And when we query this first file, we specify the data types of each column. We don’t need to touch the other 99 files, yet. And we can already get the benefit. We can analyse the data straight away. On day one! If the other 99 files have the same structure, then we can union them, without extra effort of designing any database, and we can query them straight away. On day two! It is much simpler, we don’t need a team of 10 people designing any Staging, Normalised or Reporting layer for many many months. We can start analysing the data straight away and the project can finish in 2 to 3 months, with only 3 or 4 people. A lot less costly, a lot more agile, and a lot more flexible.

Just

17 April 2017

Definition of Big Data and Data Warehousing

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 5:38 pm
Tags: ,

I’m annoyed that many people label the normal Data Warehousing & BI stuff as “Big Data”. It is not. For me, Big Data is the “Hadoop stuff” (HDFS). If it is in SQL Server, Oracle or Teradata, it is just a normal database, not Big Data.

Big Data is stored on HDFS (Hadoop Distributed File System), not in RDBMS. Oracle, IBM, Microsoft, SAP, Teradata, all use Hadoop to store Big Data. Big Data is queried using MapReduce.

The reason why Big Data can’t be stored in RDBMS is because the format is not tabular. Sometimes it is 2 columns, sometimes it is 200 columns. Like Twitter data. The second reason is because it is too big. Sensors can make 100 measurements in a second, and in a year it could be Petabytes. Web Logs is another example. Tracking the ask and offer price of every transaction in every stock market is another example. Yes we can put Petabytes into SQL Server or Oracle, into Netezza or Teradata, but not at this speed. Hadoop on the other hand is designed exactly to cope with these kind of speed and volume.

Now the usage. What is Big Data Analytics? Big Data Analytics is when we do analytics on Hadoop Data.

Is Fraud Detection Big Data Analytics? Not always. Fraud Detection can be done on a normal Data Warehouse or a database. Is Machine Learning Big Data? Not always. Machine Learning can be done on a normal Data Warehouse or a database. If the Fraud Detection or the Machine Learning is done on data stored in Hadoop, then it is Big Data Analytics.

Even if it is only 200 GB, if it is stored in Hadoop, it is Big Data. Even if the data is 5 Petabyte, if it is stored in an SQL Server database, it is not Big Data, in my opinion.

Even if the data is in tabular format (i.e. columns and rows), if it is stored in Hadoop, it is Big Data. But if it is stored in an Oracle database, it is not Big Data.

Every Big Data architecture that I know uses Hadoop. No companies (or government) implement Big Data on an RDBMS. Or on a non HDFS files. Every single company, every single Big Data case I read implement the Big Data on Hadoop. I may be wrong and would be happy to be corrected. If you know a case which implements Big Data on a non-Hadoop system/architecture, I will grateful if you could let me know, either through comments, or via vrainardi@gmail.com.

 

 

18 September 2016

Rating Dimension

Filed under: Business Knowledge,Data Architecture,Data Warehousing — Vincent Rainardi @ 2:32 am

In an investment banking data warehouse, we have an instrument dimension (aka security dimension). This instrument dimension is used in many fact tables such as trade, risk, valuation, P&L, and performance.

In this instrument dimension we have rating attributes (columns), from the 3 agencies (S&P, Moody, Fitch), an in-house rating and many combination between them. We may have the ratings at both the issuer/obligor level, and at instrument/issue level. We could also have a hierarchy of ratings, e.g. AA+, AA and AA- are grouped into AA. There are separate ratings for cash instruments.

Usual Design

The usual design is to put all rating attributes in the instrument dimension. This is because the granularity of rating is at instrument level. In other words, rating is an attribute of an instrument.

Issues with the Usual Design

The issues with doing that are:

  1. Cash and FX may not be created as an intrument but they can have ratings.
  2. Some fact tables requires the rating outside the context of an instrument.

An example for point 1 is a USD bank balance (settled cash) in a GBP portfolio, which is given a rating of AAA. In the instrument dimension there is no instrument called USD bank balance. The other example is USD cash proceeds which is an unsettled cash.

An example for point 2 is a fact table used for analysing the average ratings of a portfolio across different date. This is called Portfolio Credit Quality analysis. On any given day, a credit portfolio has 3 attributes: highest rating, lowest rating and average rating. The average rating is calculated by converting the rating letter to a number (usually 1 to 27, or using Moody’s rating factor, link), multiply the rating number with the weight of each position, and sum them up. There can be several different criteria, for example: including or excluding debt derivatives, funds, cash, or equity, and doing lookthrough or not.

For example, portfolio X on 16th Sep could have these different attributes:

  1. Average rating: AA+
  2. Highest rating: AAA
  3. Lowest rating: BB-
  4. Average rating excluding cash: AA
  5. Average rating excluding cash and derivatives: BBB+
  6. Average rating excluding cash, derivatives and equity: BBB-
  7. Average rating including funds: AA-
  8. Average rating with lookthrough: A+

So the fact table grain is at portfolio level and date. We therefore cannot use instrument dimension in this fact table, and therefore it is necessary to have rating in its own dimension.

Another example of a fact table which requires rating not at instrument dimension is sectoral credit quality. A credit portfolio consists of many bonds and credit derivatives. Each of these bonds (or CDS) are of certain industry sector. A bond issued by BP is in Energy sector. A bond issued by Santander is in Financial sector. The average rating for each sector is calculated by converting each rating to a number, then multiplying it by the weight and sum up for all positions in that sector.

If we use GICS (link) as an example, and government, on any given day a credit portfolio can have sectoral credit quality like this: (usually excluding cash and equity, but including credit derivatives and debt funds)

  1. Consumer Discretionary: AA+
  2. Consumer Staples: AA-
  3. Energy: A+
  4. Financials:AAA
  5. Health Care: BB-
  6. Industrials: BBB-
  7. Information Technology: AA-
  8. Materials: BBB+
  9. Telecommunication Services: AA
  10. Utilities: BB+
  11. Real Estate: B+
  12. Government: AAA

Above is the average rating for each sector, for a given day, for a particular portfolio.

Design Choices for Rating Dimension

To satisfy the above requirement (fact tables which analyse rating at a level higher than instrument, e.g. at sector level or portfolio level), we can design the rating dimension in several ways:

  1. Rating schemes and hierarchies are created as attributes
  2. Rating schemes are created as rows, with hierarchies as attributes
  3. Rating schemes and hierarchies are created as rows

Before we dive into the details I’d like emphasize that we should not try to find out which approach is the best one, because different projects will require different approaches. So instead of asking “What is the best approach” we should ask “What is the most suitable approach for my project?”

And secondly I’d like to remind us that if the rating data is always used at instrument level than it should reside in the instrument dimension, not in a rating dimension on its own.

Approach 1. Rating schemes and hierarchies are created as attributes

approach1

Of course we should always have RatingKey 0 with description = Unknown. This row is distinctly difference to NR (No Rating). NR means S&P gives the bond a rating of NR. RatingKey 0 means that we don’t have any info about the rating of this bond.

The distinct advantage of this approach is that we have a mapping between different rating schemes. Moody’s Baa3 for example corresponds to S&P’s BBB-.

The Grade column is either Investment Grade (AAA down to BBB-) and High Yield (HY, from BB+ down to D).

Approach 2. Rating schemes are created as rows, with hierarchies as attributes

approach2

The distinct advantage of approach 2 over approach 1 is RatingKey 33 means Aa3, we don’t have to tell it that we meant Moody’s, like in approach 1. In Approach 1, RatingKey 4 can means Aa3 or AA-, depending whether we meant Moody’s or S&P.

The distinct disadvantage of approach 2 compared to approach 1 is we lost the mapping between S&P and Moody’s. In Approach 1 we know that Aa3 in Moody’s corresponds to AA- in S&P, but in Approach 2 we don’t know that. In the majority of the circumstances, this is not an issue, because an instrument with Moody’s rating of Aa3, may not have an S&P rating of AA- any way. Instead, its S&P rating could be AA or BBB+. So the mapping between S&P and Moody’s are the book standard, academic assumption, not the reality in the real world. In the above example, nobody cares whether the corresponding S&P rating for Aa3 is AA-. What the fund managers and analysts want to know is what Moody’s really assigned to that bond.

Approach 3. Rating schemes and hierarchies are created as rows

approach3

The distinct advantage of Approach 3 over Approach 1 and 2 is that RatingKey 31 means AA tier, it is clear and not ambiguous. In Approach 1 and 2, RatingKey 2,3,4 all means AA tier.

The other advantage of Approach 3 is the ability to create short term rating and cash rating, because they are simply different schemes.

Why do we need to keep the Tier column in Approach 3? Because we still need to convert AA+, AA and AA- to AA.

Rating Value (aka Rating Number)

In some projects we need to convert the rating letters to numbers. For this we can create an attribute in the above rating dimension (any of the 3 approaches) called Rating Value. It is usually AAA = 1, AA+ = 2, AA = 3, AA- = 4, and so on. NR (No Rating) and WR (Withdrawn Rating) got 0 rather than 23.

rating-value

Note that unlike in S&P and Moody’s schemes, in Fitch scheme the D rating is differentiated into 3: DDD, DD, and D. So DDD = 22, DD = 23, D = 24.

Miscellaneous

A common mistake is that people assume there are CC+ and CC- rating. There aren’t. Only CCC is split into 3 (CCC+, CCC, CCC-), but CC and C are not.

Sometimes the relationship between a rating in one scheme and another scheme is not a one-to-one mapping. For example, the Moody’s rating of Ca corresponds to 2 ratings in S&P scheme: CC and C. This is why approach 1 is tricky, because the rating does not have one to one relationship between 1 scheme and another. Another example is as I mentioned above, the D rating in S&P corresponds to 3 ratings in Fitch: DDD, DD and D.

20 June 2016

Data Lake vs Data Warehouse

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 6:22 pm

A Data Lake is a storage and analytic system which stores structured and unstructured data from all source systems in the company in its raw form. The data is queried, combined and analysed to get useful patterns and insights. It is built on a cost effetive Hadoop storage infrastructure, and can be dynamically scaled.

A Data Warehouse is a system that retrieves and consolidates data periodically from source systems into a dimensional or normalised data store. It usually keeps years of history and queried for business intelligence or other analytical activities. It is typically updated in batches, not every time a transaction happens in the source system.

Data Lake Advantages

Because data from all source systems are there in the data lake, the main advantage of data lake is it enables the analyst to get analytics and insight from multiple systems. Yes the data is still in their raw form, so would requires some processing, but it is a lot quicker (1-2 week) than building a data warehouse (6-12 months). And that is the second advantage: time to build.

The third advantage of data lake is its ability to store unstructured data such as documents and social media, which we can query and combine with structured data from databases.

The fourth and probably the most powerful business case for a data lake is the cost efficiency. In investment banking world for example, we can store market data (e.g. prices, yields, spreads, ratings) not only for the securities that we hold but all securities in the market, cheaper than if we store it in Oracle or SQL Server.

The fifth advantage of a data lake is the flexibility. Unlike Oracle or SQL Server, data lakes dynamically scales up. If this morning it is 2 TB, in the afternoon it can be 3 TB. As new data arrives, and we need new capacity, we can add storage easily. As we requires more computing power, we can get more computing power there and then, instantly. There is no need to wait a few days for adding a new node in a SQL always-on availability groups, or adding storage to the SAN, or extending the computing power of an Oracle Grid. This means that we do not need to spend a lot of money upfront paying for 3 TB data and 16 processors. We can just start with 300 GB and 4 processors, and expand when required.

Data Lake Disadvantages

The first issue is that the data lake technology is immature. The language which can query across database and unstructure files only has very limited features (each vendor has different language, for Microsoft Azure it is U-SQL). It is probably only 10% of what PL/SQL or T-SQL can do. We can solve this by putting QlikView or Tableau on top. We use U-SQL only to query individual table and we join the data in QlikView/Tableau and do further processing there.

The second issue is cost (but I don’t find this argument holds water). The issue is: it actually costs a lot of money to store data from all source systems in the company, let alone storing external market data which requires a lot of licence cost.

Let’s take Microsoft Azure Data Lake for pricing example. Being in the cloud, the price is only US$ 0.04 per GB and US$0.07/1 million transactions (ref: link). Let’s say that “data from ALL systems in the company” is 500 GB. And every day we store this data in the data lake. So 500 GB per day x 365 days = 182,500 GB per year x $0.04/GB = $7300 per year. Let’s say we have 10 million transactions per day, which is 3650 million transactions per year x $0.07/million transaction = $256 per year. So it is only cost us about $7500 per year. This is a very reasonable price to pay, to have “all the data in the company in one place”. Even 5 years later, when the volume grows to 5 times, it is only about $35,000 per year. Still very reasonable price to pay.

The third issue is performance. Because data lakes stores the data in its raw format, the query which joins the different data could be running like a dog. Luckily Data Lake is running on HDFS, a distributed file system, which is very fast. So yes it is slower than a data warehouse, but it is not too bad. We are not talking about 30 minutes to run a query, but something like 20-30 seconds (compared to DW query which is say 1-3 seconds).

Data Lake being “exploration tool”, I can tolerate a little bit of slowness. After we proof that (for example) the risk data we query is making money, then it can pay for creating a Risk data mart specifically for that risk analysis purpose.

The forth issue is skill. Data Lake requires a superb business knowledge. Because the analyst needs to join tables from different source system. What is the difference between Real Modified Duration and Effective Real Duration and Modified Duration? But this issue is the same whether we are building ad Data Warehouse or a Data Lake. Both requires good business knowledge.

I don’t find U-SQL is difficult to understand for people with SQL knowledge, which is a common skill. But how about Machine Learning? That is difficult to master, right? Yes, that is true, but it is worth paying an expert data scientist to discover the insight in the data, because this insight can be used to save cost or boost our revenue. The potential benefit is 10-100 times the salary of the data scientist.

Conclusion

So considering all the pros and cons above, I am in favour for creating a Data Lake. In addition to having a Data Warehouse of course.

 

21 April 2016

Why do we need a data warehouse?

Filed under: Data Warehousing — Vincent Rainardi @ 4:59 pm
Tags:

I have written about this in 2012 (link) but this one is different.

One of the most compelling reasons for creating a data warehouse is to reduce cost. Imagine if you were a large investment bank, trading fixed income, equities, commodity and currency. Every month you need to produce hundreds of reports to your clients and to regulatory bodies, and for management. The reporting team currently spends a lot of time creating the reports. They all have access to various systems. Risk systems, trading systems (order management), compliance systems, fixed income systems, finance systems. They copy and paste data from these systems into Excel, and do some calculations in Excel, which as the data source for making the reports in PowerPoint and PDF. Let’s find out how much this operation cost. Let’s say there are 8 systems, costing $200 per user per month, and there are 10 people in the team. The licence cost is 8 x $200 x 12 x 10 = $16k/month x 12 months = $192k/year, make it $200k. The salary is $60k x 10 = $600k per year. Total cost = $800k/year.

If we make a data warehouse, taking 1 year and costing $500k, we will be able to put together the data from that 8 systems into 1 database, and the reporting team won’t need access to those system anymore. That saves $200k/year.

Secondly, we will be able to automate the production of those reports. That will reduce the amount of time required to make those reports, thus reduce the number of people required, freeing them to do other activities. Assuming that half of the reports can be automated, that would save us $300k per year.

So total saving per year is $500k. Project cost is $500k. Surely a project with this level of “cost vs benefit” is well worth pursuing.

1 April 2016

Data Sourcing

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 8:16 pm
Tags: ,

One of the critical activity in a data warehouse project (and in any data project) is data sourcing. The business users have a few reports which need to be produced from the data warehouse. There is no data for that yet in the data warehouse, so we look at the report and ask ourselves: “Where can I get the data from to produce this report?” The plan is to find the data source, and then bring the data into the data warehouse.

There are 6 steps of data sourcing:

  1. Find the source
  2. Reproduce the numbers
  3. Verify the extraction
  4. Check the coverage
  5. Check the timing
  6. Check the cost

Step 1. Finding the source

We find out how the report was created, who created it, and from where. The report could be a PDF which is created from PowerPoint, which was created from an Excel spreadsheet, which was created from a file, which was exported from a system. It is this system that we need to check. Find the tables, the columns and the rows where the data is coming from.

Step 2. Reproduce the numbers

Once we have located the tables, we then try to reproduce the numbers we see on the report. For example, the report could be like this:

Asset Allocation

Try to reproduce the first number (24.3%) by querying the tables which you think the data is sourced from. In doing this you will need to look at the data processing happening in the Excel spreadsheet. Do they exclude anything? Is it a straight forward sum of the holdings in government bonds? Do they include supranational? Government agencies? Regional? Municipal? Emerging Market Debt?

If you can match 24.3% then great. If not, investigate if the data is manually overridden before it was summed up. For example, in Excel, there might be a particular bond which was manually classified as Government even though the asset class is Corporate, and this is because it was 80% owned by the government, or because it was backed by the government.

We need to particularly careful with regards to the totals. For example, on the “Allocation by Country of Risk”, if the total portfolio value is $200m, but they exclude FX forwards/swaps, or bond futures, or certain derivatives, then the total portfolio value could decrease to $198m, and all the percentages would be incorrect (they are slightly higher).

Understanding the logic behind the report is critical in reproducing the numbers. In-depth industry knowledge will be helpful to understand the logic. For example:

  1. The credit rating for each debt security is categorised into IG (Investment Grade) if it is BBB- or above, and into HY (High Yield) if it is BB+ or below, or NR.
  2. The credit rating used for point 1 is the average between S&P, Moody’s an Fitch, except a) for own fund use look-through, b) for outside fund use IMMFA
  3. The “Allocation by Country of Risk” table excludes cash and FX forwards/swaps.
  4. Each country is categorised into Developed, Emerging or Frontier market.
  5. When determining the country of risk in point 4, for derivative use the underlying.

In the above case, if we are not familiar with the investment banking industry, it would take a long time for us to understand the logic. So, yes, when doing Data Sourcing, it is best if it is done by a Business Analyst with good knowledge & experience in that industry sector.

Step 3. Verify the extraction

Once we can reproduce the numbers, we need to verify if we can get the data out. A few gotchas are:

  1. The numbers are calculated on-the-fly by the system, and not stored anywhere in the database. If this is the case, find out from the vendor if they have an export utility which produces a file after the numbers have been calculated.
  2. Are we allowed to connect to that database and query it? Do not assume that we can, because I’ve encountered a few cases that we are not allowed to do that. It could be because of the system work load / performance (it is a critical transaction system and they don’t want any big query ruining the front end users), or it could be because they have provided daily extract files which all downstream systems must use (instead of querying the database directly). From the system admin point of view, it makes sense not to allow any external query runs on the database, because we don’t know what kind of damage those external queries can cause, it could block the front end queries and causing a lock.
  3. Loosing precision, i.e. the data must be exported from the database but during the export process the precision decreases from 8 decimal places to 2 decimal places.
  4. There is a security restriction because the it is against the “chinese wall” compliance rules (in investment banking, the public-facing departments must not get data from the M&A department)
  5. The system is being migrated, or rewritten, so it is still in a state of flux and we need to wait a few months.
  6. The system is not a solid “production quality”, but only a “thrown-away”, which means that within a few months they could be dropping those tables.

Step 4. Check the coverage

This step is often missed by many business analysts. We need to check if all the products that we need is available in that system. If the report we are trying to reproduce is reporting 4000 products from 100 branches, but the source system/tables only covers 3000 products from 70 stores, than we’ve got to find out where the other 1000 products and 30 stores are sourced from. Are they produced from a different system.

Not only product and stores. We need to check the coverage in terms of: customer/clients, portfolios, securities, line of business, underwriting classes, asset classes, data providers. And the most important coverage check is on dates, e.g. does the source system have data from 2011 to present? It is possible that the source system only have data from 2014.

Step 5. Check the timing

After checking the coverage, we need to check if the data is available when we need it. We need to check these 3 things: data is too late, available extraction window, the data is overwritten.

  1. Data is too late: If our DW load starts at 2.15 am, will the data be available before that? If not, could the business user live with a bit of stale data (data from 2 days ago, i.e. if today is Wednesday, the latest data in the DW would be Monday data).
  2. Available extraction window: In the production environment, when can we extract the data from that source system? If from 11pm to 6am there is an overnight batch running, and we can’t run our extract during that time, then the ealierst we can run is 7am. If the DW load takes 3 hours, DW users can access it at 10am. Is that too late for the users or not?
  3. The data is overwritten: the data from the source system can be updated many times during the day and when we extract it at night, we have no trace of these changes. Is that ok? Do we need intraday, push-driven data load into the DW? Or would 10 minutes data extraction frequency (pull-driven) be enough?

Step 6. Check the cost

There is no free lunch. We need to check how much it would cost us to use that source data.

  1. If the data is valuable (such as prices, yield and rating from Bloomberg, Reuters and Markit) we would have to pay the data providers. We need to check the cost. The cost could be per $5 security, per call, so it could easily be $20-30k per day. The cost is usually shared with other department.
  2. Check with the data provider, if you use the data only as an input to your calculation, and you don’t publish it / send it on to any external parties (clients, etc.), would it still cost you a lot? Even if you don’t have to pay the data providers, your DW project might still have to share the cost with other departments. Say the data provider is Markit and your company pays $300k/year for prices data and it is currently shared by 5 departments ($60k each). Your project may have to bear the cost of $50k/year ($300/6).
  3. The cost could be a killer to the whole thing, i.e. even if #1 to #5 above are all ok, if the cost of the data is $50k, it could force you to cancel the whole project.
  4. Sometimes other department has to create the data for you. Let’s day yield calculation, or risk engine, or OTC pricing engine, and the requirement from the BI/DW is specific so they have to develop it. It could take them 3 months x 3 people and they could cross charge your project $50k (one off). And that could also be a killer to the DW project.
  5. Developing interface: some systems do not allow external system to pull the data out. They insist to develop an export, and charge the cost to your DW project.
  6. Standard data interface system: some large companies (such as multinational banks) have standard interface (real time, end of day, etc.), and the central middle ware team might charge your DW project some low amount (say $2000 one off + $50/month) to use that standard data interface system. Say you need FX rate data from FX system, and there is already a standard message queue for FX rates with publication time of 11am, 1pm and 3pm. So you “subscribe” to this publication MQ and pay the cost (project cross charge).

16 March 2016

Different Measures for Different Product Types

Filed under: Data Architecture,Data Warehousing,Investment Banking — Vincent Rainardi @ 8:33 am

What I mean by a measure here is a time-variant, numerical property of an entity. It is best to explain this by example. In the investment industry, we have different asset classes: equities, bonds, funds, ETFs, etc. Each asset class has different measures. Equities have opening and closing prices, daily volume, market capitalisation, daily high and low prices, as well as annual and quarterly measures such as turnover, pretax profit and EPS. Bonds have different daily measures: clean and dirty prices, accrued interest, yield and duration. Funds have different daily measures: NAV, alpha, sharpe ratio, and volatility, as well as monthly measures such as 3M return, 1Y return, historic yield, fund size and number of holdings. ETFs have daily bid, mid and offer prices, year high and low, and volume; as well as monthly measures such as performance. The question is: what is an appropriate data model for this situation?

We have three choices:

  1. Put all measures from different product types into a single table.
  2. Separate measures from each product types into different tables.
  3. Put the common measures into one table, and put the uncommon measures into separate tables.

My preference is approach 2, because we don’t need to join across table for each product type. Yes we will need to union across different tables to sum up across product types, but union is much more performant than join operation. The main weakness of approach a is column sparsity.

On top of this of course we will need to separate the daily measures and monthly measures into two different tables. Annual and quarterly measures for equities (such as financial statement numbers) can be combined into one table. We need to remember that measures with different time granularity usually are from different groups. For example, the prices are daily but the performance are monthly.

Static Properties

In addition to different time-variant properties (usually numerical), each asset class also different static properties (can be textual, date or numeric). For example, equities have listed exchanges, industry sectors, country of domicile, and dividend dates. Bonds have issuers, call and maturity dates, and credit ratings. Funds have benchmark, trustee, legal structure and inception date. Examples of numerical properties are minimum initial investment and annual charges for funds; outstanding shares and denomination for equities; par and coupon for bonds. Some static properties are common across asset classes, such as ISIN, country of risk, currency.

Static properties from different asset classes are best stored in separate tables. So we have equity table, bond table, fund table and ETF table. Common properties such as ISIN, country of risk, etc. are best stored in a common table (usually named security table or instrument table).

Why not store all static properties in a common table? Because the properties are different for each asset class so it is like forcing a square peg into a round hole.

Historical Data

For time variant properties it is clear that the table already stores historical data in the rows. Different dates are stored as different rows. What we are discussing here is the historical data of the static attributes. Here we have two choices:

  1. Using SCD approach: store the historical values on different rows (called versions), and each row is only valid for certain time period. SCD stands for slowly changing dimension, a Kimball approach in data warehousing.
  2. Using Audit Table approach: store the historical rows in an audit table (also called history table). This is the traditional approach in normalised modelling. The main advantage is that the main table is light weight and performant.

When to use them? Approach a is suitable for situations where the historical versions are accessed a lot, whereas approach b is suitable for situations where the historical versions are very rarely accessed.

The main issue with approach a is that we need to use “between” on the validity date columns. In data warehousing we have a surrogate key to resolve this issue, but in normalised modelling we don’t. Well, we could and we should. Regardless we are using appraoch a or b, in the time-variant tables we need to store the ID of the historical row for that date. This will make getting historical data a lot faster.

24 February 2016

Instrument Dimension

Filed under: Data Warehousing,Investment Banking — Vincent Rainardi @ 6:15 pm

One of the core dimensions in investment banking is instrument dimension. It is also known as security dimension. It contains various different types of financial instruments or financial securities, such as bonds, options, futures, swaps, equities, loans, deposits, and forwards.

The term “securities” used to mean only bonds, stocks and treasuries. But today it means any tradable financial instruments including derivatives, cash, loans, and currencies. Well, tradable and “contract-able”.

Where it is used

In an data warehouse for an investment bank, a brokerage or an investment company, an instrument dimension is used in three primary places: in trade fact tables, in position fact tables and in P&L fact tables. Trade fact tables store all the transactions made by the bank, either for a client (aka flow trading) or for the prop desk (bank’s own money), in all their lifecycle stages from initiation, execution, confirmation, clearing, and settlement. Position fact tables store daily values of all instruments that the bank holds (long) or owes (short). P&L (profit and loss) fact tables store the daily impact of all trades and positions to each of the bank’s financial accounts, e.g. IAS 39.

The secondary usages in an asset management or an investment banking data warehouse are risk fact tables (e.g. credit risk, market risk, counterparty risk), compliance fact tables, regulatory reporting, mark-to-market accounting, pricing, and liquidity fact tables.

The niche usages are ESG-score fact tables (aka SRI, socially responsible investing), rating transition fact tables, benchmark constituent fact tables, netting position fact tables, and collateral fact tables.

Data Structure

The business key of an instrument dimension is usually the bank-wide internal instrument identifier. Every instrument that the bank gets from market data providers such as Bloomberg, Reuters, Markit, index constituents, and internal OTC deals, are mastered in a waterfall process. For example, public instruments (debts, equities, ETDs) are identified using the following external instrument identifiers, in order: ISIN, Bloomberg ID (BBGID), Reuters ID (RIC), SEDOL, CUSIP, Exchange Ticker, Markit ID (RED, CLIP), Moody’s ID (Master Issue ID). Then the internal identifiers for OTCs (e.g. CDS, IRS, FX Swaps), FX Forwards, and cash are added.

The attributes of an instrument dimension can be categorised into 9:

  1. Asset Class
  2. Currency
  3. Country
  4. Sector
  5. Issuer
  6. Rating
  7. Maturity
  8. Instrument Identifier
  9. Asset class specific attributes

1. Asset Class

Asset Class is a classification of financial instruments based on its functions and characteristics, e.g. fixed income, equities, cash, commodity. We also have real assets such as land, buildings, physical gold and oil.

It also covers the hierarchy / groupings of the asset classes, hence we have attributes such as: asset class, asset sub class, asset base class. Or alternatively asset class level 1, level 2, level 3. Or asset class, asset type, asset group.

Good starting points for asset class categorisation are ISDA product taxonomy, Barclays index guides, Academlib option pages and Wikipedia’s derivative page. Here is a list of popular asset classes:

FIXED INCOME: Government bond: sovereign, supranational, municipal/regional, index linked, zero coupon, emerging market sovereign, sukuk sovereign. Corporate bond: investment grade, high yield, floating rate note, convertible (including cocos), covered bond, emerging market corporate, sukuk corporate. Bond future: single name bond future, future on bond index. Bond option: single name bond option, option on bond index. Bond forward: single name bond forward, forward on bond index. Credit default swap: single name CDS, CDS index, CDS swaption, structured CDS. Asset backed security (ABS): mortgage backed security (including RMBS and CMBS), ABS (auto, credit card, etc), collateralised debt obligation (CDO), ABS index. Total Return Swap: single name TRS, TRS index. Repurchase agreement: repo, reverse repo.

EQUITY: Cash equity: common shares, preferred shares, warrant, equity index. Equity derivative: equity option (on single name and equity index), equity future (on single name, equity index, and equity basket), equity forward (on single name, equity index, and equity basket), equity swap (on single name and equity index).

CURRENCY: Cash currency: FX spot, FX forward. Currency derivative: cross currency swap.

RATES: Interest rate: interest rate swap, overnight index swap (OIS), interest rate cap, interest rate future, interest rate swaption, forward rate agreement (FRA), asset swap. Inflation rate: inflation swap, inflation swaption, inflation cap, zero-strike floors, inflation protected annuity.

COMMODITY: commodity future: energy (oil, gas, coal, electricity, wind turbine), base metal (copper, iron, aluminium, lead, zinc), precious metal (gold, silver, platinum, palladium), agriculture (grains: corn, wheat, oats, cocoa, soybeans, coffee; softs: cotton, sugar, butter, milk, orange juice; livestock: hogs, cattle, pork bellies). Commodity index (energy, metal, agriculture). Option on commodity future. Commodity forward.

REAL ASSET: Property: Agricultural land, residential property, commercial property. Art: paintings, antique art. Collectibles: fine wine, rare coins, antique cars, jewellery (including watches and precious stones).

FUND: money market fund, equity fund, bond fund, property fund, commodity fund, currency fund, infrastructure fund, multi asset fund, absolute return fund, exchange traded fund.

OTHER: Private equity. Venture capital.

Note on differences between asset class and asset type: asset class is usually a categorisation based on market, i.e. fixed income, equity, cash, commodity and property; whereas asset type is usually a categorisation based on time and structure, i.e. spot, forward, future, swap, repo, ETD, OTC, etc.

Note on overlapping coverage: when constructing asset class structure, we need to be careful not to make the asset classes overlapping with each other. If we do have an occurrence where an instrument can be put into two asset classes, make sure we have a convention of where to put the instrument. For example, an IRS which is in different currencies are called CCS (Cross Currency Swap). So either we don’t have CCS asset class and assigned everything to IRS (this seems to be the more popular convention), or we do have CCS and make sure that none of the swaps with different currencies are in IRS.

2. Currency

For single-legged “hard” instruments such as bonds and equities, the currency is straightforward. For multi-legged, multi-currency instruments such as FX forward and cross currency swap, we have two currencies for each instrument. In this case, we either have a column called “currency pair”, value = “GBP/USD”, or two column marked as “buy currency” and “sell currency”.

For cash instruments, the currency is the currency of the cash. For “cash like” or “cash equivalent” instruments such as CP, CoD, T-bill, the currency is straightforward, inherent in the instrument. For multi-currency CDS Index such as this (i.e. a basket of CDSes with different currencies), look at the contractual currency of the index (in which the premium leg and protection leg are settled), not the liquid currency (the currency of the most liquidly traded CDS).

For derivatives of equities or fixed income, the currency is taken from the currency of the underlying instrument.

3. Country

Unlike currency which is a true property of the instrument, country is a property of the issuer. There can be three different countries in the instrument dimension, particularly for equities, i.e. country of incorporation, country of risk (aka country of operation, country of domicile), country of listing.

Country of risk is the country where if there is a significant business changes, political changes or regulatory changes in that country, it will significantly changes the operation of the company which issues this security. This is the most popular one particularly for portfolio management, and trade lifecycle. It common for a company to operate in more than one country, in this case it is the main country (from revenue/income point of view), or set to “Multi-countries”.

Country of incorporation is the country where the issuer is incorporated, not the where the holding company (or the “group”) is incorporated. This is used for regulatory reporting, for example FATCA and FCA reporting.

Country of listing depend on the stock market where the equity instrument is listed. So there can be two different rows for the same instrument, because it is listed two different stock exchanges.

The country of risk of cash is determined by the currency. In the case of Euro instruments (not Eurobond*) it is usually set to Germany, or Euroland (not EU). *Eurobond has a different meaning, it is a bond issued not in the currency of the country where it is issued, i.e. Indonesia govt bond issued in USD.

An FX forward which has 2 different currencies has one country of risk, based on the fixed leg (not the floating leg) because that is where the risk is. The country of risk for cross currency swap is also based on the fixed leg. For floating-for-floating CCS, the convention is usually to set the country of risk to the least major currency, e.g. for USD/BRL, USD is more major than BRL, so Brazil is the country of risk. For non-deliverable forward and CCS (meaning the payment is settled in other currency because ND currency can’t be delivered offshore), the country of risk is set based on settlement currency (usually USD).

Like currency, the country of a derivative of equities or fixed income instrument is taken from the country of the underlying instrument.

4. Sector

These attributes are known with many names: sector, industrial sector, industry sector, or industry. I will use the term sector here.

There can be many sector attributes in the instrument dimension, e.g. Barclays level 1/2/3, MSCI GICS (and S&P’s), UK SIC, International SIC, FTSE ICB, Moody’s sector classification, Factset’s sector classification, Iboxx, etc. They have different coverage. Some are more geared up towards equities, some more towards fixed income.

The cash instruments and currency instruments usually have either no sector (blank), or set to “cash”. Rates instruments, commodity futures and real asset usually have no sector.

The sector of fixed income derivatives, such as options and CDSes are determined based on the sector of the underlying instrument. Ditto equity derivatives.

5. Issuer

All equity and fixed income instruments have issuers. This data is usually taken from Bloomberg, or from the index provider if the position is an index constituent.

All corporate issuers have parents. This data is called Legal Entity data, which can be obtained from Bloomberg, Thomson Reuters, Avox/FT, etc. From the Legal Entity structure (parent-child relationship between company, or ownership/subsidiary to be more precise) we can find the parent issuer, i.e. the parent company of the issuer, and the ultimate parent, i.e. the parent of the parent of the parent (… until the top) of issuer.

Legal entity data is not only used in instrument dimension. The main use LE data within an investment bank is for credit risk and KYC (know your customer), i.e. customer due dilligence. PS. LEI means Legal Entity Identifier, i.e. BBG Company ID, FATCA GIIN (Global Intermediary Identifier Number), LSE’s IEI. But LEI also means ROC’s Global LEI – the Regulatory Oversight Committee.

6. Rating

Like sector, there are many ratings. Yes there are only 3 rating providers (S&P, Moody’s, and Fitch), but combined with in-house rating, there can be 15 different permutations of them, i.e. the highest of SMF, the lowest of SMF, the second highest of SMF, the average of SMF, the highest of SM, the lowest of SM, the average of SM, the highest of SMFH, the lowest of SMFH, etc. With M = Moody’s and H = House rating.

Plus we have Rating Watch/Outlook from the 3 provider. Plus, for CDS, we can have “implied rating” from the spread (based on Markit CDS prices data).

7. Maturity

Almost all fixed income instruments have maturity date. Maturity is how far is that maturity date from today, stated in years rather than days. We also have effective maturity, which is the distance in time between today and the nearest call date, also in years.

8. Instrument Identifier

This is the security identifier as explained earlier, i.e. ISIN, Bloomberg ID, Ticker, Markit ID, Sedol, CUSIP, Reuters ID, Moody’s ID.

9. Asset Class Specific Attributes

Each asset classes have their own specific attributes.

For CDS we have payment frequency (quarterly or bi-annually), standard coupon payment dates (Y/N), curve recovery (Y/N), recovery rate (e.g. 40%), spread type (e.g. conventional), restructuring clause (Old R, Mod R, Mod-Mod R, No R), fixed coupon convention (100 or 500), succession event, auction settlement term, settlement type (cash/physical), trade compression, standard accruals (Y/N), contract type (e.g. ISDA).

For IRS we have amortising swap flag, day count convention, following convention (Y/N), no adjustment flag (Y/N), cross currency flag (Y/N), buy currency, sell currency, mark-to-market flag, non-deliverable flag, settlement currency.

Debt instruments such as bonds have these specific attributes: coupon type (e.g. fixed, floating), seniority (e.g. senior, subordinated), amortising notional, zero coupon. Funds also have their own specific attributes, such as emerging market flag, launch date, accumulation/income, base currency, trustee, fund type, etc.

Granularity

The granularity of an instrument dimension can be a) one row for each instrument (this is the norm), or b) one row for each leg. This is to deal with multi-leg instruments such as CDS (3 legs) and cross currency swap (2 leg). The asset class of each leg is different.

If we opt for one row for each instrument, the asset class for each leg needs to be put in the position fact table (or transaction fact table, compliance fact table, risk fact table, etc).

Coverage

There are millions of financial instruments in the market and through-out the life of an investment bank there can be millions of OTCs created in its transactions. For investment companies, there are a lot of instruments which they had holdings in the past, but not any more. Coverage of an instrument dimension means: which instruments are we going to maintain in this dimension? Is it a) everything in the market, plus all OTCs, b) only the one we ever used, c) only the one we hold in the last N years.

We can set the coverage of the instrument dimensions to cover all bonds and equities which ever existed since 1900, but this seems to be a silly idea (because of the cost, and because they are not used), unless we plan conduct specific research, e.g. analyse the quality changes over a long period. The most popular convention is to store only what we ever used.

SCD

Most of instrument dimensions are in type 2, but which attributes are type 2 are different from project to project, bank to bank (and non-bank). Most of the sector, rating, country attributes are type 2. Maturity date is type 2 but maturity (which is in years) are either type 1 or kicked out of this dimension into a fact table (this is the more popular option). Next coupon date and effective date are usually type 1.

Numerical attribute

Numerical attributes such as coupon, rating factor, effective maturity, etc. are treated depending on their change frequency. If they change almost every day, they must be kicked out of the instrument dimension, into a fact table. For example: effective maturity, maturity, holding positions.

If they change a few times a year, or almost never change (static attribute), they stay in this instrument dimension, mostly as type 2. An example of a static numerical attribute is coupon amount (e.g. 6.25%). An example of a numerical attribute is rating factor (e.g. 1 for AAA and 3 for A), and associated PD (probability of default, in bps, e.g. such as 0.033 for AAA, 0.54 for AA, and 10 for BBB), which on average changes once every 1 to 3 years.

The difference between a numerical attribute and a fact is that a numerical attribute is a property of the instrument, whereas a fact is a measurement result.

System Columns

The usual system columns in the instrument dimension are: created datetime, last updated datetime, and the SCD system columns e.g. active flag, effective date, expiry date.

19 January 2016

The ABC of Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 8:52 am

In this article I would like to define the terminologies used in data warehousing. It is a data warehousing glossary, similar to my investing article: The ABC of Investing (link). I won’t cover terms used in business intelligence; I will only cover terms in data warehousing. Terms in bold are defined in this glossary.

This glossary consists of 2 levels. Only the first level is alphabetical. The second level is not. So the best way to use this glossary is by searching the term (Control-F).

People make mistakes and I am sure there are mistakes in this article and I would be grateful if you could correct me, either using the comment below or via vrainardi@gmail.com.

Why I wrote this: many people working in data warehousing do not understand the standard terminology used. Even the simplest of the term such as “dimension” can be a foreign word to them. My intention is to provide a “quick lookup”, enabling them to understand that precise word in about 15 seconds or so.

Why don’t they use internet searches or Wikipedia? Why create another one? Because a) it takes longer to search and find the information, particularly if you are new, b) the pages on search results can be technically incorrect, c) sometimes I have my own view or prefer to emphasise things differently.

  1. Archiving: an approach to remove old data from the fact table and store it another table (usually in different database). It is quite common that the old data is simply deleted and not stored any where else. The correct term for the latter is purging.
  2. Corporate Information Factory (CIF): a top-down data warehousing approach/architecture created by WH Inmon in 1999. The “Data Acquisition” part loads the data from Operational Systems into a Data Warehouse (DW) and an Operational Data Store (ODS). The “Data Delivery” part loads the data from DW and ODS into Exploration Warehouse, Data Mart, and Data Mining Warehouse. The following are the definition of the terms used in CIF (I quote this from CIF website). Back in 1999, these concepts are revolutionary, hence my admiration and respect to the author, WH Inmon.
    • Operational Systems: the internal and external core systems that support the day-to-day business operations. They are accessed through application program interfaces (APIs) and are the source of data for the data warehouse and operational data store. (Encompasses all operational systems including ERP, relational and legacy.)
    • Data Acquisition: the set of processes that capture, integrate, trans-form, cleanse, reengineer and load source data into the data warehouse and operational data store. Data reengineering is the process of investigating, standardizing and providing clean consolidated data.
    • Data Warehouse: a subject-oriented, integrated, time-variant, non-volatile collection of data used to support the strategic decision-making process for the enterprise. It is the central point of data integration for business intelligence and is the source of data for the data marts, delivering a common view of enterprise data.
    • Operational Data Store: a subject-oriented, integrated, current, volatile collection of data used to support the tactical decision-making process for the enterprise. It is the central point of data integration for business management, delivering a common view of enterprise data.
    • Data Delivery: the set of processes that enable end users and their supporting IS group to build and manage views of the data warehouse within their data marts. It involves a three-step process consisting of filtering, formatting and delivering data from the data warehouse to the data marts.
    • Exploration Warehouse: a DSS architectural structure whose purpose is to provide a safe haven for exploratory and ad hoc processing. An exploration warehouse utilizes data compression to provide fast response times with the ability to access the entire database.
    • Data Mart: customized and/or summarized data derived from the data warehouse and tailored to support the specific analytical requirements of a business unit or function. It utilizes a common enterprise view of strategic data and provides business units more flexibility, control and responsibility. The data mart may or may not be on the same server or location as the data warehouse.
    • Data Mining Warehouse: an environment created so analysts may test their hypotheses, assertions and assumptions developed in the exploration warehouse. Specialized data mining tools containing intelligent agents are used to perform these tasks.
    • Meta Data Management: the process for managing information needed to promote data legibility, use and administration. Contents are described in terms of data about data, activity and knowledge.
    • Primary Storage Management: the processes that manage data within and across the data warehouse and operational data store. It includes processes for backup and recovery, partitioning, summarization, aggregation, and archival and retrieval of data to and from alternative storage.
    • Alternative Storage: the set of devices used to cost-effectively store data warehouse and exploration warehouse data that is needed but not frequently accessed. These devices are less expensive than disks and still provide adequate performance when the data is needed.
  3. Current Flag: a column in a dimension table which indicates that the row contains the current values (in all attribute columns). Also known as Active Flag.
  4. Data Lake: a data store built in Hadoop file system, capable to store structured and unstructured It has analytic and query tools which enable users to join the underlying various data types into a single output dataset.
  5. Data Mining: the old meaning: an approach to find pattern in the data using certain algorithm (such as clustering and decision trees), and then use this pattern to predict the future values of data. See also: Machine Learning. The new meaning: an approach to extract certain information from a data warehouse which is valuable to the business. “To mine the warehouse” is term commonly used.
  6. Data Warehouse: a system that retrieves and consolidates structured data periodically from the source system into a dimensional or a normalised data store. A data warehouse is usually used for business intelligence and reporting.
  7. Data Warehouse vs Data Mart: a data warehouse contains everything for multiple purposes, whereas a data mart contains one data area for one purpose. For example: sales mart, customer mart. A data mart is always in dimensional model, whereas a data warehouse can be in dimensional model or normalised model.
  8. DW 2.0: a follow up from Corporate Information Factory (CIF) concept, which stores unstructured data as well as structured data. The DW 2.0 concept is created by WH. Inmon in 2008.
  9. Dimensional modelling: a data modelling approach using fact and dimension tables. A dimensional model is also known as dimensional schema. There are 2 approaches in dimensional modelling: star schema and snowflake.
    • Star schema: only one level of dimension.
    • Snowflake schema: more than one level of dimension.
  10. Dimensional Data Warehouse: a data warehouse which consists of fact and dimension tables. The primary key of a dimension table becomes a foreign key in the fact table. A dimensional data warehouse can be a star schema or a snowflake schema.
  • Fact Table: a table that contains business events or transaction data (time-sensitive data), such as orders and account balances. See also: Fact Table types.
  • Dimension Table: a table that contains “static data” (non time-sensitive), such as customer and product.
  • Bridge Table: a table that defines many-to-many relationship between two dimensions.
  • Measure: a column in a fact table which contains the numbers we want to analyse. For example: sales amount per city. Sales amount is a measure.
  • Attribute: a column in a dimension table which we can use to analyse the numbers. For example: sales amount per city. City is an attribute.
  • Surrogate Key: the primary key column of a dimension table. It becomes a foreign key column in the fact table.
  • Business Key: the primary key of the source table in the transaction system. This becomes the identifier in the dimension table.
  1. Enterprise Data Warehouse (EDW): a normalised database containing data from many departments (more than one divisions), and sourced from several source systems. An EDW stores historical transactional data as well as historical attribute EDW was pioneered by W. H. Inmon. See also: Normalised Data Warehouse.
  2. ETL (Extract Load Transform): the loading of data from the source system into the data warehouse.
  3. Fact Table types: there are 3 types of fact tables.
  • Transaction Fact Table: a fact table that stores the measure value of each business events when the business event happened. In a transaction fact table, each business event is stored only once as a row.
  • Periodic Snapshot Fact Table: a fact table that contains the values of each measure taken at regular interval. In a periodic snapshot fact table, each business event is stored multiple times. A Periodic Snapshot Fact Table is also known as a Snapshot Fact Table. See also: Snapshot Fact Table.
  • Accumulative Snapshot Fact Table: a fact table which the timing and status from different points in time are put as different columns on the same row. The row describes one particular customer (or other dimension).
  1. Machine Learning: an approach to find pattern in the data using certain algorithms then use this pattern to predict the future values. In my opinion, Machine Learning is the new term for the old meaning of Data Mining.
  2. MPP Server (Massively Parallel Processing): a type of database server where a query is broken down into multiple streams. Each stream is executed concurrently in different nodes. The output from multiple streams are then combined into one and passed back to the user. This architecture is also known as Shared Nothing architecture. Examples of MPP server are: Teradata, Netezza, Azure SQL Data Warehouse, and Greenplum.
  3. Normalised Data Warehouse: a data warehouse which consists of transaction tables, master tables, history tables, and auxiliary tables, in first, second or third normal form.
  • Transaction table: a table that contains business events or transaction data (time-sensitive data), such as orders and account balances.
  • Master table: a table that contains “static data” (non time-sensitive), such as customer and product. It contains today’s values.
  • History table: a table that contains “static data” (non time-sensitive), such as customer and product. It contains historical values.
  • Auxiliary table: a table which contains data that describes the codes in a master table. For example: customer type and currency code.
  • Bridge table: a table which implements many-to-many relationship between a transaction table and a master table. Or between two master tables.
  1. OLAP (Online Analytical Processing): an process of interrogating a multidimensional database to explore the data and find patterns. See: multidimensional database.
  2. OLAP Cubes: a compressed, in-memory, multidimensional database.
  3. Operational Data Store (ODS): a normalised database containing transaction data from several source systems (it is an integration point). It is similar to Enterprise Data Warehouse (EDW) but only contains the current values of the attributes. Of course it contains the historical values of the transactions.
  4. Multidimensional Database (quoted from my book, Building a Data Warehouse): a form of database where the data is stored in cells and the position of each cell is defined by a number of hierarchies called dimensions. The structure stores the aggregate values as well as the base values, typically in compressed multidimensional array format.
  5. Slowly Changing Dimension (SCD): an approach to preserve the historical/old values of attributes in the dimension tables. There are 5 types of SCD:
  • SCD Type 0: the attribute is static and the values never changes.
  • SCD Type 1: the old attribute values are not overwritten by the new attribute value. The old values are not preserved; only the latest value is stored.
  • SCD Type 2: the old attribute values are stored in a different row to the new attribute value. In the dimension table, each version of the attribute value is stored as a different row. The row containing the latest value is marked with a special sign.
  • SCD Type 3: the old attribute values are stored a column in the dimension table. Only the last two to three values are stored. Type 3 is usually used when most of the rows in the dimension table change their values together at the same time.
  • SCD Type 4: (as defined by SQL 2016, link) the old attribute values are stored in a different table called history table. Only the latest value is stored in the dimension table.
  1. Slowly Changing Dimension: a dimension which have a type 2 attributes.
  2. Snapshot Fact Table: a fact table that contains the values of each measure at a particular point in time. For example, the balance of every account at the end of the day. There are two snapshot fact tables: periodic and accumulative, but when people say “Snapshot Fact Table” they usually mean periodic. See also: Fact Table types.
  • Daily Snapshot: contains the value of each measure at the end of every day.
  • Weekly Snapshot: contains the value of each measure at the end of every week.
  • Monthly Snapshot: contains the value of each measure at the end of every month.
  1. Staging table: a table in a data warehouse which contains the raw data from the source system. This raw data will be processed further and loaded into either a fact table or a dimension table.

14 January 2016

Temporal Tables in SQL Server 2016

Filed under: Data Warehousing,SQL Server — Vincent Rainardi @ 8:40 am

The temporal table feature in SQL Server 2016 is a breakthrough in data warehousing, because we no longer need to do anything to store the historical values. This cuts down the development effort of a data warehouse by approximately 15% (half of the ETL development effort which is about 30%; the other 70% is analysis, design, database development, BI/report development, management).

Usage scenarios: https://msdn.microsoft.com/en-gb/library/mt631669.aspx

  • Data audit/forensic
  • Point-in-time analysis (time travel)
  • Data anomaly detection
  • Slowly changing dimension (type 4!)
  • Repairing data corruption at row level

Up to now we always have to build a data warehouse, which stores a) the historical changes of attribute values from the source system, and b) fact snapshots over time. With temporal tables, we no longer need to create a data warehouse. We can do a) and b) above within the source system itself, as demonstrated by the above article. This effectively kills the idea of creating a data warehouse, increasing the time we saved from 15% to 95%*.

*This remaining 5% of effort reflects the extra time required to setup the temporal tables in the source system’s SQL Server, along with memory-optimized tables to store the current data in memory, and the full history of changes on the disk. That arrangement gives us the optimum balance between the performance and the cost.

The above article (https://msdn.microsoft.com/en-gb/library/mt631669.aspx) demonstrates that using temporal tables we can do SCD, and we can do snapshot fact tables (called Time Travel). The SCD that temporal tables do is not type 2, but type 4. Type 4 is an old technique where the data changes are kept in the history table. This keeps the original table small and compact as it only contains the latest version.

 

 

Next Page »

Blog at WordPress.com.