Data Warehousing and Data Science

24 January 2023

What is Data Architecture?

Filed under: Data Architecture — Vincent Rainardi @ 10:53 pm

What is data architecture? It is a description of how and where the data is sourced, integrated, stored, arranged, moved, accessed, used and shared.

It includes data models, data standards, data storage, data infrastructure, data sourcing, data ingestion, data processing, data security, data access, data sharing, data availability, data mapping, and data quality.

How the data is used is not part of data architecture. So data queries, data analytics, reports, machine learning models, BI applications and business applications are not part of data architecture.

What is the difference between data architecture and data governance?

Data governance is the methodology and practices in managing data in a company/organisation in order to achieve high data quality, good data ownership, good data usage/analytics, good data security, good regulatory compliance, efficient processes and lower cost.

It includes data policies, data standards, data access, data calculations, data dictionary/definition, data ownership, data lineage and data practices.
It does not include data models, data ingestion mechanism, data storage, data infrastructure, data analytics, reports or ML models.

The difference between data architecture and data governance is: data architecture is about the system that host and move the data, whereas data governance is about how the data should be used and managed. Data architecture is about the build/system that host the data, whereas data governance is about managing the data itself.

For example, data architecture defines the structure of the client holdings table in the data lake, whereas data governance defines the calculations for the client’s AUM and who should be allowed to access it.

13 November 2017

Numerical Attributes

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 7:24 pm

Should we put numerical attributes in the dimension table or in the fact table? Numerical attributes are static attributes of an entity but they are numeric, not string. For example: interest rate of a swap and weight of a product.

I usually follow the textbook approach and put them in the dimension. But for measures and attributes of OTC derivative instruments, we should be putting them in the fact table.

The fact table contains daily (and intraday) valuations, such as price, market value and the “Greeks”. An interest rate swap usually has a fixed leg and a floating leg. For example, receive 0.886% pay EUR 3m + 0.567%. So we will get a fixed interest rate of 0.886% and pay our counterparty the rate of EURIBOR 3m plus 0.567% (currently EURIBOR 3m is -0.329%).

These two attributes (the 0.886% and 0.567%) are part of the interest rate swap contract and are fixed for the duration of the contract (until maturity). When it reaches the maturity date, the contract ends and if we want to extend it, we will normally make a new contract, with a new SEDOL. Hence these 2 attributes are essentially static attributes (they are called receive interest rate and pay interest rate).

Because they are static they should be put into the dimension, which in this case is the security dimension or instrument dimension, rather than the fact table, which in this case is the instrument valuation fact table.

But I recommend putting them into the fact table for 4 reasons:

  1. Incorrect business judgement (a simple “sweep all” rule, and if we are wrong then it’s not too bad)
  2. Bytes
  3. Floating
  4. Calculation

1. Incorrect business judgement

There are many types of OTC derivative instruments, e.g. Interest Rate Swaps, Credit Default Swap, FX Options, Equity Index Swap, CDS Options, Asset Swap, Total Return Swap, etc. There are about 15 of them, each with different numerical attributes. So in there are about 50 or so numerical attributes.

We could sit down with the business and categorise these 50 numerical attributes into two:

  • A measure or fact: meaning they are potentially changing every day
  • A dimension attribute: meaning that they are static and never change

But the business judgement could be wrong. Something which they think is static could be changing. Here are some of those 50 attributes and imagine you sit down with the business categorising them into the two categories above:

Strike price interval, put option premium, strike price, intrinsic value, market value, fair value, time value, put call ratio, put call parity, recovery rate, expiration value, underlying asset value, fixed interest rate, floating interest rate margin, nominal amount, notional value, pay duration, pay/receive spread, pay/receive margin, credit spread.

In addition to the input from the business people, it can be useful to check the source system for historical changes. Does the value of that column/field ever changes? If yes, how often?

If we miscategorise them, there are two things which can happen:
a) A measure is miscategorised as an attribute
b) An attribute is miscategorised as a measure

Generally speaking, b) is fine but a) has quite serious consequences, because a) means that 1 instrument in the instrument dimension can have 100 rows, even 1000 rows. And if we have 300,000 instruments (covering both active positions, and non-active positions), the size of the instrument dimension will be very big, 300 million rows, making it slow.

The reason an instrument can have 1000 rows is that the attribute is actually a measure. We think it is static, but it actually changes every day. Or even several times a day.

On the other hand if an attribute is miscategorised as a measure, that attribute will be put in the fact table. And because it is numeric, it does not take a lot of bytes, even if it is repeated every day.

So to be safe, if you are not sure, put it as a measure in the fact table. The setup of an instrument can take a few weeks, particularly if it is a swap. During this period there is a continuous update on the instrument rows. If we have 20 numerical attributes for a swap, during this few weeks period we could have 20 rows because on day 1 only 5 attributes are setup, on day 2 another 5 is setup, on day 3 another 5, and on day 4 they change the attribute setup on day 1, and so on. So the whole thing can take 3 weeks, producing 20 rows. If it is in the fact table, a) we will only get 1 version a day, and b) we are preventing the instrument dimension from ballooning into many millions of rows.

2. Bytes

Numerical attributes take up much less bytes / disk space compared to textual attributes. Therefore, it does not slow things down if we put numerical attributes in the fact table, compared to the risk of miscategorising them into the instrument dimension, which will cause the dimension to “explode”.

3. Floating

If the attribute value is 0.00352281 and it is a float in the source system, it can be stored as 0.0035228100000000001. And it is possible that the value in the following day is 0.0035228100000000000, just because of the rounding in the floating point processor. Perhaps it is derived from a calculation of A/B. And if it changing from 00001 to 00000 then to 00001 again the next day, we have a phenomenon called “Flip Flop”. Flip Flop is where a dimension attribute keep changing values (usually twice a day) because it is updated from 2 different systems. This causes the number of rows in the dimension to “explode”.

4. Calculation

The last argument is that numerical attributes might be required for calculation, for example column1 divided by column2, or column1 + column2 + column3. If all columns involved in the calculation are stored in the fact table, we will be able to perform the calculation without joining the any dimension. We just need to filter on the partitioning date of the fact table, which is very fast.

Conclusion

I am not advocating to store all numerical attributes in the fact table for all cases. Only in the case of OTC instrument, I would recommend putting the numerical attributes in the fact table, because of the 4 reasons above.

But for most cases, usually the best approach is to put the numerical attributes in the dimension. For example: the dimension of a product (the width, length and height) and the weight of a product. Yes they do change from time to time, but not daily, hence we want to record these changes as SCD type 2 in the product dimension.

12 November 2017

Relational vs Non Relational Databases

Filed under: Data Architecture — Vincent Rainardi @ 6:35 am

Most of us database professionals are familiar with relational databases like Oracle and SQL Server. This is understandable because most companies are using relational databases. But we also have graph database, object databases, document databases, key-value databases, columnar databases. All of these are known as non relational databases. They are also called NoSQL databases.

So let’s find out what a relational database is, and what a non relational database is. A data architect should understand non relational databases, even though they are not as popular as relational databases.

A. Relational Databases

A relation means a table, which consists of columns (aka attributes) and rows (aka tuples). More formally, a relation is a header with a body. A header is a set of attribute names with the constraints, and the body is a set of tuples from a domain.

A table is called a relation because it is a relation between the header and the body, between the attribute and the tuples. For example: Employee relation consists of a header with 2 attributes: Name and Date of Birth. The Name attribute can contain any value up to 10 characters, as long as it is alphanumeric. The Date of Birth attribute can contain any valid date. So “Andy”, “Bob” and “Carol” relate to the Name attribute. “25 Dec 2000” and “1 Jan 1995” relate to the Date of Birth attribute. That is why a table is called a relation, and the whole thing is called Relational Model, by Edgar F Codd in 1969.

A relational database is a database where the data is stored as values in tables. Data inside a relational database is retrieved using a query called SQL, Structured Query Language. SQL is also used to add and update the data.

A1. Relvar, attribute value, database schema

Actually, a relation does not mean a table. A relation means a table definition (does not contain data). A relation variable means the table instance which contains data. A relation variable is called a relvar.

An attribute means a column. An attribute value means the value of an attribute for a particular tuple (row).

A database schema is a set of relations (tables), the header (columns) associated with those relations, and their predicates (constraints).

A2. Constraints

A constraint can be of one several forms:

  • Domain constraint is a restriction on the type of values that an attribute can have, e.g. numbers, dates, or alphanumeric. Today a domain constraint is known as a data type.
  • Data constraint is a restriction on the possible values that an attribute can have, e.g. only 1 to 10. Today a data constraint is known as a check constraint.
  • Key constraint is a way to identify a tuple using the values of certain attributes.

There several types of key constraint:

  • A superkey is a set of attributes which identifies a tuple (row), which is chosen to be implemented. There are no two tuples which have the same superkey. A superkey is now known as a primary key.
  • A foreign key is an attribute which identifies a tuple in another relation.
  • A candidate key is an attribute (or a set of attributes) which identifies a tuple. It can be chosen as a superkey. If not chosen it becomes an alternate key.
  • An alternate key is a candidate key which is not chosen as a superkey.

A3. Query

A query means getting data from a set of relations in the database. The output of a query is called an answer, or a result set.

A query can do several operations:

  • Restrict means limiting the rows returned on the result set.
  • Project means limiting the columns returned on the result set.
  • Product means combining all tuples in a relation to all tuples in another relation (Cartesian explosion)
  • Join means doing a product, and then filter the result to only the tuples we want.
  • Union means combining a table with similar structures.
  • Difference means listing the tuples in a relation which are not found in another relation.
  • Intersect means listing the tuples found in in both relations.

We also have other operations: outer join, outer union, aggregation (e.g. sum), group and ungroup.

A4. What A Relational Database Is

So a Relational Database is a database in the form of tables (relation), which consists of columns (attribute) and rows (tuples). The database is created, populated and queried using SQL language. Each column (attribute) has a domain (data type). Each table has an identifier (superkey or primary key) and can have foreign keys. In SQL we have join, product, union, project, restrict, difference and intersect.

Relation means table. The word “Relational” here means the relation between the Table Header (the data structure) and the Table Body (the data content). Not between one table and another table. It is based on Relational Model by Edgar F Codd in 1969.

SQL Server, Oracle, MySQL, DB2, Sybase, Teradata, Informix are examples of Relational Databases.

In a relational database, the data domain (data type) must be known first, before we can store the data. And secondly, the data structure must be designed first, before we can store the data in the database.

B. Non Relational Databases (aka NoSQL Databases)

Non Relational Databases are databases which are not in the form of tables, where the data attributes are put as table header, and the data is put in the table body. The attributes can be stored in the table body. Or they can be stored as a node or as a file. Or the data might not have attributes. The data might not be numbers, text and dates, but images, videos and documents. The data could be DNA sequences, social network, or chemical molecules. Storing them as text and numbers might require a lot of space, or might not be suitable to how the data is used/queried.

Non Relational Database is known as NoSQL, stands for Not Only SQL (No doesn’t mean no here, but it means not only).

There are various different types of Non Relational Database. The 5 most famous ones are #1 to #5 below: (#6 no longer exists and #7 is a combination of #1 to #5)

  1. Key-Value Pair Database
  2. Document Database
  3. Graph Database
  4. Object Database
  5. Columnar Database
  6. Hierarchical Database
  7. Multi Model Database

 

B1. Key-Value Database

The most widely used Non Relational Database is Key-Value database, aka Key-Value store. It consists of files consisting of 2 fields: Key field and Value field. It is very simple. Because there is no data type, we can store anything in there. The benefit of using Key-Value database is we do not need to define the data structure (the schema). The other benefit is that it can be distributed across many nodes (file servers), so the performance is highly scalable. But the winning benefit is the money: it can store a large amount of data at much lower cost than relational database.

A Key-Value database is ideal for storing large amount of unstructured data. In some implementation of the Key-Value Database we can store not only strings but also lists of string. In other implementation, the Key-Value Database is in-memory, boosting the query performance.

The main disadvantage of Key-Value Databases is that the data consistency is not guaranteed. After 1 second it is consistent, but before 1 second it is not.

Examples of Key-Value Databases are Redis, Aerospike, Riak KV, Berkeley DB, LevelDB and Apache Ignite. Amazon DynamoDB, Azure CosmosDB and Oracle NoSQL are not only a Key-Value Database, they are also a Document Database, which is why they are classified as a Multi Model Database (see point B7 below).

Hadoop is essentially Key-Value store, it stores Key-Value data. MapReduce operates on Key-Value data stored in Hadoop. But Hadoop, unlike Redis or Aerospike, is not a Key-Value database. Hadoop is not a database system, it is merely a file system (HDFS).

B2. Document Database

Document Database stores JSON and XML documents. It doesn’t stores binary files such as videos, PDFs and images. The word “document” in Document Database means JSON and XML, not a file we normally work in the office. Microsoft Office files such as xlsx, docx, pptx (2007 and above) are XML files, but they are zipped so they becomes binary. So Excel and Word files are not what we store in Document Database. This is why some people prefer to call it “Document-oriented Database”.

Why is it important to store JSON and XML files? Because they contain data, such as loan, claim or patient data. It’s just that the format is JSON or XML. Data is widely stored as text files, i.e. CSV or tab delimited. But the limitation is that it is not validated. If a row in a CSV file is missing a field, it causes problems when we try to import it using an ETL tool. So we prefer XML files. Using a Schema Definition (XSD), the maker of the file can automatically validate the file, e.g. the data type for each field is, what values are permitted, etc. So the consumer of the file is guaranteed that the file is “good”.

That was in 2012, when XML is the de-facto standard. In 2017 JSON practically replaces XML, and is now the de-facto standard. People prefer putting data into JSON files than into XML files because JSON files are smaller and the processing is faster and easier. And JSON files are easier to read than XML (link, link). You can’t win against these 3 arguments.

The power of putting data in JSON documents is the hierarchical navigation. If we put a product hierarchy into JSON, we can query all nodes under product A up to 3 levels deep, or all the siblings of product A under the same grandparent. Hierarchical query like this is the main advantage of using a Document Database, compared to a Relational Database.

In a Document Database, the data structure is contained within the document. Whether we use XML files or JSON files, the data structure is inside the data file. So we don’t need to define the data structure in advance, like in a Relational Database.

A Document Database enables us to update a JSON document (or XML), adding data into it or deleting data from it. If we want to add a new field, we just add it into the JSON document. There is schema or data structure that we need to modify like in a Relational Database.

Examples of Document Database are: MongoDB, RavenDB, Apache CouchDB, RethinkDB, RaptorDB, JasDB, IBM Cloudant.

B3. Graph Database

In Graph Database data is stored as nodes connected to other nodes using lines. For example, a Graph Database containing social network data, a node can be: a person, a company, a city or a place (e.g. a restaurant, a shop, a school). The line shows the relationship between 2 nodes. A line has a time property, which shows when the relationship occurred.

A line between 2 persons can be: “is a friend of” or “is a colleague of”. A line between a person and a company can be “is an employee of” or “is a customer of”. A line between a place and a person can be: “had dinner in” or “bought something in”. The time property of a line can be “on 3rd March 2016”, “from 4th April 2016 to 5th May 2017” or “every Monday”.

A property of a “person” node can be name, date of birth, address. A property of a “place” node can be name, type of place and rating.

In the case of a hospital, the node can be patient, healthcare professional, medication, ward and illness. In the case of an electricity company, the node can be customer, product, tariff, engineer and bill.

In the case of social media, examples of the queries on a graph database are:

  1. Who likes “Amari Noodle Bar” in the last 3 months?
  2. The friends of the people who likes “Amari Noodle Bar” (up to 3 hops) – where do they live (just street / post code, not the house number)
  3. What are the education background of people who have ever worked for both IBM and Microsoft?

As you can see from these queries, a Graph Database is a very powerful tool to have, if it is populated with good data. It is very different to a relational database.

Examples of Graph Database are Neo4J, GraphBase, Gephi, and InfiniteGraph.

B4. Object Database

Object Database stores data as objects. By objects I mean as in classes in Java or C#.NET. For example: a customer is an object, with name, date of birth and address as the attributes. The Java code defines this customer object, populates for 10 customers then stores these 10 customers in the Object Database (each customer is called an “instance”, and the customer itself is a “class”). The Java code can then update the address of customer #5, delete customer #3 and insert customer #11, all in 1 database call. It can also copy customer #6 into a new customer with all attributes having the same values except the name.

To query an Object Database, we use JPA (Java Persistence API). JPA is the object oriented version of SQL. Like SQL, JPA uses SELECT FROM WHERE GROUP BY.

For example: SELECT C.Name, C.Address.City FROM Customer as C WHERE C.DateOfBirth.Month = “December”

About the “C.Address.City” above: because it is an Object Database, Address is an object which has attributes, such as Street, City and Post Code. JPA also have JOIN, ORDER BY and SUB QUERY. JPA has UPDATE and DELETE (no INSERT) but this is not the recommended way to update objects.

Object Databases are not popular because object oriented languages such as Java and .NET stores their object data in Relational Databases. The OO language has an interface to save objects as tables in Relational Databases. This interface is called Object-Relational Mapping (link). For example, .NET has TableAdapter (link), which enables us to do this:

CustomerTableAdapter.Insert(Customer.Name, Customer.DateOfBirth, Customer.Address)

And to update or delete, just replace the word Insert above with Update and Delete. To query we use the standard SQL language, because the data is stored in the Customer table in Relational Database. This makes the query a lot easier, especially when aggregating data.

Examples of Object Databases are: ObjectDB, GemStone, Ignite ObjectStore, Ekky ObjectDatabase++.

B5. Columnar Database

Columnar Databases store the data in tables. But instead of storing it as rows, the data is stored as columns. For example, if we have a table 1 million orders, with 3 fields: date, product and quantity. The order table in the Columnar Database looks like this: (red numbers are the ID of the order)

Date: 1/11/2017: 1,2,3,4,5. 2/11/2017: 6,7,8,9. 3/11/2017: …
Product: Product1: 1,3,4,6,7,15. Product2: 2,5,8,9,10. Product3: …
Quantity: 1: 1,2,3,5,6,9,10. 2: 4,7,8,11,12,13. 3: …

Because orders with the same date are stored together, orders with the same products are stored together, and orders with the same quantity are stored together, then when we query how many orders are for Product2 and date = 2/11/2017 and quantity = 1, it will be very fast to return the result.

Examples of Columnar Databases are: Greenplum, MariaDB AX, MonetDB, InfiniDB, Apache Kudu.

B6. Hierarchical Database

In a Hierarchical Data Model the data is organised like a tree. Today there is no Hierarchical Database, not as a stand-alone commercial product. Today Hierarchical Data Models are now mainly implemented in Document Databases, such as MongoDB (see B2 above). In a Document Database we can do hierarchical navigation on the data, i.e. query the data hierarchically. For example: in a product hierarchy we can query all nodes under product A up to 3 levels deep, or all the siblings of product A under the same grandparent.

B7. Multi Model Database

Multi Model Databases can store data as Key-Value, Document, Graph, Object or Columnar. It does not have to be all five formats; two formats are enough to be called as Multi Model. Most Multi Model NoSQL Databases can store data as Key-Value and Document. AWS is arguably the most complete, it can store data as Key-Value, Document, Object and Columnar.

Oracle NoSQL, Amazon DynamoDB, Azure Cosmos DB, MUMPS, OrientDB

Reference:

  1. Paraha Sarathi, A Deep Dive into NoSQL: A complete list of NoSQL databases, link
  2. Josh Wyse, Why JSON is better than XML, link
  3. W3Schools, JSON vs XML, link
  4. Couchbase, Comparing document-oriented and relational data, link
  5. org, Definition of Object DBMS: link
  6. Wikipedia, NoSQL databases, link
  7. Wikipedia, Column-oriented DMBS, link

 

24 April 2017

Choosing between Big Data and Data Warehousing

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 4:14 am

If we have 100 files each containing 10 million rows that we need to load to a repository so that we can analyse the data. What should we do? Do we put them into Hadoop (HDFS), or into a database (RDBMS)?

Last week I defined the difference between Big Data and Data Warehousing as: Big Data is Hadoop, and Data Warehousing is RDBMS. See my article here: link. Today I would like to illustrate using an example where we need to choose between the two.

There are 4 consideration factors:

  1. Data Structure
  2. Data Volume
  3. Unstructured Data
  4. Schema on Read

1. Data Structure: Simple vs Complex
If all 100 files have the same structure e.g. they all consist of the same 10 columns, then it is better to put them into Hadoop. We can then use Hive, Spark, Presto, R or Python* to analyse the data. For example, to find patterns within the data, doing statistical analysis, or create forecasts. The development time will be shorter, because it is only 1 layer.
*or Phoenix, Impala, BigSQL, Stinger, Drill

If the 100 files contain 100 different tables, it is better to put them into a database, create a data warehouse, and use a Analytic/BI tool such as Tableau or QlikView* to analyse the data. For example to slice and dice the data, find percentage or anomalies and time series analysis. Yes we need to create 3 layers (staging, 3NF, star schema) but it enables us analyse each measure by different dimensions.
*or Looker, PowerBI, MicroStrategy, BusinessObjects, Cognos BI, Birt, Pentaho, Roambi, SAS, Sisense, or other BI tool

So if the data structure is simple, put into Hadoop, and if the structure is complex, put into a data warehouse. This is the general rule, but there are always exceptions. Can data with simple pattern be put into a data warehouse? Of course it can. Can data with complex pattern be put into Hadoop? Of course it can.

Using Hadoop and Hive/Spark/Presto we can also do slice and dice, find percentage or anomalies and do time series analysis. Using a data warehouse we can also do machine learning and data mining to find patterns in the data, do statistical analysis, and create forecasts. So basically, whether we store the data in Hadoop or in a data warehouse, we can still do complete analysis.

The issue here is storing it. Linking 100 tables in Hadoop is difficult and not natural. RDBMS such as SQL Server or Oracle is designed precisely for that task: linking and joining tables. Constructing a data model linking 100 tables is very suitable for an RDBMS. Can we design a data model linking 100 files with different structures in Hadoop? Of course we can. But it is much more difficult. For starters, it is Schema-on-Read, so the columns in the files have no data types. Schema-on-Read means that we don’t try to understand the relationship between the files when loading them into Hadoop. So yes we can load the 100 files into Hadoop, but we keep them as individual files, without relationships between them. This is the same as in Data Lake, which is also using Schema-on-Read, also using HDFS.

2. Data Volume: Small vs Large

100 files containing 10 million rows each is 1 billion rows per day. If all 100 files have the same structure (say they all consists of the same 10 columns), then we will have a performance problem if we load them into an SMP database such as SQL Server or Oracle. Within 3 years, this table will have about 1 trillion rows. Even with partitioning and indexing, it will still be slow to query.

Hadoop on the other hand, will have no problem storing and querying 1 trillion rows. It is designed exactly for this task, by storing it in many files and querying it in parallel using Stinger, Drill, Phoenix, Impala or Spark. The file structure is simple (the same 10 columns each) which lends itself to Hadoop.

Redshift, Azure SQL Data Warehouse, Exadata, Teradata, Greenplum and Netezza are more than capable to handle this, with excellent query performance. But MPPs are more costly than Hadoop, which is why companies tend to choose Hadoop for this task. Using an MPP for this task is like killing a fly with a canon. Not only it is expensive and unnecessary, but also it is too sluggish and cumbersome for the task.

If the 100 source files have a complex structure (such as an export from SAP system) then yes an MPP is a suitable solution as we need to create relationship between the files/tables. But if the source files have a simple structure and we just need to union them, then Hadoop is more suitable and more economical for this task.

So if the data volume is large, like 1 billion per day, and the data structure is simple, put them into Hadoop. But if the data volume is large and the data structure is complex, put them into an MPP.

3. Unstructured Data

If most of those 100 source files are MP4 (video) or MP3 (music), then Hadoop or Data Lake is an ideal platform to store them. An RDBMS be it SNP or MPP are not designed to store video or music files. They can (as a blob, or as externally-linked files), but they are not really designed for it.

If the source files have different number of attributes (such as Facebook or Twitter files,) then Hadoop or Data Lake is an ideal platform to store them. An RDBMS is not really designed for it.

Unstructured Data can also comes in the form of free-format text files (such as emails) and documents (such as journals and patents). Again Hadoop or Data Lake is much better position to store them than a RDBMS. But even better is document database, such as MongoDB, AWS DynamoDB or Azure CosmosDB.

4. Schema-on-Read

One of the advantages of using Hadoop or Data Lake is that they are Schema-on-Read. Meaning that we just store those files without determining whether the columns are numeric or string. It is only when we want to query it then we need to specify the data type.

Why is this an advantage? Because it makes it flexible. In Data Warehousing the first thing we need to do is to analyse the file structure, and design many tables to host the files in a Staging database. Then we design a normalised database to integrate those Staging tables. And then we design a Reporting layer in the form of Fact and Dimension tables and load those normalised tables into them. The whole thing can take a year if we have 100 files. The more number of files we have, the more complex the process and the longer it takes to design the databases for Integration layer and Reporting layer. It is good for the data architect (it gives them a job) but it is not good for the people who pay for this project.

Hadoop on the other hand is Schema-on-Read. After we put these 100 files in Hadoop, we query the first file. And when we query this first file, we specify the data types of each column. We don’t need to touch the other 99 files, yet. And we can already get the benefit. We can analyse the data straight away. On day one! If the other 99 files have the same structure, then we can union them, without extra effort of designing any database, and we can query them straight away. On day two! It is much simpler, we don’t need a team of 10 people designing any Staging, Normalised or Reporting layer for many many months. We can start analysing the data straight away and the project can finish in 2 to 3 months, with only 3 or 4 people. A lot less costly, a lot more agile, and a lot more flexible.

Conclusion

So that’s the 4 consideration factors when choosing between implementing big data or a data warehouse: data structure, data volume, unstructured data and schema on read.

17 April 2017

Definition of Big Data and Data Warehousing

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 5:38 pm
Tags: ,

I’m annoyed that many people label the normal Data Warehousing & BI stuff as “Big Data”. It is not. For me, Big Data is the “Hadoop stuff” (HDFS). If it is in SQL Server, Oracle or Teradata, it is just a normal database, not Big Data.

Big Data is stored on HDFS (Hadoop Distributed File System), not in RDBMS. Oracle, IBM, Microsoft, SAP, Teradata, all use Hadoop to store Big Data. Big Data is queried using MapReduce.

The reason why Big Data can’t be stored in RDBMS is because the format is not tabular. Sometimes it is 2 columns, sometimes it is 200 columns. Like Twitter data. The second reason is because it is too big. Sensors can make 100 measurements in a second, and in a year it could be Petabytes. Web Logs is another example. Tracking the ask and offer price of every transaction in every stock market is another example. Yes we can put Petabytes into SQL Server or Oracle, into Netezza or Teradata, but not at this speed (and more importantly not at this price!) Hadoop on the other hand is designed exactly to cope with these kind of speed and volume (and price).

Now the usage. What is Big Data Analytics? Big Data Analytics is when we do analytics on Hadoop Data.

Is Fraud Detection Big Data Analytics? Not always. Fraud Detection can be done on a normal Data Warehouse or a database. Is Machine Learning Big Data? Not always. Machine Learning can be done on a normal Data Warehouse or a database. If the Fraud Detection or the Machine Learning is done on data stored in Hadoop, then it is Big Data Analytics.

Even if it is only 200 GB, if it is stored in Hadoop, it is Big Data. Even if the data is 5 Petabyte, if it is stored in an SQL Server database, it is not Big Data, in my opinion.

Even if the data is in tabular format (i.e. columns and rows), if it is stored in Hadoop, it is Big Data. But if it is stored in an Oracle database, it is not Big Data.

Every Big Data architecture that I know uses Hadoop. No companies (or government) implement Big Data on an RDBMS. Or on a non HDFS files. Every single company, every single Big Data case I read implement the Big Data on Hadoop. I may be wrong and would be happy to be corrected. If you know a case which implements Big Data on a non-Hadoop system/architecture, I will grateful if you could let me know, either through comments, or via vrainardi@gmail.com.

 

 

18 September 2016

Rating Dimension

Filed under: Business Knowledge,Data Architecture,Data Warehousing — Vincent Rainardi @ 2:32 am

In an investment banking data warehouse, we have an instrument dimension (aka security dimension). This instrument dimension is used in many fact tables such as trade, risk, valuation, P&L, and performance.

In this instrument dimension we have rating attributes (columns), from the 3 agencies (S&P, Moody, Fitch), an in-house rating and many combination between them. We may have the ratings at both the issuer/obligor level, and at instrument/issue level. We could also have a hierarchy of ratings, e.g. AA+, AA and AA- are grouped into AA. There are separate ratings for cash instruments.

Usual Design

The usual design is to put all rating attributes in the instrument dimension. This is because the granularity of rating is at instrument level. In other words, rating is an attribute of an instrument.

Issues with the Usual Design

The issues with doing that are:

  1. Cash and FX may not be created as an intrument but they can have ratings.
  2. Some fact tables requires the rating outside the context of an instrument.

An example for point 1 is a USD bank balance (settled cash) in a GBP portfolio, which is given a rating of AAA. In the instrument dimension there is no instrument called USD bank balance. The other example is USD cash proceeds which is an unsettled cash.

An example for point 2 is a fact table used for analysing the average ratings of a portfolio across different date. This is called Portfolio Credit Quality analysis. On any given day, a credit portfolio has 3 attributes: highest rating, lowest rating and average rating. The average rating is calculated by converting the rating letter to a number (usually 1 to 27, or using Moody’s rating factor, link), multiply the rating number with the weight of each position, and sum them up. There can be several different criteria, for example: including or excluding debt derivatives, funds, cash, or equity, and doing lookthrough or not.

For example, portfolio X on 16th Sep could have these different attributes:

  1. Average rating: AA+
  2. Highest rating: AAA
  3. Lowest rating: BB-
  4. Average rating excluding cash: AA
  5. Average rating excluding cash and derivatives: BBB+
  6. Average rating excluding cash, derivatives and equity: BBB-
  7. Average rating including funds: AA-
  8. Average rating with lookthrough: A+

So the fact table grain is at portfolio level and date. We therefore cannot use instrument dimension in this fact table, and therefore it is necessary to have rating in its own dimension.

Another example of a fact table which requires rating not at instrument dimension is sectoral credit quality. A credit portfolio consists of many bonds and credit derivatives. Each of these bonds (or CDS) are of certain industry sector. A bond issued by BP is in Energy sector. A bond issued by Santander is in Financial sector. The average rating for each sector is calculated by converting each rating to a number, then multiplying it by the weight and sum up for all positions in that sector.

If we use GICS (link) as an example, and government, on any given day a credit portfolio can have sectoral credit quality like this: (usually excluding cash and equity, but including credit derivatives and debt funds)

  1. Consumer Discretionary: AA+
  2. Consumer Staples: AA-
  3. Energy: A+
  4. Financials:AAA
  5. Health Care: BB-
  6. Industrials: BBB-
  7. Information Technology: AA-
  8. Materials: BBB+
  9. Telecommunication Services: AA
  10. Utilities: BB+
  11. Real Estate: B+
  12. Government: AAA

Above is the average rating for each sector, for a given day, for a particular portfolio.

Design Choices for Rating Dimension

To satisfy the above requirement (fact tables which analyse rating at a level higher than instrument, e.g. at sector level or portfolio level), we can design the rating dimension in several ways:

  1. Rating schemes and hierarchies are created as attributes
  2. Rating schemes are created as rows, with hierarchies as attributes
  3. Rating schemes and hierarchies are created as rows

Before we dive into the details I’d like emphasize that we should not try to find out which approach is the best one, because different projects will require different approaches. So instead of asking “What is the best approach” we should ask “What is the most suitable approach for my project?”

And secondly I’d like to remind us that if the rating data is always used at instrument level than it should reside in the instrument dimension, not in a rating dimension on its own.

Approach 1. Rating schemes and hierarchies are created as attributes

approach1

Of course we should always have RatingKey 0 with description = Unknown. This row is distinctly difference to NR (No Rating). NR means S&P gives the bond a rating of NR. RatingKey 0 means that we don’t have any info about the rating of this bond.

The distinct advantage of this approach is that we have a mapping between different rating schemes. Moody’s Baa3 for example corresponds to S&P’s BBB-.

The Grade column is either Investment Grade (AAA down to BBB-) and High Yield (HY, from BB+ down to D).

Approach 2. Rating schemes are created as rows, with hierarchies as attributes

approach2

The distinct advantage of approach 2 over approach 1 is RatingKey 33 means Aa3, we don’t have to tell it that we meant Moody’s, like in approach 1. In Approach 1, RatingKey 4 can means Aa3 or AA-, depending whether we meant Moody’s or S&P.

The distinct disadvantage of approach 2 compared to approach 1 is we lost the mapping between S&P and Moody’s. In Approach 1 we know that Aa3 in Moody’s corresponds to AA- in S&P, but in Approach 2 we don’t know that. In the majority of the circumstances, this is not an issue, because an instrument with Moody’s rating of Aa3, may not have an S&P rating of AA- any way. Instead, its S&P rating could be AA or BBB+. So the mapping between S&P and Moody’s are the book standard, academic assumption, not the reality in the real world. In the above example, nobody cares whether the corresponding S&P rating for Aa3 is AA-. What the fund managers and analysts want to know is what Moody’s really assigned to that bond.

Approach 3. Rating schemes and hierarchies are created as rows

approach3

The distinct advantage of Approach 3 over Approach 1 and 2 is that RatingKey 31 means AA tier, it is clear and not ambiguous. In Approach 1 and 2, RatingKey 2,3,4 all means AA tier.

The other advantage of Approach 3 is the ability to create short term rating and cash rating, because they are simply different schemes.

Why do we need to keep the Tier column in Approach 3? Because we still need to convert AA+, AA and AA- to AA.

Rating Value (aka Rating Number)

In some projects we need to convert the rating letters to numbers. For this we can create an attribute in the above rating dimension (any of the 3 approaches) called Rating Value. It is usually AAA = 1, AA+ = 2, AA = 3, AA- = 4, and so on. NR (No Rating) and WR (Withdrawn Rating) got 0 rather than 23.

rating-value

Note that unlike in S&P and Moody’s schemes, in Fitch scheme the D rating is differentiated into 3: DDD, DD, and D. So DDD = 22, DD = 23, D = 24.

Miscellaneous

A common mistake is that people assume there are CC+ and CC- rating. There aren’t. Only CCC is split into 3 (CCC+, CCC, CCC-), but CC and C are not.

Sometimes the relationship between a rating in one scheme and another scheme is not a one-to-one mapping. For example, the Moody’s rating of Ca corresponds to 2 ratings in S&P scheme: CC and C. This is why approach 1 is tricky, because the rating does not have one to one relationship between 1 scheme and another. Another example is as I mentioned above, the D rating in S&P corresponds to 3 ratings in Fitch: DDD, DD and D.

20 June 2016

Data Lake vs Data Warehouse

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 6:22 pm

A Data Lake is a storage and analytic system which stores structured and unstructured data from all source systems in the company in its raw form. The data is queried, combined and analysed to get useful patterns and insights. It is built on a cost effetive Hadoop storage infrastructure, and can be dynamically scaled.

A Data Warehouse is a system that retrieves and consolidates data periodically from source systems into a dimensional or normalised data store. It usually keeps years of history and queried for reporting, business intelligence or other analytical activities. It is typically updated in batches, not every time a transaction happens in the source system.

Data Lake Advantages

Because data from all source systems are there in the data lake, the main advantage of data lake is it enables the analyst to get analytics and insight from multiple systems. Yes the data is still in their raw form, so would requires some processing, but it is a lot quicker (1-2 week) than building a data warehouse (6-12 months). And that is the second advantage: time to build.

The third advantage of data lake is its ability to store unstructured data such as documents and social media, which we can query and combine with structured data from databases.

The fourth and probably the most powerful business case for a data lake is the cost efficiency. In investment banking world for example, we can store market data (e.g. prices, yields, spreads, ratings) not only for the securities that we hold but all securities in the market, cheaper than if we store it in Oracle or SQL Server.

The fifth advantage of a data lake is the flexibility. Unlike Oracle or SQL Server, data lakes dynamically scales up. If this morning it is 2 TB, in the afternoon it can be 3 TB. As new data arrives, and we need new capacity, we can add storage easily. As we require more computing power, we can get more computing power there and then, instantly. There is no need to wait a few days/weeks (or months) for adding a new node in a SQL always-on availability groups, or adding storage to the SAN, or extending the computing power of an Oracle Grid. This means that we do not need to spend a lot of money upfront paying for 3 TB data and 16 processors. We can just start with 300 GB and 4 processors, and expand when required.

Data Lake Disadvantages

The first issue is that the data lake technology is immature. The language which can query across database and unstructure files only has very limited features (each vendor has different language, for Microsoft Azure it is U-SQL). It is probably only 10% of what PL/SQL or T-SQL can do. We can solve this by putting QlikView or Tableau on top. We use U-SQL only to query individual table and we join the data in QlikView/Tableau and do further processing there.

The second issue is cost (but I don’t find this argument holds water). The issue is: it actually costs a lot of money to store data from all source systems in the company, let alone storing external market data which requires a lot of licence cost.

Let’s take Microsoft Azure Data Lake for pricing example. Being in the cloud, the price is only US$ 0.04 per GB and US$0.07/1 million transactions (ref: link). Let’s say that “data from ALL systems in the company” is 500 GB. And every day we store this data in the data lake. So 500 GB per day x 365 days = 182,500 GB per year x $0.04/GB = $7300 per year. Let’s say we have 10 million transactions per day, which is 3650 million transactions per year x $0.07/million transaction = $256 per year. So it is only cost us about $7500 per year. This is a very reasonable price to pay, to have “all the data in the company in one place”. Even 5 years later, when the volume grows to 5 times, it is only about $35,000 per year. Still very reasonable price to pay.

The third issue is performance. Because data lakes stores the data in its raw format, the query which joins the different data could be running like a dog. Luckily Data Lake is running on HDFS, a distributed file system, which is very fast. So yes it is slower than a data warehouse, but it is not too bad. We are not talking about 30 minutes to run a query, but something like 20-30 seconds (compared to DW query which is say 1-3 seconds).

Data Lake being “exploration tool”, I can tolerate a little bit of slowness. After we proof that (for example) the risk data we query is making money, then it can pay for creating a Risk data mart specifically for that risk analysis purpose.

The forth issue is skill. Data Lake requires a superb business knowledge. Because the analyst needs to join tables from different source system. What is the difference between Real Modified Duration and Effective Duration and Modified Duration? But this issue is the same whether we are building ad Data Warehouse or a Data Lake. Both requires good business knowledge.

I don’t find U-SQL is difficult to understand for people with SQL knowledge, which is a common skill. But how about Machine Learning? That is difficult to master, right? Yes, that is true, but it is worth paying an expert data scientist to discover the insight in the data, because this insight can be used to save cost or boost our revenue. The potential benefit is 10-100 times the salary of the data scientist.

Conclusion

So considering all the pros and cons above, I am in favour for creating a Data Lake. In addition to having a Data Warehouse of course.

 

1 April 2016

Data Sourcing

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 8:16 pm
Tags: ,

One of the critical activity in a data warehouse project (and in any data project) is data sourcing. The business users have a few reports which need to be produced from the data warehouse. There is no data for that yet in the data warehouse, so we look at the report and ask ourselves: “Where can I get the data from to produce this report?” The plan is to find the data source, and then bring the data into the data warehouse.

There are 6 steps of data sourcing:

  1. Find the source
  2. Reproduce the numbers
  3. Verify the extraction
  4. Check the coverage
  5. Check the timing
  6. Check the cost

Step 1. Finding the source

We find out how the report was created, who created it, and from where. The report could be a PDF which is created from PowerPoint, which was created from an Excel spreadsheet, which was created from a file, which was exported from a system. It is this system that we need to check. Find the tables, the columns and the rows where the data is coming from.

Step 2. Reproduce the numbers

Once we have located the tables, we then try to reproduce the numbers we see on the report. For example, the report could be like this:

Asset Allocation

Try to reproduce the first number (24.3%) by querying the tables which you think the data is sourced from. In doing this you will need to look at the data processing happening in the Excel spreadsheet. Do they exclude anything? Is it a straight forward sum of the holdings in government bonds? Do they include supranational? Government agencies? Regional? Municipal? Emerging Market Debt?

If you can match 24.3% then great. If not, investigate if the data is manually overridden before it was summed up. For example, in Excel, there might be a particular bond which was manually classified as Government even though the asset class is Corporate, and this is because it was 80% owned by the government, or because it was backed by the government.

We need to particularly careful with regards to the totals. For example, on the “Allocation by Country of Risk”, if the total portfolio value is $200m, but they exclude FX forwards/swaps, or bond futures, or certain derivatives, then the total portfolio value could decrease to $198m, and all the percentages would be incorrect (they are slightly higher).

Understanding the logic behind the report is critical in reproducing the numbers. In-depth industry knowledge will be helpful to understand the logic. For example:

  1. The credit rating for each debt security is categorised into IG (Investment Grade) if it is BBB- or above, and into HY (High Yield) if it is BB+ or below, or NR.
  2. The credit rating used for point 1 is the average between S&P, Moody’s an Fitch, except a) for own fund use look-through, b) for outside fund use IMMFA
  3. The “Allocation by Country of Risk” table excludes cash and FX forwards/swaps.
  4. Each country is categorised into Developed, Emerging or Frontier market.
  5. When determining the country of risk in point 4, for derivative use the underlying.

In the above case, if we are not familiar with the investment banking industry, it would take a long time for us to understand the logic. So, yes, when doing Data Sourcing, it is best if it is done by a Business Analyst with good knowledge & experience in that industry sector.

Step 3. Verify the extraction

Once we can reproduce the numbers, we need to verify if we can get the data out. A few gotchas are:

  1. The numbers are calculated on-the-fly by the system, and not stored anywhere in the database. If this is the case, find out from the vendor if they have an export utility which produces a file after the numbers have been calculated.
  2. Are we allowed to connect to that database and query it? Do not assume that we can, because I’ve encountered a few cases that we are not allowed to do that. It could be because of the system work load / performance (it is a critical transaction system and they don’t want any big query ruining the front end users), or it could be because they have provided daily extract files which all downstream systems must use (instead of querying the database directly). From the system admin point of view, it makes sense not to allow any external query runs on the database, because we don’t know what kind of damage those external queries can cause, it could block the front end queries and causing a lock.
  3. Loosing precision, i.e. the data must be exported from the database but during the export process the precision decreases from 8 decimal places to 2 decimal places.
  4. There is a security restriction because the it is against the “chinese wall” compliance rules (in investment banking, the public-facing departments must not get data from the M&A department)
  5. The system is being migrated, or rewritten, so it is still in a state of flux and we need to wait a few months.
  6. The system is not a solid “production quality”, but only a “thrown-away”, which means that within a few months they could be dropping those tables.

Step 4. Check the coverage

This step is often missed by many business analysts. We need to check if all the products that we need is available in that system. If the report we are trying to reproduce is reporting 4000 products from 100 branches, but the source system/tables only covers 3000 products from 70 stores, than we’ve got to find out where the other 1000 products and 30 stores are sourced from. Are they produced from a different system.

Not only product and stores. We need to check the coverage in terms of: customer/clients, portfolios, securities, line of business, underwriting classes, asset classes, data providers. And the most important coverage check is on dates, e.g. does the source system have data from 2011 to present? It is possible that the source system only have data from 2014.

Step 5. Check the timing

After checking the coverage, we need to check if the data is available when we need it. We need to check these 3 things: data is too late, available extraction window, the data is overwritten.

  1. Data is too late: If our DW load starts at 2.15 am, will the data be available before that? If not, could the business user live with a bit of stale data (data from 2 days ago, i.e. if today is Wednesday, the latest data in the DW would be Monday data).
  2. Available extraction window: In the production environment, when can we extract the data from that source system? If from 11pm to 6am there is an overnight batch running, and we can’t run our extract during that time, then the ealierst we can run is 7am. If the DW load takes 3 hours, DW users can access it at 10am. Is that too late for the users or not?
  3. The data is overwritten: the data from the source system can be updated many times during the day and when we extract it at night, we have no trace of these changes. Is that ok? Do we need intraday, push-driven data load into the DW? Or would 10 minutes data extraction frequency (pull-driven) be enough?

Step 6. Check the cost

There is no free lunch. We need to check how much it would cost us to use that source data.

  1. If the data is valuable (such as prices, yield and rating from Bloomberg, Reuters and Markit) we would have to pay the data providers. We need to check the cost. The cost could be per $5 security, per call, so it could easily be $20-30k per day. The cost is usually shared with other department.
  2. Check with the data provider, if you use the data only as an input to your calculation, and you don’t publish it / send it on to any external parties (clients, etc.), would it still cost you a lot? Even if you don’t have to pay the data providers, your DW project might still have to share the cost with other departments. Say the data provider is Markit and your company pays $300k/year for prices data and it is currently shared by 5 departments ($60k each). Your project may have to bear the cost of $50k/year ($300/6).
  3. The cost could be a killer to the whole thing, i.e. even if #1 to #5 above are all ok, if the cost of the data is $50k, it could force you to cancel the whole project.
  4. Sometimes other department has to create the data for you. Let’s day yield calculation, or risk engine, or OTC pricing engine, and the requirement from the BI/DW is specific so they have to develop it. It could take them 3 months x 3 people and they could cross charge your project $50k (one off). And that could also be a killer to the DW project.
  5. Developing interface: some systems do not allow external system to pull the data out. They insist to develop an export, and charge the cost to your DW project.
  6. Standard data interface system: some large companies (such as multinational banks) have standard interface (real time, end of day, etc.), and the central middle ware team might charge your DW project some low amount (say $2000 one off + $50/month) to use that standard data interface system. Say you need FX rate data from FX system, and there is already a standard message queue for FX rates with publication time of 11am, 1pm and 3pm. So you “subscribe” to this publication MQ and pay the cost (project cross charge).

16 March 2016

Different Measures for Different Product Types

Filed under: Business Knowledge,Data Architecture,Data Warehousing — Vincent Rainardi @ 8:33 am

What I mean by a measure here is a time-variant, numerical property of an entity. It is best to explain this by example. In the investment industry, we have different asset classes: equities, bonds, funds, ETFs, etc. Each asset class has different measures. Equities have opening and closing prices, daily volume, market capitalisation, daily high and low prices, as well as annual and quarterly measures such as turnover, pretax profit and EPS. Bonds have different daily measures: clean and dirty prices, accrued interest, yield and duration. Funds have different daily measures: NAV, alpha, sharpe ratio, and volatility, as well as monthly measures such as 3M return, 1Y return, historic yield, fund size and number of holdings. ETFs have daily bid, mid and offer prices, year high and low, and volume; as well as monthly measures such as performance. The question is: what is an appropriate data model for this situation?

We have three choices:

  1. Put all measures from different product types into a single table.
  2. Separate measures from each product types into different tables.
  3. Put the common measures into one table, and put the uncommon measures into separate tables.

My preference is approach 2, because we don’t need to join across table for each product type. Yes we will need to union across different tables to sum up across product types, but union is much more performant than join operation. The main weakness of approach a is column sparsity.

On top of this of course we will need to separate the daily measures and monthly measures into two different tables. Annual and quarterly measures for equities (such as financial statement numbers) can be combined into one table. We need to remember that measures with different time granularity usually are from different groups. For example, the prices are daily but the performance are monthly.

Static Properties

In addition to different time-variant properties (usually numerical), each asset class also different static properties (can be textual, date or numeric). For example, equities have listed exchanges, industry sectors, country of domicile, and dividend dates. Bonds have issuers, call and maturity dates, and credit ratings. Funds have benchmark, trustee, legal structure and inception date. Examples of numerical properties are minimum initial investment and annual charges for funds; outstanding shares and denomination for equities; par and coupon for bonds. Some static properties are common across asset classes, such as ISIN, country of risk, currency.

Static properties from different asset classes are best stored in separate tables. So we have equity table, bond table, fund table and ETF table. Common properties such as ISIN, country of risk, etc. are best stored in a common table (usually named security table or instrument table).

Why not store all static properties in a common table? Because the properties are different for each asset class so it is like forcing a square peg into a round hole.

Historical Data

For time variant properties it is clear that the table already stores historical data in the rows. Different dates are stored as different rows. What we are discussing here is the historical data of the static attributes. Here we have two choices:

  1. Using SCD approach: store the historical values on different rows (called versions), and each row is only valid for certain time period. SCD stands for slowly changing dimension, a Kimball approach in data warehousing.
  2. Using Audit Table approach: store the historical rows in an audit table (also called history table). This is the traditional approach in normalised modelling. The main advantage is that the main table is light weight and performant.

When to use them? Approach a is suitable for situations where the historical versions are accessed a lot, whereas approach b is suitable for situations where the historical versions are very rarely accessed.

The main issue with approach a is that we need to use “between” on the validity date columns. In data warehousing we have a surrogate key to resolve this issue, but in normalised modelling we don’t. Well, we could and we should. Regardless we are using appraoch a or b, in the time-variant tables we need to store the ID of the historical row for that date. This will make getting historical data a lot faster.

29 December 2015

DimMonth, DimQuarter and DimYear

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 6:21 am
Tags: ,

Sometimes the grains of our fact tables are monthly, quarterly, or yearly. In such cases, how do we create DimMonth, DimQuarter and DimYear? Some of the questions in these cases are:

  1. Why do we need to create Month as a dimension? We can’t Month column in the fact table remain as Month, not as a dimension key?
  2. What does DimMonth look like? What are the attributes? How about DimQuarter?
  3. Should we create DimMonth as a physical table, or as a view on top of DimDate?
  4. What is the surrogate key of DimMonth and DimQuarter? Do we create a new SK, or do we use the SK from DimDate?
  5. Do we need to create a dimension for year? It seems weird because it only has 1 column (so that would be against the “degenerate dimension” concept)

For example, in the source system we have a table which stores the monthly targets for each store:
Sales Target table

Or quarterly target like this:
Quarterly Target

How do we create a fact table for this? Should we create it like this?
FactSalesTarget

Question 1 above: Why do we need to create Month as a dimension? Why can’t we leave it as Month in the fact table, not as a dim key, like this?
FactSalesTarget - MonthNotAsDimKey

Question 2 above: if we decided to keep the column as MonthKey, how should DimMonth look like? What are the attributes? Should DimMonth be like this?
DimMonth

What attributes should we put there?

  • Do we need quarter and year?
  • Do we need half year? e.g. 2015 H1 and 2015 H2
  • For month name, should we use the short name or long name? e.g. Oct 2015 or October 2015?
  • Do we need an attribute for “Month Name without Year”, e.g. October?
  • Do we need “Month End Date” column? e.g. 2015-10-30
  • Do we need “Month is a Quarter End” indicator column for March, June, Sep and Dec?
  • Do we need “Number of days in a month” column? e.g. 30 or 31, or 28 or 29.

Or should we use “intelligent key” like this?
DimMonth using Intelligent Key

Question 3 above: Should we create DimMonth as a view of DimDate like this?

create view DimMonth as
select min(DateKey) as MonthKey,
MonthNumber, MonthName, Quarter, Year
from DimDate
group by MonthNumber, MonthName, Quarter, Year

DimMonth as a view

What are the advantages of creating DimMonth and DimQuarter as a view of DimDate? (compared to creating it as a physical table) What are the disadvantages?

I think with the above questions and examples we are now clear about what the issue is. Now let’s answer those questions.

Q1. Do we need to create Month as dimension? We can’t Month column in the fact table remain as Month, not as a dimension key, like this?
FactSalesTarget - MonthNotAsDimKey

We need the Month column in the fact table to be a Dim Key to a month dimension because we need to access Year and other attributes such as Quarter.

Bringing Year into the Sales Target fact table like below is not a good idea, because it makes it inflexible. For example, if we want to add Quarter column we have to alter the fact table structure.
Bring year into fact table

Using a Dimension Key to link the fact table to a Month dimension makes it a flexible structure:
DimKey Link

There is an exception to this: Snapshot Month column. In a monthly periodic snapshot fact table, the first column is Snapshot Month. In this case, we do not need to create this column as a dimension key, linking it to DimMonth. In this case, we do not need a DimMonth. Because we do not need other attributes (like Year or Quarter). A monthly periodic snapshot fact table stores the measures as of the last day of every month, or within that month. For example: number of customers, number of products, number of orders, number of orders for each customer, the highest price and lowest price within that month for every product, the number of new customers for that month, etc.

Q2. What does DimMonth look like? What are the attributes?

Obviously, the grain of DimMonth is 1 row per month. So we are clear about what the rows are. But what are the columns? Well it depends on what we need.

I usually put MonthNumber, MonthName, Quarter and Year in DimMonth, because they are frequently used.

I don’t find “Month Name without the Year” as a useful attribute. I rarely come across the need for “Half Year” attribute.

“Month is a Quarter End” column is also rarely used. Instead, we usually use “Quarter” column.

“Month End Date” and “Number of days in a month” are also rarely used. Instead, we usually use “IsMonthEnd” indicator column in the DimDate.

For month name, should we use the short name (Oct 2015) or the long name (October 2015)? I found that the short name is more used that the long name. But the month number (2015-10) is even more frequently used that the short name

Q3. Should we create DimMonth as a physical table, or as a view on top of DimDate?

This is really the core of this article. A view on top of DimDate is better in my opinion, because we avoid maintaining two physical tables. And it makes the dimensions less cluttered.

If we make DimMonth and DimQuarter as a physical dimensions, in SSMS Object Explorer, when we open the table section we would see these:
DimDate
DimMonth
DimQuarter
DimYear

But if we create DimMonth and DimQuarter as views, then we will only see DimDate in the Object Explorer’s table section. The DimMonth and DimQuarter will be in the view section.

The main disadvantage of creating DimMonth as a view from DimDate is that it is less flexible. The attribute column that we want to appear in DimMonth should exist in DimDate. But I found that DimMonth usually only need 2 or 3 attributes i.e. Month, Quarter, Year; and all of them are available in the DimDate table. So this is not an issue.

Avoiding maintaining 2 physical tables is quite important because when we extend the date dimension (adding more years i.e. more rows) and we forget to extend DimMonth and DimQuarter, then we will cause an error.

The other consideration is of course the performance. I do not find the performance of DimMonth and DimQuarter to be an issue. This is because DimDate is not too large, and more importantly because the monthly and quarterly fact tables are small, less than 1 million rows. They are much smaller than daily fact tables which have millions or billions of rows.

Q4. What is the surrogate key of DimMonth and DimQuarter? Do we create a new SK, or do we use the SK from DimDate?

If we create DimMonth and DimQuarter as physical tables, then the surrogate key can either be pure surrogate (1, 2, 3, …) or intelligent key (201510, 201511, 201512, etc.)

But if we create them as a view of DimDate, then the surrogate key can be either the first day of the month (20151001, 20151101, 20151201, etc.) or the month itself (201510, 201511, 201512, etc.). I prefer the latter than the former because it is more intuitive (intelligent key) and there is no ambiguity like the former.

The script to create the view for DimMonth with SK = 201510, 201511, 201512, etc. is like this:

create view DimMonth as
select distinct convert(int,left(convert(varchar,SK_Date),6)) as MonthKey,
[MonthName] , Quarter, [Year]
from DimDate

Q5. Do we need to create a dimension for year?

No we don’t need to create DimYear, because it would only have 1 column.

What should we call the dim key column in the fact table then? Is it YearKey or Year? We should call it YearKey, to be consistent with the other dim key columns.

A dimension which only has 1 column, and therefore be kept in the fact table is called a Degenerate Dimension. A Degenerate Dimension is usually used to store identifier of the source table, such as Transaction ID and Order ID. But it is also perfectly valid for dimensions which naturally only have one attribute/column, like Year dimension. See my article about “A dimension with only one attribute” here: link.

Next Page »

Blog at WordPress.com.