Data Warehousing, BI and Data Science

28 October 2017

What is Data Science?

Filed under: Data Science — Vincent Rainardi @ 7:21 am

What is the difference between Data Science, Data Mining, Statistics, Machine Learning and Artificial Intelligence? Data Science is one of those buzz words which is very popular today, and therefore tends to be used to spice up news and marketing materials. In this article I will explain what Data Science is, and what the difference is to AI, BI, Big Data, Computer Science, Data Analysis, Data Management, Data Mining, Data Warehousing, Machine Learning, Predictive Analytics, Robotics and Statistics.

Data Science and Data Scientist

Data Science is about the scientific approaches to manage and analyse data using statistics, machine learning and visualisation.

Unlike a business analyst, a Data Scientist’s job is to manage and analyse data, focusing on the data itself, rather than the business functionality. For example, finding a pattern in the data, or forecasting future values.

Although a Data Scientist does manage the data, unlike a Big Data engineer, a Data Scientist does not setup the Hadoop infrastructure, such as configuring the nodes. A Data Scientist is an expert in using big data, such as video feed in a self driving car, million images in a character recognition system, streaming voice in speech recognition, or classifying ecommerce customers in petabytes of website traffic data.

A data scientist cleans the data, reformat the data, and manage how the data is stored and retrieved from files and databases. For example, splitting the data into many files, and combine the calculation result back. A data scientist also creates new data (usually artificial data, based on the data which already exists).

Unlike a data warehouse architect, a Data Scientist is not an expert in ETL technology (such as Informatica, SSIS), or parallel technology (such as UPI in Teradata, or partitioning in SQL server) or any database/file/storage technology (such as in-memory, cube or SAN).

A Data Scientist is an expert in clustering algorithm, deep learning, and artificial neural network. Unlike a brain doctor or neural surgeon, they are not an expert in neuron, nerve system, or human brain (or animal’s). A Data Scientist knows the architecture of a neural network, such as the number of layers or nodes, and they know about Long Short-Term Memory, Long Term Memory, Deep Believe networks, and Reinforced Learning (these are all architecture of a neural network). Unlike a psychologist or a psychiatrist, a Data Scientist does not know how human perceived events, remember and recall things, or make decisions, and they do not study human behaviour or mental illnesses.

A Data Scientist is an expert in visualising the data using python, Jupyter and R, such as producing 3D graphs, charts and trend lines using ggplot2. And they should be able to use a BI tool such as Tableau and QlikView to create visualisations, but at basic level. Unlike a Tableau, BusinessObjects, Cognos, Microsoft BI, TM1 or QlikView developer, they are not an expert in BI software technicalities. For example, they would not know about setting access restrictions on BO Universes, using Hierarchize on MDX, TurboIntegrator in TM1, or loading security tables in QlikView.

A Data Scientist knows how to write SQL queries, including grouping and joining tables, and converting data types. But they are not a SQL developer who knows how to write stored procedures containing recursive query using CTE, ranking functions and cursors, or forcing a query plan to use hash join or bitmap filter.

A Data Scientist knows about big data technologies, and how to use them. They should be able to write MapReduce in Java to read text files from HDFS and outputting a json file. But I would not expect them to understand why their MapReduce code is giving this warning: “Use GenericOptionsParser for parsing the arguments” (because we need to use the getConf() method). When using Cassandra or Hortonworks, a Data Scientist needs a Big Data Engineer to setup the platform for them, and a Big Data Developer to help them with the coding.

A Data Scientist should have a good background on data architecture, and they should be able to design data structures such as tables and HDFS/json files. But at basic level. They should not be expected to understand detailed Kimball modelling (such as implementing bridge table to solve multi valued attributes), or detailed Graph modelling, or detailed Data Vault modelling – for these we still need a Data Architect.

A glaring gap is the business knowledge. Unlike a Data Architect or a Business Analyst, a Data Scientist does not have good business knowledge. FX Option, IRS, CDS, or any other swaps are not in their vocabulary. A Data Architect or a Business Analyst working in Investment Banking or Investment Management would know these OTC Derivative well. A Data Scientist will need to be taught of the business knowledge. Whether it is cancer, lending, airline, retail or telecom data, a Business Analyst or a Business person in that sector will need to explain the numbers and data to the Data Scientist, before they can do their work.

So to recap, a Data Scientist is:

  • An expert in Machine Learning, statistics, AI and mathematical modelling
  • An expert in using various types of data, such as numeric, video, images and voice
  • An expert in statistic and analytics tool such as SPSS, Statistica, Weka, KNIME
  • An expert in data processing, data quality, and data management
  • Has good knowledge in programming, particularly python, R and Matlab
  • Has good knowledge in querying Big Data platforms such as Spark and Cloudera
  • Has basic knowledge in SQL, data modelling and databases
  • Has basic knowledge in some BI tools and visualisation
  • Has no business knowledge such as treasury, lending or legal
  • Has no knowledge in technical infrastructure such as Big Data platform, database engine, network infrastructure, and storage technologies
  • Typically has a degree or PhD in Math, Physics, Computer Science or Engineering

Data Science is about the scientific approaches to manage and analyse data using machine learning, statistics and visualisation. So the data is already there, stored in databases or file systems, and the data science is about analysing this data using scientific techniques (not business techniques).

Data Science is not about collecting and storing data in databases or file systems. It is about analysing the data which already stored in databases/files. But this analysis is not business analysis, it is scientific analysis (mathematical). We try to find patterns in the data.

Statistics

Statistics are mathematical methods for analysing numbers and sets. It is about linear regression, Gaussian distribution, correlation, sample size and probability. Advanced statistics involves advanced mathematics such as differential equations (calculus) and stochastic.

Statistics are not about analysing multimedia data or big data, such as images, video feeds, text and voices (natural language processing). It is about analysing numbers and sets.

Statistics are not about neural networks, deep learning or clustering algorithms. These are machine learning.

Machine Learning

Machine Learning is about creating machines/computers, that do tasks without explicitly being programmed (link). After learning, the computer will be able to predict future values or or classify future data.

In the last 50 years we have been giving computers instructions or rules on how to do things. For example, a credit card fraud is defined as … probably 25 rules, such as a transaction with unusually big amount, or has unusual location.

In contrast, in machine learning, we don’t give the rules to the computer. Instead we give the computer 1000 transactions and tell it: this one is a fraud, this one is not, this one is a fraud, this one is not, … and so on until 1000. The computer learns and creates its own rules (or pattern) and stores these rules. Then we give a new transaction and the computer can tell us whether it is a fraud or not.

We can use the same methods to identify cancer based on thousands of scan images, or analysing millions of images of nebulas and stars for finding black holes. We can use it to read handwriting (like post codes on envelopes used by the Royal Mail), speech recognition like Siri and Alexa, image recognition like in the self driving cars, and face recognition for banking and payment applications.

Machine Learning is about creating machines/computers which can do tasks without being programmed. We don’t give them the rules; they create the rules themselves.

Machine Learning uses various algorithms such as Linear Regression, Logistic Regression, Decision Tree, K-means Clustering, Support Vector Machines (SVM), Principal Component Analysis (PCA), Anomaly Detection, and Neural Network.

Machine Learning uses mathematics (statistics, algebra, calculus) to derive and calculate those algorithms. It uses computer programming to make the computer “learn” the parameters or weightings for the training data (to create the “rules”), and to predict the result for new data.

The most popular programming language today for Machine Learning is Python. R comes second, and very few use Matlab. This is because Python has Scipy and Scikit, a comprehensive Machine Learning library. Today nobody manually implements those ML algorithms themselves – why reinvent the wheel? Everybody just uses the Scikit library. Secondly because Python and R are free whereas Matlab is expensive (£1800, link). Also because Python is a richer language than R, i.e. Python is a general language like Java and C#, whereas R is a specialist language, used primarily for statistics.

Data Mining

Data Mining is about extracting and processing data, and finding patterns and insight from large amount of data. So data mining is not about collecting and storing data, or designing databases / file systems to store the data. It is about analysing the data to find patterns and insight.

So it is the same as Machine Learning then? No, it is not. Data Mining is wider than Machine Learning. Data Mining uses Machine Learning algorithms, such as Clustering and Classification, but it also uses non Machine Learning algorithm such as recurring relationship, market basket analysis, and frequent items.

Data Mining also does the following:

  • Data Mining uses SQL queries to simply query relational databases to find answer to a specific question to get a business insight (not trying to find the pattern).
  • Data Mining also explores OLAP cubes to find specific business insight about the data
  • Data Mining also queries text and documents (it is called Text Mining, or Text Analytics)
  • Data Mining also queries graph databases, object databases, document databases
  • Data Mining also queries file systems, such as data lakes and Hadoop
  • Data Mining also extract information from images, sounds, videos (i.e. multi media) and maps (called spatial data)
  • Data Mining also extract information from social media such as Twitter, Facebook and Instagram
  • Data Mining also extract information from streaming data, such as weather and stock market
  • Data Mining also queries DNA or protein sequence for a certain genetic pattern, e.g. using DNAQL

Data Mining is different to Reporting in the sense that Data Mining explores the data (a flexible exercise where we query the data repeatedly) whereas Reporting queries the data and output it in a rigid, specific format.

Data Mining is different to Data Science because when Data Mining analyses the data it does not only use scientific method (mathematical) but also business methods (business analysis). For example: analysing floating rates in Interest Rate Swaps or rating migration pattern in corporate bonds. So Data Mining covers Business Intelligence and business analysis, in addition to scientifically analysing the data using statistics and Machine Learning algorithms. Data Mining also tries to find business insights, in addition to finding statistical patterns.

Artificial Intelligence (AI)

Artificial Intelligence is the ability for a machine/computer to learn, think, solve problems and making decisions. The machine does not have to be able to see, hear, talk or communicate well (that’s robotics), just a basic input-output will do.

At the moment Machine Learning is used a lot in AI (especially Neural Network), but AI also use non Machine Learning methods such as Bayesian network, Kalman filter, fuzzy logic, automated reasoning, solution searching and evolutionary algorithm.

So an AI expert can be a Machine Learning expert, or an evolutionary algorithm expert, which are very different. But generally an AI expert is an expert in mathematics and able to do basic programming in some languages (usually Python or Java).

An “AI expert” does not mean that they are in IT. They could be a psychologyst or a psychiatrist who defines what AI is and isn’t. Who analyses the cognitive behaviour of an AI computer, comparing it to human behaviour/thinking.

Some people includes robotics, sensors and motions in AI, such as the ability to move, pickup objects and understand conversation. In my opinion all the mechanics of a robot are not AI. AI is only the thinking bit. But that’s just my opinion.

Ability to understand language is part of AI. This is called NLP, Natural Language Processing. The mechanics of hearing and speaking is not AI (that’s electronic & mechanical engineering), but the ability to understand the meaning of the words is AI.

Business Intelligence (BI)

Business Intelligence is about analysing business data to get a business insight, to be used to make business decisions. So Business Intelligence is not for scientific purposes, but to improve business performance, typically reducing the costs or increasing revenues.

Data Analysis can be for many different purposes, including academic research, personal interest, but if it is not for business, it is not BI. Examples of BI are: analysing customer profitability, analysing sales across different products, analysing patient risk for illness, and analysing risk of losses in investment or lending.

BI is done by human, not machines. A BI analyst explore the data in the database, using a BI tool such as Tableau and MicroStrategy, using OLAP cubes such as SSAS and Qlikview, or using a reporting tool such as Board and SAP Crystal Report.

Based on the tools used, there are 6 categories of BI applications: reporting, analytic, data mining, dashboard, alert and portal. Quoting from my book: Reporting applications query the data warehouse and present the data in static tabular format. Analytic applications query the data warehouse repeatedly and interactively, and present the data in flexible formats that users can slice and dice. Data mining applications explore the data warehouse to find patterns and relationships that describe the data. Reporting applications are usually used to perform lightweight analysis. Analytic applications are used to perform deeper analysis. Data mining applications are used for pattern finding.

Dashboards are a category of BI applications that gives a quick high level summary of business performance in graphical gadgets, typically gauges, charts, indicators, and color-coded maps. By clicking these gadgets, we can drill down to lower-level details. Alerts are notifications to the users when certain events or conditions happen. A BI portal is an application that functions as a gateway to access and manage business intelligence reports, analytics, data mining, and dashboard applications as well as alert subscriptions.

We now have a new category of BI: streaming analytics, where the BI tool give second-by-second real time summary of the streaming data.

That’s my understanding. If I’m wrong please correct me, via comments below, thanks. I hope this article is useful for you.

References:

  1. Data Mining: Concept and Techniques, Jiawei Han and Micheline Kamber, link
  2. Artificial Intelligence: A Modern Approach, Stuart Russell and Peter Norvig, link
Advertisements

25 October 2017

Andrew Ng’s Machine Learning course

Filed under: Data Science — Vincent Rainardi @ 6:00 pm

I have just completed Andrew Ng’s Machine Learning course on Coursera and in this article I would like to share what I have experienced.

The course contains various algorithms of machine learning such as neural networks, K-means clustering, linear regression, logistic regression, support vector machine, principal component analysis and anomaly detection.

More importantly for me, the course contains various real world applications of machine learning, such as self driving car, character recognition, image recognition, movie recommendation, image compression, cancer detection, property prices.

What makes this course different to other Machine Learning materials and sessions I have seen is that it is technical. Usually when people talk about machine learning, they don’t talk about the mechanics and the mathematics. They explain about what clustering does, but they don’t explain about how exactly clustering is done. In this course Andrew Ng explains how it is done in great details.

It is amazing how people can get away with it, i.e. explaining the edges of Data Science / Machine Learning without diving into the core. But I do understand that in reality, not many people are able to understand the mathematics. The amount of matrix and vector algebra in this course is mind boggling. I am lucky that I studied physics for undergraduate in university, so I have a good grounding in calculus and linear algebra. But for those who have never worked with matrix before, they might have difficulties understanding the math.

Unfortunately, we do need to understand the math in order to be able to complete the programming assignments. The programming is in Octave, which is very similar to Matlab. When I started this course, I have not heard about Octave. But luckily again, I have used Matlab when I was in uni. Yes it was 25 years ago, but it gave me some background. And in the last 20 years I have been coding in various programming languages, from C++, VB, C#, Pascal, Cobol and SQL, to Java, R and Python so it does help. It does help if you have strong programming experience.

The programming assignments do take a lot of time. It took me about 3-5 hours per assignment, and there are 9 assignments (week 1 to 9, there are no assignment for week 8 and 9). For me, the problem with these programming assignment is the vectorisation. I can roughly figure out the answer using loops, which takes many lines. But to convert into a vectorised solution (which is only 1 line), it takes a long time. Secondly, it takes time to translate the mathematics formula into Octave, at least in the first 3 weeks. And thirdly, it takes time to get it test the program and correct mistakes that I make.

The quizes (the tests) are not bad. They are relatively easy, far easier than the programming. Each quiz comprises of 5 questions and we need to get 4 out of 5 questions correct. We are given 3 goes for each quiz. If we still fails after 3 goes, we can try again the next day. Out of about 11 quizes (or may be 12) there is one which is difficult and I failed twice, but passed the third time. But that’s the only one which is difficult. The other are relatively straight forward.

The most enjoyable thing for me is the math. I have not got a chance to use the math I learned in uni since I graduated 25 years ago (well, except helping my children with their GCSE homework). The practical / real world examples are also enjoyable. The programming on some modules are enjoyable. Overall it was fun and interesting, and very useful at the same time. I have been to many IT courses: .NET, SQL BI, Teradata, TDWI, and Big Data, but none of them are as enjoyable as this one.

Because of this course, which makes me realise that Machine Learning is very useful and fun, I decided to apply for an MSc course in Data Science (containing big data and machine learning). So thank you Andrew Ng for creating this course, and patiently explaining the chapters and concepts, video by video. Thank you.

21 October 2017

Swaps and Options

Filed under: Analysis Services — Vincent Rainardi @ 6:12 am

Working in investment world often requires us to understand various asset classes. Sometimes we come across simple asset classes, such as shares, bonds, currency and commodity. But sometimes we come across complex asset classes / derivatives, such as swaps and options. In this article I’d like to give a flavour on swaps and options. There are other derivatives such as futures and forward, but I won’t be covering them today.

Swaps

Swaps are essentially an exchange of two assets. The asset produces an income, so what really being exchanged is the income (the cash flow). In fact, we don’t have to have an asset. We just need to have a cash flow. For example, if the fixed income from a bond is exchanged to a floating interest rate.

Swaps comes in many different variations, but it is always in the form of A is exchanged to B. Typically A is floating price, and B is fixed price. Here’s some of them, alphabetically:

  • Asset Swap: a bond (fixed coupon) in exchange to a floating interest rate (LIBOR plus spread). Link
  • Commodity Swap: a commodity (floating price, usually oil) in exchange to a fixed price. Link
  • Credit Default Swap (CDS): a credit protection (from a bond going default) in exchange to a fixed payment. Link
  • Cross Currency Swap (CCS): a loan in currency A (fixed interest) in exchange to a loan in currency B (fixed interest). Link
  • FX Swap: a spot FX (today) in exchange to the same amount of a forward FX (tomorrow). Link
  • Inflation Swap: the inflation rate (such as RPI) in exchange to a fixed rate. Link
    Zero Coupon Inflation Swap is one lump sum payment at the end. Link
  • Interest Rate Swap (IRS): a floating interest rate (LIBOR + spread) in exchange to a fixed interest rate. Link
  • Total Return Swap: the income of an asset, plus the capital gain, in exchange to a fixed or variable payment. Link
  • Variance Swap: a variance of a price (the volatility) in exchange to a fixed payment (the strike). Link

Options

Options is the right to buy or sell at certain price on a certain date.
The right to buy is a call. The right to sell is a put.
American: buy/sell before a certain date. European: on a certain date.

  • Bond Option: the right to buy/sell a bond at a certain price before a certain date. Link
  • Commodity Option: the right to buy/sell a commodity (e.g. oil or metal) at a certain price before a certain date. Link
  • Equity Index Option: the right to buy/sell an equity index (such as S&P 500) at a certain price before a certain date. Link
  • FX Option (or currency option): the right to exchange a currency into another currency (fixed amount) at certain rate before a certain date. Link
  • Interest Rate Option: the right to buy/sell a time deposit (EuroDollar, CD, T-Bill, EuroSterling) with a certain rate before a certain date. Link
  • Swaption: the right to enter a swap (typically Interest Rate Swap). Link

18 October 2017

Hierarchy with Multiple Parents

Filed under: Analysis Services — Vincent Rainardi @ 7:40 am

The left diagram below shows a single-parent hierarchy. The right diagram is a multiple-parent hierarchy, where a child node can have 2 or more parents.

Figure 1 Single and Multiple parent hierarchy
Figure 1. Left: single-parent hierarchy, right: multiple-parent hierarchy

In the right diagram above, leaf node H has 3 parents: E and F are in one branch, and G is in another branch. Node E and F show an important principle: in a multiple-parent hierarchy, a child node can be located in 2 different branches (or more).

Double Counting

If the amount in node H is £10, this £10 is carried-forward in node E, F, G then to B and C and finally to A. This causes double counting. Node A accumulates £30 from node H (£20 via node B, and £10 via node C). Solving double counting is one of the most important points in dealing with multiple-parent hierarchies.

Ragged Hierarchy

In the right diagram above, Node I has 2 parents: F is 1 level up, and C is 2 levels up. This is an important principle: in multiple-parent hierarchies, the parent node can be 2 levels up (or more). This situation is called “ragged”, meaning skipping a level. Ragged can also happen in a single-parent hierarchy. Node I in the left diagram above is an example of a ragged node. A hierarchy that has a ragged node is called a ragged hierarchy.

Terminology

A hierarchy is like a tree. It has a root, branches and leaves. It is useful to picture the hierarchy upside down like below, so that it looks like a tree.

Figure 2. Upside Down Tree
Figure 2. Upside Down Hierarchy, similar to a Tree

Picturing the hierarchy upside down helps us understanding the terminology. We say that A is the “Root Node”. I, H and D are the “Leaf Nodes”. You see, when we turn the tree upside down like this, the terminology matches the reality. We have the Root Node at the bottom of the tree, and the Leaf Nodes at the top of the tree.

Unfortunately we should only depicted this upside down tree in our mind. Because in every text we will find that people drawing the tree like in Figure 1 above, with the Root Node at the top and the Leaf Nodes at the bottom. So when we communicates with other people, we should use this “root at the top” diagram/convention.

A. Aggregation

In a multiple-parent hierarchy, the aggregation is done by summing the amount for distinct leaf nodes. Taking a distinct is very important, to avoid double counting.

Figure 3 Left. Double counting aggregation   Figure 3 Right. Single counting aggregation
Figure 3. Left: double counting aggregation (sum of nodes), right: single counting aggregation (sum of distinct nodes)

In figure 2 above, each node is worth £10. The red numbers above the nodes are the total of all nodes under it, including itself. As we can see on the left diagram, if we double count, the total for node A (which is at the top of the tree) is £150, which is incorrect. There are only 9 nodes so the total should be £90. The right diagram shows the correct amount at node A, which is £90.

B. Relational Data Model

The relational model is the usual 2 columns parent-child hierarchy table, preferably with a “level” column showing at what level the child node is. Note that the root node is at Level 1, not Level 0. Also note that the parent of the root node is NULL. This is the parent-child relational hierarchy table for the multiple-parent hierarchy in Figure 1 and 2 above:

Figure 4. Parent Child Hierarchy Table
Figure 4. Parent-child hierarchy table

C. Flattened Hierarchy Table

C1. Path to Leaf Nodes

To flatten a multiple-parent hierarchy, we walk down from the top level node, taking the left most path until we reach the leaf level. We do this for each path in the tree. Each of these paths then becomes a row in the flatten hierarchy table. The business key of this table is the combination of all nodes.

Figure 5 Left. Red paths becomes the rows    Figure 5 Right. Flatten Hierarchy Table
Figure 5. Left: red paths become the rows, right: flatten hierarchy table

C2. Paths to Intermediate Nodes

Consider path ABFI below. This path reaches the leaf level. This path has 2 intermediate nodes. Node B and node F. So we have 2 “Intermediate Node Paths”: path AB and path ABF (shown as dash lines below).

Figure 6. Paths to Intermediate Nodes
Figure 6. Paths to Intermediate Nodes

So we have 2 types of path:

  1. Paths to leaf nodes
  2. Paths to intermediate nodes

To be complete, the flattened hierarchy should contain not only paths to leaf nodes, but also paths to intermediate nodes, like this:

Figure 7. Flatten Hierarchy with Paths to Intermediate Nodes
Figure 7. Flatten hierarchy table including paths to intermediate nodes

C3. Recursive CTE

To create a flatten hierarchy table like above, in SQL Server we use a recursive CTE. CTE stands for Common Table Expression. It is essentially a select statement which is defined at the beginning of another select statement. A recursive CTE is a CTE which joins to itself repeatedly, each time joining on a different row. It typically start by joining to first row, then joining to the second row, and so on until it reaches the last row.

We use Recursive CTE to get all the paths in the parent child hierarchy, as per Figure 6 above. The code is like this:

-- Create and populate a parent child hierarchy table
if object_id('dbo.parent_child_hierarchy') is not null
drop table parent_child_hierarchy

create table parent_child_hierarchy
( id int,
  child_node varchar(50),
  parent_node varchar(50),
  child_level int
)

insert into parent_child_hierarchy values
(1, 'A', NULL, 1),
(2, 'B', 'A', 2),
(3, 'C', 'A', 2),
(4, 'D', 'B', 3),
(5, 'E', 'B', 3),
(6, 'F', 'B', 3),
(7, 'F', 'C', 3),
(8, 'G', 'C', 3),
(9, 'H', 'E', 4),
(10, 'H', 'F', 4),
(11, 'H', 'G', 4),
(12, 'I', 'C', 4),
(13, 'I', 'F', 4)

select * from parent_child_hierarchy;

-- Create the "Path Table" containing every path (line) in the tree, including paths to intermediate nodes
if object_id ('tempdb..#path') is not null
drop table #path

;with CTE (child_node, parent_node, path) as
( select child_node, parent_node, cast(child_node as varchar(max)) as path
  from parent_child_hierarchy
  where parent_node is null
  union all
  select p.child_node as child, t.child_node as parent,
  t.path + ' > ' + cast(p.child_node as varchar(50)) as path
  from parent_child_hierarchy p join CTE t on p.parent_node = t.child_node
)
select path into #path from CTE where parent_node is not null;

select * from #path order by path
Output:
A > B
A > B > D
A > B > E
A > B > E > H
A > B > F
A > B > F > H
A > B > F > I
A > C
A > C > F
A > C > F > H
A > C > F > I
A > C > G
A > C > G > H
A > C > I

The recursive CTE code consists of two parts: the root part is union-ed with the low level part. The way this code work is: it gets the root node first (level 1, top of the tree), then join to the level 2 nodes, then join to the level 3 nodes, and so on until it reaches the bottom of the tree. Note that a hierarchy tree can have 2 tops (2 root nodes), or more.

The exact code depends on how data is arranged in the parent child hierarchy table. Usually a root node is indicated by parent = null. But occasionally a root node is indicated by parent = child.

C4. Padding

In the flatten hierarchy table the leaf nodes are not located at the same level, e.g. some leaf nodes are located at level 3, some leaf nodes are at level 4, etc. This makes the query difficult. We can’t easily query the leaf nodes. So we populate the empty cells in the lower levels with the leaf nodes. This is called Padding. Like this:

Figure 8. Flatten Hierarchy Table with empty cells populated
Figure 8. Flatten Hierarchy Table with empty cells populated

In figure 5 we can see that the red D and I on row 1 and 8 are added on the Level 4 column, marked in red. This makes the query easier, i.e. if we want to get the leaf nodes (the most bottom nodes) we just need to query level 4.

D. Dimensional Data Model

The standard dimensional model is a fact table (F), connected to a dimension table (D), which in turn is connected to a hierarchy dimension table (H) using bridge table (B).

Figure 9. Dimensional Design - General
Figure 9. Dimensional Data Model

It is easier to explain by example. I am going to use an example of a sales fact table and a product dimension table:

Figure 10. Dimensional Design - Product Hierarchy
Figure 10. Dimensional Data Model for Product Dimension

The 4 tables above are populated as follows (let’s assume a sale of 1 product, called product A): A row in the fact table is sale transaction for Product A on a particular date, with quantity, unit price and sales amount. The product dimension table has 3 row for product A, 1 row for the active version and 2 rows for the expired version. The bridge table contains 3 rows for product A, 2 for the Inventory Hierarchy and 1 for the Sales Hierarchy (see below). Each of these 3 rows in the bridge table corresponds to a row in the product hierarchy dimension, with a similar surrogate key.

Inventory Hierarchy is a method of grouping products for the purpose of inventory management. Sales Hierarchy is a method of grouping/classifying products for the purpose of sales management. The Inventory Hierarchy intertwines with the Sales Hierarchy in one big tree. This tree is called Product Hierarchy Dimension. It is a flattened dimension table.

Let’s go through the fact first, then dimension table, then the hierarchy dimension, then the bridge.

1. Sales Fact Table

The sales fact table is the same as single-parent scenario. Each sale is recorded as one row. The fact for each sale event are quantity, unit price and amount.

If a child node has 2 parents in the hierarchy, we do not store it in the fact as two rows. We store it as one row. This is very important to prevent double counting. Because the transaction happens only once, so we should store it only once.

2. Product Dimension Table

The Product Dimension table is a slowly changing dimension type 2. It contains many rows for each product. Each version of each product is stored as one row. Product A has 3 versions, two of them are expired, and one is active. Being type 2, the Product Dimension table has an Effective Date column, an Expiry Date column and an Is Current Flag.

3. Product Hierarchy Dimension Table

The Product Hierarchy dimension is flattened. See “Flatten Table” above. The columns are: Level 1, Level 2, Level 3, etc. Flattening is necessary to increase the user friendliness and speed of the queries. Each path (line) on the Product Hierarchy tree becomes a row in this dimension table. For Product A above, which has 3 parents, becomes 3 rows in the dimension table. 2 rows for Inventory Hierarchy and 1 row for Sales Hierarchy. Each row has different parent. Each row represent a path (line) in the hierarchy tree.

How do you update this dimension? What is the business key? There is no business key. The unique identifier is the path (line), i.e. the combination of all the nodes from the top level to the bottom level, strung together to represent the path. Because there is no business key this dimension is type 1. We can’t identify if a row has changed.

In a single-parent hierarchy, the leaf node is the business key. So the parents (which are attributes) can be updated using that business key. But in a multi-parent hierarchy, we can’t do that. For a multi-parent hierarchy, if the hierarchy data is updated monthly (say it is 10,000 rows), the usual technique is to stamp the 10,000 rows for this month as “March”. And the 10,000 rows for next month as “April”. So we have a “month” or “version” column. But if the data is updated daily, we should delete this month’s rows and reinsert them from source every day.

This dimension has a surrogate key, which is a sequential number. This dimension also contains an unknown row with surrogate key zero.

4. Bridge Table

The bridge table contains only 2 columns: the surrogate key of the product hierarchy dimension, and surrogate key of the product dimension. This bridge table can be implemented as a view over the product dimension. The number of rows in this bridge table is the same as the number of rows in the product hierarchy dimension, but it only contains 2 columns (3 with the SK, plus system columns). It contains both the leaf level nodes, and the intermediate level nodes from the product hierarchy dimension.

In relational world we don’t need this bridge table. The leaf id in Product Hierarchy Dimension table can directly be linked to the Product Dimension table. But in SSAS we can’t connect a dimension to another dimension. We will need a fact table in between. That is why we need this bridge table.

Note: Christopher Adamson offers a design for a bridge table: link. In this design the bridge table contains all children below a parent (not just the children directly below the parent, but from the lower levels as well).

Note2: Margy Ross said that bridge tables are used to represent a ragged or variable depth hierarchical relationship which cannot be reasonably forced into a simpler fixed depth hierarchy of many-to-one attributes in a dimension table: link. I am not sure what she meant, but it is probably what we are implementing here. True, a multiple-parent hierarchy is not the same as a ragged or variable-depth hierarchy, but if it is not many-to-one then it is many-to-many, and a multiple-parent hierarchy is many-to-many.

5. Parent Child Hierarchy table

In some projects, the sales amount in the fact table is aggregated up. At this summary level, the sales amount does not correspond to the bottom level of the product hierarchy. It corresponds to the middle level or the top level of the product hierarchy.

In this case we need to create a new dimension table which contains every single node in the hierarchy, not just the leaf nodes (bottom nodes). The parent column in this table is not mandatory, it is optional. But it is best practice to have it so we can traverse the hierarchy to find leaf nodes.

E. SSAS Cube

In SSAS cube, the design is using many-to-many relationship. See Marco Russo’s paper here: link.

Reference:

  1. Five best ways to split strings, by Aaron Bertrand, 26 July 2012, SQL Performance.com: link

Blog at WordPress.com.