Data Warehousing, BI and Data Science

18 October 2017

Hierarchy with Multiple Parents

Filed under: Analysis Services — Vincent Rainardi @ 7:40 am

The left diagram below shows a single-parent hierarchy. The right diagram is a multiple-parent hierarchy, where a child node can have 2 or more parents.

Figure 1 Single and Multiple parent hierarchy
Figure 1. Left: single-parent hierarchy, right: multiple-parent hierarchy

In the right diagram above, leaf node H has 3 parents: E and F are in one branch, and G is in another branch. Node E and F show an important principle: in a multiple-parent hierarchy, a child node can be located in 2 different branches (or more).

Double Counting

If the amount in node H is £10, this £10 is carried-forward in node E, F, G then to B and C and finally to A. This causes double counting. Node A accumulates £30 from node H (£20 via node B, and £10 via node C). Solving double counting is one of the most important points in dealing with multiple-parent hierarchies.

Ragged Hierarchy

In the right diagram above, Node I has 2 parents: F is 1 level up, and C is 2 levels up. This is an important principle: in multiple-parent hierarchies, the parent node can be 2 levels up (or more). This situation is called “ragged”, meaning skipping a level. Ragged can also happen in a single-parent hierarchy. Node I in the left diagram above is an example of a ragged node. A hierarchy that has a ragged node is called a ragged hierarchy.

Terminology

A hierarchy is like a tree. It has a root, branches and leaves. It is useful to picture the hierarchy upside down like below, so that it looks like a tree.

Figure 2. Upside Down Tree
Figure 2. Upside Down Hierarchy, similar to a Tree

Picturing the hierarchy upside down helps us understanding the terminology. We say that A is the “Root Node”. I, H and D are the “Leaf Nodes”. You see, when we turn the tree upside down like this, the terminology matches the reality. We have the Root Node at the bottom of the tree, and the Leaf Nodes at the top of the tree.

Unfortunately we should only depicted this upside down tree in our mind. Because in every text we will find that people drawing the tree like in Figure 1 above, with the Root Node at the top and the Leaf Nodes at the bottom. So when we communicates with other people, we should use this “root at the top” diagram/convention.

A. Aggregation

In a multiple-parent hierarchy, the aggregation is done by summing the amount for distinct leaf nodes. Taking a distinct is very important, to avoid double counting.

Figure 3 Left. Double counting aggregation   Figure 3 Right. Single counting aggregation
Figure 3. Left: double counting aggregation (sum of nodes), right: single counting aggregation (sum of distinct nodes)

In figure 2 above, each node is worth £10. The red numbers above the nodes are the total of all nodes under it, including itself. As we can see on the left diagram, if we double count, the total for node A (which is at the top of the tree) is £150, which is incorrect. There are only 9 nodes so the total should be £90. The right diagram shows the correct amount at node A, which is £90.

B. Relational Data Model

The relational model is the usual 2 columns parent-child hierarchy table, preferably with a “level” column showing at what level the child node is. Note that the root node is at Level 1, not Level 0. Also note that the parent of the root node is NULL. This is the parent-child relational hierarchy table for the multiple-parent hierarchy in Figure 1 and 2 above:

Figure 4. Parent Child Hierarchy Table
Figure 4. Parent-child hierarchy table

C. Flattened Hierarchy Table

C1. Path to Leaf Nodes

To flatten a multiple-parent hierarchy, we walk down from the top level node, taking the left most path until we reach the leaf level. We do this for each path in the tree. Each of these paths then becomes a row in the flatten hierarchy table. The business key of this table is the combination of all nodes.

Figure 5 Left. Red paths becomes the rows    Figure 5 Right. Flatten Hierarchy Table
Figure 5. Left: red paths become the rows, right: flatten hierarchy table

C2. Paths to Intermediate Nodes

Consider path ABFI below. This path reaches the leaf level. This path has 2 intermediate nodes. Node B and node F. So we have 2 “Intermediate Node Paths”: path AB and path ABF (shown as dash lines below).

Figure 6. Paths to Intermediate Nodes
Figure 6. Paths to Intermediate Nodes

So we have 2 types of path:

  1. Paths to leaf nodes
  2. Paths to intermediate nodes

To be complete, the flattened hierarchy should contain not only paths to leaf nodes, but also paths to intermediate nodes, like this:

Figure 7. Flatten Hierarchy with Paths to Intermediate Nodes
Figure 7. Flatten hierarchy table including paths to intermediate nodes

C3. Recursive CTE

To create a flatten hierarchy table like above, in SQL Server we use a recursive CTE. CTE stands for Common Table Expression. It is essentially a select statement which is defined at the beginning of another select statement. A recursive CTE is a CTE which joins to itself repeatedly, each time joining on a different row. It typically start by joining to first row, then joining to the second row, and so on until it reaches the last row.

We use Recursive CTE to get all the paths in the parent child hierarchy, as per Figure 6 above. The code is like this:

-- Create and populate a parent child hierarchy table
if object_id('dbo.parent_child_hierarchy') is not null
drop table parent_child_hierarchy

create table parent_child_hierarchy
( id int,
  child_node varchar(50),
  parent_node varchar(50),
  child_level int
)

insert into parent_child_hierarchy values
(1, 'A', NULL, 1),
(2, 'B', 'A', 2),
(3, 'C', 'A', 2),
(4, 'D', 'B', 3),
(5, 'E', 'B', 3),
(6, 'F', 'B', 3),
(7, 'F', 'C', 3),
(8, 'G', 'C', 3),
(9, 'H', 'E', 4),
(10, 'H', 'F', 4),
(11, 'H', 'G', 4),
(12, 'I', 'C', 4),
(13, 'I', 'F', 4)

select * from parent_child_hierarchy;

-- Create the "Path Table" containing every path (line) in the tree, including paths to intermediate nodes
if object_id ('tempdb..#path') is not null
drop table #path

;with CTE (child_node, parent_node, path) as
( select child_node, parent_node, cast(child_node as varchar(max)) as path
  from parent_child_hierarchy
  where parent_node is null
  union all
  select p.child_node as child, t.child_node as parent,
  t.path + ' > ' + cast(p.child_node as varchar(50)) as path
  from parent_child_hierarchy p join CTE t on p.parent_node = t.child_node
)
select path into #path from CTE where parent_node is not null;

select * from #path order by path
Output:
A > B
A > B > D
A > B > E
A > B > E > H
A > B > F
A > B > F > H
A > B > F > I
A > C
A > C > F
A > C > F > H
A > C > F > I
A > C > G
A > C > G > H
A > C > I

The recursive CTE code consists of two parts: the root part is union-ed with the low level part. The way this code work is: it gets the root node first (level 1, top of the tree), then join to the level 2 nodes, then join to the level 3 nodes, and so on until it reaches the bottom of the tree. Note that a hierarchy tree can have 2 tops (2 root nodes), or more.

The exact code depends on how data is arranged in the parent child hierarchy table. Usually a root node is indicated by parent = null. But occasionally a root node is indicated by parent = child.

C4. Padding

In the flatten hierarchy table the leaf nodes are not located at the same level, e.g. some leaf nodes are located at level 3, some leaf nodes are at level 4, etc. This makes the query difficult. We can’t easily query the leaf nodes. So we populate the empty cells in the lower levels with the leaf nodes. This is called Padding. Like this:

Figure 8. Flatten Hierarchy Table with empty cells populated
Figure 8. Flatten Hierarchy Table with empty cells populated

In figure 5 we can see that the red D and I on row 1 and 8 are added on the Level 4 column, marked in red. This makes the query easier, i.e. if we want to get the leaf nodes (the most bottom nodes) we just need to query level 4.

D. Dimensional Data Model

The standard dimensional model is a fact table (F), connected to a dimension table (D), which in turn is connected to a hierarchy dimension table (H) using bridge table (B).

Figure 9. Dimensional Design - General
Figure 9. Dimensional Data Model

It is easier to explain by example. I am going to use an example of a sales fact table and a product dimension table:

Figure 10. Dimensional Design - Product Hierarchy
Figure 10. Dimensional Data Model for Product Dimension

The 4 tables above are populated as follows (let’s assume a sale of 1 product, called product A): A row in the fact table is sale transaction for Product A on a particular date, with quantity, unit price and sales amount. The product dimension table has 3 row for product A, 1 row for the active version and 2 rows for the expired version. The bridge table contains 3 rows for product A, 2 for the Inventory Hierarchy and 1 for the Sales Hierarchy (see below). Each of these 3 rows in the bridge table corresponds to a row in the product hierarchy dimension, with a similar surrogate key.

Inventory Hierarchy is a method of grouping products for the purpose of inventory management. Sales Hierarchy is a method of grouping/classifying products for the purpose of sales management. The Inventory Hierarchy intertwines with the Sales Hierarchy in one big tree. This tree is called Product Hierarchy Dimension. It is a flattened dimension table.

Let’s go through the fact first, then dimension table, then the hierarchy dimension, then the bridge.

1. Sales Fact Table

The sales fact table is the same as single-parent scenario. Each sale is recorded as one row. The fact for each sale event are quantity, unit price and amount.

If a child node has 2 parents in the hierarchy, we do not store it in the fact as two rows. We store it as one row. This is very important to prevent double counting. Because the transaction happens only once, so we should store it only once.

2. Product Dimension Table

The Product Dimension table is a slowly changing dimension type 2. It contains many rows for each product. Each version of each product is stored as one row. Product A has 3 versions, two of them are expired, and one is active. Being type 2, the Product Dimension table has an Effective Date column, an Expiry Date column and an Is Current Flag.

3. Product Hierarchy Dimension Table

The Product Hierarchy dimension is flattened. See “Flatten Table” above. The columns are: Level 1, Level 2, Level 3, etc. Flattening is necessary to increase the user friendliness and speed of the queries. Each path (line) on the Product Hierarchy tree becomes a row in this dimension table. For Product A above, which has 3 parents, becomes 3 rows in the dimension table. 2 rows for Inventory Hierarchy and 1 row for Sales Hierarchy. Each row has different parent. Each row represent a path (line) in the hierarchy tree.

How do you update this dimension? What is the business key? There is no business key. The unique identifier is the path (line), i.e. the combination of all the nodes from the top level to the bottom level, strung together to represent the path. Because there is no business key this dimension is type 1. We can’t identify if a row has changed.

In a single-parent hierarchy, the leaf node is the business key. So the parents (which are attributes) can be updated using that business key. But in a multi-parent hierarchy, we can’t do that. For a multi-parent hierarchy, if the hierarchy data is updated monthly (say it is 10,000 rows), the usual technique is to stamp the 10,000 rows for this month as “March”. And the 10,000 rows for next month as “April”. So we have a “month” or “version” column. But if the data is updated daily, we should delete this month’s rows and reinsert them from source every day.

This dimension has a surrogate key, which is a sequential number. This dimension also contains an unknown row with surrogate key zero.

4. Bridge Table

The bridge table contains only 2 columns: the surrogate key of the product hierarchy dimension, and surrogate key of the product dimension. This bridge table can be implemented as a view over the product dimension. The number of rows in this bridge table is the same as the number of rows in the product hierarchy dimension, but it only contains 2 columns (3 with the SK, plus system columns). It contains both the leaf level nodes, and the intermediate level nodes from the product hierarchy dimension.

In relational world we don’t need this bridge table. The leaf id in Product Hierarchy Dimension table can directly be linked to the Product Dimension table. But in SSAS we can’t connect a dimension to another dimension. We will need a fact table in between. That is why we need this bridge table.

Note: Christopher Adamson offers a design for a bridge table: link. In this design the bridge table contains all children below a parent (not just the children directly below the parent, but from the lower levels as well).

Note2: Margy Ross said that bridge tables are used to represent a ragged or variable depth hierarchical relationship which cannot be reasonably forced into a simpler fixed depth hierarchy of many-to-one attributes in a dimension table: link. I am not sure what she meant, but it is probably what we are implementing here. True, a multiple-parent hierarchy is not the same as a ragged or variable-depth hierarchy, but if it is not many-to-one then it is many-to-many, and a multiple-parent hierarchy is many-to-many.

5. Parent Child Hierarchy table

In some projects, the sales amount in the fact table is aggregated up. At this summary level, the sales amount does not correspond to the bottom level of the product hierarchy. It corresponds to the middle level or the top level of the product hierarchy.

In this case we need to create a new dimension table which contains every single node in the hierarchy, not just the leaf nodes (bottom nodes). The parent column in this table is not mandatory, it is optional. But it is best practice to have it so we can traverse the hierarchy to find leaf nodes.

E. SSAS Cube

In SSAS cube, the design is using many-to-many relationship. See Marco Russo’s paper here: link.

Reference:

  1. Five best ways to split strings, by Aaron Bertrand, 26 July 2012, SQL Performance.com: link
Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: