The left diagram below shows a single-parent hierarchy. The right diagram is a multiple-parent hierarchy, where a child node can have 2 or more parents.
Figure 1. Left: single-parent hierarchy, right: multiple-parent hierarchy
In the right diagram above, leaf node H has 3 parents: E and F are in one branch, and G is in another branch. This shows an important principle: in a multiple-parent hierarchy, a child node can be located in 2 different branches (or more).
Double Counting
If the amount in node H is £10, this £10 is carried-forward in node E, F, G then to B and C and finally to A. This causes double counting. Node A accumulates £30 from node H (£20 via node B, and £10 via node C). Solving double counting is one of the most important points in dealing with multiple-parent hierarchies.
Ragged Hierarchy
In the right diagram above, Node I has 2 direct parents: F and C. F is 1 level up, and C is 2 levels up. This is an important principle: in multiple-parent hierarchies, the parent node can be 2 levels up (or more). This situation is called “ragged”, meaning skipping a level. Ragged can also happen in a single-parent hierarchy. Node I in the left diagram above is an example of a ragged node. A hierarchy that has a ragged node is called a ragged hierarchy.
Terminology
A hierarchy is like a tree. It has a root, branches and leaves. It is useful to picture the hierarchy upside down like below, so that it looks like a tree.
Figure 2. Upside Down Hierarchy, similar to a Tree
Picturing the hierarchy upside down helps us understanding the terminology. We say that A is the “Root Node”. I, H and D are the “Leaf Nodes”. You see, when we turn the tree upside down like this, the terminology matches the reality. We have the Root Node at the bottom of the tree, and the Leaf Nodes at the top of the tree.
Unfortunately we should only depicted this upside down tree in our mind. Because in every text we will find that people drawing the tree like in Figure 1 above, with the Root Node at the top and the Leaf Nodes at the bottom. So when we communicates with other people, we should use this “root at the top” diagram/convention.
A. Aggregation
In a multiple-parent hierarchy, the aggregation is done by summing the amount for distinct leaf nodes. Taking a distinct is very important, to avoid double counting.
Figure 3. Left: double counting aggregation (sum of nodes), right: single counting aggregation (sum of distinct nodes)
In figure 2 above, each node is worth £10. The red numbers above the nodes are the total of all nodes under it, including itself. As we can see on the left diagram, if we double count, the total for node A (which is at the top of the tree) is £150, which is incorrect. There are only 9 nodes so the total should be £90. The right diagram shows the correct amount at node A, which is £90.
B. Relational Data Model
The relational model is the usual 2 columns parent-child hierarchy table, preferably with a “level” column showing at what level the child node is. Note that the root node is at Level 1, not Level 0. Also note that the parent of the root node is NULL. This is the parent-child relational hierarchy table for the multiple-parent hierarchy in Figure 1 and 2 above:
Figure 4. Parent-child hierarchy table
C. Flattened Hierarchy Table
C1. Path to Leaf Nodes
To flatten a multiple-parent hierarchy, we walk down from the top level node, taking the left most path until we reach the leaf level. We do this for each path in the tree. Each of these paths then becomes a row in the flatten hierarchy table. The business key of this table is the combination of all nodes.
Figure 5. Left: red paths become the rows, right: flatten hierarchy table
C2. Paths to Intermediate Nodes
Consider path ABFI below. This path reaches the leaf level. This path has 2 intermediate nodes. Node B and node F. So we have 2 “Intermediate Node Paths”: path AB and path ABF (shown as dash lines below).
Figure 6. Paths to Intermediate Nodes
So we have 2 types of path:
- Paths to leaf nodes
- Paths to intermediate nodes
To be complete, the flattened hierarchy should contain not only paths to leaf nodes, but also paths to intermediate nodes, like this:
Figure 7. Flatten hierarchy table including paths to intermediate nodes
C3. Recursive CTE
To create a flatten hierarchy table like above, in SQL Server we use a recursive CTE. CTE stands for Common Table Expression. It is essentially a select statement which is defined at the beginning of another select statement. A recursive CTE is a CTE which joins to itself repeatedly, each time joining on a different row. It typically start by joining to first row, then joining to the second row, and so on until it reaches the last row.
We use Recursive CTE to get all the paths in the parent child hierarchy, as per Figure 6 above. The code is like this:
-- Create and populate a parent child hierarchy table if object_id('dbo.parent_child_hierarchy') is not null drop table parent_child_hierarchy create table parent_child_hierarchy ( id int, child_node varchar(50), parent_node varchar(50), child_level int ) insert into parent_child_hierarchy values (1, 'A', NULL, 1), (2, 'B', 'A', 2), (3, 'C', 'A', 2), (4, 'D', 'B', 3), (5, 'E', 'B', 3), (6, 'F', 'B', 3), (7, 'F', 'C', 3), (8, 'G', 'C', 3), (9, 'H', 'E', 4), (10, 'H', 'F', 4), (11, 'H', 'G', 4), (12, 'I', 'C', 4), (13, 'I', 'F', 4) select * from parent_child_hierarchy; -- Create the "Path Table" containing every path (line) in the tree, including paths to intermediate nodes if object_id ('tempdb..#path') is not null drop table #path ;with CTE (child_node, parent_node, path) as ( select child_node, parent_node, cast(child_node as varchar(max)) as path from parent_child_hierarchy where parent_node is null union all select p.child_node as child, t.child_node as parent, t.path + ' > ' + cast(p.child_node as varchar(50)) as path from parent_child_hierarchy p join CTE t on p.parent_node = t.child_node ) select path into #path from CTE where parent_node is not null; select * from #path order by path Output: A > B A > B > D A > B > E A > B > E > H A > B > F A > B > F > H A > B > F > I A > C A > C > F A > C > F > H A > C > F > I A > C > G A > C > G > H A > C > I
The recursive CTE code consists of two parts: the root part is union-ed with the low level part. The way this code work is: it gets the root node first (level 1, top of the tree), then join to the level 2 nodes, then join to the level 3 nodes, and so on until it reaches the bottom of the tree. Note that a hierarchy tree can have 2 tops (2 root nodes), or more.
The exact code depends on how data is arranged in the parent child hierarchy table. Usually a root node is indicated by parent = null. But occasionally a root node is indicated by parent = child.
C4. Padding
In the flatten hierarchy table the leaf nodes are not located at the same level, e.g. some leaf nodes are located at level 3, some leaf nodes are at level 4, etc. This makes the query difficult. We can’t easily query the leaf nodes. So we populate the empty cells in the lower levels with the leaf nodes. This is called Padding. Like this:
Figure 8. Flatten Hierarchy Table with empty cells populated
In figure 5 we can see that the red D and I on row 1 and 8 are added on the Level 4 column, marked in red. This makes the query easier, i.e. if we want to get the leaf nodes (the most bottom nodes) we just need to query level 4.
D. Dimensional Data Model
The standard dimensional model is a fact table (F), connected to a dimension table (D), which in turn is connected to a hierarchy dimension table (H) using bridge table (B).
Figure 9. Dimensional Data Model
It is easier to explain by example. I am going to use an example of a sales fact table and a product dimension table:
Figure 10. Dimensional Data Model for Product Dimension
The 4 tables above are populated as follows (let’s assume a sale of 1 product, called product A): A row in the fact table is sale transaction for Product A on a particular date, with quantity, unit price and sales amount. The product dimension table has 3 row for product A, 1 row for the active version and 2 rows for the expired versions. The bridge table contains 3 rows for product A, 2 for the Inventory Hierarchy and 1 for the Sales Hierarchy (see below). Each of these 3 rows in the bridge table corresponds to a row in the product hierarchy dimension, with a similar surrogate key.
Inventory Hierarchy is a method of grouping products for the purpose of inventory management. Sales Hierarchy is a method of grouping/classifying products for the purpose of sales management. The Inventory Hierarchy intertwines with the Sales Hierarchy in one big tree. This tree is called Product Hierarchy Dimension. It is a flattened dimension table.
Let’s go through the fact first, then dimension table, then the hierarchy dimension, then the bridge.
1. Sales Fact Table
The sales fact table is the same as single-parent scenario. Each sale is recorded as one row. The fact for each sale event are quantity, unit price and amount.
If a child node has 2 parents in the hierarchy, we do not store it in the fact as two rows. We store it as one row. This is very important to prevent double counting. Because the transaction happens only once, so we should store it only once.
2. Product Dimension Table
The Product Dimension table is a slowly changing dimension type 2. It contains many rows for each product. Each version of each product is stored as one row. Product A has 3 versions, two of them are expired, and one is active. Being type 2, the Product Dimension table has an Effective Date column, an Expiry Date column and an Is Current Flag.
3. Product Hierarchy Dimension Table
The Product Hierarchy dimension is flattened. See “Flatten Table” above. The columns are: Level 1, Level 2, Level 3, etc. Flattening is necessary to increase the user friendliness and speed of the queries. Each path (line) on the Product Hierarchy tree becomes a row in this dimension table. For Product A above, which has 3 parents, becomes 3 rows in the dimension table. 2 rows for Inventory Hierarchy and 1 row for Sales Hierarchy. Each row has different parent. Each row represent a path (line) in the hierarchy tree.
How do you update this dimension? What is the business key? There is no business key. The unique identifier is the path (line), i.e. the combination of all the nodes from the top level to the bottom level, strung together to represent the path. Because there is no business key this dimension is type 1. We can’t identify if a row has changed.
In a single-parent hierarchy, the leaf node is the business key. So the parents (which are attributes) can be updated using that business key. But in a multi-parent hierarchy, we can’t do that. For a multi-parent hierarchy, if the hierarchy data is updated monthly (say it is 10,000 rows), the usual technique is to stamp the 10,000 rows for this month as “March”. And the 10,000 rows for next month as “April”. So we have a “month” or “version” column. But if the data is updated daily, we should delete this month’s rows and reinsert them from source every day.
This dimension has a surrogate key, which is a sequential number. This dimension also contains an unknown row with surrogate key zero.
4. Bridge Table
The bridge table contains only 2 columns: the surrogate key of the product hierarchy dimension, and surrogate key of the product dimension. This bridge table can be implemented as a view over the product dimension. The number of rows in this bridge table is the same as the number of rows in the product hierarchy dimension, but it only contains 2 columns (3 with the SK, plus system columns). It contains both the leaf level nodes, and the intermediate level nodes from the product hierarchy dimension.
In relational world we don’t need this bridge table. The leaf id in Product Hierarchy Dimension table can directly be linked to the Product Dimension table. But in SSAS we can’t connect a dimension to another dimension. We will need a fact table in between. That is why we need this bridge table.
Note: Christopher Adamson offers a design for a bridge table: link. In this design the bridge table contains all children below a parent (not just the children directly below the parent, but from the lower levels as well).
Note2: Margy Ross said that bridge tables are used to represent a ragged or variable depth hierarchical relationship which cannot be reasonably forced into a simpler fixed depth hierarchy of many-to-one attributes in a dimension table: link. I am not sure what she meant, but it is probably what we are implementing here. True, a multiple-parent hierarchy is not the same as a ragged or variable-depth hierarchy, but if it is not many-to-one then it is many-to-many, and a multiple-parent hierarchy is many-to-many.
5. Parent Child Hierarchy table
In some projects, the sales amount in the fact table is aggregated up. At this summary level, the sales amount does not correspond to the bottom level of the product hierarchy. It corresponds to the middle level or the top level of the product hierarchy.
In this case we need to create a new dimension table which contains every single node in the hierarchy, not just the leaf nodes (bottom nodes). The parent column in this table is not mandatory, it is optional. But it is best practice to have it so we can traverse the hierarchy to find leaf nodes.
E. SSAS Cube
In SSAS cube, the design is using many-to-many relationship. See Marco Russo’s paper here: link.
Reference:
- Five best ways to split strings, by Aaron Bertrand, 26 July 2012, SQL Performance.com: link