Data Warehousing and Data Science

30 July 2011

Bridge Table with Date

Filed under: Data Warehousing — Vincent Rainardi @ 3:06 am
Tags:

A bridge table is a fact table connecting 2 dimensions. A bridge table is usually used in data warehousing for multi valued attributes, see here. A bridge table has no measure (factless). Usually a bridge table contains only 2 columns, both are dim key columns, connecting the 2 dimensions (see my previous article about Bridge Table here*). But there are some cases where we have the 3rd column, which points to the 3rd dim. And this 3rd dimension contains the date. Hence the title of this article Bridge Table with Date.

*Also see my SQLBits Brighton presentation here, page 35 about bridge table.

As usual, it is easier to understand using an example. It’s a bank. A retail bank. Every 3 months or so they conducted a cross sell opportunity review. In the review they identify (programmatically) which products might be suitable for a customer. Products which the customer does not currently have. Hence the name “cross selling” opportunity, or CSO. The results of this CSO identification software are then reviewed manually by the customer service officers and they adjusted/overwritten the results. The final output is then sent to the bank’s data warehouse for storage and reporting.

Let’s have a look at the table structure:

create table dim_customer (customer_key int, customer_name varchar(50))
create table dim_cso (cso_key int, cso_product varchar(50)) --cross selling opportunity
create table dim_review (review_key int, review_number int, review_date date) --a review is conducted across all customers in the bank, about once a quarter
create table bridge_customer_review (customer_key int, review_key int, cso_key int)

insert into dim_customer values (1,'Customer1'), (2,'Customer2')
insert into dim_cso values (1,'Mortgage'), (2,'Investment'), (3,'Mortgage, Investment'), (4,'Credit Card'), (5,'Mortgages, Credit Card'), (6,'Investment, Credit Card'), (7, 'Mortgage, Investment, Credit Card')
insert into dim_review values (1, 1, '2011-04-29'), (2, 2, '2011-07-29')
insert into bridge_customer_review values (1,1,1), (1,1,2), (1,2,1), (1,2,2), (1,2,4), (2,1,1),(2,2,1),(2,2,2)

select * from dim_customer
select * from dim_cso
select * from dim_review
select * from bridge_customer_review

And these are the questions that the table structure was designed to answer:

-- Q1. How many customers have Investment cross selling opportunity?
select count( distinct customer_key) from bridge_customer_review where cso_key = 2
-- Output: 2

-- Q2. List of customers having Investment cross selling opportunity:
select distinct C.customer_name
from bridge_customer_review B
join dim_customer C on C.customer_key = B.customer_key
join dim_cso O on O.cso_key = B.cso_key
where O.cso_product = 'Investment'
/* Output:
Customer1
Customer2
*/

-- Q3. What are the CSOs for Customer1?
select B.review_key, O.cso_product from
( select customer_key, review_key, sum(cso_key) as cso_key from bridge_customer_review
  group by customer_key, review_key
) B
join dim_cso O on O.cso_key = B.cso_key
join dim_customer C on C.customer_key = B.customer_key
where C.customer_name = 'Customer1'
/* Output:
instance_dd cso_product
1  Mortgage, Investment
2  Mortgage, Investment, Credit Card
*/

-- Q4. For Customer1 how many reviews are Mortgage, Investment, Credit Card?
select count(*) from
( select customer_key, review_key, sum(cso_key) as cso_key from bridge_customer_review
  group by customer_key, review_key
) A
join dim_cso O on O.cso_key = A.cso_key
join dim_customer C on C.customer_key = A.customer_key
where C.customer_name = 'Customer1' and O.cso_product = 'Mortgage, Investment, Credit Card'
-- Output: 1

--Q5. In the July review, which customers get cross selling recommendation of Mortgage and Investment?
select count(*) from
( select customer_key, review_key, sum(cso_key) as cso_key from bridge_customer_review
  group by customer_key, review_key
) A
join dim_cso O on O.cso_key = A.cso_key
join dim_customer C on C.customer_key = A.customer_key
join dim_review R on R.review_key = A.review_key
where R.review_date = '2011-07-29' and C.customer_name = 'Customer1'
and O.cso_product = 'Mortgage, Investment'
-- Output: 1

In the bridge table above, we have a customer key column and a CSO key column. The customer key column points to the customer dimension as usual, whereas the CSO key column points to the Cross Selling Opportunity dimension, which contains recommended cross sell products. The bridge table essentially is a factless fact table connecting these 2 dimensions. It contains which products are recommended for each customer for cross selling.

But in the bridge table, we also have the review key column, which points to the review dimension. This review dimension has a date, enabling us to find out in which review the CSO product recommendation originated. Hence the title of the article, Bridge Table with Date.

Also notice that in the CSO dimension, I applied the “1,2,4,8 keys” technique*, which I used here. The purpose of using “1,2,4,8 key” technique is so that the total is unique, for example:

  • 1+2 is 3 and 3 is not there in the set (we only have 1,2,4,8)
  • 1+4 is 5 and 5 is not there in the set

In the CSO dim it is optional to have rows for CSO key = 3,5,6,7. Without these rows, we can still decode the combination, like this:

To decode 3 (the answer is 1+2), we find out first what is the largest number in the (1,2,4,8) set which is <= 3. The answer is 2. The we find the largest number in the set which is <= 3-2. The answer is 1. So 3 is 2+1.

To decode 5 (the answer is 1+4), we find out first what is the largest number in the (1,2,4,8) set which is <= 5. The answer is 4. Then we find out what is the largest number in the set which is <= 5-4. The answer is 1. So 5 is 4+1.

But, it is a lot easier, if we have the rows for CSO key = 3,5,6,7 because then we can run queries like Q4 where we have this clause: “and O.cso_product = ‘Mortgage, Investment, Credit Card’” rather than trying to decode the product combination like above (mortgage + investment + credit card 1 + 2+ 4 = 7).

*binary keys is probably a better name than “1,2,4,8 keys”.

Vincent Rainardi, 30th July 2011

28 July 2011

Bridge Table with Multiple Instances

Filed under: Data Warehousing — Vincent Rainardi @ 5:20 am
Tags:

A bridge table is a fact table connecting 2 dimensions. A bridge table is usually used to solve “dimensions with multi valued attributes” as I showed in my previous article, “Dimensions with Multi Valued Attributes” here. Below is an example of a bridge table:

create table dim_product (product_key int, name varchar(50))
create table dim_colour (colour_key int, colour varchar(50))
create table bridge_product_colour (product_key int, colour_key int)

insert into dim_product values (1,'Product1'), (2,'Product2')
insert into dim_colour values (1,'Red'), (2,'Yellow'), (3,'Green')
insert into bridge_product_colour values (1,1), (1,2), (2,2), (2,3)

select * from dim_product
select * from dim_colour
select * from bridge_product_colour

-- Q1. How many products contain colour Yellow?
select count(*) from bridge_product_colour F
join dim_colour C on F.colour_key = C.colour_key
where C.colour = 'Red'
Output: 1

-- Q2. How many product for each colour?
-- (total is more than 3 because 1 product can have multiple colours)
select C.colour, count(*)
from bridge_product_colour B
join dim_colour C on C.colour_key = B.colour_key
group by C.colour
Output:
Red 1
Green 1
Yellow 2

Above is an example of a simple bridge table. It only connects the product and colour.

But sometimes, we have “instances”. Each item can have multiple instances. Each instance has different colour scheme, for example instance 1 of product 1 is red and yellow, whereas instance 2 is yellow and green.

In this case we need to create an “instance degenerate dimension” column in the bridge table, and we need to set the colour key to 1, 2, 4, 8, etc, so that when we sum the colour key, we will get a unique number.

An example of this is film. A film has multiple genres. Film1 is classified as “Action, Drama”. But Film1 can also be classified as “Action, Drama, Comedy”.

The example below illustrates this case:

create table dim_film (film_key int, title varchar(50))
create table dim_genre (genre_key int, genre varchar(50))
create table bridge_film_genre (film_key int, instance_dd int, genre_key int)

insert into dim_film values (1,'Film1'), (2,'Film2')
insert into dim_genre values (1,'Action'), (2,'Drama'), (3,'Action, Drama'), (4,'Comedy'), (5,'Action, Comedy'), (6,'Drama, Comedy'), (7, 'Action, Drama, Comedy')
insert into bridge_film_genre values (1,1,1), (1,1,2), (1,2,1), (1,2,2), (1,2,4), (2,1,1),(2,2,1),(2,4,2)

select * from dim_film
select * from dim_genre
select * from bridge_film_genre
film_key instance_dd genre_key
1	1	1
1	1	2
1	2	1
1	2	2
1	2	4
2	1	1
2	2	1
2	4	2
-- Q1. How many films contain Drama genre?
select count( distinct film_key) from bridge_film_genre where genre_key = 2
Output: 2

-- Q2. List of films containing Drama genre:
select distinct F.title
from bridge_film_genre B
join dim_film F on F.film_key = B.film_key
join dim_genre G on G.genre_key = B.genre_key
where G.genre = 'Drama'
Output:
Film1
Film2

-- Q3. What are the genres for each instance of Film1?
select B.instance_dd, G.genre from
( select film_key, instance_dd, sum(genre_key) as genre_code from bridge_film_genre
  group by film_key, instance_dd
) B
join dim_genre G on G.genre_key = B.genre_code
join dim_film F on F.film_key = B.film_key
where F.title = 'Film1'
Output
1  Action, Drama
2  Action, Drama, Comedy

-- Q4. For Film1 how many are Action, Crime, and Comedy?
select count(*) from
( select film_key, instance_dd, sum(genre_key) as genre_code from bridge_film_genre
  group by film_key, instance_dd
) A
join dim_genre B on B.genre_key = A.genre_code
join dim_film C on C.film_key = A.film_key
where C.title = 'Film1' and genre = 'Action, Drama, Comedy'
Output: 1

Notice how value genre_key is set to 1,2,4, etc so that its sum is unique. But we created rows in the dim_genre for genre_key = 3,5,6, and 7 to so that we can join back to get the description.

I would like to thank Rohinton Tarapore for highlighting the problem to me.

Vincent Rainardi, 27th July 2011

12 July 2011

Slowly Changing Dimension (SCD) Type 0

Filed under: Data Warehousing — Vincent Rainardi @ 10:08 pm
Tags:

I first noticed SCD type 0 in Wikipedia about 1.5 years ago (link). We recently had a debate/discussion in the office with regard to SCD type 0:

  • Some said that SCD type 0 was static, i.e. once we load it we don’t change it.
  • Some said that SCD type 0 was static, i.e. once we load a row into the dim, we don’t change that row. But we can load it again with a different Surrogate Key! Wow what a revolution!
  • Some wanted to find the definition of SCD type 0 in the Kimball’s Toolkit book. I said to him that type 0 is not there but he still insisted in searching the book. Cover to cover. And after sometime he gave up. Yup, it’s not there. Ralph Kimball and Margy Ross never wrote about type 0. About type 1+2+3 = 6 yes, they had, but not SCD type 0.

In my opinion, with SCD type 0 we don’t change the dim row and we don’t reload it under new Surrogate Key (SK). The question here is: how do we identify the dim row? Using the natural key of course (NK) – some people call it Business Key (BK).

What if the business key changes? Then it’s a new dim row of course, and we load it in. Didn’t we just say that the way we identify a dim row is by the NK?

Do we load new rows? Of course we do. How do we know they are new rows? Well, as I said before, we use the NK to identify a row.

The Idea of SCD Type 0 Was Inspired by SSIS SCD Transform?

I think this type 0 thing was probably inspired by SSIS SCD Transformation. In SSIS SCD Wizard we have 3 different types of attributes: fixed attribute, changing attribute and historical attribute, see below:

 

This is probably what triggered SCD type 0 idea. In 2009 it’s been 4 year since SSIS 2005 (5 years if you include the CTP) hence in 2009 the SSIS SCD Wizard became a common thing. If we think about it, it is natural when we see the “fixed attribute” in the SCD Wizard, we think “oh that’s probably type 0”. See here for example.

What’s Type 0 For?

OK. Enough with the etymology or the history of it. The correct question is this: what’s that for? What is SCD type 0 for? Because if it’s never get used, why bother talking about it right?

Why do we want to ignore any changes that happen in that dim? What are the examples/cases?

To answer that question first we need to differentiate between a type 0 attribute and a type 0 dimension. Not many type 0 dims around (although I’ll try to give 1-2 examples) but there are many type 0 attributes around. In a Type 2 dimension, we often find some columns which are fixed. We don’t change their value. For example: created by and created date. For these columns if their values change, we don’t want to create a new row. And we don’t want to change their values either. Let’s define what a type 0 attribute is, then go through an example. After that, let’s define what a type 0 dimension is, followed by an example.

Type 0 Attribute

A type 0 Attribute is: a column with fixed value in type 2 dimension. It is of course possible to have a type 0 attribute in a type 1 dimension.

An example of a type 0 attribute in a type 2 dimension: in the source system we have a customer table with 10 columns, plus 1 column: Created Date. One day, for customer number 1, the Created Date column has a different value. For 11 years it has always been 31/10/2000 but today it is 6/7/2011. We don’t want to change the value of the Created Date column to 6/7/2011. We want it to still be 31/10/2000. We want to ignore this change, because it is a Created Date column, i.e. the date the customer was created. We know that customer number 1 was created 11 years ago and its record has been there since 2000. Setting its Created Date to 2011 is incorrect. Why in the source system the Created Date became 6/7/2011? Probably because of data migration, i.e. the row for customer1 was inserted into a new table hence the “system created date” was set to the today’s date. In this case we want to deliberately ignore the new value. Created Date is a type 0 column.

In that customer table we also have Last Modified Date column. Yesterday the value for this column for customer 1 was 24/8/2007. Today the value of that column changes to 6/7/2011. We do want to change the value of that column in our customer dimension table to 6/7/2011, but we don’t want to create a new row. Last Modified Date column is a type 1 column. We are not interested in keeping its historical values.

So that’s an example of a type 0 column (Created Date), as opposed to a type 1 column (Last Modified Date).

When defining a type 2 dimension, often we don’t set every single column to be “historical”/type 2. Some columns are type 1 attributes, some columns are type 2 attributes, and some columns are type 0 attributes. These type 0 attributes are usually the system column, i.e. when the row was created, name of the application that created the row (or IP address), etc.

Please note that in some data warehouses, the Last Modified Date and Last Modified User are type 2 columns. We do track their history because we want to know, who changed the row and when. For example: credit card application in a bank. We want to track who changed the status of the application and when. So the “last modified date” becomes a very important column, by which the Data Warehouse tracks the status of each application, using an “Accumulated Snapshot Fact Table” like this:

PS. I’ll blog about this “Using Accumulated Snapshot Fact Table for tracking status” later on. Promise. But now let me finish this “type 0 dimension” first.

Type 0 Dimension

OK so that’s SCD type 0 attribute. How about SCD type 0 dimension? Now that we have understood what a type 0 attribute is, we can define what a type 0 dimension is.

A SCD Type 0 Dimension is: a dimension of which all attributes are type 0.

In a type 0 dimension, we can add new rows (with new natural keys), but we don’t want the existing rows in dimension to change. An example of a type 0 dimension is an event dimension, and a “derived” dimension. I’ll explain the “derived” dimension first, then the event dimension.

Derived Dimension

SCD Type 0 dimension is often found in “derived” dimensions. Let me explain what a “Derived” dimension is using an example. Say we have a transaction table. In this table there are 2 columns called “Grouping Type ID” and “Grouping Type”. The value of this “Grouping Type” column is either “Multiple” or “Single”. Whereas the Grouping Type ID column has M and S value. We don’t have the source table for Grouping Type. So we derived/created a dimension called Grouping Type dimension in your mart, and created 3 columns: Surrogate Key, Grouping Type ID and Grouping Type Description. We populated the dim with select distinct type from that transaction table, resulting in 2 rows, like below:

Then we maintain the dimension by comparing today’s “select distinct type” with the current content of dimension table. We only insert if there is a new grouping type. But we don’t want the rows which are already in the dim to be touched. If there’s an “S” row coming today with description = “Syndication”, we ignore this new row. We can add new rows (with new natural keys), but we don’t want the existing rows in dimension to change. Because the source is a transaction table, not a master table.  The dimension is “derived” from a transaction table.

Event Dimension

The second case of a type 0, which is quite common also, is an “Event” dimension. An “event” dimension is created over time. The rows in an event dimension always “go forward”, i.e. incrementally moving forward in time. There is no “going back” in time. Every event that occur in the business will create a row in the dimension (in some cases more than 1 row). In banking, an example of event is “credit event”. A credit event happens when a company, or a government, goes default (unable to meet its financial obligation, for example a coupon on a bond).

In scientific DW, it is more common to find an event dimension. Weather measurements, electricity measurements, all have event dimensions. The event is the measurement that happens automatically every period of time. An event dimension is characterised with timestamp for its natural key. Which means that what makes it unique is when the measurement happens. For example, a “Differential Pressure Transmitter” (DPT) measures the pressure difference between 2 points and transmits the measurement electronically continuously (analogue output). If we do a sampling say every 1 second, then we have a measurement every second. The time the measurement was taken is the natural key to this Measurement Event Dimension.

To give us an idea about what the attributes of an event dimension are: the attributes of an event dimension are usually the input parameters to the event. In the event of the event being a “production run” (as in in a factory/manufacturer), the input parameters are the settings of the machines in that run. Some DW architect put the single value outputs of the event/measurement into the event dimension. I disagree with this approach. In my opinion the outputs of the event (as they are numerical values) needs to be put in a fact table.

Summary

So a type 0 attribute is a column with fixed value in type-2 dimension or type-1 dimension. The values of a type 0 attribute column never change. The values remain the same forever. An example of a type 0 attribute is created date column.

A type 0 dimension is a dimension of which all attributes are type 0. In a type 0 dimension we can add new rows (with new natural keys) but we don’t want existing rows to change. Examples of a type 0 dimension are derived dimensions and event dimensions.

For those of you seeking what a type 0 dimension is, I hope this helps. There are many respectable data warehouse architects out there and some of them may not necessarily agree with my opinion above. In that case I would like to hear from them, and I will be glad to learn from them.

Vincent Rainardi, 12/7/2011

1 July 2011

SQLBits 9 Liverpool

Filed under: Event — Vincent Rainardi @ 5:33 am

The next SQLBits will be in Liverpool, on 29th September (Thursday) to 1st October (Saturday).

It will be at The Adelphi Hotel, Ranelagh Place, L3 5UL, very near Liverpool Central Rail station.

The sessions are here. The speakers are here.

Steve Jones will be presenting (if chosen). I was in contact with him when I was writing articles for SQL Server Central, of which he was (still is) the editor. SQL Server Central is the largest SQL Server forum in the world. He submitted 2 sessions: “Branding yourself for a dream job”, and “Preparation for disaster”.

Duncan Sutcliffe submitted 3 sessions: “Using  Data Capture in ETL Routines”, “Kerberos – All you need to know in 1 hour”, “Making Full Use of SharePoint for Business Intelligence”. That Kerberos session will be very interesting, as a lot of people implementing SSAS client tool / SharePoint / .NET front end app in corporations are caught up in the “Configuring Kerberos” business. After a few weeks configuring Kerberos in a client, a Kerberos expert said to me “If we can package our Kerberos knowledge  into a product, we’ll be rich!”

Marco RussoAlberto FerrariAlex Whittles and Jennifer Stirrup also submitted their sessions.

SQLBits 4 Manchester

Filed under: Event — Vincent Rainardi @ 5:24 am

In this session I will explain what a dimensional data model is, why do we use it for data warehousing, when to use and when not to use, what are the advantages and disadvantages and what are the alternatives. We will put dimensional modelling into practice by designing a data mart using a case study. That way we can get first hand experience about how a 3rd normal form transactional database is translated into fact and dimension tables in a dimensional model. Time permitting, we will also discuss some particular points in dimensional modelling such as smart date key, choosing dimensional grain, and real time fact table.

The PowerPoint presentation is here.

The SQLBits page is here.

Blog at WordPress.com.