Data Warehousing and Data Science

29 December 2015

DimMonth, DimQuarter and DimYear

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 6:21 am
Tags: ,

Sometimes the grains of our fact tables are monthly, quarterly, or yearly. In such cases, how do we create DimMonth, DimQuarter and DimYear? Some of the questions in these cases are:

  1. Why do we need to create Month as a dimension? We can’t Month column in the fact table remain as Month, not as a dimension key?
  2. What does DimMonth look like? What are the attributes? How about DimQuarter?
  3. Should we create DimMonth as a physical table, or as a view on top of DimDate?
  4. What is the surrogate key of DimMonth and DimQuarter? Do we create a new SK, or do we use the SK from DimDate?
  5. Do we need to create a dimension for year? It seems weird because it only has 1 column (so that would be against the “degenerate dimension” concept)

For example, in the source system we have a table which stores the monthly targets for each store:
Sales Target table

Or quarterly target like this:
Quarterly Target

How do we create a fact table for this? Should we create it like this?
FactSalesTarget

Question 1 above: Why do we need to create Month as a dimension? Why can’t we leave it as Month in the fact table, not as a dim key, like this?
FactSalesTarget - MonthNotAsDimKey

Question 2 above: if we decided to keep the column as MonthKey, how should DimMonth look like? What are the attributes? Should DimMonth be like this?
DimMonth

What attributes should we put there?

  • Do we need quarter and year?
  • Do we need half year? e.g. 2015 H1 and 2015 H2
  • For month name, should we use the short name or long name? e.g. Oct 2015 or October 2015?
  • Do we need an attribute for “Month Name without Year”, e.g. October?
  • Do we need “Month End Date” column? e.g. 2015-10-30
  • Do we need “Month is a Quarter End” indicator column for March, June, Sep and Dec?
  • Do we need “Number of days in a month” column? e.g. 30 or 31, or 28 or 29.

Or should we use “intelligent key” like this?
DimMonth using Intelligent Key

Question 3 above: Should we create DimMonth as a view of DimDate like this?

create view DimMonth as
select min(DateKey) as MonthKey,
MonthNumber, MonthName, Quarter, Year
from DimDate
group by MonthNumber, MonthName, Quarter, Year

DimMonth as a view

What are the advantages of creating DimMonth and DimQuarter as a view of DimDate? (compared to creating it as a physical table) What are the disadvantages?

I think with the above questions and examples we are now clear about what the issue is. Now let’s answer those questions.

Q1. Do we need to create Month as dimension? We can’t Month column in the fact table remain as Month, not as a dimension key, like this?
FactSalesTarget - MonthNotAsDimKey

We need the Month column in the fact table to be a Dim Key to a month dimension because we need to access Year and other attributes such as Quarter.

Bringing Year into the Sales Target fact table like below is not a good idea, because it makes it inflexible. For example, if we want to add Quarter column we have to alter the fact table structure.
Bring year into fact table

Using a Dimension Key to link the fact table to a Month dimension makes it a flexible structure:
DimKey Link

There is an exception to this: Snapshot Month column. In a monthly periodic snapshot fact table, the first column is Snapshot Month. In this case, we do not need to create this column as a dimension key, linking it to DimMonth. In this case, we do not need a DimMonth. Because we do not need other attributes (like Year or Quarter). A monthly periodic snapshot fact table stores the measures as of the last day of every month, or within that month. For example: number of customers, number of products, number of orders, number of orders for each customer, the highest price and lowest price within that month for every product, the number of new customers for that month, etc.

Q2. What does DimMonth look like? What are the attributes?

Obviously, the grain of DimMonth is 1 row per month. So we are clear about what the rows are. But what are the columns? Well it depends on what we need.

I usually put MonthNumber, MonthName, Quarter and Year in DimMonth, because they are frequently used.

I don’t find “Month Name without the Year” as a useful attribute. I rarely come across the need for “Half Year” attribute.

“Month is a Quarter End” column is also rarely used. Instead, we usually use “Quarter” column.

“Month End Date” and “Number of days in a month” are also rarely used. Instead, we usually use “IsMonthEnd” indicator column in the DimDate.

For month name, should we use the short name (Oct 2015) or the long name (October 2015)? I found that the short name is more used that the long name. But the month number (2015-10) is even more frequently used that the short name

Q3. Should we create DimMonth as a physical table, or as a view on top of DimDate?

This is really the core of this article. A view on top of DimDate is better in my opinion, because we avoid maintaining two physical tables. And it makes the dimensions less cluttered.

If we make DimMonth and DimQuarter as a physical dimensions, in SSMS Object Explorer, when we open the table section we would see these:
DimDate
DimMonth
DimQuarter
DimYear

But if we create DimMonth and DimQuarter as views, then we will only see DimDate in the Object Explorer’s table section. The DimMonth and DimQuarter will be in the view section.

The main disadvantage of creating DimMonth as a view from DimDate is that it is less flexible. The attribute column that we want to appear in DimMonth should exist in DimDate. But I found that DimMonth usually only need 2 or 3 attributes i.e. Month, Quarter, Year; and all of them are available in the DimDate table. So this is not an issue.

Avoiding maintaining 2 physical tables is quite important because when we extend the date dimension (adding more years i.e. more rows) and we forget to extend DimMonth and DimQuarter, then we will cause an error.

The other consideration is of course the performance. I do not find the performance of DimMonth and DimQuarter to be an issue. This is because DimDate is not too large, and more importantly because the monthly and quarterly fact tables are small, less than 1 million rows. They are much smaller than daily fact tables which have millions or billions of rows.

Q4. What is the surrogate key of DimMonth and DimQuarter? Do we create a new SK, or do we use the SK from DimDate?

If we create DimMonth and DimQuarter as physical tables, then the surrogate key can either be pure surrogate (1, 2, 3, …) or intelligent key (201510, 201511, 201512, etc.)

But if we create them as a view of DimDate, then the surrogate key can be either the first day of the month (20151001, 20151101, 20151201, etc.) or the month itself (201510, 201511, 201512, etc.). I prefer the latter than the former because it is more intuitive (intelligent key) and there is no ambiguity like the former.

The script to create the view for DimMonth with SK = 201510, 201511, 201512, etc. is like this:

create view DimMonth as
select distinct convert(int,left(convert(varchar,SK_Date),6)) as MonthKey,
[MonthName] , Quarter, [Year]
from DimDate

Q5. Do we need to create a dimension for year?

No we don’t need to create DimYear, because it would only have 1 column.

What should we call the dim key column in the fact table then? Is it YearKey or Year? We should call it YearKey, to be consistent with the other dim key columns.

A dimension which only has 1 column, and therefore be kept in the fact table is called a Degenerate Dimension. A Degenerate Dimension is usually used to store identifier of the source table, such as Transaction ID and Order ID. But it is also perfectly valid for dimensions which naturally only have one attribute/column, like Year dimension. See my article about “A dimension with only one attribute” here: link.

23 December 2015

Data Dictionary

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 8:15 am
Tags: ,

During my 10 years or so doing data warehousing projects, I have seen several initiatives doing data dictionary. A data dictionary is a list of definitions used in the system. Each definition is about usually about 10 to 50 words long. And there are about 50 to 200 definitions. What being defined is mostly technical terms, such as the meaning of each field in the database. This is the origin of data dictionary, grew from the need to explain the fields. For example, delivery date, product category, client reference number, tax amount, and so on. I saw this in 1998 all the way to today (2015).

This data dictionary was created by the business analyst in the project, who was trying to explain each field to clear any ambiguity about the terms used in the system. I found that the business people view it differently. They found that it is not too useful for them. The data dictionary is very useful to the new starter in the IT development project, but not too useful for the business. I’ll illustrate with several examples below.

In the database there are terms like: CDS, obligor, PD, yield. The BA defines these terms by copying a definition from internet searches, so they ended up with something like this:

  • Credit Default Swap (CDS): a financial contract on which the seller will pay the buyer if the bond defaults
  • Obligor: a person or entity who is legally obliged to provide some payment to another
  • Probability of Default (PD): likelihood of a default over a particular time frame
  • Yield: the interest received from a bond

The CDS definition is trying to explain a big concept in one sentence. Of course it fails miserably: what bond? What do you mean by “defaults”? After reading the whole sentence, the readers are none of the wiser. The readers will get much better understanding about CDS if they read the Wikipedia page about CDS.

The Obligor definition is too generic. It is too high level so that it is useless because it doesn’t provide the context to the project/system. In a credit portfolio project, obligor should be defined as the borrower of the loan, whereas in a fixed income portfolio project, obligor should be defined as the issuer of the bond.

The PD definition contains a common mistake in data dictionary: the definition contains the word being defined, because the author doesn’t have a good understanding about the topic. What is a default? It is not explained. Which PD is used in this field, Through The Cycle or Point In Time? Stressed or Unstressed? Is it calculated using linear regression or discriminant analysis? That is what the business wants to know.

The Yield definition does not explain the data source. What the business users need is whether it is Yield To Maturity, Current Yield, Simple Yield, Yield to Call or Yield to Worse. There are different definitions of yield depending on the context: fixed income, equity, property or fund. The author has firmly frame it for fixed income, which is good, but within fixed income there are 5 different definitions so the business want to know which one.

Delivery mechanism: Linking and Web Pages

To make the life of the business users easier, sometimes the system (the BI system, reporting system, trade system, finance system, etc.) is equipped with hyperlinks to the data dictionary, directly from the screen. So from the screen we can see these words: CDS, obligor, PD, yield are all hyperlinks and clickable. If we click them, the data dictionary page opens, highlighting the definition. Or it can also be delivered as a “balloon text”.

But mostly, data dictionaries are delivered as static web pages. It is part of the system documentation such as architecture diagram, content of the warehouse, support pages, and troubleshooting guide.

It goes without saying that as web pages, it should have links between definitions.

I would say that the mechanism of delivery only doesn’t contribute much to the value. It is the content which adds a lot of value.

Accessibility

A data dictionary should be accessible by the business, as well as by IT. Therefore it is better if it is delivered as web pages rather than links direct from the system’s screens. This is because web pages can be accessed by everybody in the company, and the access can be easily controlled using AD groups.

Data Source

The most important thing in data dictionary is in-context, concise definition of the field. The second most important thing is the data source, i.e. where is this field coming from. System A’s definition of Duration could be different from system B’s. They may be coming from different sources. In System A it might be sourced from Modified Duration, whereas in System B it is sourced from Effective Duration.

Because of this a data dictionary is per system. We should not attempt building a company-wide data dictionary (like this: link) before we completed the system-specific data dictionaries.

How deep? If the field is sourced from system A, and system A is sourced from system B, which one should we put as the source, system A or system B? We should put both. This is a common issue. The data dictionary says that the source of this field is the data warehouse. Well of course! A lot of data in the company ended up being in the data warehouse! But where does the data warehouse get it from? That’s what the business what to know.

Not Just the Fields

A good data dictionary should also define the values in the field. For example, if we have a field called Region which has 3 possible values: EMEA, APAC, Americas, we should explain what APAC means, what EMEA means and what Americas means (does it include the Caribbean countries?)

This doesn’t mean that if we have a field called currency we then have to define USD, EUR, GBP and 100 other currencies.  If the value of a field is self-explanatory, we leave it. But if it is ambiguous, we explain it.

If the field is a classification field, we should explain why the values are classified that way. For example: the value of Asset Class field could be: equity, fixed income, money market instruments (MMI), CDS, IRS. Many people would argue that CDS and MMI are included in fixed income, so why having separate category for CDS and MMI? Perhaps because MMI has short durations and the business would like to see its numbers separately. Perhaps because the business views CDS as hedging mechanism rather than investment vehicle so they would like to see its numbers separately.

Summary

So in summary, a data dictionary should:

  1. Contain in-context, concise definition of every field
  2. Contain where the field is sourced from
  3. Contain the definition of the values in the field
  4. It should be system specific
  5. It should be delivered as web pages
  6. It should be accessible by both the business and IT

A good data dictionary is part of every data warehouse. I would say that a data warehouse project is not finished until we produce a good data dictionary.

18 December 2015

SQLBits XV Liverpool 4-7th May 2016

Filed under: Data Warehousing,Event,SQL Server — Vincent Rainardi @ 10:17 pm

I’ve posted 8 sessions for SQLBits XV in Liverpool. Hopefully one of them got chosen so I can to present it.

  1. Populating a Fact Table using SSIS
  2. Testing a SQL Server Data Warehouse
  3. Indexing Fact Tables in SQL Server
  4. SQL Server 2016 Data Warehousing & BI Features (deleted as it’s duplicate with another speaker)
  5. Using a SQL Server Data Warehouse for CRM
  6. Using a SQL Server Data Warehouse for BPM
  7. Loading a Dimension Table in SSIS
  8. T-SQL: Back to Basic

1. Populating a Fact Table using SSIS

In this session I will be doing a demo on my article “Populating a Fact Table using SSIS”: https://dwbi1.wordpress.com/2012/05/26/how-to-populate-a-fact-table-using-ssis-part1/ by going through it step by step on the screen. Giving the audience plenty of time to understand the mechanics and to ask questions.

When populating a fact table, people come across issues like these:

  • Where do I populate my fact table from?
  • How do I get the dimension keys to put into my fact table?
  • Where can I get the data for the measures columns?
  • With what do I populate the snapshot date column?
  • What is the primary key of my fact table?
  • The source table doesn’t have a primary key of the source table. What should I do?
  • I can’t find the row in the dimension table, what should I do?

As always, the best way to explain is by using an example. So in this session I’m going to do the following, and hopefully by the end of the session the above issues / questions in the audience’s mind will be answered.

  • Describe the background on the company and the data warehouse
  • Create the source tables and populate them
  • Create the dimension tables and populate them
  • Create the fact table (empty)
  • Build an SSIS package to populate the fact table, step by step

2. Testing a SQL Server Data Warehouse

The usual issue when we test a data warehouse is the sheer volume of data, which makes the normal testing method doesn’t work. Over the years, various ways have been developed to provide solutions to this problem, from a manual process to automated process.

In this session I will be demo-ing 3 approaches for testing SQL Server-based data warehouses:

  1. Using Manual Comparison
  2. Using Excel
  3. Using SSIS – SQL Server – SSAS

Along the way I will show the challenges, the mechanics, and the solutions.

Using Excel we will need to sort the data first. And then we use different formula for compare string columns, date column, logical column and numeric columns. We also need to incorporate the tolerance levels. Finally we can present the data quality for each column.

To verify the data in the data warehouse using the 3rd approach, first we need to match the rows to the source system. Then we verify the attribute columns. Then we verify the measure columns. SSIS facilitates the flow of data from both the source system and the warehouse into the test area, which is in SQL Server. A series of stored procedures then do the matching process in SQL Server, and compare the attribute columns and measure columns. The SSAS enables the testers to dissect the compared data, to find out which specific data areas are causing the issue.

3. Indexing Fact Tables in SQL Server

Index Fact Tables in SQL Server are different to Oracle or DB2 because of clustered index. Some people say that it is better to create a clustered index on the fact key, then non-clustered on each dimension key. Some say that it is better to use the clustered index on the snapshot date column. Of course the considerations are different between periodic snapshot fact tables, accumulative snapshot fact tables and transaction fact tables.

In this session I will go through the principles in indexing the 3 types of fact tables, including physical ordering, slimness of the clustered index, multi-column clustered index, a PK doesn’t have to be a clustered index, which dimension key column to index, when to include a column in an index, and of course partitioning, i.e. indexing partitioned fact tables. As always, it is better to explain by example than theory so I will give an example on each principles so that we can understand how it is applied in practice. For example: the performance comparison. I will also add my own “lessons learned”, i.e. mistakes I’ve made in the past so you can avoid making the same mistakes.

The title is “in SQL Server” because principles I will be explaining in this session are only applicable specifically to SQL Server data warehouses. It does not apply to Oracle, DB2 or Teradata data warehouses.

4. SQL Server 2016 Data Warehousing & BI Features

In this session I would like to go through the new features in SQL Server 2016 which are applicable to data warehousing and BI such as DB Engine: Stretch database, JSON, column-store indexes, memory-optimized tables, live query statistics, query store, temporal tables; SSAS: DBCC for SSAS, SSAS Extended event GUI, new DAX functions, tabular model partitions, roles in tabular; SSIS: custom logging in SSIS, Excel 2013 data sources, OData V4 data source, R in Execute SQL Task, HDFS connections, Auto Adjust Buffer Size, reusable control flow template; SSRS: Pin Report Items, project template for SSRS, PPTX format, PDF based printing, custom parameter pane, HMTL 5, Report Builder UI, Tree Map, Sunburst Charts, High DPI, Subscription; MDS: data compression at entity level, super user function, CRUD permissions, transaction log.

I may not be able to fit all of the above points into the session, but will try to cover as many as possible. It is essentially a session which goes through the new features in SQL 2016. Wherever possible I will try to demo the feature rather than talking about it on PowerPoint. Because by seeing we learn a lot more than by discussing them in theory. Also for each feature I will mention how they can be used in data warehousing or BI development. If a feature already exist in 2014, but it is enhanced in 2016 I will show only the improvement, and not going through the fundamentals of that feature.

5. Using a SQL Server Data Warehouse for CRM

This session is taken from my article “Using a Data Warehouse for CRM”, which I wrote with Amar Gaddam: https://dwbi1.wordpress.com/2010/01/28/using-data-warehouse-for-crm/

A data warehouse is not only for BI. It can also be used from Customer Relationship Management (CRM). In fact, a DW is the best platform for doing CRM. In this session I will show how to use a data warehouse built in SQL Server for doing core CRM activities such as: 1. Single Customer View, 2. Permission management, 3. Campaign segmentation, 4. Campaign Results, 5. Customer Analysis, 6. Personalisation, 7. Customer Insight, 8. Customer Loyalty Scheme.

If you don’t work in CRM and not familiar with the above concepts, don’t worry. I will explain them one by one during this sessions. Although it would be help a lot if you read my article above, before coming to this session.

For each point I will try to show the table design in SQL Server DW so we can all see how they are implemented, not just the theory. Due to time limit I may not be able to cover all the above points, but I promise I will try to cover as much as possible.

6. Using a SQL Server Data Warehouse for BPM

BPM = Business Performance Management. A Data Warehouse is not only used for BI. It is also used for CRM and BPM. In this session I will show how to use a Data Warehouse for BPM using Balanced Scorecard. Before the session I will have built a Data Warehouse on SQL Server, for BPM, and during the session I will show the design of this DW.

The Data Warehouse will contain Sales data, Financial data, Customer Service data, and Order Processing data. Each of this data will form part of the Balanced Scorecard. In addition to Fact and Dimension tables, a Data Warehouse used for BPM contains one additional area which stores the KPI scores. I will show how this area is designed, and how the KPI values and score are calculated.

As a take away, I hope the audience can learn how a DW is used outside BI, how the additional area are designed and built, and how it is all implemented on a SQL Server platform. I will also show how the SSIS packages that populate the Data Warehouse from the source system, and the SSRS reports which shows the KPIs, the Balanced Scorecard and Performance Scoring calculation.

7. Loading a Dimension Table in SSIS

We can load a dimension table in SSIS using a) SCD Transformation, b) Merge Join + Conditional Split, c) Change Data Capture, d) Merge command, and e) Upsert using Execute SQL. In this session I will be showing/demo-ing these 5 approaches on the screen one by one, then compare them in terms of efficiency/performance, code clarity/maintenance, and ease of build. It is based on my article: https://dwbi1.wordpress.com/2015/09/09/loading-a-dimension-table-using-ssis/

SCD Transformation and Merge Join + Conditional Split are both using row-by-row operation hence not efficient compared to Upsert. CDC is a mechanism to extract the data changes. To load a dimension table we need to read the CDC output table, and update or insert into the dimension table based on the _$Operation column. The Merge command is buggy, has concurrency issues, requires an index to support performance, and does the Insert twice.

In every data warehouse project, we need to load many dimension tables. So this is a fundamental knowledge to know, for those of us who uses SQL Server and SSIS for your warehouse.

8. T-SQL: Back to Basic

In SQL Server we usually look forward to working with the latest features, such as SQL 2016, Power BI, ML, Azure, etc. So much so that we often forget the basics such as creating/modifying a constraint, PK, FK, index, trigger, partitioned table, synonym; joins, correlated subquery, update from, output an SP into a table, cast, NULLIF, variables, temp tables, table variables, cross apply, while, case; string/date functions, row number, transaction, except, rank, find duplicate rows, etc.

This session is intended to be a refresher, i.e. we know all the above but we forget. We will go back to basic. I will cover 3 sections: a) creating database objects, b) database development, c) database administration. This session is based on my article: https://dwbi1.wordpress.com/2014/12/27/sql-server-scripts/. So this session will consists of many short SQL scripts (T-SQL) and only T-SQL, i.e. I won’t be using the GUI. I won’t be able to cover every single script in that article (there are over 250 in total!), but I will pick the important ones, and avoid the ones which are similar to the others.

Measure or Attribute

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 10:00 pm

Last week two people asked me about this subject so I thought it would be beneficial for others if I write it. It is about whether a number should be a measure (a fact) or an attribute (of a dimension). Apologies in advance I can’t reveal the real case as it is a client’s case. But I will present several similar cases to illustrate this issue. As always, it is better to learn by example (cases) than by theory.

First case: worker rate

The first case is the timesheet system in the construction industry. The daily rate of the worker is the number in question. The hourly rate is recorded as an attribute of the worker. The fact table recorded the hours each worker worked in the project, every day. The data mart is used to analyse the cost the project (by work area), and to plan for the next 6-12 months. The cost for each worker for each day is simply the rate multiplied by the hours worked.

There are 2 issues here. One, when the rate entered last month is incorrect, how do we correct it. Second, how to plan for next year using the current rate + 5%, and current work hours -5%.

In this case, it is better to treat the rate as a measure, located in the fact table for actual costs as well as in the fact table for planning. This way we can correct the past rates and recalculate last month’s costs. We can also change the future rates (in the planning fact table) and recalculate the future costs. Yes we will occupy more disk space because we are repeating the rate on every row, but that is a very good price to pay for the advantage we get: more flexibility, ability to support the required functionality and higher performance.

The need to put the rate in the worker dimension arises when we need to know the historical rates for each worker. But this need can also be satisfied if we store the rate as a measure. Granted it would be less performant as we have to plough through lots of fact rows to distil the data to get distinct historical rate per worker. So if this is a mandatory requirement, it is a good idea if we put rate an type 2 attribute, in addition to putting the rate as a measure.

Second case: product price

This is a classic example from the retail industry. But it is also applicable to banking and insurance, pharma, energy and telecom sectors. In a shop, each product has a price. The same product can have different price in different stores. The price can change at any time. The price reviews can happen weekly or monthly, or ad-hoc. It can be done at the store level, city level or regional level. The actual price paid depends on discount or promotion applied for each transaction, at the time the customer purchase the product.

The issue here is, whether we should store the price of the product in a sales fact table, or as an attribute in the product dimension. I maintain an opinion that the price must be store in the sales fact table, for each transaction. In addition, the price should also be stored in the price list fact table. The price on the price list fact table is the official price for each store, for that day. Whereas the price on the sales fact table is the actual price paid, which could be different from price list, because of the offer or promotion applied to that transaction. For example, buy one get one free, or three for the price of two. Or, because the product is in a bundle such as cheese and wine, pyjamas and sleeper, a dictionary and thesaurus, etc. Or, because the product is simply discounted on that day, perhaps due to expiry date, or because the manager want to get rid of the stock quickly.

Having a price in the product dimension, in my opinion is not right. Product attributes such as size and colour, model and optional extras, are true product properties and therefore should be in the product dimension as attributes. These attributes are the same across stores, they are truly the property of a product. But prices can be different across stores, so they are not sole property of a product.

The price may be attached not to an individual product code, but to several product codes. For example, regardless of colour, the prices of the products are the same. Prices can be dynamic, e.g. prices of tickets (entertainment industry or transport industry) of the same show or flight can be different from one seat to the next, from one booking to the next, from one time slot to the next. Prices can be in bands, for example in utility industry (electricity and gas) and in telecommunication industry (mobile calls) the first 100 kWh can have different price to the second 100 kWh, Friday price can be different to Sunday calls.

The concept of price list fact table is more accommodative to satisfy the above demanding requirements, compared to putting price as a type 2 attribute in the product dimension.

Blog at WordPress.com.