Data Warehousing and Data Science

17 August 2011

How to Choose the Right BI Technology to Suit Your Style

Filed under: Business Intelligence — Vincent Rainardi @ 5:42 pm

By: Tejas Goradia

I want to see reports. Microsoft offers Excel Services / PowerPivot for Excel, SQL Server Reporting Services (including Report Builder), Performance Point Services; which one should I use?

This post shall help choose Microsoft’s data visualization tool based on reporting requirements driven by business process and stakeholder expectation and need analysis. Typically reporting styles fall into Self-Service Analysis, Business Reporting, Parameterized & Operational Reporting, Performance Monitoring and Scorecarding.

Figure 1 – Choosing the Right Microsoft BI Tool (Microsoft Excel, PowerPivot for Excel, SQL Server Reporting Services (including Report Builder), and PerformancePoint Services in SharePoint)

Description and Key Characteristics

  • Self-Service Analysis – The best analysts (non-technical users), and often the best assets, in most organizations are those users who love to dig into the data and really understand why the business is performing the way it is. They want to perform self-service analysis (reports, graphs, dashboards, and so forth) without reliance on IT. They want to be able to easily drill down, pivot, filter and format data, integrate information from a variety of sources without the need of specialized technical skills such as SQL, MDX, or other query languages, using familiar business analysis skills & tools such as Excel. Because Self-Serivce BI tools empower users to gather data from multiple sources, the need for a structured corporate system to be built, such as a data warehouse, is reduced.Tool Options: Excel, PowerPivot for Excel
  • Business Reporting – While typically not as fluid, fun, or “sexy” as the self-service creations discussed above, Business Reporting is just as important to organizations because of the need for power users in business functions to create highly formatted and distributable reports  delivered in multiple formats such as PDF, Excel, HTML, and so on. Reports are typically based upon approved corporate data, and then shared more broadly with managers, teams, or departments.  Specialized technical skills such as SQL, MDX, or other query languages may not be required to author reports. IT involvement is moderate, usually overseeing the distribution and monitoring of the reporting environment and building of the structured data layer upon which the reports are built.  A data warehouse, data mart and/or online-analytical-processing (OLAP) environment is often in place and used as the data source for business user reporting.Tool Options: Excel, PowerPivot for Excel, Report Builder
  • Parameterized & Operational Reporting – Similar to the Business Reporting style, Parameterized &Operational Reporting is also characterized by fixed-format reports that need to be distributed to the responsible parties for each store, region, or other organizational entity in a timely manner and without requiring user involvement to locate and run the report.  The reports, however, are authored and managed centrally by IT instead of business users and usually follow a pixel perfect format and rendering style.  Consistency, scalability, manageability, auto-refresh and automated distribution are some of the key characteristics of this style.  Data used to create reports comes from corporate-sanctioned and IT-managed data sources. Reports may be complex and require special technical skills such as advanced SQL, MDX, or other query languages to build. The reports may be delivered via portal, file share, email, or even sent directly to a printer, and may be generated as PDF, Excel, XML, comma delimited text file, TIFF image, HTML or Microsoft Word formats.
    Tool Options: SQL Server Reporting Services (including Report Builder)
  • Performance Monitoring – Just as a dashboard in a car helps a driver quickly understand the real-time operational performance of the automobile, a dashboard report helps business decision makers understand the current health of their business.  A dashboard often combines key performance indicators from various business functions on a single page to help provide an at-a-glance view of performance and the ability to drill down further when something is off track or performing extremely well.  This style is called “Performance Monitoring (Dashboard)”.  This style is generally found in more mature BI environments where data points of interest have been defined, key determinants of business success have been identified, and a comprehensive BI strategy exists.Tool Options: Excel Services / PowerPivot for Excel, SQL Server Reporting Services (including Report Builder), Performance Point Services.
  • Scorecarding – Scorecarding is a style that describes highly summarized views with Key Performance Indicators (or KPIs) measured and scored against predefined targets such as a balanced scorecard.  This style is generally a part of a performance management program, though it can also be used to measure operational performance. While all of the Microsoft BI tools permit “scorecard-like”reporting, PerformancePoint Services is the Microsoft tool that enables Scorecards by using a methodology such as Balanced Scorecard or by creating your own methodology.  PerformancePoint Services facilitates weighted score rollups, measurement against targets, and pre-defined KPIs, in addition to providing the broader set of PerformancePoint functionality to allow integrating the Scorecard into a dashboard alongside other reporting and analysis capabilities.Tool Options: Excel Services / PowerPivot for Excel, SQL Server Reporting Services (including Report Builder), Performance Point Services.

Choose the right business intelligence technology to suit your style (white paper)

9 August 2011

Using Accumulating Snapshot Fact table to Monitor Status

Filed under: Data Warehousing — Vincent Rainardi @ 4:32 am

Status is a specific stage in a workflow. This is an example of a workflow:

In the above workflow, we have 6 statuses. The workflow begins with Status 1, and ends with Status 5 or 6. The normal workflow is: Status 1, Status 2, Status 4, Status 6 (1,2,4,6). And we have 3 alternate flows:

  1. 1, 3, 5
  2. 1, 2, 3, 5
  3. 1, 2, 4, 3, 5

An example of a workflow is a mortgage application. A mortgage application needs to go through various stages such as: credit score check, collateral check, application approved, application declined, etc.

When the workflow is in action, we have a date of those various stages:

In the above example, the application got at Status 1 on date1, at Status 2 on date2, at Status 4 on date3 and it got at status 6 on date4.

In the transaction/source system (e.g. mortgage application), the questions are about individual applications, such as:
SS1. What is the current status of Application 1?
SS2. When did Application 2 reach Status 4?

But in the data warehouse, the questions are less about individual applications. The questions are more about the status of many or all applications, such as:
DW1. How many applications are currently on Status 4?
DW2. What is the average duration (in days) between Status 2 and Status 4?

In many data warehouse implementations, Status is created as an attribute on the main dimension, e.g. customer dimension, product dimension, or application dimension. The dimension is set as Slowly Changing Dimension type 2 (SCD2), like this:

This SCD2 structure is more appropriate to answer Transaction/Source System questions such as SS1 and SS2 above. Using this SCD2 structure, it is difficult to answer Data Warehouse questions such as DW1 and DW2 above. Using this structure the query to get “How many on Status 4” is:

select count(*) from DimApp where IsCurrent = 1 and Status = 'Status4'

Output: 2

And the query to answer “What is the average duration (in days) between Status 2 and Status 4?” is

with x as
(select DateDiff(d, EffectiveDate, ExpiryDate) as Days from DimApp where Status = 'Status2')
select avg(Days*1.0) from x

Output: 1.5

The “*1.0” is to convert days column from integer to decimal so we don’t get an integer output.

But, from Status 2 an application doesn’t always go to Status 4. It can go to Status 3 instead! So how do we find out which application goes from Status 2 to Status 4 (App2) and which app goes from Status 2 to Status 3 (App3)? We can’t!

Not able to identify which row is the previous state, we are forced to add “Previous App Key” column:

Now we can join the table to itself using PrevKey and AppKey column. So the query to answer “What is the average duration (in days) between Status 2 and Status 4?” becomes:

select avg(datediff(d, StatusFromDate, StatusToDate)*1.0) from
( select A.AppId, B.Status as StatusFrom, B.EffectiveDate as StatusFromDate,
  A.Status as StatusTo, A.EffectiveDate as StatusToDate
  from DimApp A
  join DimApp B on A.PrevKey = B.AppKey
) X where StatusFrom = 'Status2' and StatusTo = 'Status4'

Output: 2.5


Using Accumulating Snapshot Fact Table

To answer question DW1 and DW2 above, it is more appropriate to use Accumulating Snapshot Fact Table, like this:

The “Ind” columns are indicator columns. It shows which status the application is currently on.

For a background on Accumulating Snapshot you can read Kimball Toolkit book chapter 3 page 75 and his article here.

Using this structure the query to get “how many on status 4” becomes:

select sum(S4Ind) from FactApp

Output: 2

Whereas the query to answer “What is the average duration (in days) between Status 2 and Status 4?” is:

select avg(datediff(d, S2Date, S4Date)*1.0) from FactApp
where S2Date is not null and S4Date is not null

As you can see, it is easier to use Accumulating Snapshot Fact Table to answer Data Warehousing questions such as:
DW1. How many applications are currently on Status 4?
DW2. What is the average duration (in days) between Status 2 and Status 4?

Vincent Rainardi, 8th August 2011

Expiry Date column in SCD Type 2 Dimension

Filed under: Data Warehousing — Vincent Rainardi @ 4:19 am

There are 3 parts of this article, all about Expiry Date column in SCD type 2 dimension.

  1. Date only Expiry Date (no time component), set to Effective Date or a day before?
  2. Expiry Date: Should We Have the Millisecond?
  3. Expiry Date of the current row: NULL or max date?

A. Expiry Date without the Time Component

If the Effective Date and Expiry Date in our SCD type 2 dimension is date only (no time component), should we set the Expiry Date = Effective Date of the new row, or a day before?

As always it is easier to explain by example. This is a typical SCD type 2 dimension:

The question is: should the expiry date on the first row be 05/08/2011 instead?

Let’s see what happens if it’s 04/08/2011 like in the above table. If the transaction date in the source system has no time component, e.g. 03/08/2011 then it will fit nicely. The fact row will be mapped to exactly 1 row in the above type 2 dimension.

But if the transaction date in the source system has the time component e.g. 04/08/2011 11:23:09 then it will “fall between the cracks” e.g. fall between the expiry of the first row (04/08/2011) and the effective date of the 2nd row (05/08/2011), so it will neither be mapped to the 1st row nor the 2nd row of the dimension. This is because 04/08/2011 means 04/08/2011 00:00:00 and 05/08/2011 means 05/08/2011 00:00:00 so 04/08/2011 11:23:09 neither belong to the 1st row nor the 2nd row.

So in this case we will need to set the Expiry Date of the first row to 05/08/2011, so that 04/08/2011 11:23:09 will belong to the 1st row.

We will also need to make a “convention” in the ETL that “midnight” dates (e.g. with time component = 00:00:00) will fall to the 2nd row, e.g. if the transaction date is 05/08/2011 00:00:00 then the ETL process will map it to the 2nd row.

The best one is of course to set the data type of Effective Date and Expiry Date column to datetime instead of date data type, like below:

With datetime data type like above, we can be sure that a transaction date with time component will never “fall between the cracks”. It will always be mapped to a dim row.

B. Should We Have the Millisecond?

But what if the transaction time is 04/08/2011 23:59:59.100, i.e. 100 milliseconds after the expiry date? In a global data warehouse this is possible. Should we have the millisecond like below?

This way, 04/08/2011 23:59:59 100 will belong to the first row.

What if the transaction time is 04/08/2011 23:59:59.1234567? With the arrival of the datetime2 data type in 2008 this is now possible to have this time value in the source system. Datetime2 has a precision of 100 nanoseconds, i.e. 7 digits after the second.

Should we set the data type of the Effective and Expiry Dates to datetime2 instead of datetime? If you use 2008 yes you should, in my opinion. So that the SCD2 dim would look like this:

But if you use 2005, then on the ETL you need to have a convention. For example: anything that is less that 05/08/2011 (like 04/08/2011 23:59:59.1234567) will be mapped to the first row.

C. Expiry Date of the Current Row: NULL or max date

In all the above examples, I set the Expiry Date of the current row as a max date, e.g. 31/12/9999 23:59:59.9999999. In some data warehouses, the Expiry Date of the current row is set as NULL. Which one is better and why?

As always there are pluses and minuses to everything.

If we put a certain date, e.g. 31st December with the year = 2999 or 9999, we will have to hardcode this date in the ETL. And it’s not only 1 place, but in many places. In the insert script of the ETL of every table. And there might cause different perceptions/understanding. Developer1 thinks it’s year 2999 whereas Developer2 may think it’s year 3999. Tester1 thinks it’s 1st Jan 9999, Tester2 may think it’s 31st Dec 9999. Documentation1 says 2999, documentation2 says 9999, etc.

On the other hand if we put NULL it’s simpler. Because it’s only 1 value. But using NULL we can’t compare with a certain date. If we want to query rows which have ExpiryDate greater than today for example, the NULL ExpiryDate will fail. So we can’t do

select * from t1 where ExpiryDate >= getdate()

In order to get rows with ExpiryDate greater than today, we have to add “or is null” like this:

select * from t1 where ExpiryDate >= getdate() or ExpiryDate is null

Kind regards,

Vincent Rainardi, 9th August 2011

4 August 2011

Populating Unknown Measure with Zero or NULL

Filed under: Data Warehousing — Vincent Rainardi @ 12:26 am

If a measure column in a fact table does not have a value, should we fill it with 0 or null? Some people are concerned that number 0 may have a different meaning to “unknown” or “nothing”, as they have been influenced by the Kimball concept of “surrogate key column = 0”. Also they worry that if they populate it with null, they might have an issue when the column is being totalled or queried, because null can’t be compared.

Unknown Measure Can Be Replaced by Zero

99%* of the time an unknown measure can be replaced by 0, because the measure is additive. So totalling something with 0 will not disturb the result. Even if the measure is not additive, e.g. (account) balance, we can still replace it with zero, because balance = 0 means we don’t have any money in the account. Zero in balance column means nothing. It is less than 1 penny and it is more than -1 penny. A negative number in a balance column means credit for Asset and Expense accounts and it means debit for Liabilities, Equity and Income accounts.

*not in its scientific meaning, but rather means “most”.

Even if the measure column is a percentage figure, we can still fill in an unknown measure with zero, because 0% means nothing. A zero in a percentage column means “nothing” or “none”: it is less than 1% and it is more than -1%.

Zero Can Have a Different Meaning to Nothing

So in what cases does a zero have different meaning to nothing? In a probability column. If a probability column has a value of 0, it means “it is impossible to happen”. If a probability column has a value of 1, it means “it will certainly happen”. In investment banking, in Basel III, there is a measure called Probability of Default. It is the possibility that a loan (or other forms of credit*) is not paid. See the Wikipedia definition here. *or other instruments such as bond and CDS.

If the Probability of Default is 0 that means Default will never happen. The word “Default” means the event a company* that borrows the money is unable to pay its debt to the lender. Default is both a noun and a verb.

*known as “Obligor”, see Investopedia definition here. An Obligor is not necessarily a company. As long as it borrows money it is an obligor. It can be an individual, other form of entity (such as LLP) or the government of a country.

So if Probability of Default is 0, it means that we are certain, even if the earth no longer exists*, that the Obligor will pay its debt.  So it is impossible for the Probability of Default column to be 0, because there is always a possibility that an Obligor can’t pay its debt. Even the company with Acid-Test ratio of 4:1 could default. Even the UK or US government could default.

*Not literally because if that happens the word “money” and “debt” has no meaning. But more like catastrophic worldwide event, e.g. a huge meteor impact wiping out half of the earth population, world war III, or a Wall Street crash.

So that’s an example of a measure where a zero has different meaning to nothing: probability.

Issue of Totalling Measure Column Containing NULL

There is no issue at all when a measure column containing NULL is summed or totalled, as you can see from this little SQL:

create table fact1 (dim1key int, measure1 int)
insert into fact1 values (1,10),(2,null),(3,20)
select * from fact1 

1              10
2              NULL
3              20

select sum(measure1) from fact1

The output is 30, as we expected.

There is no issue when we take the max and min either:

select max(measure1) as [max], min(measure1) as [min] from fact1;

Output: 20, 10, 15

There is no issue when comparing the measure using “more than”:

select measure1 from fact1 where measure1 > 10

Output: 20

As we expect the row with NULL measure is not included in the output, because NULL is not > 10.

Issue With Measure Column Containing NULL

So what’s the issue with a measure column containing NULL?

The issue is when we want to take an average of the measure:

select avg(measure1) as [avg] from fact1;

The output is 15, not 10, because the denominator is 2 not 3 (the NULL doesn’t count).

select count(measure1) from fact1;

The output is 2 not 3 (the NULL doesn’t count).

But if we do count(*) it’s 3:

select count(*) from fact1;

Output: 3

In most cases* we want the average to be 10 (30/3), i.e. the business functionality requires us to average over all rows, not just the “not null rows”. Hence it is better to not use AVG function, but we have a measure in the front end (BO, SSAS, OBIEE, etc) for both the sum and the count, then we do the sum/count at run time.
*Occasionally the business functionality requires us to average over only the “not null rows”. In this case the average need.

The other issue with NULL is when it is compared using “less than”:

select * from fact1 where measure1 < 10

Output: no rows.

Had we populated the measure with 0 instead of NULL, we would have had 1 row in the output. This is actually the main reason why in Data Warehousing we tend to populate the measure column with 0 instead of NULL.

Hope this helps,

Vincent Rainardi, 4th August 2011

Blog at