Data Warehousing and Data Science

4 August 2011

Populating Unknown Measure with Zero or NULL

Filed under: Data Warehousing — Vincent Rainardi @ 12:26 am

If a measure column in a fact table does not have a value, should we fill it with 0 or null? Some people are concerned that number 0 may have a different meaning to “unknown” or “nothing”, as they have been influenced by the Kimball concept of “surrogate key column = 0”. Also they worry that if they populate it with null, they might have an issue when the column is being totalled or queried, because null can’t be compared.

Unknown Measure Can Be Replaced by Zero

99%* of the time an unknown measure can be replaced by 0, because the measure is additive. So totalling something with 0 will not disturb the result. Even if the measure is not additive, e.g. (account) balance, we can still replace it with zero, because balance = 0 means we don’t have any money in the account. Zero in balance column means nothing. It is less than 1 penny and it is more than -1 penny. A negative number in a balance column means credit for Asset and Expense accounts and it means debit for Liabilities, Equity and Income accounts.

*not in its scientific meaning, but rather means “most”.

Even if the measure column is a percentage figure, we can still fill in an unknown measure with zero, because 0% means nothing. A zero in a percentage column means “nothing” or “none”: it is less than 1% and it is more than -1%.

Zero Can Have a Different Meaning to Nothing

So in what cases does a zero have different meaning to nothing? In a probability column. If a probability column has a value of 0, it means “it is impossible to happen”. If a probability column has a value of 1, it means “it will certainly happen”. In investment banking, in Basel III, there is a measure called Probability of Default. It is the possibility that a loan (or other forms of credit*) is not paid. See the Wikipedia definition here. *or other instruments such as bond and CDS.

If the Probability of Default is 0 that means Default will never happen. The word “Default” means the event a company* that borrows the money is unable to pay its debt to the lender. Default is both a noun and a verb.

*known as “Obligor”, see Investopedia definition here. An Obligor is not necessarily a company. As long as it borrows money it is an obligor. It can be an individual, other form of entity (such as LLP) or the government of a country.

So if Probability of Default is 0, it means that we are certain, even if the earth no longer exists*, that the Obligor will pay its debt.  So it is impossible for the Probability of Default column to be 0, because there is always a possibility that an Obligor can’t pay its debt. Even the company with Acid-Test ratio of 4:1 could default. Even the UK or US government could default.

*Not literally because if that happens the word “money” and “debt” has no meaning. But more like catastrophic worldwide event, e.g. a huge meteor impact wiping out half of the earth population, world war III, or a Wall Street crash.

So that’s an example of a measure where a zero has different meaning to nothing: probability.

Issue of Totalling Measure Column Containing NULL

There is no issue at all when a measure column containing NULL is summed or totalled, as you can see from this little SQL:

```create table fact1 (dim1key int, measure1 int)
insert into fact1 values (1,10),(2,null),(3,20)
select * from fact1 ```

Output:
1              10
2              NULL
3              20

`select sum(measure1) from fact1`

The output is 30, as we expected.

There is no issue when we take the max and min either:

`select max(measure1) as [max], min(measure1) as [min] from fact1;`

Output: 20, 10, 15

There is no issue when comparing the measure using “more than”:

`select measure1 from fact1 where measure1 > 10`

Output: 20

As we expect the row with NULL measure is not included in the output, because NULL is not > 10.

Issue With Measure Column Containing NULL

So what’s the issue with a measure column containing NULL?

The issue is when we want to take an average of the measure:

`select avg(measure1) as [avg] from fact1;`

The output is 15, not 10, because the denominator is 2 not 3 (the NULL doesn’t count).

`select count(measure1) from fact1;`

The output is 2 not 3 (the NULL doesn’t count).

But if we do count(*) it’s 3:

`select count(*) from fact1;`

Output: 3

In most cases* we want the average to be 10 (30/3), i.e. the business functionality requires us to average over all rows, not just the “not null rows”. Hence it is better to not use AVG function, but we have a measure in the front end (BO, SSAS, OBIEE, etc) for both the sum and the count, then we do the sum/count at run time.
*Occasionally the business functionality requires us to average over only the “not null rows”. In this case the average need.

The other issue with NULL is when it is compared using “less than”:

`select * from fact1 where measure1 < 10`

Output: no rows.

Had we populated the measure with 0 instead of NULL, we would have had 1 row in the output. This is actually the main reason why in Data Warehousing we tend to populate the measure column with 0 instead of NULL.

Hope this helps,

Vincent Rainardi, 4th August 2011