Data Warehousing and Business Intelligence

27 September 2011

UNK in Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 5:31 pm

The word “UNK” is quite popular in data warehousing. It is the short form of “Unknown”. The word “Unknown” is used to populate if an attribute is NULL in a dimension.

Approach I

For example, say you have an attribute called Country in your customer dimension. In the source system, for a particular customer you don’t have a country. The country code column is NULL in the source system. So in the customer dimension you set the country attribute as “Unknown”.  (a)

What if the country code is ‘’ (blank string)? Then we could set it as “Unknown” too. (b)

Approach II

Other people/projects have different convention:

a)   In the customer dimension, for Surrogate Key 0, they set the country as “Unknown”.

b)   But for a customer with proper SK, they set it as whatever is in the source system, be it blank string or NULL.

Now here’s the difference between I and II above.

For I, when you browse the country attribute in the BI tool (to list down its value) you will see:
Afghanistan
Aland Islands
Albania
Algeria
Unknown

Whereas for II when you browse the country attribute in the BI tool you will see:
Afghanistan
Aland Islands
Albania
Algeria
[blank]
Unknown
[blank]

The first blank represents the blank string member.
The second blank represents the NULL member.

It is generally perceived that I is better than II. But different companies have different standards. I’ve seen both approaches implemented in the DW projects I’ve worked on.

UNK

Now back to the title of this article: UNK.

For attributes of which the data type is less than 7 characters, we can’t use Unknown. Most DW practitioner use UNK. This was how UNK was born.

For example: country code: either 2 char ISO or 3 char ISO. Currency code: 3 char ISO. Yes/No attribute. For all these 3 char-wide attributes we initialise the SK 0 row with UNK. And, if you use approach I you will also see UNK in the non SK 0 rows in the dim tables.

2 char wide attributes

How about 2 char wide attributes such as 2 char ISO country code?  Well, I don’t think “UN” is a good idea as it might be used in the future by ISO standard (currently not, only UM, UZ, UY, US). Probably a dash symbol is better (-). Or double dash (–).

UNK: Good or bad?

I think UNK is not a good idea, because if it’s in the middle of 3 char values, users could misinterpret it:

UAH, UGX, USD, USN, USS, UNK, UYI, UYU

Of all the currency codes above, only UNK is not valid. Who would know that?

I prefer triple dash than UNK.

Advertisements

4 Comments »

  1. Hey, very nice site. I came across this on Google, and I am stoked that I did. I will definately be coming back here more often. Wish I could add to the conversation and bring a bit more to the table, but am just taking in as much info as I can at the moment.

    iso 9000

    Comment by ISO 9000 — 29 September 2011 @ 4:58 am | Reply

  2. Hi Vincent – I never thought about dashes instead of the Unk/Unknown string. Good food for thought.

    Thought you may find this interesting….I have a requirement at my current client to never show the users an unknown member. So we let the relational DW do its data load, during which we assign unknown members a -1 key if applicable, but if any unknown members exist in any fact table, the cube is not allowed to process. An e-mail gets generated w/ the unknown member information so someone can resolve it before the cube is processed. After it’s fixed they kick off cube processing (or, if the unknown is some teensy little thing, they may kick off processing the cube anyway.) It’s the first time I’ve had a client be less concerned with aged data than they are with avoiding the presentation of unknown members. Interesting, huh?

    Melissa

    Comment by Melissa Coates — 1 October 2011 @ 10:07 pm | Reply

    • Hi Melissa, thank you for sharing it. I’ve experienced it as well, but when I said “Sir, this measure, the total is £5m for today. But if we exclude unknown, the total is £4.9m” they said “we need to include unknown rows, we need the total to be £5m”. Kind regards, Vincent

      Comment by Vincent Rainardi — 2 October 2011 @ 12:12 pm | Reply

  3. Very good post, I was really searching for this topic, as I wanted this topic to understand completely and it is also very rare in internet, that is why it was very difficult to understand.

    Thank you for sharing this.

    regards:
    ISO 9001

    Comment by ISO 9001 — 19 October 2011 @ 10:06 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: