The word “UNK” is quite popular in data warehousing. It is the short form of “Unknown”. The word “Unknown” is used to populate if an attribute is NULL in a dimension.
For example, say you have an attribute called Country in your customer dimension. In the source system, for a particular customer you don’t have a country. The country code column is NULL in the source system. So in the customer dimension you set the country attribute as “Unknown”. (a)
What if the country code is ‘’ (blank string)? Then we could set it as “Unknown” too. (b)
Other people/projects have different convention:
a) In the customer dimension, for Surrogate Key 0, they set the country as “Unknown”.
b) But for a customer with proper SK, they set it as whatever is in the source system, be it blank string or NULL.
Now here’s the difference between I and II above.
For I, when you browse the country attribute in the BI tool (to list down its value) you will see:
Whereas for II when you browse the country attribute in the BI tool you will see:
The first blank represents the blank string member.
The second blank represents the NULL member.
It is generally perceived that I is better than II. But different companies have different standards. I’ve seen both approaches implemented in the DW projects I’ve worked on.
Now back to the title of this article: UNK.
For attributes of which the data type is less than 7 characters, we can’t use Unknown. Most DW practitioner use UNK. This was how UNK was born.
For example: country code: either 2 char ISO or 3 char ISO. Currency code: 3 char ISO. Yes/No attribute. For all these 3 char-wide attributes we initialise the SK 0 row with UNK. And, if you use approach I you will also see UNK in the non SK 0 rows in the dim tables.
2 char wide attributes
How about 2 char wide attributes such as 2 char ISO country code? Well, I don’t think “UN” is a good idea as it might be used in the future by ISO standard (currently not, only UM, UZ, UY, US). Probably a dash symbol is better (-). Or double dash (–).
UNK: Good or bad?
I think UNK is not a good idea, because if it’s in the middle of 3 char values, users could misinterpret it:
UAH, UGX, USD, USN, USS, UNK, UYI, UYU
Of all the currency codes above, only UNK is not valid. Who would know that?
I prefer triple dash than UNK.