Data Warehousing and Data Science

5 October 2018

Junk Dimension Revisited

Filed under: Data Warehousing — Vincent Rainardi @ 5:27 am

If the data we get from the source system consists of many columns, and 11 of those columns are like this:

  • Small, medium, large (2 columns)
  • High, medium, low (2 columns)
  • Red, green, blue, brown (1 column)
  • Red, amber, green (2 columns)
  • Yes, no (2 columns)
  • Long, short (2 column)

1. Source Row

How should we model it in dimensional model?

The answer is of course a junk dimension. If in the above data we have not 5 rows but 5000 rows, then the distinct combination of values is probably only 200. That is the power of the junk dimension. It puts all these attributes into one table, in a compact form.

Hashbytes

One of the challenge of implementing a junk dimension is the lookup, to check if the row coming from the source data already exists in the dimension table or not. If it’s only 11 columns like above then we can compare the source row and the target row directly by comparing all the columns. But if the number of columns is 100, then it would be cumbersome to type it all in.

One common solution is to use hasing. We hash the 100 source columns and compare it with the hash column in the dimension which has been computed previously. If we find a match, that means the row already exists in the dimension table.

There are a few functions in SQL Server which can do this. One of them is checksum. But it is possible that two different inputs generate the same hash output. A better function is hashbytes, with much lower possibility of producing the same hash output for different input, particularly if we use it with MD5 algorithm, as we can read on Books Online and on SQL Server Central (link).

With Hashbytes we can choose different algorithms: MD2, MD4, MD5, SHA, SHA1, SHA2. Note that from SQL 2016 only SHA2_256 and SHA2_512 are recommended. All other algorithm are deprecated. So I recommend using SHA2_512 rather than MD5, as it has less chance of producing the same output value for different input values.

Impossible To Have Duplicates

Having the same output value for different input values is called a collision. MD5 is very weak, it took only a few seconds using a normal computer to produce a collision (link).

SHA1 is the second strongest, We need to try 10^19 different inputs before collision occurs, which takes about 100 year if we use a server with 64 cores. Last year Google and CWI Amsterdam managed to produce a collision with SHA1 (link).

With SHA2 using 64 bytes (512 bits) it is practically impossible to have a duplicate/collision. We need to try 10^77 different inputs before a duplicate/collision occurs. If we use a server with 64 cores, it will take about 5800 years.

Warning: Functions Are Slow

One warning when using functions in SQL Server (any function including hashing functions), is that function is always perform a row by row operations. Or RBAR as Jeff Moden calls it, stands for Row By Agonising Row. In relational databases, be it Oracle or SQL Server, or anything else, when dealing with high number of rows we should always perform a set operation, not a row-by-row operation. If the incoming data from source is over 1m rows, hashing them will take time, particularly if when we use a complex algorithm such as MD5.

First Character Technique

One alternative is taking the first character of every column and use it as the business key for lookup, like this:

2. First Character

By comparing the First Character column between the incoming source row and the one stored in the dimension table, we can check if the row already exists or not.

The advantage of using this technique is that there is no way that two different inputs will produce the same output.

But (as always there is a but) if in one column we have two values with the same first character like below, we need to differentiate them:

3. Same first character

In the above example the Status column have two values with the first character = O, i.e. Outstanding and On Hold. So we need to differentiate them, for example by making On Hold as H.

To implement that we will need either a case statement or a lookup table. For example: case when Status = ‘On Hold’ then ‘H’ else Left(Status,1).

If the upper and lower case in the source data is consistent, we don’t need to convert the first character to uppercase. But if the upper/lower case in the source data is inconsistent then we need to convert it to uppercase. For example, if in the source data the Status column contains both “Active” and “active”, then we need to convert both to “A” by taking the uppercase using UPPER() function.

Dealing With Null

If we use Hashbytes, Null values need to be converted to “Unknown” or “1/1/1900” or -1 depending on the data type. Otherwise it won’t work. We can also opt to convert NULL to the appropriate data type before inputting it to the Hashbytes function.

The advantage of Checksum is that it can deal with NULL. We don’t have to convert NULL values to something else. The other advantage of Checksum is that it is faster than Hashbytes with SHA_512.

Data Type

Varchar and Nvarchar produce different output, even through the input text is the same. Similarly Char and Nchar produce different output.

Varchar(100) and Varchar(200) produce the same output, if the input text is the same. So length of the varchar does not affect the output.

If the collations are different, the outputs are different.

We need to remember all the above points when later on we want to change the data type of the attribute column, either in the staging or in the dimension. Or if we want to change the collation of the database.

8000 bytes limitation

The main limitation of Hashbytes function is that it can only accept a maximum of 8000 bytes. So if the total bytes of all the input columns is more than 8000, we will need to split the input columns into two or three and hash them separately.

Hashbytes Performance

So how fast or show is Hashbytes function? A table with 49 columns and 1.5 million rows takes about 44 seconds with MD5 (link).

A table with one varchar(10) column and 11 million rows takes about 17 seconds with SHA2 256 (link)

Obviously it depends on how many CPUs are in the server.

Back to Junk Dimension

So if there are many distinct values in a column (say 100) then we can’t use the first character technique, because there will be many duplicate. In this case Hashbytes is better, because it can take thousands distinct values, and it can take many columns too.

Yes/No Columns

If the source data contains many boolean columns (say 100 columns, all True and False or Yes and No), then we can either put it in the fact table as degenerate dimension columns, or create a junk dimension. I would advise to create a junk dimension, because it will declutter the fact table. And because those columns are attributes so they should be in a dimension table. And because from space consideration the junk dimension solution is much more compact, particularly if the fact table is a snapshot fact table.

Blog at WordPress.com.