Data Warehousing, BI and Data Science

15 February 2019

Data Type

Filed under: Data Warehousing — Vincent Rainardi @ 6:53 pm

In the data warehouse, should we make the data type the same as the source system? If in the source system the data type of the currency code column is varchar(50), in the our data warehouse should we set the data type to varchar(50) or to varchar(3)? If an ISIN column is defined as varchar(100) in the source system should you set the data type to varchar(100) or varchar(12)?

The short answer is: set it to whatever the source system specifies.

The long answer is:

  1. If the source system specifies a ridiculously long data type such as varchar(500) for currency code or varchar(1000) for ISIN, or varchar(8000) for a description column, and the actual data in those columns are much shorter than the data type (e.g. 3, 12 and 200 characters respectively for the above 3 examples), then set the data type to fit the actual data length, plus spare.
  2. Have a standard in the warehouse for length. For example: 1, 5, 10, 20, 50, 100, 200 and pick the next one up. For example if the actual data length is 12 set the data type for that column to 20, and if the actual data length is 30 set the data type to 50.
  3. In the ETL, if the actual data exceed the column data type, cut the data. For example, if the data type of the attribute column in the dimension is 100 and the actual data is 101 characters, then cut it to 100 characters and load it into the dimension but flag this in the ETL error log.
  4. Talk to the source system team. Perhaps there is a reason why the designer of that source column set it to wider than what it should be. Perhaps when there is no ISIN, they put the other security identifier such as currency + SEDOL in there? Or internal security identifier, which is quite long?
  5. If the source system team confirmed that the data they put into that column will not exceed 12 characters (even though the column is defined as 100 characters) then set the column in the warehouse to 12. But again, in the ETL cut it to 12 if it exceed 12, and log it.

 

Advertisements

3 Comments »

  1. Hi,

    Any data that is loaded based on a particular period can be considered as a periodic snapshot fact?

    Comment by sandy353 — 21 February 2019 @ 1:19 am | Reply

    • In my opinion, no. Depends on what “based on a particular period” means. Does it mean the loading frequency? Also depends on the structure of the fact table, does it have a snapshot date/month? Does it repeat the whole source table or only partial?

      Comment by Vincent Rainardi — 21 February 2019 @ 8:20 am | Reply

    • I just wrote about this last night: https://dwbi1.wordpress.com/2019/02/20/transactional-fact-tables/. See the definition of periodic snapshot fact table on this article, also see case 2 for comparison/deciding between periodic snapshot and transactional fact tables.

      Comment by Vincent Rainardi — 21 February 2019 @ 8:21 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: