Data Warehousing and Business Intelligence

13 March 2010

Data type of “Is Current Flag” column

Filed under: Data Warehousing,SQL Server — Vincent Rainardi @ 1:00 pm
Tags: ,

In data warehousing we have an Is Current Flag column in SCD type 2 dimension table. This flag indicates whether the row is the current version or not.

Some DW data modellers believe that in SQL Server platform this column should be created as bit rather than int or Y/N. Because it saves space. Int is 4 bytes. Y/N is char(1) which is 1 byte (2 if nchar/unicode). Bit, they say, is 1/8 byte. That’s the reasoning.

But, in SQL Server, if you have 8 or less bit column in the table, it is stored as 1 byte. So if Is Current Flag is the only bit column in your dimension table, you don’t save space by going to bit data type.

Contrary to popular believe that 1/0 (numbers) takes less space that Y/N (letters), it is not. By default people think that the data type for integer is int, they don’t think about small int or tiny int. In that case (int) the storage is 4 bytes. If the data modeller is platform aware and they know about small and tiny int, then it’s 2 or 1 byte.

Char(1) is 1 byte. But varchar(1) is 3 bytes, not 1 byte. The additional 2 bytes in varchar column is to specify the length of the column.

So the question then become:

which data type is faster for “where is_current_flag = 1” query, is it char(1) Y/N or bit 1/0?  We know both take 1 byte storage, but which one is gives better query performance?

It’s subject to testing but I think comparing bit is faster than char(1). That’s because bit can only have 1,0 and null values where as char(1) can have many possible values. So for Is Current Flag column I’d go for bit data type.

Now let’s test it.

Dim1 is a dimension table with 2 attributes. It has 1000 rows. Fact1 is a fact table with 2 measures. It has 1 million rows.

— Add Is_Current column as bit
alter table dim1 add is_current bit
update dim1 set is_current = 0 where dimkey1%4 = 0
update dim1 set is_current = 1 where dimkey1%4 <> 0
select * from dim1

— Measure the query performance
set statistics time on
declare @i int
set @i = 1
while @i <= 3
begin
select d.attr2, sum(measure1) as m1, sum(measure2) as m2
from fact1 f join dim1 d on d.dimkey1 = f.dimkey1
where d.is_current = 1 group by d.attr2
print @i
set @i += 1
end
–8885 ms (avg of 8201,9871,9159,8875,8319)

— Change Is_Current column as char(1)
set statistics time off
alter table dim1 alter column is_current char(1)
update dim1 set is_current = ‘N’ where dimkey1%4 = 0
update dim1 set is_current = ‘Y’ where dimkey1%4 <> 0

— Measure the query performance
set statistics time on
declare @i int
set @i = 1
while @i <= 3
begin
select d.attr2, sum(measure1) as m1, sum(measure2) as m2
from fact1 f join dim1 d on d.dimkey1 = f.dimkey1
where d.is_current = ‘Y’ group by d.attr2
print @i
set @i += 1
end

— 5353.5 ms (avg of 5248,5299,5053,5254,5764,5503)

So, quite surprisingly, char(1) is faster than bit. 40% faster. Hmm, in that case I’ll set the Is_Current column as char(1) Y/N. And Y/N more descriptive than 1/0.

4 Comments »

  1. If true or false is what I want to store, I would always go for bit. If I want to store something like true, false, maybe then also I wont go for char(1). I would take tinyint instead and have the column as a primary key in another table with a little description of the values.

    Comment by Amit Singh — 13 March 2010 @ 3:35 pm | Reply

  2. […] Use Is Current Flag column instead of date validity columns (it’s faster). A discussion about the data type of the Is Current Flag column is here. […]

    Pingback by Creating a Data Mart from a Data Warehouse: Four Questions « Data Warehousing and Business Intelligence — 13 March 2010 @ 11:44 pm | Reply

  3. […] Use Is Current Flag column instead of date validity columns (it’s faster). A discussion about the data type of the Is Current Flag column is here. […]

    Pingback by Creating a Data Mart from a Data Warehouse: Four Questions « Business Intelligence — 18 July 2011 @ 9:54 am | Reply

  4. Bit is faster for me because it takes 2 fewer keystrokes to type:
    where is_something = 1
    as compared to
    where is_something = ‘y’

    That’s how I look at it.

    Comment by Eric Sabine — 9 December 2011 @ 2:08 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: