Data Warehousing and Business Intelligence

28 March 2010

Join brings more rows

Filed under: Data Warehousing — Vincent Rainardi @ 12:59 pm
Tags:

When populating a fact table from the source system we do a lookup on the dimension tables to get the dimension surrogate keys. We join the dimension tables to the fact stage table on the natural keys:

select D1.SK, D2.SK
from fact_stage_table S
left join dim1 D1 on S.D1NK = D1.NK
left join dim2 D2 on S.D2NK = D2.NK

(SK = Surrogate Key, NK = Natural Key)

Say the query returns 200,000 rows.
But when we bring dim3 the query returns say 230,000 rows:
select D1.SK, D2.SK, D3.SK
from fact_stage_table S
left join dim1 D1 on S.D1NK = D1.NK
left join dim2 D2 on S.D2NK = D2.NK
left join dim3 D3 on S.D3NK = D3.NK

This is because 1 row in the fact stage table corresponds to 2 rows (or more) in dim3.
In this case we need to find out which column in the fact stage table causing the split.
To find out the column causing the split, strip out dim1 and dim2 from the query, and group by D3.SK:
select S.D3NK from fact_stage_table S
left join dim3 D3 on S.D3NK = D3.NK
group by S.D3NK having count(*) > 1

Say the above query returns 4 rows: NK1 to NK4. This means that in the fact stage table, there are more than 1 row for NK1.
So now we query the fact stage table to see which column is causing the issue:
select * from fact_stage_table where D3NK = ‘NK1’

The column(s) causing the issue have different values across the rows. All other columns in the last query should have the same value across the rows, except these 1 or 2 columns that cause the issue.

Common causes are:

  • It’s a type 2 dim, so there are several versions for each NK. One possible solution is to use the latest version, i.e. where Is_Current = ‘Y’ (or use validity dates, e.g. valid from and valid to).
  • It has an outrigger hence there are multiple rows in the dim table for each NK. One possible solution is to take the first version, i.e. where outrigger_id = 1
  • The dim table is populated multiple source systems. It is possible that the NK1 exists in source system 1, but also exists in source system 2.

Then apply those additional conditions to the investigation query:
select S.D3NK from fact_stage_table S
left join dim3 D3 on S.D3NK = D3.NK and D3.Is_Current = ‘Y’ and D3.Outrigger_ID = 1
group by S.D3NK having count(*) > 1

If this returns nothing, then apply that condition to the main query:
select D1.SK, D2.SK, D3.SK
from fact_stage_table S
left join dim1 D1 on S.D1NK = D1.NK
left join dim2 D2 on S.D2NK = D2.NK
left join dim3 D3 on S.D3NK = D3.NK and D3.Is_Current = ‘Y’ and D3.Outrigger_ID = 1

This should now returns 200,000 rows. If the fact stage table has 200,000 rows, after we join with the dimension tables it should still return 200,000 rows. If less return less than this (say 170,000), check if you use inner join — you should use left outer join. If it return more than this (say 230,000), check if split occurs (as I explain above). If it does, find out which column causes the split and put the additional where clause for that column.

1 Comment »

  1. […] is the original post: Join brings more rows « Data Warehousing and Business Intelligence Share this […]

    Pingback by Join brings more rows « Data Warehousing and Business Intelligence | Information Mining R&D — 28 March 2010 @ 2:41 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: