Data Warehousing and Machine Learning

27 December 2011

Filed under: Data Warehousing — Vincent Rainardi @ 6:36 am
Tags:

Now that your data warehouse is built, populated, and ready, you need to test it. How do you test a data warehouse? Where do you start? There are a billion numbers and strings in the data warehouse and how do you verify them all? Can you just verify a few rows and then declare that the data warehouse is fit for purpose? Of course not. So how do we do it?

Testing the data warehouse is the easy bit. The difficult bit is to find out where the errors are. We test the data warehouse by looking at the grand total. Let’s say you have this fact table:

Figure 1. A fact table with 2 measures and 2 dim key columns

Step 1. You ignore the dim key column and just get the sum of measure1 and sum of measure2. Say the total is 5,000,000 for measure1 and 4,000 for measure2. You then verify these 2 numbers against the source system. If you don’t get this step done don’t move to the next step. If the grand total is wrong there’s no point getting the break down.

Step 2. You then check the breakdown of the dimension attribute. Let say your dim1 is like this:

Figure 2. A dim table with 1 dim key column and 3 attribute columns

You do something like this:

```
select d.attribute1, sum(f.measure1), sum(f.measure2)
from fact1 f
inner join dim1 d on d.dim1key = f.dim1key
group by d.attribute1

```

Say the output is like this:

Figure 3. Sum of measure1 and measure2 by attribute1

Of course the SQL doesn’t give you the total; you calculate the total yourself. You then verify this output against the source system. a) Is there a missing member, or an extra member? b) Are the numbers for measure1 and measure2 all correct?

Step 2 is actually quite a big task. The number of queries is the same as the number of attributes. For a data warehouse with 10 dimensions containing 10 attributes each, you will have to produce 100 queries.

Step 3. You verify the bottom levels (the leaf level). Obviously we can’t verify all rows in the warehouse, because of the volume. A fact table can contain a billion rows. So here we are forced to take sample. So get a set of attribute value, i.e. dim1.attribute1 = value1, dim2.attribute2 = value2, etc. When you apply this filter to the fact table, you get a small number of rows, like this:

```
select f.measure1, f.measure2
from fact1
join dim1 d1 on d1.dim1key = f.dim1key
join dim2 d2 on d2.dim2key = f.dim2key
where d1.attribute1 = ‘value1a’ and d2.attribute2 = ‘value2a’

```

So instead of 1 billion rows we only get say 1000 rows. So from this (value1a, value2a) set which contains 1000 rows we chose 3 rows. We then verify all the measures in these 3 rows. Then we do the same thing with fact2, fact3, etc until we have done all the fact tables in the data warehouse.

Finding the error rows

When you did step 1 for example, you found that the total doesn’t match with the source systems. So how do you find out the incorrect rows? The best way is using a cube. You can use reports, but it would be a tedious process. Whereas using a cube is very easy. I describe the process here.

I have been involved in a data warehouse project where they never tested the total. The System Integration Test only tested 1 or 2 rows. And the User Acceptance Test only tested 1 or 2 rows too. Whist those 2 rows might be correct, other rows can be wrong. So it is extremely important to test the total. For all measures. It is not difficult to get the totals for each measure. Finding the error rows is the difficult task. But with a cube, imagine if you have the total of the error in measure1 is £18,755. This is the total difference between the source systems and the data warehouse for measure1. You can then drill down by any dimension/attribute you like. Say you drill down by cost centre. You will find something like this:

Figure 4. Drill down of measure1 discrepancy by cost centre.

In figure 4 we are clear that the cause of this £18,755 difference is cost centre 4 and 6. All the other cost centres are spot on. They fully reconcile, i.e. the source is equal to the target. In cost centre 4 and 6 that the total are not the same.

You can then drill down on cost centre 4 by another attribute, say country, like this:

Figure 5. Drill down of measure1 discrepancy by country

So you know that the 6700 difference in cost centre 4 is caused by country 5 and country 8.

If you keep drilling down, eventually you will arrive at leaf level, the most detail level (say it’s transaction level), which is like this:

Figure 6. Detail/transaction level

Once we found which transactions are causing the difference, we can investigate that transaction in the source system, and comparing it to the data warehouse. Is it because of currency conversion? Is it because of business rule? Is the because of incorrect business / surrogate key?

Once we found the cause of the issue, we can correct the data integration program (ETL) and reload the warehouse again. And refresh the cube. Hopefully this time the total discrepancy for measure1 at the highest level will be 0. And we can then move on to the next measure.

I hope this provide a little bit of guidance of how to test a data warehouse. If you have any question or different opinion, or would like to add comments please don’t hesitate to contact me on vrainardi@gmail.com or via the comment button.

Vincent Rainardi, 19/12/2011

1 Comment »

1. Excellent. Very usefull. Expecting more from you….

Comment by Magesh — 14 August 2012 @ 3:40 am

Blog at WordPress.com.