Data Warehousing and Data Science

27 December 2011

SQLBits 10 London

Filed under: Event — Vincent Rainardi @ 7:20 am

The 10th SQLBits conference will be on 29th to 31st March 2012 (Thursday to Saturday) in Novotel London West, London W6 8DR.

This time, the SQLBits will not be just a SQL Server Conference. It will also be the place where Microsoft will announce SQL Server 2012, which is known as Denali. See here [link] for new features of SQL Server 2012.

I have submitted a session for SQLBits 10, titled “Data warehousing bits from my experience” (link).

In the previous SQLBits sessions I’ve shared pretty much the theory of dimensional modelling. See my previous sessions here, here and here.

Enough with the theory now I’d like to share my observations and experience during various data warehousing projects I’ve been and from helping lots of people with issues on their DW projects who contacted me via my blog. And from consulting experience (sometimes between contracts I do consulting, 1 day here and there, advising clients on their DWs) It is very useful to learn what went wrong, how we could better it next time, etc. Be it design decision, testing methods, dimensional modelling, loading mechanism, or cube/BI there are various aspects that I’ve been through that I can share. Of course, there will be no company names, for client confidentiality reason. All that will be I presented will be real cases, but I will have to annonymise them. True that I can’t share any of their data, but I can share the principles and technique of how we addressed those issues and situations.

I take on board the feedback from previous session in Brighton that you don’t like me to “read the presentation”. You would like me to be shorter on the text on PowerPoint slides and tell you more about the slides on the microphone. OK, I will do that.

Testing Your Data Warehouse

Filed under: Data Warehousing — Vincent Rainardi @ 6:36 am
Tags:

Now that your data warehouse is built, populated, and ready, you need to test it. How do you test a data warehouse? Where do you start? There are a billion numbers and strings in the data warehouse and how do you verify them all? Can you just verify a few rows and then declare that the data warehouse is fit for purpose? Of course not. So how do we do it?

Testing the data warehouse is the easy bit. The difficult bit is to find out where the errors are. We test the data warehouse by looking at the grand total. Let’s say you have this fact table:

Figure 1. A fact table with 2 measures and 2 dim key columns

Step 1. You ignore the dim key column and just get the sum of measure1 and sum of measure2. Say the total is 5,000,000 for measure1 and 4,000 for measure2. You then verify these 2 numbers against the source system. If you don’t get this step done don’t move to the next step. If the grand total is wrong there’s no point getting the break down.

Step 2. You then check the breakdown of the dimension attribute. Let say your dim1 is like this:

Figure 2. A dim table with 1 dim key column and 3 attribute columns

You do something like this:


select d.attribute1, sum(f.measure1), sum(f.measure2)
 from fact1 f
 inner join dim1 d on d.dim1key = f.dim1key
 group by d.attribute1

Say the output is like this:

Figure 3. Sum of measure1 and measure2 by attribute1

Of course the SQL doesn’t give you the total; you calculate the total yourself. You then verify this output against the source system. a) Is there a missing member, or an extra member? b) Are the numbers for measure1 and measure2 all correct?

Step 2 is actually quite a big task. The number of queries is the same as the number of attributes. For a data warehouse with 10 dimensions containing 10 attributes each, you will have to produce 100 queries.

Step 3. You verify the bottom levels (the leaf level). Obviously we can’t verify all rows in the warehouse, because of the volume. A fact table can contain a billion rows. So here we are forced to take sample. So get a set of attribute value, i.e. dim1.attribute1 = value1, dim2.attribute2 = value2, etc. When you apply this filter to the fact table, you get a small number of rows, like this:


select f.measure1, f.measure2
 from fact1
 join dim1 d1 on d1.dim1key = f.dim1key
 join dim2 d2 on d2.dim2key = f.dim2key
 where d1.attribute1 = ‘value1a’ and d2.attribute2 = ‘value2a’

So instead of 1 billion rows we only get say 1000 rows. So from this (value1a, value2a) set which contains 1000 rows we chose 3 rows. We then verify all the measures in these 3 rows. Then we do the same thing with fact2, fact3, etc until we have done all the fact tables in the data warehouse.

Finding the error rows

When you did step 1 for example, you found that the total doesn’t match with the source systems. So how do you find out the incorrect rows? The best way is using a cube. You can use reports, but it would be a tedious process. Whereas using a cube is very easy. I describe the process here.

I have been involved in a data warehouse project where they never tested the total. The System Integration Test only tested 1 or 2 rows. And the User Acceptance Test only tested 1 or 2 rows too. Whist those 2 rows might be correct, other rows can be wrong. So it is extremely important to test the total. For all measures. It is not difficult to get the totals for each measure. Finding the error rows is the difficult task. But with a cube, imagine if you have the total of the error in measure1 is £18,755. This is the total difference between the source systems and the data warehouse for measure1. You can then drill down by any dimension/attribute you like. Say you drill down by cost centre. You will find something like this:

Figure 4. Drill down of measure1 discrepancy by cost centre.

In figure 4 we are clear that the cause of this £18,755 difference is cost centre 4 and 6. All the other cost centres are spot on. They fully reconcile, i.e. the source is equal to the target. In cost centre 4 and 6 that the total are not the same.

You can then drill down on cost centre 4 by another attribute, say country, like this:

Figure 5. Drill down of measure1 discrepancy by country

So you know that the 6700 difference in cost centre 4 is caused by country 5 and country 8.

If you keep drilling down, eventually you will arrive at leaf level, the most detail level (say it’s transaction level), which is like this:

Figure 6. Detail/transaction level

Once we found which transactions are causing the difference, we can investigate that transaction in the source system, and comparing it to the data warehouse. Is it because of currency conversion? Is it because of business rule? Is the because of incorrect business / surrogate key?

Once we found the cause of the issue, we can correct the data integration program (ETL) and reload the warehouse again. And refresh the cube. Hopefully this time the total discrepancy for measure1 at the highest level will be 0. And we can then move on to the next measure.

I hope this provide a little bit of guidance of how to test a data warehouse. If you have any question or different opinion, or would like to add comments please don’t hesitate to contact me on vrainardi@gmail.com or via the comment button.

Vincent Rainardi, 19/12/2011

19 December 2011

Half Year To Date

Filed under: Analysis Services — Vincent Rainardi @ 7:50 am

Half Year To Date

In MDX we have Year To Date, Month To Date and Week To Date functions, but we don’t have Half Year To Date. This article is about how to do Half Year To Date.

YTD (Year To Date)

We use the YTD function to can get all dates/months from 1st Jan to a certain date. For example:

with member x as
( Aggregate
  ( YTD
    ( [Dim Date].[YMD Hierarchy].[Date].[2011-03-05]
    ), [Measures].[Measure1]
  )
)
select x on columns
from Cube;

The above MDX sums up measure1 from 1/1/2011 to 5/3/2011.

Note: in order for YTD to work,

  1. The type of the date dim must be “time”.
  2. The type of the year attribute must be “years”.

Another example:

with member x as
( Aggregate
  ( YTD
    ( [Dim Date].[YMD Hierarchy].[Month].[201103]
    ), [Measures].[Measure1]
  )
)
select x on columns
from Cube;

The above MDX sums measure1 from 1st Jan to end of March 2011.

with member x as
Aggregate
( YTD
  ( [Dim Date].[YHMD].CurrentMember
  ), [Measures].[Measure1]
)
select x on columns,
[Dim Date].[YHMD Hierarchy].[Date] on rows
from Cube;

The above lists all the dates and the sum of measure1 from 1st Jan to that date.

MTD (Month To Date) and WTD (Week To Date)

We use MTD function to get all the dates from the beginning of the month to a particular date. For example:

with member x as
( Aggregate
  ( MTD
    ( [Dim Date].[YHMD].[Date].[2011-03-01]
    ), [Measures].[Measure1]
  )
)
select x on columns
from Cube;

The type of the “month” attribute must be “months” otherwise MTD won’t work.

WTD works the same way as YTD and MTD, i.e. we get all the dates from the beginning of the week to the date we specified. Again the type of the “week” attribute must be “weeks”.

Half Year to Date (HTYD)

To do HYTD we can do this:

with member x as
Aggregate
(
  ( Ancestor
    ( [Dim Date].[YHMD Hierarchy].CurrentMember,
      [Dim Date].[YHMD Hierarchy].[Half Year]
    ).FirstChild.FirstChild:
    [Dim Date].[YHMD Hierarchy].[Date].[2011-08-17]
  ), [Measures].[Measure1])
select x on columns
from Cube;

This code sums measure1 from 1/7/2011 to 17/8/2011.

Here is how it works: from 17/8/2011 we climb up 2 levels to get the half year which is 2011H2. Then we get the first child of 2011H2 which is July 2011. Then we get the first child of July 2011 which is 1/7/2011. We then sum measure1 from 1/7/2011 to 17/8/2011.

Of course in order for this to work you must have Half Year attribute in your date dimension. The type of this Half Year attribute doesn’t have to be “half year”.

Another way of doing HTYD

If you don’t have Half Year level on your date dimension, you can you HYTD like this:

with member a as membertostr([Dim Date].[Date].CurrentMember)
member b as len(a) member c as left(a,25) member d as mid(a,28,100)
member e1 as c+"01"+d member e2 as c+"07"+d
member x as
iif
( mid([Dim Date].[Date].CurrentMember.Name,6,2) <= "06",
  Aggregate
  ( ( strtomember(e1):[Dim Date].[Date].CurrentMember ),
    [Measures].[Measure1]
  ),
  Aggregate
  ( ( strtomember(e2):[Dim Date].[Date].CurrentMember ),
    [Measures].[Measure1]
  )
)
select x on columns,
[Dim Date].[Date].[Date] on rows
from Cube;

Here is how it works: first we convert the current member to string, i.e. for 17/8/2011 we get [Dim Date].[Date].&[2011-08-17T00:00:00]

0        1         2         3         4
1234567890123456789012345678901234567890
[Dim Date].[Date].&[2011-08-17T00:00:00]

We then get the first 25 characters: “[Dim Date].[Date].&[2011-“ —– (c)

Then get character 31 onwards: “T00:00:00]” —– (d)

And we replace the month and date with 01-01 and 07-01 which are the beginning of the half year:

[Dim Date].[Date].&[2011-01-01T00:00:00] —– e1

[Dim Date].[Date].&[2011-07-01T00:00:00] —– e2

Now we can sum measure1 from the beginning of half year to the current member:

(strtomember(e1):[Dim Date].[Date].CurrentMember), or

(strtomember(e2):[Dim Date].[Date].CurrentMember)

But first we need to find out if the current member is on H1 or H2:

iif( mid([Dim Date].[Date].CurrentMember.Name,6,2) <= “06”, …

Find Function

If you don’t like hardcoding the “25” and “31” above you can use the “instr” function to find string within a string in MDX:

with member a as
instr(membertostr([Dim Date].[Date].CurrentMember),'T')
select a on columns,
  [Dim Date].[Date].[Date] on rows
from Cube;

Scratch Page

Below is my scratch page when doing the work. I’ll share it as some of you might find it useful for doing string parsing and date parsing in MDX.

select [Measures].[Measure1] on columns,
[Dim Date].[Date].[Date] on rows
from HYTD;

with member x as
Aggregate
( ([Dim Date].[Date].FirstChild:[Dim Date].[Date].CurrentMember),
[Measures].[Measure1])
select x on columns,
[Dim Date].[Date].[Date] on rows
from HYTD;

with member x as
iif(1=1, Aggregate
( ([Dim Date].[Date].FirstChild:[Dim Date].[Date].CurrentMember),
[Measures].[Measure1]), 2 )
select x on columns,
[Dim Date].[Date].[Date] on rows
from HYTD;

with member x as
iif([Dim Date].[Date].CurrentMember.Name = "2011-01-01", 1, 2 )
select x on columns,
[Dim Date].[Date].[Date] on rows
from HYTD;

with member x as
iif(mid([Dim Date].[Date].CurrentMember.Name,6,2) <= "06", 1, 2 )
select x on columns,
[Dim Date].[Date].[Date] on rows
from HYTD;

with member x as
iif
( mid([Dim Date].[Date].CurrentMember.Name,6,2) <= "06",
Aggregate
( ([Dim Date].[Date].FirstChild:[Dim Date].[Date].CurrentMember),
[Measures].[Measure1]
),
2
)
select x on columns,
[Dim Date].[Date].[Date] on rows
from HYTD;

select {} on columns,
[Dim Date].[Date].[Date] on rows
from HYTD;

with member x as
len(membertostr([Dim Date].[Date].CurrentMember))
select x on columns,
[Dim Date].[Date].[Date] on rows
from HYTD;

with member x as
len(membertostr([Dim Date].[Date].CurrentMember))
select x on columns,
[Dim Date].[Date].[Date] on rows
from HYTD;

with member x as
membertostr([Dim Date].[Date].CurrentMember)
member y as len(x)
select x on columns,
[Dim Date].[Date].[Date] on rows
from HYTD;

with member a as
membertostr([Dim Date].[Date].CurrentMember)
member b as len(a)
member c as left(a,25)
member d as mid(a,28,100)
member e as c+"06"+d
select e on columns,
[Dim Date].[Date].[Date] on rows
from HYTD;

with member a as
membertostr([Dim Date].[Date].CurrentMember)
member b as len(a)
member c as left(a,25)
member d as mid(a,28,100)
member e as c+"06"+d
member f as strtomember('[Dim Date].[Date].&[2011-01-01T00:00:00]')
select {strtomember('[Dim Date].[Date].&[2011-01-01T00:00:00]')} on columns,
[Measures].[Measure1] on rows
from HYTD;

with member a as
membertostr([Dim Date].[Date].CurrentMember)
member b as len(a)
member c as left(a,25)
member d as mid(a,28,100)
member e as c+"06"+d
member f as strtomember(e)
select {strtomember('[Dim Date].[Date].&[2011-06-01T00:00:00]')} on columns,
[Measures].[Measure1] on rows
from HYTD;

with member x as
Aggregate
(
( Ancestor ( [Dim Date].[YHMD].CurrentMember, [Dim Date].[YHMD].[Half Year] ).FirstChild.FirstChild:
[Dim Date].[YHMD].CurrentMember
),
[Measures].[Measure1])
select x on columns,
[Dim Date].[YHMD].[Date] on rows
from HYTD;

with member x as 1
select x on columns,
[Dim Date].[YHMD].[Date] on rows
from HYTD;

with member x as
membertostr
( Ancestor ( [Dim Date].[YHMD].CurrentMember, [Dim Date].[YHMD].[Half Year] ).FirstChild.FirstChild)
select x on columns,
[Dim Date].[YHMD].[Date] on rows
from HYTD;

with member x as
Aggregate
(
( Ancestor
( [Dim Date].[YHMD].CurrentMember,
[Dim Date].[YHMD].[Half Year]
).FirstChild.FirstChild:
[Dim Date].[YHMD].[Date].[2011-03-01]
),
[Measures].[Measure1])
select x on columns
from HYTD;

with member x as
Aggregate
(
( Ancestor
( [Dim Date].[YHMD].CurrentMember,
[Dim Date].[YHMD].[Half Year]
).FirstChild.FirstChild:
[Dim Date].[YHMD].CurrentMember
),
[Measures].[Measure1])
select x on columns,
[Dim Date].[YHMD].[Date] on rows
from HYTD;

with member x as
(Aggregate(YTD([Dim Date].[YHMD].[Date].[2011-03-01]), [Measures].[Measure1]))
select x on columns
from HYTD;

with member x as
Aggregate(YTD([Dim Date].[YHMD].[Date].[2011-03-01]), [Measures].[Measure1])
select x on columns,
[Dim Date].[YHMD].[Date] on rows
from HYTD;

with member x as
Aggregate(YTD([Dim Date].[YHMD].CurrentMember), [Measures].[Measure1])
select x on columns,
[Dim Date].[YHMD].[Date] on rows
from HYTD;

with member x as
(Aggregate(YTD([Dim Date].[YHMD].[Month].[201103]), [Measures].[Measure1]))
select x on columns
from HYTD;

with member x as
(Aggregate(MTD([Dim Date].[YHMD].[Date].[2011-03-01]), [Measures].[Measure1]))
select x on columns
from HYTD;

with member x as
(Aggregate(WTD([Dim Date].[YHMD].[Date].[2011-03-01]), [Measures].[Measure1]))
select x on columns
from HYTD;

-- If you don't have HY level
with member a as membertostr([Dim Date].[Date].CurrentMember)
member b as len(a) member c as left(a,25) member d as mid(a,31,100)
member e1 as c+"01-01"+d member e2 as c+"07-01"+d
member x as
iif
( mid([Dim Date].[Date].CurrentMember.Name,6,2) <= "06",
Aggregate
( (strtomember(e1):[Dim Date].[Date].CurrentMember),
[Measures].[Measure1]
),
Aggregate
( (strtomember(e2):[Dim Date].[Date].CurrentMember),
[Measures].[Measure1]
)
)
select x on columns,
[Dim Date].[Date].[Date] on rows
from HYTD;

with member a as
membertostr([Dim Date].[Date].CurrentMember)
member b as len(a)
member c as left(a,25)
member d as mid(a,31,100)
member e as c+"07-01"+d
select e on columns,
[Dim Date].[Date].[Date] on rows
from HYTD;

--[Dim Date].[Date].&[2010-06-01T00:00:00]

-- research find function
with member a as
instr(membertostr([Dim Date].[Date].CurrentMember),'T')
select a on columns,
[Dim Date].[Date].[Date] on rows
from HYTD;

And here are the SSAS project (link) and the SQL table DDL+population script (link).

Vincent Rainardi 19/12/2011

Blog at WordPress.com.