Data Warehousing and Data Science

24 April 2011

Reviewing a Dimensional Model

Filed under: Data Warehousing — Vincent Rainardi @ 8:03 am
Tags:

Sometimes in a data warehousing or BI projects we need to review an existing dimensional model. In this post I will try to write down what to look for when reviewing.

FUNCTIONALITY/BUSINESS

The most important thing to look out when reviewing a dimensional model is the functionality. The technicality is the easy bit. You can read the Kimball book, read all Kimball Group articles, read my blog, and you might consider yourself an expert technically on dimensional modelling. But if you don’t understand insurance, you won’t be able review the dimensional model in a Lloyd’s underwriter. You could be asking lots of basic questions: what is a “line”, what does “to underwrite” mean? During the first month of the project, you will have to spend a lot of hours reviewing the business requirements of the source systems. You need to understand the source systems. That’s the key.

After you understand the source systems, you can ask yourself these questions: (I’m going to use Investment Bank for an example, and abbreviate Dimensional Model with DM)

  1. Pick a business area (e.g. Repo, Credit Risk, FX, Fixed Income). Don’t try to analyse the business in its entirety.
  2. In one area, say Repo, list all the sub areas that need analysis (for example position analysis, exposure analysis, collateral analysis, etc), and check if all of them are covered in the DM. It is possible that you have 6 sub areas for analysis and the dim model only cover 5.
  3. Then pick one sub area (say exposure analysis), ask yourself what are the business questions that need answering when analysing exposure? For example, daily re-pricing/marking-to-market, exposure by counterparty, rating, liquidity analysis, etc. For example: “what is the value of three days’ worth of the underlying security’s turnover?” To manage the liquidity the business needs to know this.
    A good business analyst has experience in Repo in other banks, hence he or she can guide the business user of what to do/expect in the analysis. But what I observe is that the data warehouse architect/designer often don’t understand the business hence he/she can’t design the DM.
  4. Now that you know all the questions, you can check if the DM can be used to answer those questions. Are the missing attributes/measures? Are they at the right grain? Are there a missing fact table? Are there a missing dimension? Is the history captured? Does the fact table contain the all the necessary data? (for example: we need 5 product types but the fact table only contains 4).
    I must emphasise that there is no point going straight to point 4 without doing point 1 to 3.

TECHNICAL

As I said this is the easy bit. Just read Kimball’s DW Toolkit book and you will be able to evaluate the design of the DM. If you are the business user (rather than the dimensional modeller), avoid going into the technical straight away, without discussing the functionalities. You need to sort out the Functionality first. There is no point talking about SCD, grain, etc if the functionality is wrong.

I’ll list down some technical things:

  1. Measured stored in dimensions
  2. Incorrect grain
  3. DD not captured
  4. SCD type 2 over used in all attributes
  5. Incorrect data type
  6. Some data is required but not available
  7. Daily snapshot causing performance issue
  8. Aggregate fact table not defined
  9. Multi Valued Attribute

As I said, Kimball Toolkit book will give you an exhaustive list. You can also refer to my Advanced Dimensional Modelling presentation here to get some ideas.

Where To Put an Attribute (SSAS)

Filed under: Analysis Services — Vincent Rainardi @ 6:28 am
Tags:

Suppose you have a cube and you need to add an attribute to the cube. Say that the cube structure is: 1 MG containing 1 measure, and 1 dim containing 1 attribute.
The physical table structure is: MG1 is from on table fact1, and dim1 is from table dim1.
You need to add attribute2 to this cube. Attribute2 is located on table dim2. Fact2 links Dim2 to Dim1.

Here we have 2 options:
1. Create Dim2 in AS, create MG2 and link MG2 to both Dim1 & Dim2
2. Join dim2 to dim1 on DSV (using Fact2), and put attribute2 on dim1

I found that, even though the processing is slightly longer, the query on 2 is quicker than 1. The query that I mean is: list down all attribute2 members for a particular attribute1 member. Something like:

select measures.measure1 on columns,
 dim1.attribute2.members on rows
 from cube
 where dim1.attribute1.&[member]

I remember also on one occasion we get performance benefit by moving an attribute to another dimension. Attrib1 was in dim1 and attrib2 was in dim2. The query: list down all possible combination of attrib1 and attrib2 was slow. Moving attrib1 to dim2 make the query quicker.

Please refer here for a discussion about this by Mosha. My previous article about Autoexist is here.

So, it’s better to put attributes which are frequently used together, in the same dimension.

I should add a note to that last sentence: as long as the grain of the dimension doesn’t change a great deal. What I mean is: if after moving an attribute into the dimension, the number members increases a lot (say from a million to a billion), it is probably better not moving the attribute.

23 April 2011

IIS in Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 10:44 pm
Tags:

When your colleague mentions that they are using IIS for their data warehouse load, it’s not Microsoft Internet Information Server that they mean. But IBM InfoSphere Information Server, widely known as IIS.

When people talk about IIS, usually what they meant is IIS Workgroup, not IIS for DW. IBM has 2 IISes:

  • InfoSphere Information Server Workgroup Edition (IIS WE), which is for ETL. It contains DataStage, QualityStage, Information Analyzer, and Metadata Workbench.
  • InfoSphere Information Server for Data Warehousing (IIS DW), which contains the same things as IIS WE, but is for Smart Analytics System.

IIS is quite a comprehensive ETL / data integration tool. It consists of these components:

  • Data Stage: IBM’s main ETL & data integration tool
  • Quality Stage: IBM’s main data quality tool
  • Fast Track: write the mapping spec to generate Data Stage & Quality Stage jobs & reports
  • Business Glossary: to edit business meaning/data dictionary
  • Blueprint Director: link everything: metadata, ETL, data model
  • Information Analyzer: to understand the content, structure and quality of the data
  • Information Services Director: to deploy data stage/quality jobs as web services or EJB
  • Metadata Workbench: create data lineage between database, files and BI models
  • Metadata Server : stores operational metadata like how many rows were written

Some of the best places to get information about IIS are:

 

22 April 2011

T-SQL: Exists versus In – Performance Comparison

Filed under: SQL Server — Vincent Rainardi @ 4:25 pm
Tags:

In this test I’m going to query which numbers exist in Table1 but not Table2, using “Exist” and “In”.

Exist Query:

select * from Table1 A where not exists
( select * from Table2 B where B.col1 = A.col1 )

In Query:

select * from Table1 where col1 not in
( select col1 from Table2 )

There will be 2 tests:

  • Test 1: Table2 is large
  • Test 2: Table2 is small

I believe that in Test 1 “exist” is a lot faster, but in Test 2 “in” is slightly faster than “exist”

Test 1: Table 2 is large

Data Setup:
Table1 contains numbers from 1 to 100,000.
Table2 contains numbers from 10 to 100,000.

Expected results: Exist is faster than In.

Script to populate both tables with data:

truncate table table1;

declare @i int
set @i = 1
while @i <= 100000
begin
insert into Table1 values (@i, @i)
set @i = @i + 1
end;

truncate table table2;

declare @i int
set @i = 10
while @i <= 100000
begin
insert into Table2 values (@i, @i)
set @i = @i + 1
end;

Results:

  • Exist Query: average of 167 ms.
  • In Query: average of 1892 ms.

This is as I predicted.

Test 2: Table 2 is small

Data Setup:
Table1 contains 100,000 rows. Col 1 contains numbers from 0 to 50.
Table2 contains 49 rows. Col 1 contains numbers from 2 to 50.

Expected results: In is slightly faster than Exist.

Script to populate both tables with data:

truncate table table1;

declare @i int
set @i = 1
while @i <= 100000
begin
insert into Table1 values (@i%50, @i)
set @i = @i + 1
end;

select * from Table1

truncate table table2;

declare @i int
set @i = 2
while @i <= 50
begin
insert into Table2 values (@i, @i)
set @i = @i + 1
end;

select * from Table2

Results:

  • Exist Query: average of 219 ms.
  • In Query: average of 253 ms.

So I was wrong. Even if Table 2 is small, “Exist” is still faster than “in” (slightly). But if Table 2 is large, “Exist” is a lot faster than “in”. So it is better to use Exist. Always.

Blogging Using Windows Live Writer

Filed under: Other — Vincent Rainardi @ 7:33 am

Advantages:

  • Simple, easy to use. Simpler than Microsoft Word.
  • You can setup multiple blog accounts and very easy to switch between them. Even if you have the same user name for different blogs.
  • You can add pictures. No need to load them separately to WordPress.
  • Font, colour, bold, italic, underline
  • You can set the tags as well as the categories. And you can set the post date too.
  • You can open the a post from WordPress, edit it and update it.
  • You can edit the source (HTML) code/script.
  • You have Preview facility (how it would look on WordPress)
  • You can see the stats of your blog in WordPress.

Disadvantages:

  • You can’t change the location of the local draft*. Windows Live Writer let us save our post in our computer, but it is always on My Document\My Weblog Post\Draft. I’m not able to point this else where. I’ve checked File|Options and Blog Account|Blog Options but there’s no option to set the directory for the Local draft.
    There are 2 drafts: other than a local draft which is on your PC, you can also post a draft is on your blog.
  • After you publish your post to WordPress, if you modify it on WordPress you can’t edit it again in WLW. Well, you can, but you get an old copy.
  • You can’t use

Null in Date Column

Filed under: SQL Server — Vincent Rainardi @ 6:14 am
Tags:

I was having one of those blunder moment when a few days ago I said that date column is not nullable. I was in SSIS, in a Derived Column transform and was writing something one of those conditional statements: “a ? b : c”. I put “null” in c and got an error hence my silly conclusion that date column is not nullable. But of course it is. Any column is nullable. And in SSIS it’s not “null”, but we write it using NULL functions, e.g. “NULL(DT_DBDATETIME)”.

21 April 2011

Blogging Using Microsoft Word

Filed under: Other — Vincent Rainardi @ 10:23 pm
Tags:

In the last couple of months I have been blogging probably 30 posts using Microsoft Word (2010) and felt it’s so much easier than typing it on WordPress directly. I have 3 blogs and all of them are on WordPress. One is about data warehousing and BI (this blog), and the other two are about chess and life thoughts. The main advantages of using Microsoft Word to blog are:

  1. It is quick and simple.
  2. Using Microsoft Word, I don’t have to upload the pictures one by one into WordPress.
  3. All the formatting are preserved, for example: italics, bold, underline, fonts, colours.
  4. You can set the category from within Word.
  5. You can use the Format Painter which is very useful.
  6. You have the thesaurus & dictionary nearby.
  7. You can insert symbols, shapes and screenshots easily.

Whereas the disadvantages are:

  1. When you use the blog template on Word, you can’t set the View to large font. (I used to do 140% on Word)
  2. To write code (e.g. SQL Script) is not so user friendly. I can copy and paste from SSMS into Word, and Word preserve all the blue, red, green and grey colours, which is fantastic, but when I publish it, they messed up.
  3. You can’t set the tags. You can set the category, but not the tags. So you need to go to WordPress to add your tags.
  4. If you have several blogs in WordPress under 1 account, you can’t switch between accounts. You need to manually editing the blog URL everytime you switch. The Manage Accounts can only setup several accounts if you have different user names.

20 April 2011

T-SQL: Exists versus In

Filed under: SQL Server — Vincent Rainardi @ 6:31 am
Tags:

I keep forgetting the syntax for “Not Exists” so I thought I’d write a post for it.

select * from Table1 A where not exists
( select * from Table2 B where A.Col1 = B.Col1 ) 

The above query is known as “correlation” query.

You need to have the “A” and “B” aliases in the code above. Otherwise you will have to type in the whole table name:

select * from Table1 where not exists
( select * from Table2 where Table2.col1 = Table1.col1 ) 

Well, not really. You can get away with not specifying the Table2 alias. But you have to specify Table1’s:

select * from Table1 where not exists
( select * from Table2 where col1 = Table1.col1 ) 

If you take away the Table1, it will return nothing, because there is no correlation between the two tables:

select * from Table1 where not exists
( select * from Table2 where col1 = col1 ) 

Exists versus In

And while I’m at it I might as well explain the benefit of “Exists” compared to “In”. And the disadvantage.

Here’s “In”: (it’s always easier to explain by example)

select * from Table1 where col1 not in
(select col1 from Table2) 

If we want to compare 2 columns:

select * from Table1
where col1 not in (select col1 from Table2)
  and col2 not in (select col2 from table2) 

Whereas using “Exists” to compare 2 columns:

select * from Table1 A where not exists
( select * from Table2 B where A.Col1 = B.Col1 and A.Col2 = B.Col2 ) 

So comparing 2 columns using “In” is not the same as using “Exists”. The “In” is not really comparing 2 columns.

To make the “In” really comparing 2 columns we need to do this:

select * from Table1
where convert(varchar,col1) + '|' + convert(varchar,col2) not in
(select convert(varchar,col1) + '|' + convert(varchar,col2) from Table2)

Above, we concatenate the two columns and compare the concatenated string.

Query Performance

Due to time limitation I thought I’d write the performance in another post. But considering that performance is the most important factor for some people, I will write the principles here, and leave the proofing of it for another time.

It is my believe that if the result set to compare to is small, then In is faster. But if the result set to compare to is large, then join is faster.

For example, compare these 2 SQL Statements, one for Exists, one for In:

 select * from Table1 A where not exists
( select * from Table2 B where A.Col1 = B.Col1 ) -- Query1

select * from Table1 where col1 not in
( select col1 from Table2 ) -- Query2 

If Table2 is small (say 10 rows), then the subquery in Query 2 will run fast, SQL will pin the result in memory and the lookup happens in memory, hence fast. But if Table2 is large (say 4 million rows) then SQL will put the result on disk, and the lookup will be searching the disk, hence slow.

On the other hand, the correlation is implemented as join and is very efficient dealing with 2 large set. Say both Table1 and Table2 contains 4 million rows. If (and that’s a big if) Col1 is indexed on both tables, Query 1 will run fast because Col1 is the join criteria.

The question I had for years is the efficiency of Query 2 to deal with small data set. Say Table 1 is 4 million rows and Table 2 is 10 rows. Can Query 2 beat Query 1? In SQL 7 and 2000 I believe this was the case. But in 2008 I don’t believe that is the case. It depends on the index. And the logic for hash join in SQL Server 2008 has improved a lot. We have 3 types of hash join in SQL Server: In-Memory hash join, Grace hash join and Recursive hash join (see here). Quote: “SQL Server starts by using an in-memory hash join and gradually transitions to grace hash join, and recursive hash join, depending on the size of the build input.” Note that SQL Server doesn’t understand the size of the table if the statistics are out of date, hence it is important to keep the statistics up-to-date.

17 April 2011

Inserting Multiple Rows in One SQL Statement

Filed under: Analysis Services — Vincent Rainardi @ 6:42 am
Tags:

In SQL Server 2008 we can insert multiple rows in one SQL statement:

create table table1
( col1 int, col2 int );

insert into table1 (col1, col2)
values (1, 1), (2, 2), (3, 3);

Don’t forget the semicolon. This will be required in SQL Server 2011.

5 April 2011

ETL and Data Integration

Filed under: Data Warehousing — Vincent Rainardi @ 6:06 pm
Tags:

I realise that the concept of “point to point” like ETL is a costly option, if you have a many consumers. The alternative is for everybody to come to a Central Point to get data. Consider this “Point to Point” diagram:

System A publishes to 3 systems (B, C, D), indicated by the red arrows. C consumes 3 sources (A, B, D). In real life it is also like that. A transaction system is consumed by many, whereas a data warehouse/mart consumes many. In large corporations, there could be 100 systems instead of just 4. Here it gets really complicated. Imagine for a minute the above diagram, but with 100 nodes. Below is a diagram for just 10 systems, imagine if it’s 100.

In the above diagram, system B, C, D, G and J essentially do the same thing, i.e. all of them consume data from system A. Yes but, they consume different things. OK, fair point, but still, there is a cost saving opportunity here. If A publishes its data, and B, C, D, G, J all subscribe to it, then it would be simpler. Imagine all data sources publish to a Central Point and all data consumers read from that Central Point:

From data integration point of view, that’s a lot simpler. So in large corporation, that’s what’s happening. The data warehouse does not read directly from many source systems, but consumes data from a Central Point. That’s Data Integration.

Apart from simplifying the route, the other benefit of having a Central Point is standardisation mechanism of data publishing and consumption. This standardisation will also reduce the cost of development. Because it’s only developed once, all systems then use it to consume data.

By having a Central Point, we can monitor the data traffic. We can police the traffic. We can ensure that everybody adhere to the standard format. We have a catalogue of publishers, the data they publish, when they were published, what are their frequencies, what validation were made before publishing, who the consumers are, when they were last sent out, who is the support team, etc. We have a catalogue of consumers, the data they consume, when it was last consumed, etc. We can build a logging mechanism, number of records sent, Kilo bytes sent, volume of data from each publisher, etc.

Because all traffic must go through a central point, monitoring is a lot easier. The disadvantage of course is: a single point of failure. And bandwidth: we need to provide adequate bandwidth. Considering the benefit (cost saving, simplification, monitoring, compliance, validation) and the disadvantages, the key decision factor is the number of nodes. If you are small company with only 4 systems, then it’s not worth it. If you are a large corporation with 40 systems, then it’s definitely worth it.

That’s data integration. Not ETL.

As usual I welcome questions and comments at vrainardi@gmail.com. Vincent, 5/3/11.

Next Page »

Blog at WordPress.com.