Data Platform and Data Science

27 September 2011

Top 2 things I missed in SQL Server and Oracle

Filed under: Data Warehousing,Oracle,SQL Server — Vincent Rainardi @ 6:08 pm

In the last few months I used both SQL Server and Oracle every day. This is because the source system is in SQL Server and the data warehouse is in Oracle. I begin to miss little things on SQL Server that I don’t get in Oracle, and vice versa. I thought I’d write this experience down as it could be useful for others.

In this article, when I say Oracle I mean using Toad, whereas when I say SQL Server I mean using SSMS. You can use Toad with SQL Server, and you can use other UI for Oracle, but that’s for another article. Version? SQL Server: 2008, Oracle: 10g R2 (well doesn’t really matter 10 or 11 as I use Toad as UI).

Oracle: Control Enter

First is the Control Enter to run SQL at the cursor. In SQL Server you need to block the code you want to run before pressing F5. In Oracle if you press Control Enter, you don’t need to block anything. Very useful I use this every day.

Oracle: No need for “top 1000”

You don’ t need to type “top 1000” when selecting large table. It will only bring back 500 rows. Automatically. That can be useful and can be annoying. Strangely, I found it very useful.

Oracle: Column Selection

It is called Intellisense in SSMS 2005/8 but Toad one is better: it enables you to select the column you like. When you type table alias followed by dot, it shows the column names. But you have to alias the table first, i.e. “select from fact_interest_rate_swap S” then you go back, put your cursor between “select” and “from” and type “S.” (s dot) and it will list all columns of fact_interest_rate_swap table.

Oracle: Count(*) Works

In the last 6 months or so, I work in a project where we have a column called countrysomething and every time I type count(*) it shows this country column instead. It is so irritating that I have to carefully press escape after typing count but before typing “(“.

SQL Server: Copy including Header

This SQL Server feature in SSMS is very useful to put data into Excel. In Oracle/Toad we need to Export to clipboard get the header.

SQL Server: Object Explorer

It’s true the saying: you don’t know how valuable it is until you don’t have it. In Toad we have Object Pallette, but it’s not as good as SSMS’ object explorer where you can browse the columns, stored proc, etc. In Toad Object Pallette you can only browse the tables, not the columns.

SQL Server: Top N

We won’t know how useful this top N is until we don’t have it. In Oracle, a SQL guy will go wondering why “select top 100 * from table1” doesn’t work. It’s “select * from table1 where rownum <= 10”.

In Informix/Ingress it’s FIRST: select first 100 * from table1. And it’s very handy in Informix to have SKIP clause: select skip 10 first 100 * from table1.

In DB2 we have FIRST too: select * from table1 fetch first 100 rows only

Ending

Well it’s not 2 things. It’s more like 11 things I know, but who’s counting anyway? 🙂

UNK in Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 5:31 pm

The word “UNK” is quite popular in data warehousing. It is the short form of “Unknown”. The word “Unknown” is used to populate if an attribute is NULL in a dimension.

Approach I

For example, say you have an attribute called Country in your customer dimension. In the source system, for a particular customer you don’t have a country. The country code column is NULL in the source system. So in the customer dimension you set the country attribute as “Unknown”.  (a)

What if the country code is ‘’ (blank string)? Then we could set it as “Unknown” too. (b)

Approach II

Other people/projects have different convention:

a)   In the customer dimension, for Surrogate Key 0, they set the country as “Unknown”.

b)   But for a customer with proper SK, they set it as whatever is in the source system, be it blank string or NULL.

Now here’s the difference between I and II above.

For I, when you browse the country attribute in the BI tool (to list down its value) you will see:
Afghanistan
Aland Islands
Albania
Algeria
Unknown

Whereas for II when you browse the country attribute in the BI tool you will see:
Afghanistan
Aland Islands
Albania
Algeria
[blank]
Unknown
[blank]

The first blank represents the blank string member.
The second blank represents the NULL member.

It is generally perceived that I is better than II. But different companies have different standards. I’ve seen both approaches implemented in the DW projects I’ve worked on.

UNK

Now back to the title of this article: UNK.

For attributes of which the data type is less than 7 characters, we can’t use Unknown. Most DW practitioner use UNK. This was how UNK was born.

For example: country code: either 2 char ISO or 3 char ISO. Currency code: 3 char ISO. Yes/No attribute. For all these 3 char-wide attributes we initialise the SK 0 row with UNK. And, if you use approach I you will also see UNK in the non SK 0 rows in the dim tables.

2 char wide attributes

How about 2 char wide attributes such as 2 char ISO country code?  Well, I don’t think “UN” is a good idea as it might be used in the future by ISO standard (currently not, only UM, UZ, UY, US). Probably a dash symbol is better (-). Or double dash (–).

UNK: Good or bad?

I think UNK is not a good idea, because if it’s in the middle of 3 char values, users could misinterpret it:

UAH, UGX, USD, USN, USS, UNK, UYI, UYU

Of all the currency codes above, only UNK is not valid. Who would know that?

I prefer triple dash than UNK.

24 September 2011

A Data Warehouse Must Always Reflect the Values in the Source System

Filed under: Data Warehousing — Vincent Rainardi @ 8:51 pm
Tags:

All these years I always assumed that data warehouse was the place for a single version of the truth. It is the place for integration. If the department A has data different from department B, then we integrate it in the DW. If the DW is for a group of companies, and the companies have different data, we integrate it there. And I have done it many times (integration).

1. Integrating Rows: (as always it is easier to explain by example)

For example: it is a group DW for 2 companies. Company A has Customer1 to Customer100. Company B has Customer80 to Customer 140. They have 20 common customers. We integrate this in DW. Our DW Customer Dimension has 120 rows (plus SCD2 rows).

2. Integrating Attributes: If company A has 3 customer attributes (A, B, C) and company B has 4 attributes (A, B, D, E) then the DW customer dim has 5 attributes (A, B, C, D, E). We populate attribute A and B for all 120* rows. We populate attribute C only for 80* rows (Customer1 to 80). We populate attribute D & E for 60 rows* (customer80 to 140). *Plus SCD2 rows.

The whole point of the “DW is the point of integration” theory is so that we can report at group wide numbers.

3. Multiple Versions of the Truth

But this week I learned something contradictive: different departments/projects may want different versions. Again it is easier to explain by example:

Application1 creates and stores customer data into a database. This customer data is then fed into the company DW.

Application2 takes the customer data from the app1’s database. App2 then modifies the customer attributes and uses it for its calculations. App2 has rules like this: if Attribute A is value1 then replace it with value2. App2 stores the modified customer attribute into its database. Clearly, App2 customer table is contains different values to App1.

Say the attribute is Class. In App1 the Class contains value from A to Z. In App2, there is rule which says: if the Class is H, I or J, change it to L. So in App1, the value of Class is A to Z. In App2, the value Class is A to W. Say in App1 we have 10,000 customers. In App2 we have 2,000 customers, all of which exist in App1’s list of 10,000 customers.

Now, how do we build the DW’s customer dim?

We need to have two Class attributes: “App1 Class” and “App2 Class”. We populate App1 Class for 10,000 rows. We populate App2 Class for 2,000 rows.

Can we have 1 Class attribute in the DW? No we can’t. Why? Because both Apps want to see different things. App2 rule clearly says: if the Class is X, Y or Z then change it to L.

The basic principle of Data Warehousing is that: what you see in the source system must be what you get from the DW/BI. It does not make sense if Customer1 in App2 has Class = L and in the BI report it has Class = Y. If in App2 the Class of Customer1 is L then in the DW/BI it must be L.

This “DW must report what is in the source system” principle is the most important principle of data warehousing. It takes the precedence over any other principles, including the principle of integration.

4. Measures

Say in App1 we have a column called CFI, with value ranging from 0.0 to 1.0. This in the Data Warehouse becomes a measure in a fact table. App2 reads App1’s CFI and added a rule: if CFI is >= 0.1 use it. But if CFI < 0.1 then use the a specific formula to calculate CFI.

So in DW we should have 2 measures: App1 CFI and App2 CFI. The App1 CFI can just be called CFI because that’s what everyone thinks CFI is about, but The CFI from App2 must be clearly labelled in the DW as “App2 CFI”.

5. The Importance of Timing

App1 and App2 data may arrive in the Data Warehouse at different timing. App1 data arrives a day after it is created in App1. App2 data arrives in DW 2 days after it is created in App2. Why? Because App2 reads the App1 database overnight.

When creating App1 CFI and App2 CFI the DW/BI must be aware that App2 CFI is 1 day later than App1 CFI. So the DW/BI must not add them together. Tuesday App1 CFI can be summed with Wednesday App2 CFI. But Tuesday App1 CFI cannot be summed with Tuesday App2 CFI. This is in the DW’s ETL and also in the BI’s Report. You can’t add data from different timing.

Similarly, on the dimension/measure front, App1 Class has different timing to App2 Class so when a user tries to compare them, they need to take Tuesday App1 Class and compare it with Wednesday App2 Class. They can’t compare Tuesday with Tuesday.

6. The downstream principle

If App2 is getting data from App1, and App3 is getting data from App2, then

a)    There’s an increased chance of getting it wrong (Chinese whispers)

b)    We are creating dependency on App2

It is better that App3 reads directly from App1. In large corporation this is the known as the downstream principle: flowing data downstream is not allowed. All downstream apps must get the data direct from the app that creates the data.

It is not too bad in small companies. But if a group has many companies it becomes a big problem. In big corporation there is a concept of “middleware”, a software that publishes data from 1 app to another app. Middleware arrange the messages, queuing, the resilience, etc between apps. The middleware prevent direct ODBC connection between the consuming app and the source app.

Yes the consuming app must get data direct from the source app (downstream principle) but it must not use direct ETL i.e. direct DB to DB connectivity. The consuming app must read the data from the Message Queuing System (MQS) middleware, such as MSMQ and WebSphere MQ. And the source app must send the data to the MQS.

This is because it is likely that there are many consumer of that data, not only the DW, but also other downstream app. That is why: “Publish once consume many times”. This principle is enabled by the MQS, aka Message Broker System. If all consuming apps read direct from the source app, the source app can’t handle it. There will be a bottleneck on the source app.

Conclusion

If Customer1 in App2 has Class = L, in the DW it must be L. A data warehouse must always reflect the values in the source system. A MQS middleware enables many downstream apps including DW to consume the same data from the source app without causing bottleneck because the source app publishes it only once.

18 September 2011

Slowly Changing Dimension Type 2 – Initialising a New Attribute

Filed under: Data Warehousing — Vincent Rainardi @ 8:26 pm

Intro: you have an existing type 2 dimension, which is already populated, and you would like to add a new type 2 attribute. How do you populate this new column? What value should you populate it with, particularly with regards to the different versions of the same dimension row. Thank you to Jacob for pointing this out.

Note: the date format in this article, and in this web site, is European (dd/mm/yyyy). Apologies to the American folks who use mm/dd/yyyy.

————————————————————————————————————————————————————————-

Before I begin let I would like to thank my colleague Simon Jensen who explained the concept to me.

Situation: we have a type 2 dimension. We added a new column. We want to initialise its value. As always it is easier to learn by example. This is a type 2 dimension:

On 1/8/2011 the customer changed tariff from A9 to B2. Then on 1/9/2011 the customer changed the meter from digital to dual dials.

Now we have a new attribute: Payment Type

Customer1’s payment type is “Quarterly” until 14th August, then Direct Debit from 15th August.

Issue: How we do we fill in the Payment Type column? Because we only have 3 rows and the dates of the rows does not correspond to the 15/8/2011.

Answer: we expire row 2 and insert a new row for 15/8/2011 to 31/8/2011:

Of course for this new row we need to use new surrogate key (4).

Now we can clearly see the 3 changes for Customer1:
a) On 1/8/2011 he changed tariff from A9 to B2
b) On 15/8/2011 he changed payment type from Quarterly to Direct Debit
c) On 1/9/2011 he changed meter type from Digital to Dual Dials.

Now we can extrapolate this logic to all customers, and write an ETL logic to initiate the new attribute (Payment Type) for all rows in the table.

It does not mean that we will always have to create a new row. If Payment Type for a customer never changes, then obviously we don’t need to create a new row in the table. In the above example, if the Payment Type for Customer1 is always Direct Debit, then we just populate the existing 3 rows with “Direct Debit”.

If the dimension has a lot of CDS type 2 attributes (like 50 of them), it could be quite a complex ETL logic to write:

For each customer:

  • If the new attribute doesn’t change value then update existing rows.
  • If the new attribute changes value then a) insert a new row for that date range, b) expire the old row, and c) update the Payment Type for existing rows.

The source table for Payment Type is like this:

To be performant, when writing the code we should not use a cursor to loop around a million customer, but to write a SQL for all customers in one go. For large data sets, “set based” operation is always faster than cursor based. First we split the customers into 2 groups:

a) The customers which the Payment Type changes value
b) The customers which the Payment Type doesn’t change value: it stay the same throughout

Step 1. For b), we perform a global update. Not customer by customer, looping in a cursor, but all rows in the table in 1 go (“update from”).

Step 2. For a), we perform a global insert. First we identify the rows which we need to expire. Before we expire them, we use their expiry date, and insert the new rows. All attributes are the same as the old row, except the SK and the Payment Type.

Step 3. For a), we also perform a global update: we update the value of the Payment Type. When joining the customer dimension table with the Payment Type source table, we join on the NK column, as well as the valid_from/to column.

This is essentially what we are joining: (left: target, right: source). All these 4 columns need indexing of course.

Remember that on the target (left) there are only 3 rows, not 4 rows, because 1 row already has a Payment Type (we inserted it in Step 2). The “date join” works on the basis that the target date range is entirely within the source date range, i.e. the join criteria is:
on target.validfrom >= source.validfrom and target.validto <= source.validto

Vincent Rainardi, 18/9/2011

1 September 2011

Tomislav’s MDX Book

Filed under: Analysis Services,MDX — Vincent Rainardi @ 11:15 pm
Tags: ,

I met Tomislav Piasevoli for the first time in Manchester in 2009 at SQLBits conference. I attended at his Advanced MDX session and said to him afterwards that his MDX skill was one of the highest in the world.

Last month his new MDX book was published, titled MDX with Microsoft SQL Server 2008 R2 Analysis Services: Cookbook. I am amazed by this book because of how it is structured, as well as the quality of the content. It contains 80 problems. Each problem is described in problem-solution-explanation format.

To me this format is ideal. This is how people should learn. It is always easier to learn by example, rather than by theory. I think all IT forums should have this format, i.e. after 2 years of people asking and answering questions, the moderator should package up the knowledge in the problem-solution-explanation format. I’m sure this FAQ section will answer 90% of the questions in that forum.

We are all so lucky that Tomislav has done this for us in MDX. I wish somebody else do the same in AS, IS, RS, SQL and all other areas of DWBI. The “cook book” format is spot on!

The book contains extremely useful stuff. You only need to look at the table of content to see how useful this book is. From the basic, everyday problem to very advanced ones, they are all there. Date calculations, formatting, navigation, DMVs, context-aware, they are all there. This is a book that provides a guiding hand to SSAS developers. I think all SSAS developers need to have this book inside their laptop, iPad or mobile phone*. It will help us solve most of the MDX problems in our work. Essentially, this book contains the code that you need to type into your cubes.

*or on their desks, for those who still use paper version. For this book the electronic version is better, because you can search your problem and quickly find the solution. Very handy when you have to finish the cube tomorrow!

I need to emphasise that we are very lucky Tomislav spent a lot of his time writing this book. Not many people in the world have the required level of skill to do it. As I said in the beginning, Tomislav’s MDX skill is one of the highest in the world, along with people like Chris Webb, Greg Galloway, Deepak Puri, Darren Gosbell, and Marco Russo, all of whom are the reviewers of this book. If top people like them spend their precious time writing and reviewing this book, those of us who work on MDX should read it! It is a real treasure.

This is my favourite MDX book. Thank you Tomislav for writing it.

Vincent Rainardi, 2/9/2011.

Link to the book’s page on publisher’s web site: http://link.packtpub.com/UUrJM6

Blog at WordPress.com.