Data Platform and Data Science

11 August 2010

Cube Translation in Excel 2007

Filed under: Analysis Services — Vincent Rainardi @ 9:52 pm
Tags:

In Strategy Companion Analyzer, we have “Show/Hide Language Panel” icon on the toolbar, i.e. the one with “abc” and Chinese writing below:

If we click it, it shows the Language Panel:

Once we choose a language, the dimension names, attribute names and measures names will be displayed using the cube translation for that language:

For comparison this is the Translation tab on BIDS:

But in Excel 2007 we don’t have language selection icon. So how do we change the language?

The answer is not on Regional and Language Options:

It’s not on Microsoft Office Language Settings (which can be accessed via Excel Options):

But it’s on the data source files:

Open Adventure Works.odc on notepad and you’ll see the connection string:

<odc:ConnectionString>Provider=MSOLAP.4;Integrated Security=SSPI;Persist Security Info=True;Data Source=localhost;Initial Catalog=Adventure Works DW 2008</odc:ConnectionString>

And add “Locale Identifier = 1036” like this:

<odc:ConnectionString>Provider=MSOLAP.4;Integrated Security=SSPI;Persist Security Info=True;Data Source=localhost;Initial Catalog=Adventure Works DW 2008; Locale Identifier = 1036</odc:ConnectionString>

For other languages: 1031 = German, 1033 = English, 1034 = Spanish, 1036 = French, 1041 = Japanese. For a complete list of Locale Identifier see MS TechNet here. Not Language ID, but Locale ID.

Save the .ODC file and open Excel 2007 again, Data menu, Existing Connection, choose Adventure Works, click Open:

And now we see the dimension names, attribute names and measure names in French:

Vincent Rainardi, 11/8/2010, vrainardi@gmail.com, www.datawarehouse.org.uk

Layering in Insurance

Filed under: Business Knowledge — Vincent Rainardi @ 4:21 pm
Tags:

One of the most difficult things in data warehousing is actually not its technicalities, but understanding the business. There is no escape that we have to understand the business, in order for us to be able to model the data warehouse correctly. For example, I came across a case the other day where somebody asked me about the concept of layering in insurance, and specifically in Lloyd’s market. He needs to understand layering concept in order for him to be able to model the data warehouse correctly. He searched the web and found the following:

Layering: “The building of a program of insurance coverage using the excess of loss approach. Layered programs involve a series of insurers writing coverage, each one in excess of lower limits written by other insurers. Umbrella liability coverage is frequently structured in this manner, whereby a number of umbrella insurers write coverage at various levels, on an excess of loss basis, ultimately providing an insured with a high total limit of coverage”

After reading that definition he didn’t quite understand it so he asked me to explain. So I explained the following, which I’m sharing with you in this article/post.

In reinsurance industry nobody want to cover the whole thing. If a large US retail firm XYZ looks for coverage against fire, tornado, wind storm, hurricane, and flood for all of its 3500 stores in the US, the broker or managing agent will probably arrange it as follows:

Syndicate1 $100m x $0
Syndicate2 $400m x $100m
Syndicate3 $500m x $500m
Syndicate4 $800m x $1b

“x” means excess. Example: you took a house insurance with Churchill for £120k x 250 meaning that if the house burned down, and it costs £120k to rebuild, Churchill will give you £119,750 and you have to pay the 250 yourself. That’s excess.

In the above case Syndicate1 is taking $100 million excess zero. Meaning, no excess.
Syndicate2 is covering an insured amount of $400m from $100m.
Syndicate3 is covering an insured amount of $500m from $500m.

So if the XYZ store in Buffalo was burned and the loss estimate is $5m, then Syndicate1 will pay $5m.
But if a disaster like Hurricane Ike hits and XYZ lost 70 stores, which cost them $150m, then Syndicate1 will pay $100m and Syndicate2 will pay $50m.
So when a catastrophe like Katrina hits and XYZ lost 200 stores which cost them $700m, then Syndicate1 will pay $100m, Syndicate2 pays $400m and Syndicate3 pays $200m.

Obviously the premium is different. Syndicate1 bears the biggest risk, because the chance of a small event occurs is much bigger than a big event. Statistically speaking that is, and here’s where actuarial calculations and matrices comes in. Hence Syndicate1 gets the biggest premium.

Syndicate4 will get the smallest premium (percentage wise), but very profitable because it is very likely that the year will pass without a single event hitting $1 billion mark. Which means Syndicate4’s loss ratio will be very healthy.

Reinsurance companies like high layer business because of its profitability. Unfortunately the capital adequacy standard from FSA (and from next year Solvency II) require them to have large enough capital in order to be able to make good amount of profitable business in that market.

As I said in the beginning of this post, one of the most difficult things in data warehousing is the business knowledge. And insurance is one of the most difficult industry there is. As usual I welcome any corrections, discussions and questions on vrainardi@gmail.com.

Using Pivot for 2 Value Columns

Filed under: SQL Server — Vincent Rainardi @ 3:08 pm
Tags:

Say you need to pivot this table:

into this:

I usually use the old “case when” for this but being 2010, I’d prefer to use pivot statement. We can use pivot for 1 value column, but unfortunately we can’t use it for 2 value columns. Meaning, in the above source table, if we have sales amount only, or profit amount only, we can use pivot.

Using “case when” it’s like this:

select product,
sum(case when trans_type = ‘type1’ then sales_amount end) as sales_type1,
sum(case when trans_type = ‘type2’ then sales_amount end) as sales_type2,
sum(case when trans_type = ‘type1’ then profit_amount end) as profit_type1,
sum(case when trans_type = ‘type2’ then profit_amount end) as profit_type2
from source_table
group by product

Using pivot it’s like this:

select * from source_table
pivot
( sum(sales_amount)
for trans_type in (type1, type2)
) as p

But how about the profit_amount column? It needs to be pivoted for type1 and type2 too. As I said, you can’t put 2 measures in pivot, like this:

select * from source_table
pivot
( sum(sales_amount)
for trans_type in (type1, type2)
) as p1
pivot
( sum(profit_amount)
for trans_type in (type1, type2)
) as p2

We can’t specify type1 twice like above, because it means we will have 2 columns with the name = [type1]. When we run the above pivot statement SQL Server gives these 4 errors:

Msg 207, Level 16, State 1, Line 11
Invalid column name ‘trans_type’.
Msg 265, Level 16, State 1, Line 11
The column name “type1” specified in the PIVOT operator conflicts with the existing column name in the PIVOT argument.
Msg 265, Level 16, State 1, Line 11
The column name “type2” specified in the PIVOT operator conflicts with the existing column name in the PIVOT argument.
Msg 8156, Level 16, State 1, Line 12
The column ‘type1’ was specified multiple times for ‘p2’.

SQL Server 2008 Books Online gives this syntax for “FROM” clause:

[ FROM { <table_source> } [ ,n ] ]
<table_source> ::=
{ …
| <pivoted_table>

}

<pivoted_table> ::=
table_source
PIVOT <pivot_clause>
table_alias

<pivot_clause> ::=
( aggregate_function ( value_column [ [ , ]…n ])
FOR pivot_column
IN ( <column_list> )
)

<column_list> ::=
column_name [ ,…n]

Notice that on the value_column part there is “, n” meaning that we can specify more than 1 value column. But if we try it:

select * from source_table
pivot
( sum(sales_amount, profit_amount)
for trans_type in (type1, type2)
) as p

But SQL Server gives this error message:

Msg 8144, Level 16, State 2, Line 1
Procedure or function sum has too many arguments specified.

Aaarrggghhh…

So what’s the solution?

We need to provide 2 trans type columns with different value, like this:

That is done by selecting the trans type column twice,
adding “p” prefix on the second time:

select product, trans_type as trans_type1, ‘p’ + trans_type as trans_type2, sales_amount, profit_amount
from source_table

Using the above as a derived table, we pivot it twice, like
this:

select * from
( select product, trans_type as trans_type1, ‘p’ + trans_type as trans_type2,
sales_amount, profit_amount
from source_table
) s
pivot
( sum(sales_amount)
f
or trans_type1 in (type1, type2)
) as p1
pivot
( sum(profit_amount)
for trans_type2 in (ptype1, ptype2)
) as p2

Then we group them on product, like this:

select product, sum(type1) as sales_type1, sum(type2) as sales_type2,
sum(ptype1) as profit_type1, sum(ptype2) as profit_type2
from
( select product, trans_type as trans_type1, ‘p’ + trans_type as trans_type2,
sales_amount, profit_amount
from source_table
) s
pivot
( sum(sales_amount)
for trans_type1 in (type1, type2)
) as p1
pivot
( sum(profit_amount)
for trans_type2 in (ptype1, ptype2)
) as p2
group by product

So can have 2 value columns. We can pivot the table twice.
The key thing is
a) the column for second pivot must be different from the first pivot (e.g. trans_type1 and trans_type2) otherwise SQL Server will complain about duplicate columns, and
b) the value on the 2nd column must be different from the 1st column e.g. ‘type1’ and ‘ptype1’.

If we have a lot of value columns to be pivoted then it’s worth creating a bit of dynamic SQL.

Hope this helps. As usual I welcome any discussion, correction and questions on vrainardi@gmail.com

10 August 2010

General performance considerations for “insert select”

Filed under: Analysis Services — Vincent Rainardi @ 3:26 pm

“Insert select” is an “old school” way of populating a target table:

insert into target_table (col1, col2, …, col N)
select col1, col2, …, col N from source_table

“insert select” works not only in SQL Server but also in most RDBMS including Oracle, DB2, Postgres, Sybase, Ingres, MySQL, Teradata and Netezza.

Performance can be very good (in comparison to the “workflow” approach). Particularly if you have a very powerful DB engine (talking about Exadata, PDW, Netezza, Teradata)

The main disadvantage is obviously both source and target tables need to be on the same server, unless you create a linked server.

Source tables:

  1. Index on join columns *
  2. Partition on join column *

Target table:

  1. Empty it (truncate)
  2. No index on target table *
  3. No FK on target table

Database:

  1. Simple recovery mode *
  2. Enough usable space on data files (don’t let the file grow when insert is happening) *
  3. Put file groups on separate spindles

Server:

  1. If joining large table with large table like this: have large memory (64 or 128GB)
  2. Put file groups on RAID 10  disks instead of RAID 5
  3. Use 2008 instead of 2005 (select insert is bulk enabled in 2008)

* means significant impact

I know that not everything is possible, e.g. we can’t empty the table, etc. But the above list might give you some “ideas” about what can be done to increase your loading speed. Say from 1 hour to 10 mins. I’ve seen 10 fold increase by implementing some of the star ones above.

As usual, I welcome discussions and questions at vrainardi@gmail.com or via comments

Addition, 3/9/2010: In 2008 Insert Select could be minimally logged (in 2005 we can’t do this). This is done by simply adding tablock query hint, i.e. “insert into table1 with (tablock) select col1 from table2”. Sorry I forgot to mention this in the original post, for details please see “Minimally Logged Insert” in my “SQL Server 2008 Data Warehousing Features” presentation.

Blog at WordPress.com.