Data Warehousing and Data Science

21 June 2015

Varchar(255)

Filed under: Analysis Services — Vincent Rainardi @ 6:45 am

I’ve always wondered why people put 255 as the length of the varchar, not 200, 250 or 300. I know it’s from 2^8, but why use 2^8? I understand one of the reason that people argue is that “to match the binary nature of the data store” so we can safe space/not wasting space.

SQL Server 2005, 2008, 2012 and 2014 all has the storage size as the length of data entered + 2 bytes. So whether we create the column as varchar(255) or varchar(258), when the column contains “406 Middleton Street” it uses 22 bytes and when it contains “10 High Street” it uses 17 bytes. It is that simple: bytes = length + 2.

Now that I’m taking a role as a data architect, I feel obliged to know this thing. I need to know what is the right data type to use, is it varchar(200) or varchar(255). And the answer to this question is very simple: use varchar(200). Or 100, or 300. Just go with what is required, and we don’t need to pay attention to 255 (or 511, or 64, or 131). These 2^n-1 numbers don’t have any roles in the storage of SQL Server database. Just ignore them.

But where is this 255 convention come from? Yes it is from history / historical reason, but what is it? Yes it is 2^8-1, but what is it? What was the historical reason? What happened in the past, and when, and where?

Proposal #1: because of UTF-8 (link)
No that doesn’t make sense. UTF-8 (link) is 8 bits, not bytes.

Proposal #2: because in SQL Server 6.5 the maximum length for varchar was 255 (link, same page as above)
Bingo! This makes sense. And it is true (link). In SQL Server up to 6.5 the max length was 255, both for char and varchar.

#2 is why we don’t see people uses the other 2^n-1 like 131 and 511. It makes sense.

I know it is old story, but it is good to remember that up to version 6.5, SQL Server code was still Sybase (link), i.e. Microsoft SQL Server 1.0 = Sybase SQL Server 3.0. The brand was Microsoft but the code was Sybase. It is in Microsoft SQL Server 7.0 that this Sybase code base was changed (link).

18 June 2015

Holding History – MDX For “From Date” and “To Date”

Filed under: Analysis Services — Vincent Rainardi @ 5:16 am

Summary
This article shows how to create these calculated measures:
1. How many securities were purchased between date1 and date2
2. How many securities were sold between date1 and date2
3. How many securities were “in holding” between date1 and date2

I will first write the business background, then create the tables and populate them (only 2 tables: FactHoldingHistory and DimDate). Then I’ll create the SSAS BIDS solution, including the calculated measures and deploy the cube to an AS server. And finally I will browse the cube using Excel pivot table, demonstrating the 3 calculated measures above using some test data.
When explaining the business situation, I will try to make it as detailed as possible, so that we can also learn about the investment business, which is the second purpose of this article. The first purpose is to provide the MDX for using the “From Date” and “To Date” technique, which is why I will also take time to explain the concept of technique as clearly as possible.

Business Background

An investment bank, an asset management, a wealth management, or a pension fund would need to maintain their “holdings”, i.e. a collection of securities which they purchase and maintain for investment or risk management. These securities could be bonds, equities, commodities, IRS, CDS, bond futures, FX Forward and cash. Each of these asset classes has different functions in the portfolio/fund.
Holding History is a table which contains the securities that they bought and sold across many years. It contains the securities (e.g. bonds, equities) that they are “holding” / have at that particular moment. There are 2 common design for holding history. One as a daily snapshot table, and the other as a transaction table.
Suppose you have a portfolio which consists of 50-60 equities and every month you make 2-4 trades. In the daily snapshot approach, every day all the 50-60 securities are inserted into the Holding History table. In the transaction approach, we maintain only 50-60 rows in the Holding History table. We updates these 50-60 rows with the trades.
This article is about the second approach (transactional). The Holding History table contain the following information:
1. What security was purchased/sold
2. In which account* (see below) the security was being held in
3. Which broker was used in purchasing/selling the security
4. The date the security was purchased
5. The date the security was sold (null if never sold)
6. How many days in holding (#5 minus #4, null if never sold, 0 on the day of purchase)
7. Quantity purchased
8. Purchase price (to 6 DPs, not just to cent/penny)
9. Purchase Value (this is usually gross, before purchase cost, e.g. broker charges)
10. Selling price (in 6 DPs)
11. Sale Amount (gross, before sales cost, e.g. broker charges)
12. Holding Count (1 if currently held, 0 if not currently held)

*An account could be a client mandate or an open fund or a closed investment trust. A client mandate or a fund could be implemented as several accounts, for example as different asset classes to differentiate the fund charges or to separete the “protection” part from the “growth” part.
If the same security is purchased for the second time, we create a second row for the second purchase.

The Holding History is per account. So if Account A needs 300 shares of Security1 and Account B needs 200 shares Securities1 and they purchased 500 then allocate it, then we create 2 rows, 1 for Account A and 1 for Account B.
The Holding History is different from “Active Holding” table and is different from “Transaction” table (aka Trade table). In a Transaction table we record the status of the trade, i.e. created, approved, executed, confirmed, allocated, settled, cancelled.

Active Holding table contains only the security held, but each security in an account only have 1 row. So if a security is purchased twice, it only has 1 row. It has more detailed information (more columns), such as:
1. Quantity held
2. Date first purchased
3. Average age (this is to calculate the growth. It is weighted on not qty but on value i.e. qty x price)
4. Cost (this is average cost across all purchases, gross)
5. Net cost (after purchase cost / broker’s fees)
6. Market price (this is yesterday’s closed market price)
7. Intraday price (this is today’s price)
8. Current value
9. Potential gain/loss
10. the usual holding information such as security ID, broker ID, account ID
11. Security Analytics* (see below) – at yesterday COB’s prices

Active Holding table enables the portfolio manager to see, very quickly, what this portfolio consist of, including the gain/growth,

*Security Analytics are credit risk analytics which are dependent only on the security. The quantity we hold in the portfolio doesn’t affect these analytics. Examples are: yield to maturity, current yield, effective yield, yield to worst, modified duration, effective duration, spread duration, convexity, effective convexity, option adjusted spread. Security Analytics operates at security level, as opposed to at portfolio level.

In this article we are going to create Holding History table, not Active Holding table.

DDL and Population


Let’s create the tables as follows:

create database Investment
go

use Investment
go

if object_id('dbo.FactHoldingHistory') is not null drop table dbo.FactHoldingHistory
create table dbo.FactHoldingHistory
( HoldingHistoryKey int,
AccountKey int,
SecurityKey int,
BrokerKey int,
DatePurchasedKey int,
DateSoldKey int,
DaysInHolding int,
QtyPurchased decimal(18,8),
PurchaseCost decimal(21,4),
SaleValue decimal(21,4),
AmountGained decimal(21,4),
HoldingCount int
)

insert into dbo.FactHoldingHistory
( HoldingHistoryKey, AccountKey, SecurityKey, BrokerKey, DatePurchasedKey, DateSoldKey, DaysInHolding,
QtyPurchased, PurchaseCost, SaleValue, AmountGained, HoldingCount)
values
(1, 3, 34, 2, 20140506, 20150107, 246, 349.321, 4000, 4300, 300, 1),
(2, 3, 96, 2, 20140709, 20150331, 265, 29.4156625, 2500, 2750, 250, 1),
(3, 3, 72, 2, 20140915, 20150219, 157, 81.0429, 3600, 3650, 50, 1)

select * from dbo.FactHoldingHistory

-- Create and Populate DimDate
create table DimDate
( DateKey int,
[Date] date
)

declare @d date
set @d = '1 Jan 2014'
while DateDiff(day, @d, '31 Dec 2015') >= 0
begin
insert into DimDate values (convert(varchar, @d, 112), @d)
set @d = DateAdd(day, 1, @d)
end

select * from DimDate

-- Create and Populate DimDateInput (for DimDateFrom and DimDateTo)
create table DimDateInput
( DateKey int,
[Date] date
)

declare @d date
set @d = '1 Jan 2014'
while DateDiff(day, @d, '31 Dec 2015') >= 0
begin
insert into DimDateInput values (convert(varchar, @d, 112), @d)
set @d = DateAdd(month, 1, @d)
end

select * from DimDateInput

Create SSAS Cube

Now let’s create the SSAS cube as follows:
DSV:
DSV

FactHoldingHistory named query:
SELECT HoldingHistoryKey, AccountKey, SecurityKey, BrokerKey, DatePurchasedKey, DateSoldKey, DaysInHolding, QtyPurchased, PurchaseCost, SaleValue, AmountGained, HoldingCount FROM FactHoldingHistory

Date From named query:
SELECT DateKey, [Date] AS [From Date] FROM DimDateInput

Date To named query:
SELECT DateKey, [Date] AS [To Date] FROM DimDateInput

Dimension Usage:
Dimension Usage

Cube Structure:
Cube Structure

In the cube structure we create 2 “input dimensions”, i.e. Date From and Date To. These 2 dimensions are for taking input from the users at run time. Using this method, users will be able to specify a date range that they want to use, at run time.
The Date From and Date To dimensions are only populated with the first date of every month, i.e. 1st Jan 2014, 1st Feb 2014, etc until 1st Dec 2015. This is so that the users are not overwhelmed with 730 days from 2 years (to chose from they would have to scroll down a lot of times).

The DimDate dimension on the other hand contains every dates in the two year period. This is so that the Date Purchased and Date Sold can have any value. I limit it from 1st Jan 2014 to 31st Dec 2015 but in practice this would be something like 15 years range (from 2005 to 2020).

In the Dimension Usage, the Date From and the Date To are not connected to any measure group. This way the users would be free too choose any member from these Input Dimensions, without affecting measures (hence the name “Input Dimension”).

And this is how we use the Input Dimensions in the Calculations tab:

Calculations:

CALCULATE;

CREATE MEMBER CURRENTCUBE.[Measures].[Holdings Purchased] AS
Sum
( { StrToMember
( "[Date Purchased].[Date].&" +
Right(MemberToStr([Date From].[From Date].CurrentMember), 21)
):
StrToMember
( "[Date Purchased].[Date].&" +
Right(MemberToStr([Date To].[To Date].CurrentMember), 21)
)
}, [Measures].[Holding Count]
), VISIBLE = 1;

CREATE MEMBER CURRENTCUBE.[Measures].[Holdings Sold] AS
Sum
( { StrToMember
( " [Date Sold].[Date].&" +
Right(MemberToStr([Date From].[From Date].CurrentMember), 21)
):
StrToMember
("[Date Sold].[Date].&" +
Right(MemberToStr([Date To].[To Date].CurrentMember), 21)
)
}, [Measures].[Holding Count]
), VISIBLE = 1;

CREATE MEMBER CURRENTCUBE.[Measures].[Number of Holdings] AS
Sum
( ( { NULL:StrToMember
( "[Date Purchased].[Date].&" +
Right(MemberToStr([Date From].[From Date].CurrentMember), 21)
)
},
{ StrToMember
( "[Date Sold].[Date].&" +
Right(MemberToStr([Date From].[From Date].CurrentMember), 21)
):NULL
}
), [Measures].[Holding Count]
), VISIBLE = 1;

Calculated Measure – Peel By Layer

Here’s the expression used by the Holdings Purchased calculated measure:

CREATE MEMBER CURRENTCUBE.[Measures].[Holdings Purchased] AS
Sum
( {StrToMember
("[Date Purchased].[Date].&" +
Right(MemberToStr([Date From].[From Date].CurrentMember), 21)
):
StrToMember
("[Date Purchased].[Date].&" +
Right(MemberToStr([Date To].[To Date].CurrentMember), 21)
)
}, [Measures].[Holding Count]
)

Let’s peel in the above expression layer by layer.

From the outer most, it is simply: sum( {A:B}, M ).
Where A:B is a range in the Purchased Date dimension, and M is the Holding Count.
So in essense it is something like “sum of Holding Count where the Purchased Date is between A and B.

And this A & B is From Date and To Date. So if the user chooses From Date = 1st May 2014 and To Date = 1st Nov 2014, it will become: sum of Holding Count where the Purchase Date is between 1st May 2014 and 1st Nov 2014.

Let’s dive into a little bit deeper. The “A” in the “A:B” above is this expression:


StrToMember
("[Date Purchased].[Date].&" +
Right(MemberToStr([Date From].[From Date].CurrentMember), 21)
)

This works as follows:
1. Get the current member of the From Date
> [Date From].[From Date].&[2014-05-01T00:00:00]
2. Convert #1 to String
> [Date From].[From Date].&[2014-05-01T00:00:00]
3. Take 21 characters from the right of #2
> [2014-05-01T00:00:00]
4. Add “[Date Purchased].[Date].&” in the front of #3
> [Date Purchased].[Date].&[2014-05-01T00:00:00]
5. Convert #4 to Member
> [Date Purchased].[Date].&[2014-05-01T00:00:00]

In essense, it convert the 1st May 2014 member of Date From dimension to become Date Purchased dimension. Why do we need to do that? We want to constrain the Holding Count by the Purchased Date dimension, but we want to take the range of this constraint from the Date From dimension.

And the B part of the “A:B” works the same way:


StrToMember
("[Date Purchased].[Date].&" +
Right(MemberToStr([Date To].[To Date].CurrentMember), 21)
)

Assuming that the user selects 1st Nov 2014 for the Date To dimension, the above expression becomes: [Date Purchased].[Date].&[2014-11-01T00:00:00]
So the A:B becomes:
[Date Purchased].[Date].&[2014-05-01T00:00:00] : [Date Purchased].[Date].&[2014-11-01T00:00:00]

So the whole thing becomes:
Sum({[Date Purchased].[Date].&[2014-05-01T00:00:00] : [Date Purchased].[Date].&[2014-11-01T00:00:00]}, [Measures].[Holding Count])
Which is roughly: sum of Holding Count where the Purchased Date is between 1st May 2014 and 1st Nov 2014.
Or, making it variable: sum of Holding Count where the Purchased Date is between From Date and To Date.

The same thing applies to the Holdings Sold calculated measures. It’s just using Date Sold instead of Date Purchased, that’s the only difference.
So Holding Sold is roughly something like this: sum of Holding Count where the Sold Date is between 1st May 2014 and 1st Nov 2014.
Or, making it variable: sum of Holding Count where the Sold Date is between From Date and To Date.

Number of Holdings Calculated Measure

Number of Holdings is roughly like this: sum of Holding Count where the From Date is between the Purchased Date and the Sold Date.
Number of Holdings on 1st May is the sum of Holding Count where the Purchased Date is before 1st May and the Sold Date is after 1st May.

So the MDX for the calculated measure is something like this:

Sum
( ( {NULL:A},{B:NULL}),
[Measures].[Holding Count]
)
Where A is the Purchased Date and B is the Sold Date.

Filling in for Purchased Date = 1st May 2014 and Sold Date = 1st Nov 2014, it becomes:
Sum
( ( {NULL:[Date Purchased].[Date].&[2014-05-01T00:00:00]},
{[Date Sold].[Date].&[2014-05-01T00:00:00]:NULL}
), [Measures].[Holding Count]
)

Number of holdings is for a particular date, for example 1st May 2014. That’s why we only use 1 input dimension (From Date). We don’t use To Date.

Browse The Cube

Browse the Cube1

When we browse the cube using Excel Pivot Table, the measures and dimension looks like above. We put From Date and To Date on the Report Filter box, put the Holdings Purchased and Holdings Sold on the Values box. Then we set the From Date to 1st July 2014, the To Date to 1st Feb 2015, The result is below left, i.e. the Holdings Purchased = 2 and Holdings Sold = 1.

Browse the Cube2

The Holdings Purchased = 2 because only Purchase Date 9th July 2014 and 15th Sep 2014 are between 1st July 2014 and 1st Feb 2015:
(1, 3, 34, 2, 20140506, 20150107, 246, 349.321, 4000, 4300, 300, 1),
(2, 3, 96, 2, 20140709, 20150331, 265, 29.4156625, 2500, 2750, 250, 1),
(3, 3, 72, 2, 20140915, 20150219, 157, 81.0429, 3600, 3650, 50, 1)
(these 3 lines are taken from the DDL & Population above)

The Holdings Sold = 1 because only Sold Date 7th Jan 2015 is between 1st July 2014 and 1st Feb 2015:
(1, 3, 34, 2, 20140506, 20150107, 246, 349.321, 4000, 4300, 300, 1),
(2, 3, 96, 2, 20140709, 20150331, 265, 29.4156625, 2500, 2750, 250, 1),
(3, 3, 72, 2, 20140915, 20150219, 157, 81.0429, 3600, 3650, 50, 1)

Number of Holdings

Browse the Cube3

For the Number of Holdings, remove Holdings Purchased and Holding Sold from the Values box and put the Number of Holdings measure into the values box instead. Then remove the To Date from the Filter box, and set the From Date to 1st Aug 2014. The Number of Holdings = 2 as displayed in the above right.
The Number of Holdings = 2 because only the second row below has Purchase Date before 1st Aug 2014 and Sold Date after 1st Aug 2014:
(1, 3, 34, 2, 20140506, 20150107, 246, 349.321, 4000, 4300, 300, 1),
(2, 3, 96, 2, 20140709, 20150331, 265, 29.4156625, 2500, 2750, 250, 1),
(3, 3, 72, 2, 20140915, 20150219, 157, 81.0429, 3600, 3650, 50, 1)

Another Method

Another method is using MDX Filter function to filter the Holdings Purchased like this:

CREATE MEMBER CURRENTCUBE.[Measures].[Holdings Purchased] AS
Sum
( FILTER
( [Purchased Date].[Date],
[Purchase Date].[Date].CURRENTMEMBER.name <= [From Date].[Hierarchy].CURRENTMEMBER.name
), [Measures].[Holding Count]
)

And for the Number in Holdings measure we need to use the Filter function twice using Cross Join, like this:
Sum(FILTER (A) * FILTER (B), M)
where A is a filter on the Purchase Date and B is a filter on the Sold Date, and M is the Holding Count.

5 June 2015

Indexing Fact Tables in SQL Server

Filed under: Analysis Services — Vincent Rainardi @ 5:12 pm

This article applies only to SQL Server as it is using clustered and nonclustered indices. It doesn’t apply to Oracle or Teradata or DB2.

Apologies I’m going straight to the matter, without explaining what is a clustered index key, what is a dimension key column, etc. For an introduction to indexing, here are Microsoft’s articles: Clustered and NonClustered Indexes Described, Using Clustered Indexes, Designing An Index.

A common way of indexing a fact table is by putting a clustered index on the fact key column, then put nonclustered index on each of the dimension key column. This will allow the each of the nonclustered index to be efficiently used, whilst ensuring that they are slim.

They are slim because in SQL Server, a nonclustered index uses the clustered index key as the row locator. The clustered index key, as I mentioned above, is the fact key which is a bigint.

They are efficiently used because they are built on each of the dimension key. Each of them as the first index key, not as a second or third index key. This ensures that they will be hit by the queries using that dimension key, because they are the first column in the index key.

But there is a major weekness. Most fact tables are periodic snapshot. Meaning that each query will hit the fact table on the snapshot date key column. Either directly hitting the fact table, or via the date dimension. On the latter case it specifies the snapshot date (2015-03-31), so it will hit the date dimension, and then hit the fact table on the snapshot date key column (20150331).

Because periodic snapshot fact tables are primarily queried on the snapshot date key, it would be a lot better to use the clustered index to hold the snapshot date key column, not the fact key column. The primary key of the fact table should still be the fact key column, but the clustered index is on the snapshot date key column.

A Primary Key doesn’t have to be a clustered index

It is a common misperception in SQL Server world, that a PK must be a clustered index. That is not so. On the create table statement, if we just do “create table X (column1 int primary key …” then SQL Server will put the PK as a clustered index. But we put the PK on the Constraint clause we can specify nonclustered.

Yes SQL Server need to create an index for a PK. This is to enforce that the PK is unique. But that index does not have to be a clustered index. It can very well be a nonclustered index. See this example below:

create table FactHoldings
( FactKey bigint,
  DateKey int,
  PortfolioKey int,
  SecurityKey int,
  CurrencyKey int,
  GrossValueBase decimal(21,4),
  constraint PK_FactKey primary key nonclustered (FactKey)
);

create clustered index CI_DateKey on Fact1 (DateKey);
create nonclustered index NI_PortfolioKey on Fact1 (PortfolioKey);
create nonclustered index NI_SecurityKey on Fact1 (SecurityKey);
create nonclustered index NI_CurrencyKey on Fact1 (CurrencyKey);

Yes, this will take more storage space for the index, because the row locator of NI_PortfolioKey and NI_CurrencyKey will be the DateKey column. But it will be more performant on almost every fact table query because the table is physically ordered on DateKey. So SQL Server will be able to locate the rows for the specified date quicker.

Which Dimension Key Column To Index

I don’t believe that we need to index every single dimension key column. Some dimensions are rarely used in the queries. We should only index the dimensions which are frequently used when that fact table is queried.

Multi-Column Clustered Index

Would it be a good idea to add additional clustered index key to the clustered index? i.e. in the example above, should we make the DateKey + PortfolioKey as the clustered index key? The argument for this is that almost every query uses PortfolioKey. So it will be quicker for the query to locate the row in the fact table.

No, I don’t believe this is a good idea. Because this will make the clustered index key much wider, and in turn affect the size of the row locator in every single nonclustered index. Which in turn will make all nonclustered index less performant.

Multi-Column NonClustered Index

Should we make the nonclustered index a multi-column? i.e. in the example above, should we add SecurityKey to the PortfolioKey?

Generally speaking this is not a good idea. Queries which only use security dimension without using portfolio dimension, will not be able to use the index because it is implemented as the second nonclustered index key. Generally speaking it is better to keep each dimension key column on a separate nonclustered index.

Indexing Accumulative Snapshot Fact Tables

In an accumulative snapshot fact table, there is one (and only one) dimension key which is the Primary Key of the table. This is usually the main dimension in that industry sector. For retail, this is CustomerKey, for Pharma it is the ProductKey, and for Investment it is the SecurityKey.

It is best to put a clustered index in this main dimension key column.

Indexing Transaction Fact Tables

A transaction fact table doesn’t have a Snapshot Date Key column. Instead it has a Transaction Date Key column. The other columns are more or less the same, i.e. for Retail industry: ProductKey, StoreKey, and optionally CustomerKey. For Retail Banking it’s CustomerAccountKey, BranchKey, TransactionTypeKey and optionally ProductLineKey.

In my opinion, it is best to put a clustered index on the FactKey column, with nonclustered index on each of the major dimension key column (major means: the ones which are heavily used, not the ones after the “optionally” above).

Partitioning the Fact Tables

Generally speaking, from performance point of view, partitioning is better than indexing. From maintenance point of view, it is much worse. The Sliding Window method to maintain the partitions requires a lot of work. To overcome this, define 10,000 static partitions enough 10 years (1 partition for each day). This way we don’t need to do any maintenance on any partition.

Partitioning particularly gives much better performance for Periodic Snapshot fact table (PS). For Accumulative Snapshot (AS) and Transaction fact tables the performance gain is not significant. This is because Transaction fact tables are not queried using a particular column. AS is queries (almost always) using the main dimension key, but the number of rows are not significant.

If a PS is partitioned on Snapshot Date Key, what should be the clustering key? The Fact Key column should be the clustering key. And the other dimension key columns should in nonclustering indexes.

Blog at WordPress.com.