Data Warehousing and Data Science

17 July 2015

Painful-to-Retrieve Data Structures

Filed under: Analysis Services — Vincent Rainardi @ 6:11 pm

There are several data structures which are good from storage point of view, but it is painful to read/retrieve the data. I believe that the best way to learn is by example. So in this article I’m going to give two examples. The first one is StartDate-EndDate, and the second one is Product-Attribute-Value.

StartDate-EndDate

This table stores the customer level using start date and end date columns:

StartDate-EndDate

The data is produced by a CRM system in an ecommerce company which evaluates the customer shopping behaviour every day, and updates the customer level accordingly. There are 10 levels, categorised based on their likelihood to make a purchase. The shopping behaviour is evaluated based on the order of pages the customer visited in the web site.

What we are trying to find out are:
1. On 30th March 2010, how many customers are on level 1, level 2 and level 3?
2. In the 3 months period ending 30th March 2010, how many customers moved from level 2 to either level 3 or level 1?
3. What kind of customers are in #2 above? (using join to the customer table)

The purpose of the exercise is to understand the customer movement tendencies, for each customer segment. A segment is a combination of region, age group, gender, product interest, occupation. Once the tendencies are found, then we can make predictions for the future behaviour.

The above questions are very difficult to answer using the above structure. The query would be a lot easier, if we restructure the above table to:

Monthly Snapshot

Yes we are losing the date granularity, but the business evaluates the customer behaviour on month-by-month basis, as of the end of the month. Using this structure, it is a lot simpler to answer the above 3 questions.

This is one of the powers of using Periodic Snapshot concept from Ralph Kimball.

Product-Attribute-Value

Product table

The above structure stores product attributes as columns. That is the original structure. The above example shows 6 attributes of a dress. In reality, there are many more attributes for dresses, e.g. country of origin, wash instruction, ironing setting, sleeve style, etc. And there are a lot of products other than dress which can have completely different attributes, e.g. shoes, hand bags, jewellery, etc.

Because of the number of product attributes which need to be created every month (assume it is a in-house built EPOS, Electronic Point of Sale), the data architect decided to convert it into a Product-Attribute-Value structure (PAV) as follows:

Product-Attribute-Value

For this table to work, we will need to group the attribute names and define the attribute values in the Group-Attribute-Value table (GAV) as follows.

Group-Attribute-Value

If we take it one step further, we can give each row in the GAV table an ID, and use this ID in the PAV table. This structure is very flexible. Using this structure the application can add a new attribute on the fly (on the maintenance screen). It is a brilliant structure from EPOS Application point of view, but it is difficult to use it for reporting. If we replace the Attribute Value with their IDs, we will end up joining PAV to GAV a lot of times simultaneously.

Still, given the choice, a data architect should choose the PAV structure because the flexibility outweigh the disadvantages. Because the EPOS application displays one product at the time, there is no performance issue.

Similar to the PAV structure which is used to store attribute values, the Product-Measure-Value structure (PMV) is used to store measure/numerical values. For example:

Product-Measure-Value

We can see above there are 3 different types of measures:

  1. Weight, Length, Height, Depth, Volume, Number of doors, and Number of drawers are product characteristics (aka product attributes or product property). Yes, product attributes can be numerical, not just textual. Product characteristic doesn’t change for the life span of the product.
  2. Average Margin, Average (Sale) Price, Age (days), and Average Cost are calculated daily and stored against date, so this table is not the right home for them. But similar concept applies, we just need to add a date column.
  3. The third type is actually not a property of the product itself, but how the stock is managed. Supply Ratio, Minimum Order Quantity, and Order Trigger Level tells the purchasing department when and how much to order.

Regardless of the 3 types, the MeasureValue column is of the same type e.g. numeric(20,4), even though some of the measures are integer, some decimal and some are float! This is the limitation of this design, but the advantages far outweigh the disadvantages. The structure is so flexible that it enables the Sales or EPOS application to create measures on the fly (on the maintenance screen).

Again this Product-Measure-Value structure is not reporting-friendly, but to support a data entry application such as EPOS it works brilliantly.

This is not just for retail industry by the way. It is also applicable for banking, insurance, manufacturing and health care industries. Instead of ProductId column, we can have CustomerId, PolicyId, PositionId, and PatientId, for example.

12 July 2015

Physical Data Modelling (PDM)

Filed under: Analysis Services — Vincent Rainardi @ 8:00 am

When designing a physical table, a data architect will be facing typical issues/things. The purpose of this short article is to list those issues/things, and I’ll give my opinion on them, which hopefully will be useful for others.

I will go through 6 aspects of physical data modelling: table, PK, column, indexing, views, maintenance. Unlike LDM, PDM is something that often discussed in theory, but rarely done in practice. If we Google “data modelling”, we’ll find a lot of LDM, but only a few PDM. One reason is because Data Architects tends to leave PDM to the DBA. Another reason is because the Data Architect is not an expert in that RDBMS, e.g. they might understand insurance and data modelling, but not USI/UPI in Teradata, table space in Oracle, or partitioned index in SQL Server.

If you just started in that company (3 months or less) and not familiar with the data and databases in that company, then you need to query those databases and tables one by one, and really familiarise yourselves with the data. It does make a huge difference between a Data Architect who knows the data structure inside out, and a DA who doesn’t.

Last point before we start, when we design an enhancement to an existing database, it is usually more difficult than if it was a green field. We are not free to create the best practice within that DB. Instead we will have to adhere to the existing convention. If the current convention is not a best practice, or is not efficient, though luck, we can’t change it. Most of the time we will have to “play along” with existing convention used in that DB even though it’s not good. But be practical, if it doesn’t make sense, don’t do it.

 

WHAT IS PHYSICAL DATA MODELLING?

Logical Data Modelling is about creating a storage structure inside which the data will be stored (persisted). We do this by understanding the business process, understanding the data to be stored, and understanding how the stored data will used for (retrieved). Once we understand these 3 things, we can design the data store. The formal procedure is by writing an ERD (Entity Relationship Diagram) which specifies the entities, the identifier and attributes of each entity, and the relationship between entities, and a data mapping document which specifies where each column will be populated from.

Physical Data Modelling is about creating the DDL (Data Definition Language), which contains the SQL to create the schemas, tables and indices, including constraint, default values, and partitioning function. PDM is not complete if we don’t specify the estimated table sizes and index sizes.

 
CREATING TABLES

1. Table names

We need to adhere the convention on table names within that database, or probably within the company. Some companies use camel case (SupplierInvoice), some company uses underscore (Supplier_Invoice). Find if the company you work for have a data governance group and database standards (search the intranet for “Database standard”). If it does, then you will need to adhere the standards.

2. Table space

Or file group if SQL Server. We need to place the table in a file group which are not likely to be accessed when a related table is being accessed. That will be good for performance. A practical consideration is usually to place the table in the file group/table space which has most free space. This approach is not right. If the table space we want to put the table on has not got enough space, then we need to expand it.

3. Partition or not

Should we partition the table? If it is a reference table, most probably not. If it is a transaction table but not periodic, most probably not. But if it is a transaction table and it is periodic (a date-based table) and the volume is large (i.e. > 5k rows per day) then we should partition it. If we do partition it, the simplest way is using static/constant partitions, rather than sliding-window.

4. Do we need to create a new table?

Suppose this new table will be used to store prices. We have to check if there is any other table in this database containing prices. If there is, should you merge into it? Check the granularity, check if the new data is different of the same as the existing data (check the values).

Take your time when doing this because it is very important. If you are new in this company, take a really good look of the database. Even if it takes 2 days to really understand the structure. Even if it takes a week. Because if you create something which is the wrong structure (because you are not familiar with the DB), then you will ruin the whole project. Data structure is the foundation of any data warehousing or reporting project.

5. Schema

In SQL Server, it is good practice to use the schema name, e.g. dbo.Table1 instead of just Table1. Because it is more performant. Without the schema name, it’s 2 trips (SQL gets the default schema first then the table). With the schema name, it’s just once. This is important when writing a DDL, either it’s for creating table, adding column to an existing table or creating/altering a view or a stored procedure.

6. Singular or Plural

Should the table name be CustomerOrders or CustomerOrder? OrderLines or OrderLine? ClaimTransactions or ClaimTransaction? ProductTypes or ProductType? I think we should use singular, because it simpler and less annoying and more consistent. If use plural for one table, then we should use plural for all tables. But in reality this never happens (and it would very annoying if it happens). People only use plural in some tables, and not the rest, which is non-consistent. So we should simply name them CustomerOrder, OrderLine, ClaimTransaction and ProductType.

 
PRIMARY KEY

1. Having or not having a PK

Every reference table (aka master table) must have a PK. A reference table is a table which contains master data or reference data. An example of a reference data is: customer, product, invoice, security, flight. Non-periodic transaction table: it is usually good to have a PK, but not always. If there is a transaction identifier in the source, we have to make it a PK. If the source doesn’t have a transaction identifier, but the source has a combination of columns  which uniquely identify a row (see point 3 below), consider creating identity column and make it a PK.

A periodic transaction table (i.e. date-based, such as account balance table in Retail Banking, position table in Asset Management, etc.): there isn’t normally a PK in the source so we would have to create it in the target DB, usually using identity column. PK is optional in this case, i.e. we may have to let rows being duplicate because that is the nature of the data. If the source has a transaction ID column, then make a PK between that and the date.

2. Doesn’t have to be clustered index (SQL Server)

In SQL Server, we don’t have to make the PK as a clustered index. For a reference table, making a PK as a clustered index is the best choice. But in a transaction table, it might not be. The clustered index might be more useful to be used for business date, so that the table physically ordered on date, because every query specifies date.

3. Which column as a PK? (doesn’t have to be the unique ID)

In a transaction table, we do not have to take the unique row identifier as PK. In retail, a transaction row can be identified using a combination of 4 columns: time of transaction, store, product, and till number. Together, these 4 columns are the unique row identifier of the retail transaction table. Do we have to take them as PK? No, we don’t have to. We have the option of creating a new identity column and make it a PK.

4. No natural row identifier

Most reference/master tables have a natural row identifier. Country code, customer ID, product SKU, store ID, etc. Some reference tables don’t have a row identifier, for example: status table having the value of Active and Inactive. In this case we create the ID, e.g. A and I, and call it Status Code, and make it a PK.

Some transaction tables don’t have a natural row identifier. For example, now all rows in the retail transaction tables consist of products. Some rows are discount, promotional items, etc. A transaction ID may consists of 3 lines, 2 of them are products and 1 of them is a discount line. In this case there is no unique row identifier and the table may not naturally have a PK.

5. Identity or Manual Increment

It is a common approach in SQL Server and Sybase platforms to use an identity column as a PK (especially for reference tables). In Oracle (up to Oracle 12) this is done manually using a combination of Sequence and Trigger, see Brent Ozar article here. Oracle 12c release 1 has identity column, finally! (link). Identity column simplifies the PK maintenance, by automatically increment the value by 1 (or 2 or any other number).

It is also common to manually maintain the PK values i.e. max + 1. Very common in Oracle and Informatica. The advantage of manual maintenance is to block any code from accessing the table (typically done using app lock in SQL Server) when the new row is being created with a new PK, and more importantly, when an existing PK is being modified.

6. Name the PK

We often fall into the trap of creating the primary key on the table when creating the table using the Primary Key keyword on the first column, like this:

create table T (ID int not null Primary Key)

Not only this results in the PK not being named properly, but you can’t specify multiple column. It is better to use the Constraint keyword to create the PK, like this:

 create table T (ID int not null, Constraint PKName Primary Key (ID))

The usual convention for PK name is PK_TableName. Alternatively PK_TableName_ColumnName.

7. PK Name: ID or TableNameID

In some companies, the convention is to name the primary key as “ID”. But in other companies, the convention is to name the primary key as “TableNameID”, e.g. CustomerID, ProductID, CampaignID. The latter is better, because it is clearer. We know what ID column it is, just by its name. The second reason is: some tools, such as SSAS and data modelling tools, have a facility to “join on the same column name”.

8. Not Null

A primary key must be not nullable. This is the main difference between a unique clustered index and a clustered PK. Between a unique nonclustered index and a nonclustered PK. PK must be not null, unique index can be null. Both of them are row identifiers.

9. Bigint or int, varchar or decimal

Int in SQL Server (Integer in Oracle) can go up to 2 billion, whereas bigint (LongInteger in Oracle) can go up to 9 billion billion (18 zeros). There is no issue for reference tables here, use int for the PK.  For transaction table with more than 5k rows per day, use bigint for the PK. For small reference table (less than 1000 rows and it is more or less constant) it is appropriate to use smallint for the PK. For a very small reference table like status table with only 5 rows, use tinyint for the PK.

It is of course totally acceptable to have a varchar PK. An identifier can be a number or a string. It is natural to have a string column a the unique identifier of a table, e.g. Store Code, Product Code, ISIN, etc. I don’t recommend using decimal/numeric (Number in Oracle) or float as primary key, because it makes things complicated. The usual reason of using decimal instead of integer is so that we can insert 12.5 in between 11 and 12 (the PK is manually maintained), in order to keep it sequential.

10. Primary Key or Unique Index?

Which one is more performant, a PK or a unique index? I’m talking about both non clustered index here. And I’m talking about transaction tables, not reference tables. If you have a transaction table with clustered index on Order Date, and you a row identifier called Transaction ID, should you make this Transaction ID a non clustered PK or a non clustered unique index? Or neither?

This could be a long discussion and I should write it as a separate article. I’ll try to be short here. The short answer is: the performance is the same because both of them are practically the same thing, if the column is not nullable. The problem with a table without a PK is that we will be unable to uniquely identify a row, hence there is no way to check if that row already exists or not. But in a transaction table, a PK is not a natural row identifier; it is an artificial row identifier e.g. only a sequentially increasing number. Meaning that we can’t use it to check if a row already exist or not. So it is pointless having it as a PK. Again, this is in a transaction table, not in a reference table.

 
COLUMNS

1. Naming convention

It’s not just about Camel case or underscore (CountryDescription or country_description, but also: abbreviation (CountryDesc), punctuations ([S&P Rating] or SAndPRating, [Moody’s Rating] or MoodysRating, [Sales%] or SalesPct), keep the column names the same as the source or not (making it different from the source can be confusing for the developers).

It is a good idea to name the flag columns starting with “Is…” consistently, so that people will be able to detect that it is a flag column just by the name. System columns such as created and updated time and user, should be named consistently, e.g. CreatedTime, UpdatedTime, CreatedBy, UpdatedBy. Not mix & match between “updated” and “modified”, between “time”, “datetime”, and “date”, between “by”, “user” and “UserID”.

Consistency is the holy grail here. If you use “ID” in one place, don’t use “Id” (lower case d) in another place. If you shorten Description to “Desc”, do it in all places, not just some places. Generally speaking, I don’t like spaces and punctuation in the column name, because we will have to use [] when specifying them, but this is old school. In modern view, we should put clarity above convenience, so [Sales%] not SalesPct. But not putting spaces on every column name, e.g. [Country Description] – that would be a very very annoying! (not just inconvenience)

2. Data dictionary

Yes you will need to document each column in the data dictionary, i.e. what data they are storing, what they will be used for, where they are populated/sourced from, and what are they data types. SQL Server has extended properties at column level which can store this information, which later on can be collected programmatically into an web based data dictionary. The data types are from information schema of course, not from the extended properties.

3. Business knowledge

To be able to specify the correct columns on the table, we need domain knowledge. For example, in Investment Banking: should Coupon be in the Security table, or in the Position table? Which attributes are at instrument-level and which attributes are at position-level. What is the different between spread duration and spread duration contribution? Are they in % or in bps? We can pass the responsibility to the BA, which should do the first pass. But we need to understand it as well. I don’t think we can call ourselves a data architect if we don’t understand the domain knowledge.

4. Types of columns

There are 2 types of columns in a reference table: key and attributes. There are 4 types of columns in transaction table: key columns (the ones which defines the grain), transaction-level attributes, lower-level attributes and measures. For example, in a retail transaction table, the key columns are time of transaction, store ID, product ID, and till number. The transaction-level attributes are transaction type, order ID, payment type, cashier ID. The product-level attributes are product type, product weight, special offer (they are stored in transaction table because there are rows which don’t have product ID; but still have the product type).

5. Storing in 2 places

It is the basic principle of a normalised database design that data should only be stored in one place. But sometimes I come across the same data stored in 2 places. For example, in the Product table, there is an attribute (column) called Size. But there is also a table called Product Specification, a 3-column table which stores all properties of every product (product ID, property name, value), including Size. This is not right, one of them should be removed. If we maintain the same data in 2 places, it is confusing, extra effort keep them in sync, and dual version of truth.

6. Enforcing FK for RI

Not all FK need to be enforced into RI. We don’t have to create physical Foreign Keys for every ID column which refers to other table. It is a trade off, so we need to keep it in balance. If a transaction table has 15 ID columns and we put RI on all of them, the insert and update on the transaction table will be slow. But more crucially, any deletion on the referenced table will be significantly slower, because it has to scan this big transaction table to check if that referenced row is used or not. Put RI on the IDs pointing to the main entities only, and leave it to the ETL to check the rest.

If it is a data entry system, insist that all insert into transaction table from the front end/GUI is done via a stored procedure, not direct insert. This is so that we can perform lookup into the reference tables. Insert into reference tables doesn’t need to be procedurised; GUI can insert directly.

7. Order of columns

Put the key column first, then FK columns, then measures, then current-level attributes, then lower level attributes, then flags, then system columns. The key and FK columns are specified first so that in a glance we can see how this table relates to other table. And we can understand what indices or partitioning might potentially be created.

The best practice is to insists that all insert must specify column names, and no select using “*” as the column names. This way the sequence of column doesn’t matter. But still, it is better order the column logically.

For periodic transaction tables (date-based tables), the periodic date column should be the first column. If the table is partitioned, the partitioning key should be the first column. System columns such as created and updated timestamp, create and updated user should be located at the end of the table so we can identify easily.

It is very annoying that once the order is set, the following year people add new columns at the end of the table, effectively destroying the order. We need to prevent this by putting change control process to any structural change in any databases in production, inside which there is a database review activity/task.

8. Not Null

Key columns should be made not null. System columns too. Unlike RI (FK), not null doesn’t have much impact on insert/delete/update performance.

9. Default

The system columns such as created/modified date should have a default of getdate() created/modified by should have a default of SYSTEM_USER.

10. ID or Id

I prefer Id rather than ID, e.g. OrderId, ClaimId, TradeId, ProductId, CustomerId rather than OrderID, ClaimID, TradeID, ProductID, CustomerID. This is because the convention is camel case. Why break your own convention? Because ID is an abbreviation? No it isn’t. Abbreviation is the first character of every word in the phrase. For example: CDS is the abbreviation of Credit Default Swap and FRN is the abbreviation of Floating Rate Notes, but ID is not an abbreviation of Identity.

11. Punctuations

Some column names contain punctionations such as & (ampersand) and ‘ (aphostrophy) for example: [S&PRatingDate], [S&P Rating Date], SAndPRatingDate or SandPRatingDate? [D&B Customer Name], DAndBCustomerName or DandBCustomerName? [Weight%], [Weight %], WeightPct or WeightPercentage? [Moody’s Rating], [Moody’sRating], MoodysRating or MoodyRating? It depends on the company standards. My preference is not having to use [] when querying because I found it is quite annoying being told by the RDBMS many times that it an invalid column name when I don’t use []. So I would name them as: SAndPRatingDate, DAndBCustomerName, WeightPercentage, and MoodysRating.

12. Abbreviations

Should we shorten the column name by abbreviating a few words? For example: ProportionOfClaimWrittenOff as PropClaimWrtnOff, GrossPremiumWritten as GPW, GrossPremiumEarned as GPE, NumberOfClaims as NoOfClaims, CountryOfIncorporation as CountryOfInc, CountryOfDomicile as CountryOfDom, DeliveryDate as DelDate, PercentageOfProductDelivered as PctOfPrdDelvd? If found that abbreviations is annoying because they are making things unclear and confusing. We should not abbreviate the column names. The disadvantage outweigh the advantage (which is to look short).

 
DATA TYPE

1. Same as the source

The best practice is to set the data type in the target database the same as the data type in the source. This way we can make sure there is no loss of precision or truncation. For example, if the source data is decimal(20,8) we should not use decimal(25,4) because we will lose 4 digits of decimal places. If the source data is UTF-16 containing European characters then we should be using nvarchar rather than varchar.

2. Date or DateTime

An exception to this rule is datetime, i.e. if the source data is in SQL Server and its data type is datetime and we know that it is only used to store the date without the time e.g. “2015-07-14 00:00:00” then we should set the target data type to Date. Not only because Microsoft recommends not to use datetime (link) but also because it is quite annoying that every time we query we always get the “00:00:00” portion which is not required.

3. Float or Decimal

If the source data is Float then we need to use Float, because if we use decimal we will lose precision. Float can have precision up to 308 decimal places (link) whereas decimal is only up to 38 (link). Float values with precision more than 17 digits when converted to decimal is converted 0.

A multiplication of two decimals can result in Arithmetic Overflow, whereas if we used Float we don’t have this issue.

4. Money or decimal

Money data type only have 4 decimal places. For this reason it is better to use decimal data type to store monetary values, particularly those with high DP such as base currency amounts after FX Rates conversions. FX Rates are stored has minimum of 6 DP but more usually 8, and sometimes 12.

5. Flags: Char(1) or Bit

For flags such as IsActive, IsCallable, etc. I prefer to use char(1) rather than bit, for the reason of clarity. Bit uses 0 and 1 which is more ambiguous to people compared to Y and N. The storage argument is not significant because it is only 7 bits (1 bit vs 1 byte).

6. Bigint or int

Int is only up to 2.1 billion so for the PK of periodic transaction table we better use bigint. For the FK of (periodic) transaction table, we should use  int as it is almost always less than 2.1 billion.

7. UDT

Sometimes I see people creating a customer data type (UDT = User Defined Type). I don’t see the point of doing this. What I mean here is creating a data type called MyInt, MyVarchar20, MyDateTime, MyDecimal21-6. Not only this is slower than native SQL data type (I have not tested it, but I think it is potentially slower), but also it is more confusing, e.g. what is MyInt? We should just simply use Int.

 

INDEXING & PARTITIONING

  1. For reference tables (customer, product, security, country, currency, etc.), set the clustered index and the PK to CustomerId, ProductId, SecurityId, CountryId, CurrencyId, etc.

  2. For transaction tables (order, delivery, purchase, registration, trade, claim, premium, etc.), set the clustered index to OrderDate, DeliveryDate, PurchaseDate, RegistrationDate, TradeDate, ClaimDate, PremiumDate, etc. This is more useful than setting the cluster index to the transaction ID, i.e. OrderId, DeliveryId, PurchaseId, RegistrationId, TradeId, ClaimId, PremiumId. Why? Because most queries specify the date rather than the transaction ID.

  3. There are cases for point #2 above where the transaction ID is a better cluster index.
    The first case is when there are 5 or more nonclustered indices and these indices are used more often than date. For example, a Claim table is indexed on Claim Date, Customer, Business Line, Policy Number and Sales Rep. Users often query the Claim table by Customer, Policy Number, or Sales Rep in combination with the Claim Date more often than by the Claim Date alone. So in this case, make the Claim ID as the clustered index (single integer column as the indexing key), and all the other nonclustered index will benefit from this slim indexing key because their row locator is the clustered index key (link).

    The second case is when the transaction ID is often used to join to other table. For example, consider Order table and Delivery table. They are often joined on Order ID column. In this case it is worth considering making OrderId as the clustered index of the Order table. But if the users query the Order table by Order Date more often than joining Order table to Delivery table, then we should make Order Date as the clustered index.

  4. For periodic/snapshot transaction tables, use the periodic date column as the cluster index. There are no ifs or buts in this case.

  5. For transaction and periodic transaction tables, we should index the IDs to the main reference tables, e.g. CustomerId, ProductId, SecurityId. These main reference tables are usually put in the first 5 columns. We should set these indices as nonclustered.

  6. Remember that if the table is partitioned, the first indexing key must be the partition key. A transaction table (be it snapshot or not) is mostly partitioned by date. So for Order table, the partitioning key should be OrderDate, and all indices must use OrderDate as their first column.

  7. When specifying partition scheme, use as many file groups as you possibly can. Work with the DBA to create the file groups. We should put the file groups in different disks, and name them accordingly, i.e. not FG1, FG2, FG3, etc. but FG1A, FG1B, FG2A, FG2B, etc. with 1 and 2 as the disk numbers.
  8. It is much better to use a static, permanent, partitioning function rather than using “sliding window partition maintenance” technique. Simply divide number of dates into months, quarters or years. For example:
    – Partition 1 is Jan 2010, partition 2 is Feb 2010, and so on.
    – Partition 1 is Jan and Feb 2010, partition 2 is Mar and Apr 2010, and so on
    – Partition 1 is 2010 Q1, partition 2 is 2010 Q2, and so on
    – Partition 1 is 2010, partition 2 is 2011, and so on.
    – Partition 1 is 2010 Jan to June, Partition 2 is 2010 July to Dec, and so on.

  9. Even when the table is partitioned, we still need to put the date column as the clustered index. This is because partitioning 1 contains 1 month worth of date (or 2 months or 1 quarter or even 1 year). By specifying the date column as the clustered index, the table will be physically ordered by date column so the query can quickly locate a particular date.

6 July 2015

Securitising Cash Positions

Filed under: Business Knowledge — Vincent Rainardi @ 7:17 am
Tags:

It is an old age issue in asset management industry, that not all positions consist of security. They are cash positions or cash-like positions, e.g. settled and unsettled cash balances, FX forward/swap positions, IRS positions, and transaction proceeds. One of the solutions is to securitise these positions into instruments.

So we’ll have instruments called USD Settled Cash, EUR Unsettled Cash, Buy USD Sell GBP 01/04/15, and Pay 3% GBP Receive LIBOR3m+1% JPY. And here is where the issue lies, outlined in the next 3 paragraphs. When does the cash become settled? It depends on when the settlement message from the broker is processed. Do we create a separate instrument for Repo cash? (Repurchase Agreement). Do we create a separate instrument for collateral cash? (margin requirements).

FX forward has 2 legs. In the above example, the Buy USD date is today/spot (say 26th March 2015) and the Sell GBP date is future/forward (1st April 2015). Do we create 2 instruments, one for each leg, or 1 instrument?

IRS (Interest Rate Swap) can be float-for-fix or float-for-float. It can be the same currency or different currency. To close out the exposure (but not the accounting values), an IRS usually have a contra. If in the IRS we pay fix, in the contra we pay float. So how do we securitise this? Do we create the contra as a separate instrument? An IRS has 2 legs (usually a fix leg and a float leg, but could be both float) – do we create separate instruments for each legs? Do we create a separate instrument for each of the rates? Do we create a separate instrument for each of the fixing dates?

Attributes of cash-like instruments

What is the country of risk of “USD Settled Cash” instrument? United States of course. What is its country of domicile? N/A. What is its currency? It’s obvious, USD. What’s the issuer? Also obvious, US Government.

Now try for Pay 3% GBP Receive LIBOR3m+1% JPY. What is the country of risk? Umm… UK? Japan? Well, the risk is on the fixed leg, so the country of risk is UK. What is the issuer? Hmm…. blank? What is the currency?

The most common attributes of an instrument are: country, currency, issuer, asset class/type, asset subclass/subtype, sector/industry/subsector, rating, (effective) maturity date, maturity bucket, coupon frequency. All these need to be populated. Oh and ID fields, e.g. Ticker, Sedol, ISIN, Cusip; most of which will be blank for cash or FX lines. Description however, is crucial. It is used for looking up to determine if that instrument already exists or not. So it is very important to have a consistency, e.g. “USD Settled Cash” or “USD Cash Balance”? “EUR Unsettled Cash” or “EUR Cash (Unsettled)”? “GBP Collateral Cash” or “Collateral Cash GBP”?

Analytics for cash-like instruments

Analytics are measures which are dependent on price, time to maturity, and interest rates. The most common analytics for fixed income are yield to maturity, current yield, modified duration, spread, spread duration, option adjusted spread, z-spread, convexity, gamma. Some of these will be zero for cash, but some of them have values (like yield for example).

These analytics will need to be calculated if we combine several positions into one instrument. Some of them are not simple additive, e.g. they need to be weighted with % contribution when summing up. Some of them doesn’t work with weighted sum.

The other solution: by not securitising them

The other option is not securitising cash and FX positions, and they become positions without instruments. If we take this route will need to store all security attributes in the holding table.

Blog at WordPress.com.