Data Warehousing and Business Intelligence

12 July 2015

Physical Data Modelling (PDM)

Filed under: Analysis Services — Vincent Rainardi @ 8:00 am

When designing a physical table, a data architect will be facing typical issues/things. The purpose of this short article is to list those issues/things, and I’ll give my opinion on them, which hopefully will be useful for others.

I will go through 6 aspects of physical data modelling: table, PK, column, indexing, views, maintenance. Unlike LDM, PDM is something that often discussed in theory, but rarely done in practice. If we Google “data modelling”, we’ll find a lot of LDM, but only a few PDM. One reason is because Data Architects tends to leave PDM to the DBA. Another reason is because the Data Architect is not an expert in that RDBMS, e.g. they might understand insurance and data modelling, but not USI/UPI in Teradata, table space in Oracle, or partitioned index in SQL Server.

If you just started in that company (3 months or less) and not familiar with the data and databases in that company, then you need to query those databases and tables one by one, and really familiarise yourselves with the data. It does make a huge difference between a Data Architect who knows the data structure inside out, and a DA who doesn’t.

Last point before we start, when we design an enhancement to an existing database, it is usually more difficult than if it was a green field. We are not free to create the best practice within that DB. Instead we will have to adhere to the existing convention. If the current convention is not a best practice, or is not efficient, though luck, we can’t change it. Most of the time we will have to “play along” with existing convention used in that DB even though it’s not good. But be practical, if it doesn’t make sense, don’t do it.

 

WHAT IS PHYSICAL DATA MODELLING?

Logical Data Modelling is about creating a storage structure inside which the data will be stored (persisted). We do this by understanding the business process, understanding the data to be stored, and understanding how the stored data will used for (retrieved). Once we understand these 3 things, we can design the data store. The formal procedure is by writing an ERD (Entity Relationship Diagram) which specifies the entities, the identifier and attributes of each entity, and the relationship between entities, and a data mapping document which specifies where each column will be populated from.

Physical Data Modelling is about creating the DDL (Data Definition Language), which contains the SQL to create the schemas, tables and indices, including constraint, default values, and partitioning function. PDM is not complete if we don’t specify the estimated table sizes and index sizes.

 
CREATING TABLES

1. Table names

We need to adhere the convention on table names within that database, or probably within the company. Some companies use camel case (SupplierInvoice), some company uses underscore (Supplier_Invoice). Find if the company you work for have a data governance group and database standards (search the intranet for “Database standard”). If it does, then you will need to adhere the standards.

2. Table space

Or file group if SQL Server. We need to place the table in a file group which are not likely to be accessed when a related table is being accessed. That will be good for performance. A practical consideration is usually to place the table in the file group/table space which has most free space. This approach is not right. If the table space we want to put the table on has not got enough space, then we need to expand it.

3. Partition or not

Should we partition the table? If it is a reference table, most probably not. If it is a transaction table but not periodic, most probably not. But if it is a transaction table and it is periodic (a date-based table) and the volume is large (i.e. > 5k rows per day) then we should partition it. If we do partition it, the simplest way is using static/constant partitions, rather than sliding-window.

4. Do we need to create a new table?

Suppose this new table will be used to store prices. We have to check if there is any other table in this database containing prices. If there is, should you merge into it? Check the granularity, check if the new data is different of the same as the existing data (check the values).

Take your time when doing this because it is very important. If you are new in this company, take a really good look of the database. Even if it takes 2 days to really understand the structure. Even if it takes a week. Because if you create something which is the wrong structure (because you are not familiar with the DB), then you will ruin the whole project. Data structure is the foundation of any data warehousing or reporting project.

5. Schema

In SQL Server, it is good practice to use the schema name, e.g. dbo.Table1 instead of just Table1. Because it is more performant. Without the schema name, it’s 2 trips (SQL gets the default schema first then the table). With the schema name, it’s just once. This is important when writing a DDL, either it’s for creating table, adding column to an existing table or creating/altering a view or a stored procedure.

6. Singular or Plural

Should the table name be CustomerOrders or CustomerOrder? OrderLines or OrderLine? ClaimTransactions or ClaimTransaction? ProductTypes or ProductType? I think we should use singular, because it simpler and less annoying and more consistent. If use plural for one table, then we should use plural for all tables. But in reality this never happens (and it would very annoying if it happens). People only use plural in some tables, and not the rest, which is non-consistent. So we should simply name them CustomerOrder, OrderLine, ClaimTransaction and ProductType.

 
PRIMARY KEY

1. Having or not having a PK

Every reference table (aka master table) must have a PK. A reference table is a table which contains master data or reference data. An example of a reference data is: customer, product, invoice, security, flight. Non-periodic transaction table: it is usually good to have a PK, but not always. If there is a transaction identifier in the source, we have to make it a PK. If the source doesn’t have a transaction identifier, but the source has a combination of columns  which uniquely identify a row (see point 3 below), consider creating identity column and make it a PK.

A periodic transaction table (i.e. date-based, such as account balance table in Retail Banking, position table in Asset Management, etc.): there isn’t normally a PK in the source so we would have to create it in the target DB, usually using identity column. PK is optional in this case, i.e. we may have to let rows being duplicate because that is the nature of the data. If the source has a transaction ID column, then make a PK between that and the date.

2. Doesn’t have to be clustered index (SQL Server)

In SQL Server, we don’t have to make the PK as a clustered index. For a reference table, making a PK as a clustered index is the best choice. But in a transaction table, it might not be. The clustered index might be more useful to be used for business date, so that the table physically ordered on date, because every query specifies date.

3. Which column as a PK? (doesn’t have to be the unique ID)

In a transaction table, we do not have to take the unique row identifier as PK. In retail, a transaction row can be identified using a combination of 4 columns: time of transaction, store, product, and till number. Together, these 4 columns are the unique row identifier of the retail transaction table. Do we have to take them as PK? No, we don’t have to. We have the option of creating a new identity column and make it a PK.

4. No natural row identifier

Most reference/master tables have a natural row identifier. Country code, customer ID, product SKU, store ID, etc. Some reference tables don’t have a row identifier, for example: status table having the value of Active and Inactive. In this case we create the ID, e.g. A and I, and call it Status Code, and make it a PK.

Some transaction tables don’t have a natural row identifier. For example, now all rows in the retail transaction tables consist of products. Some rows are discount, promotional items, etc. A transaction ID may consists of 3 lines, 2 of them are products and 1 of them is a discount line. In this case there is no unique row identifier and the table may not naturally have a PK.

5. Identity or Manual Increment

It is a common approach in SQL Server and Sybase platforms to use an identity column as a PK (especially for reference tables). In Oracle (up to Oracle 12) this is done manually using a combination of Sequence and Trigger, see Brent Ozar article here. Oracle 12c release 1 has identity column, finally! (link). Identity column simplifies the PK maintenance, by automatically increment the value by 1 (or 2 or any other number).

It is also common to manually maintain the PK values i.e. max + 1. Very common in Oracle and Informatica. The advantage of manual maintenance is to block any code from accessing the table (typically done using app lock in SQL Server) when the new row is being created with a new PK, and more importantly, when an existing PK is being modified.

6. Name the PK

We often fall into the trap of creating the primary key on the table when creating the table using the Primary Key keyword on the first column, like this:

create table T (ID int not null Primary Key)

Not only this results in the PK not being named properly, but you can’t specify multiple column. It is better to use the Constraint keyword to create the PK, like this:

 create table T (ID int not null, Constraint PKName Primary Key (ID))

The usual convention for PK name is PK_TableName. Alternatively PK_TableName_ColumnName.

7. PK Name: ID or TableNameID

In some companies, the convention is to name the primary key as “ID”. But in other companies, the convention is to name the primary key as “TableNameID”, e.g. CustomerID, ProductID, CampaignID. The latter is better, because it is clearer. We know what ID column it is, just by its name. The second reason is: some tools, such as SSAS and data modelling tools, have a facility to “join on the same column name”.

8. Not Null

A primary key must be not nullable. This is the main difference between a unique clustered index and a clustered PK. Between a unique nonclustered index and a nonclustered PK. PK must be not null, unique index can be null. Both of them are row identifiers.

9. Bigint or int, varchar or decimal

Int in SQL Server (Integer in Oracle) can go up to 2 billion, whereas bigint (LongInteger in Oracle) can go up to 9 billion billion (18 zeros). There is no issue for reference tables here, use int for the PK.  For transaction table with more than 5k rows per day, use bigint for the PK. For small reference table (less than 1000 rows and it is more or less constant) it is appropriate to use smallint for the PK. For a very small reference table like status table with only 5 rows, use tinyint for the PK.

It is of course totally acceptable to have a varchar PK. An identifier can be a number or a string. It is natural to have a string column a the unique identifier of a table, e.g. Store Code, Product Code, ISIN, etc. I don’t recommend using decimal/numeric (Number in Oracle) or float as primary key, because it makes things complicated. The usual reason of using decimal instead of integer is so that we can insert 12.5 in between 11 and 12 (the PK is manually maintained), in order to keep it sequential.

10. Primary Key or Unique Index?

Which one is more performant, a PK or a unique index? I’m talking about both non clustered index here. And I’m talking about transaction tables, not reference tables. If you have a transaction table with clustered index on Order Date, and you a row identifier called Transaction ID, should you make this Transaction ID a non clustered PK or a non clustered unique index? Or neither?

This could be a long discussion and I should write it as a separate article. I’ll try to be short here. The short answer is: the performance is the same because both of them are practically the same thing, if the column is not nullable. The problem with a table without a PK is that we will be unable to uniquely identify a row, hence there is no way to check if that row already exists or not. But in a transaction table, a PK is not a natural row identifier; it is an artificial row identifier e.g. only a sequentially increasing number. Meaning that we can’t use it to check if a row already exist or not. So it is pointless having it as a PK. Again, this is in a transaction table, not in a reference table.

 
COLUMNS

1. Naming convention

It’s not just about Camel case or underscore (CountryDescription or country_description, but also: abbreviation (CountryDesc), punctuations ([S&P Rating] or SAndPRating, [Moody’s Rating] or MoodysRating, [Sales%] or SalesPct), keep the column names the same as the source or not (making it different from the source can be confusing for the developers).

It is a good idea to name the flag columns starting with “Is…” consistently, so that people will be able to detect that it is a flag column just by the name. System columns such as created and updated time and user, should be named consistently, e.g. CreatedTime, UpdatedTime, CreatedBy, UpdatedBy. Not mix & match between “updated” and “modified”, between “time”, “datetime”, and “date”, between “by”, “user” and “UserID”.

Consistency is the holy grail here. If you use “ID” in one place, don’t use “Id” (lower case d) in another place. If you shorten Description to “Desc”, do it in all places, not just some places. Generally speaking, I don’t like spaces and punctuation in the column name, because we will have to use [] when specifying them, but this is old school. In modern view, we should put clarity above convenience, so [Sales%] not SalesPct. But not putting spaces on every column name, e.g. [Country Description] – that would be a very very annoying! (not just inconvenience)

2. Data dictionary

Yes you will need to document each column in the data dictionary, i.e. what data they are storing, what they will be used for, where they are populated/sourced from, and what are they data types. SQL Server has extended properties at column level which can store this information, which later on can be collected programmatically into an web based data dictionary. The data types are from information schema of course, not from the extended properties.

3. Business knowledge

To be able to specify the correct columns on the table, we need domain knowledge. For example, in Investment Banking: should Coupon be in the Security table, or in the Position table? Which attributes are at instrument-level and which attributes are at position-level. What is the different between spread duration and spread duration contribution? Are they in % or in bps? We can pass the responsibility to the BA, which should do the first pass. But we need to understand it as well. I don’t think we can call ourselves a data architect if we don’t understand the domain knowledge.

4. Types of columns

There are 2 types of columns in a reference table: key and attributes. There are 4 types of columns in transaction table: key columns (the ones which defines the grain), transaction-level attributes, lower-level attributes and measures. For example, in a retail transaction table, the key columns are time of transaction, store ID, product ID, and till number. The transaction-level attributes are transaction type, order ID, payment type, cashier ID. The product-level attributes are product type, product weight, special offer (they are stored in transaction table because there are rows which don’t have product ID; but still have the product type).

5. Storing in 2 places

It is the basic principle of a normalised database design that data should only be stored in one place. But sometimes I come across the same data stored in 2 places. For example, in the Product table, there is an attribute (column) called Size. But there is also a table called Product Specification, a 3-column table which stores all properties of every product (product ID, property name, value), including Size. This is not right, one of them should be removed. If we maintain the same data in 2 places, it is confusing, extra effort keep them in sync, and dual version of truth.

6. Enforcing FK for RI

Not all FK need to be enforced into RI. We don’t have to create physical Foreign Keys for every ID column which refers to other table. It is a trade off, so we need to keep it in balance. If a transaction table has 15 ID columns and we put RI on all of them, the insert and update on the transaction table will be slow. But more crucially, any deletion on the referenced table will be significantly slower, because it has to scan this big transaction table to check if that referenced row is used or not. Put RI on the IDs pointing to the main entities only, and leave it to the ETL to check the rest.

If it is a data entry system, insist that all insert into transaction table from the front end/GUI is done via a stored procedure, not direct insert. This is so that we can perform lookup into the reference tables. Insert into reference tables doesn’t need to be procedurised; GUI can insert directly.

7. Order of columns

Put the key column first, then FK columns, then measures, then current-level attributes, then lower level attributes, then flags, then system columns. The key and FK columns are specified first so that in a glance we can see how this table relates to other table. And we can understand what indices or partitioning might potentially be created.

The best practice is to insists that all insert must specify column names, and no select using “*” as the column names. This way the sequence of column doesn’t matter. But still, it is better order the column logically.

For periodic transaction tables (date-based tables), the periodic date column should be the first column. If the table is partitioned, the partitioning key should be the first column. System columns such as created and updated timestamp, create and updated user should be located at the end of the table so we can identify easily.

It is very annoying that once the order is set, the following year people add new columns at the end of the table, effectively destroying the order. We need to prevent this by putting change control process to any structural change in any databases in production, inside which there is a database review activity/task.

8. Not Null

Key columns should be made not null. System columns too. Unlike RI (FK), not null doesn’t have much impact on insert/delete/update performance.

9. Default

The system columns such as created/modified date should have a default of getdate() created/modified by should have a default of SYSTEM_USER.

10. ID or Id

I prefer Id rather than ID, e.g. OrderId, ClaimId, TradeId, ProductId, CustomerId rather than OrderID, ClaimID, TradeID, ProductID, CustomerID. This is because the convention is camel case. Why break your own convention? Because ID is an abbreviation? No it isn’t. Abbreviation is the first character of every word in the phrase. For example: CDS is the abbreviation of Credit Default Swap and FRN is the abbreviation of Floating Rate Notes, but ID is not an abbreviation of Identity.

11. Punctuations

Some column names contain punctionations such as & (ampersand) and ‘ (aphostrophy) for example: [S&PRatingDate], [S&P Rating Date], SAndPRatingDate or SandPRatingDate? [D&B Customer Name], DAndBCustomerName or DandBCustomerName? [Weight%], [Weight %], WeightPct or WeightPercentage? [Moody’s Rating], [Moody’sRating], MoodysRating or MoodyRating? It depends on the company standards. My preference is not having to use [] when querying because I found it is quite annoying being told by the RDBMS many times that it an invalid column name when I don’t use []. So I would name them as: SAndPRatingDate, DAndBCustomerName, WeightPercentage, and MoodysRating.

12. Abbreviations

Should we shorten the column name by abbreviating a few words? For example: ProportionOfClaimWrittenOff as PropClaimWrtnOff, GrossPremiumWritten as GPW, GrossPremiumEarned as GPE, NumberOfClaims as NoOfClaims, CountryOfIncorporation as CountryOfInc, CountryOfDomicile as CountryOfDom, DeliveryDate as DelDate, PercentageOfProductDelivered as PctOfPrdDelvd? If found that abbreviations is annoying because they are making things unclear and confusing. We should not abbreviate the column names. The disadvantage outweigh the advantage (which is to look short).

 
DATA TYPE

1. Same as the source

The best practice is to set the data type in the target database the same as the data type in the source. This way we can make sure there is no loss of precision or truncation. For example, if the source data is decimal(20,8) we should not use decimal(25,4) because we will lose 4 digits of decimal places. If the source data is UTF-16 containing European characters then we should be using nvarchar rather than varchar.

2. Date or DateTime

An exception to this rule is datetime, i.e. if the source data is in SQL Server and its data type is datetime and we know that it is only used to store the date without the time e.g. “2015-07-14 00:00:00” then we should set the target data type to Date. Not only because Microsoft recommends not to use datetime (link) but also because it is quite annoying that every time we query we always get the “00:00:00” portion which is not required.

3. Float or Decimal

If the source data is Float then we need to use Float, because if we use decimal we will lose precision. Float can have precision up to 308 decimal places (link) whereas decimal is only up to 38 (link). Float values with precision more than 17 digits when converted to decimal is converted 0.

A multiplication of two decimals can result in Arithmetic Overflow, whereas if we used Float we don’t have this issue.

4. Money or decimal

Money data type only have 4 decimal places. For this reason it is better to use decimal data type to store monetary values, particularly those with high DP such as base currency amounts after FX Rates conversions. FX Rates are stored has minimum of 6 DP but more usually 8, and sometimes 12.

5. Flags: Char(1) or Bit

For flags such as IsActive, IsCallable, etc. I prefer to use char(1) rather than bit, for the reason of clarity. Bit uses 0 and 1 which is more ambiguous to people compared to Y and N. The storage argument is not significant because it is only 7 bits (1 bit vs 1 byte).

6. Bigint or int

Int is only up to 2.1 billion so for the PK of periodic transaction table we better use bigint. For the FK of (periodic) transaction table, we should use  int as it is almost always less than 2.1 billion.

7. UDT

Sometimes I see people creating a customer data type (UDT = User Defined Type). I don’t see the point of doing this. What I mean here is creating a data type called MyInt, MyVarchar20, MyDateTime, MyDecimal21-6. Not only this is slower than native SQL data type (I have not tested it, but I think it is potentially slower), but also it is more confusing, e.g. what is MyInt? We should just simply use Int.

 

INDEXING & PARTITIONING

  1. For reference tables (customer, product, security, country, currency, etc.), set the clustered index and the PK to CustomerId, ProductId, SecurityId, CountryId, CurrencyId, etc.

  2. For transaction tables (order, delivery, purchase, registration, trade, claim, premium, etc.), set the clustered index to OrderDate, DeliveryDate, PurchaseDate, RegistrationDate, TradeDate, ClaimDate, PremiumDate, etc. This is more useful than setting the cluster index to the transaction ID, i.e. OrderId, DeliveryId, PurchaseId, RegistrationId, TradeId, ClaimId, PremiumId. Why? Because most queries specify the date rather than the transaction ID.

  3. There are cases for point #2 above where the transaction ID is a better cluster index.
    The first case is when there are 5 or more nonclustered indices and these indices are used more often than date. For example, a Claim table is indexed on Claim Date, Customer, Business Line, Policy Number and Sales Rep. Users often query the Claim table by Customer, Policy Number, or Sales Rep in combination with the Claim Date more often than by the Claim Date alone. So in this case, make the Claim ID as the clustered index (single integer column as the indexing key), and all the other nonclustered index will benefit from this slim indexing key because their row locator is the clustered index key (link).

    The second case is when the transaction ID is often used to join to other table. For example, consider Order table and Delivery table. They are often joined on Order ID column. In this case it is worth considering making OrderId as the clustered index of the Order table. But if the users query the Order table by Order Date more often than joining Order table to Delivery table, then we should make Order Date as the clustered index.

  4. For periodic/snapshot transaction tables, use the periodic date column as the cluster index. There are no ifs or buts in this case.

  5. For transaction and periodic transaction tables, we should index the IDs to the main reference tables, e.g. CustomerId, ProductId, SecurityId. These main reference tables are usually put in the first 5 columns. We should set these indices as nonclustered.

  6. Remember that if the table is partitioned, the first indexing key must be the partition key. A transaction table (be it snapshot or not) is mostly partitioned by date. So for Order table, the partitioning key should be OrderDate, and all indices must use OrderDate as their first column.

  7. When specifying partition scheme, use as many file groups as you possibly can. Work with the DBA to create the file groups. We should put the file groups in different disks, and name them accordingly, i.e. not FG1, FG2, FG3, etc. but FG1A, FG1B, FG2A, FG2B, etc. with 1 and 2 as the disk numbers.
  8. It is much better to use a static, permanent, partitioning function rather than using “sliding window partition maintenance” technique. Simply divide number of dates into months, quarters or years. For example:
    – Partition 1 is Jan 2010, partition 2 is Feb 2010, and so on.
    – Partition 1 is Jan and Feb 2010, partition 2 is Mar and Apr 2010, and so on
    – Partition 1 is 2010 Q1, partition 2 is 2010 Q2, and so on
    – Partition 1 is 2010, partition 2 is 2011, and so on.
    – Partition 1 is 2010 Jan to June, Partition 2 is 2010 July to Dec, and so on.

  9. Even when the table is partitioned, we still need to put the date column as the clustered index. This is because partitioning 1 contains 1 month worth of date (or 2 months or 1 quarter or even 1 year). By specifying the date column as the clustered index, the table will be physically ordered by date column so the query can quickly locate a particular date.
Advertisements

3 Comments »

  1. Hi,

    I would like to know in which cases are recommended tu use a datawarehouse appliance solution like IBM Netezza or Oracle Exadata.
    I know that this kind of solutions are very expensive and used in a very large datawarehoses, but I would like to know
    if there are more criteria in order to assess if you need a solution like that or yo can go with commun databases, servers, etc..

    Thanks,

    Juan

    Comment by Juan V. — 21 March 2016 @ 10:56 pm | Reply

    • Hi Juan, the main difference between MPP and SMP architecture is their linearity in scaling up. MPPs such as Exadata, Teradata, APS, and Netezza are linear, whereas SMPs such as SQL Server, Oracle and DB2 are not linear. Linear means that the computing power of 20 nodes is 10x the computing power of 2 nodes. This is so because MPP has a “share nothing” architecture. Each node has their own processor, memory and storage. Whereas “SMP shares a lot of things” such as disk, memory, etc. That said, MPP is not truly linear (straight line) because there is inefficiencies caused by connectivity between nodes.
      The second main difference between MPP and SMP is the capacility limit. In theory, the MPP has no limit on its capacity. Meaning that we can create an MPP with 20,000 nodes, creating a system with huge computing power. For example, each node having 100 GB RAM and 1 TB storage. But in SMP we can’t do that. The limit on SMP is very obvious. We can’t put 50 PB RAM or disk in SMP. Of course, this “MPP has no limit” principle is only in theory, because again it is restricted by the “network interconnect”, hence we can’t put 20 million nodes into it, even if we have $500 billion budget.

      Comment by Vincent Rainardi — 22 March 2016 @ 8:31 am | Reply

  2. Thank you very much Vincent, very helpful your answer.

    Comment by Juan V. — 22 March 2016 @ 9:53 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: