There are several data structures which are good from storage point of view, but it is painful to read/retrieve the data. I believe that the best way to learn is by example. So in this article I’m going to give two examples. The first one is StartDate-EndDate, and the second one is Product-Attribute-Value.
This table stores the customer level using start date and end date columns:
The data is produced by a CRM system in an ecommerce company which evaluates the customer shopping behaviour every day, and updates the customer level accordingly. There are 10 levels, categorised based on their likelihood to make a purchase. The shopping behaviour is evaluated based on the order of pages the customer visited in the web site.
What we are trying to find out are:
1. On 30th March 2010, how many customers are on level 1, level 2 and level 3?
2. In the 3 months period ending 30th March 2010, how many customers moved from level 2 to either level 3 or level 1?
3. What kind of customers are in #2 above? (using join to the customer table)
The purpose of the exercise is to understand the customer movement tendencies, for each customer segment. A segment is a combination of region, age group, gender, product interest, occupation. Once the tendencies are found, then we can make predictions for the future behaviour.
The above questions are very difficult to answer using the above structure. The query would be a lot easier, if we restructure the above table to:
Yes we are losing the date granularity, but the business evaluates the customer behaviour on month-by-month basis, as of the end of the month. Using this structure, it is a lot simpler to answer the above 3 questions.
This is one of the powers of using Periodic Snapshot concept from Ralph Kimball.
The above structure stores product attributes as columns. That is the original structure. The above example shows 6 attributes of a dress. In reality, there are many more attributes for dresses, e.g. country of origin, wash instruction, ironing setting, sleeve style, etc. And there are a lot of products other than dress which can have completely different attributes, e.g. shoes, hand bags, jewellery, etc.
Because of the number of product attributes which need to be created every month (assume it is a in-house built EPOS, Electronic Point of Sale), the data architect decided to convert it into a Product-Attribute-Value structure (PAV) as follows:
For this table to work, we will need to group the attribute names and define the attribute values in the Group-Attribute-Value table (GAV) as follows.
If we take it one step further, we can give each row in the GAV table an ID, and use this ID in the PAV table. This structure is very flexible. Using this structure the application can add a new attribute on the fly (on the maintenance screen). It is a brilliant structure from EPOS Application point of view, but it is difficult to use it for reporting. If we replace the Attribute Value with their IDs, we will end up joining PAV to GAV a lot of times simultaneously.
Still, given the choice, a data architect should choose the PAV structure because the flexibility outweigh the disadvantages. Because the EPOS application displays one product at the time, there is no performance issue.
Similar to the PAV structure which is used to store attribute values, the Product-Measure-Value structure (PMV) is used to store measure/numerical values. For example:
We can see above there are 3 different types of measures:
- Weight, Length, Height, Depth, Volume, Number of doors, and Number of drawers are product characteristics (aka product attributes or product property). Yes, product attributes can be numerical, not just textual. Product characteristic doesn’t change for the life span of the product.
- Average Margin, Average (Sale) Price, Age (days), and Average Cost are calculated daily and stored against date, so this table is not the right home for them. But similar concept applies, we just need to add a date column.
- The third type is actually not a property of the product itself, but how the stock is managed. Supply Ratio, Minimum Order Quantity, and Order Trigger Level tells the purchasing department when and how much to order.
Regardless of the 3 types, the MeasureValue column is of the same type e.g. numeric(20,4), even though some of the measures are integer, some decimal and some are float! This is the limitation of this design, but the advantages far outweigh the disadvantages. The structure is so flexible that it enables the Sales or EPOS application to create measures on the fly (on the maintenance screen).
Again this Product-Measure-Value structure is not reporting-friendly, but to support a data entry application such as EPOS it works brilliantly.
This is not just for retail industry by the way. It is also applicable for banking, insurance, manufacturing and health care industries. Instead of ProductId column, we can have CustomerId, PolicyId, PositionId, and PatientId, for example.