Data Warehousing and Data Science

28 January 2010

Using a Data Warehouse for CRM

Filed under: Data Warehousing — Vincent Rainardi @ 8:13 am

Usually a data warehouse is used for Business Intelligence. In this article we will describe another usage of a data warehouse: for Customer Relationship Management (CRM).

Customer Relationship Management is a set of activities performed by a company or organization (business and non-business) to manage and analyse information about their customers, to keep in contact and communicate with their customers, to attract and win new customers, to market products/services and conduct transactions with their customers and to service and support their customers. For non-business organisations the word “customers” can be substituted with “citizens”, “users”, “stake holders”, “applicants”, “visitors”, “constituents”, “interested parties”, “students” or similar, as long as the term describes the people that the organization serves.

A CRM system is a package of applications that supports the above activities. Among various functionalities of a CRM system, below are functionalities that are ideally supported by a data warehouse or utilising the data from a data warehouse. Other functionalities may best be served by an Operational Data Store or front end applications. Please see the table at the end of this article for the details about which tool is best for which functionalities.

  • Single customer view
  • Permission management
  • Campaign segmentation
  • Manage deliverability
  • Customer services/support
  • Customer analysis
  • Personalisation
  • Customer loyalty scheme

Let’s discuss these functionalities one by one.

Single Customer View

One of the most important things in CRM data management is the concept of Single Customer View. This concept was raised because customers could be defined differently depending on the context and to which department we are talking to. For example, a customer could be defined as follows:

  • A customer is anybody who has purchased from us.
  • A customer is anybody who has registered with us.
  • A customer is anybody who has paid us.
  • A customer is anybody who is currently subscribed to our newsletters.

And on top of that we have to deal with variations and complications such as:

  • Order cancellations: what if the customer has ordered (purchased) from us but before we deliver the goods he cancelled his order? Do we count him as a customer? Hmm, may be not.
  • Contract termination: what if a customer signed a contract for a particular service from us for a year then the following year she did not renew the contract? Do we still count her as a customer? Perhaps not.
  • Unsubscriptions: the customer has subscribed to our newsletter and then unsubscribed from that newsletter. Do we count him as a customer? May be not.
  • Order life cycle: order fulfillment process consists of many stages: quotation produced, contract signed, account opened, order placed, order dispatched (for physical goods), order fulfilled/delivered, consumed (for services), invoiced, paid, returned, credit note raised, refunded, account closed. So, at what stage do we recognized them as a customer? Different industries have different order stages.
  • Even tough it is not technically a customer yet (technically it may be a prospect), some departments such as marketing need us to store the prospect data.
  • When does a customer stop becoming a customer? When they no longer consume our products or services? In some industries such as waste management, the process of ‘discharging’ a customer is done in stages, for example: stopping the collections, removal of bins, settlement of accounts, deactivate customer status. In some industries there is not concept of stop becoming a customer. Once they purchase something they become customers forever.

So, practically speaking potentially we may need to store subscribers, registered users, purchaser and prospects as customers. What we do in the data warehouse is to build the customer dimension based on several different sources in the operational systems: order data, subscription and permission data, registration data and marketing data. We can use overlay technique I described in this article: Upsert Dimension Table. Yes it’s true that this will mean that we will have many attributes on the customer dimensions. And yes we do need to deduplicate the data, for example based on customer name, date of birth and address, or email address.

Single customer view simply means that we need to build a customer dimension that is solid, i.e. no duplication of data, complete (no missing customers) and correct. Deduplicate is not always easy, for example name can change when women get married, address changes when they move houses and email address changes over time too (a hard bounce could be an indication). Hence we need to use other measures such as time frame or data age (e.g. we can use the data if it is no more than 1 year old) or using other criteria such as Social Security Number, date of birth, etc. MDM and CDI vendors have a lot experience in this area, as well as data quality and data profiling software such as Trillium.

Permission Management

Regulations differ from country to country, but the basic principle is we can only send campaign to customers who have already given us their permissions to send it to them. Based on the scope, there are 2 kinds of permissions: subscription-based and general permission.

In a subscription-based scenario, we receive requests from customers for sending them a particular communication. For example, say we have 3 communications: weekly newsletter, low price alert and special offers. Customers can subscribe to these communications, either only to 1 communication, 2 communications or all 3 communications. In this case, if the customer subscribes to the low price alert, we can only send them low price alert, we can not send them other communications. If we have a new communication, we can not send it to subscribers of other communication. Instead, we need to build a new subscriber base for the new communication. Subscriber base is created by getting end users to subscribe to particular communication through website or promotions.

In the general permission type, customers do not subscribe to a particular communication. Instead, they give permission for us to send them a particular type of communication. For example, say we have 2 communication types: promotional items and subscription items. In this case the subscription items cover everything that is regular and promotional covers everything that is ad hoc. Another example: we can have 3 communication types: transactional (such as order confirmation emails and e-tickets), marketing (such as promotional campaign) and third party (such as campaign from our sister companies).

Permission is also categorized based on the communication channel or media, for example: by email, by SMS/text, by post, by telephone and by RSS feed. There is also a permission for sending (or not sending) any kinds of communication to customers. For example, a customer could call or email us asking to be excluded from any forms of communications.

For a multinational company, the permission is could be per locale. It is not necessarily country based, for example: Benelux, Nordic and Scandinavia are often mentioned as one locale, even though they consist of several countries. In those cases 1 office serving more than 1 country. It is possible that each locale has more than 1 language. We could also have several brands or names. For example, we could be operating in a certain country using 3 different company names, each having their own monthly newsletter or promotional campaign. Permissions can also have a validity period, e.g. only for 1 year. We need to take locale, language, brand and validity period into account when constructing the permission fact table and communication dimension.

Let us discuss the design. Permission data is ideally stored in a fact table, with all the above items as the dimension, plus the customer key, date the permission was given, source key and the permission ID as degenerate dimension if applicable. The measures or facts are number of subscribers, subscription price, There are 2 possible fact tables: subscription fact table if you use a subscription based permission, or general permission fact table if you use the general permission approach described above. For an example let us discuss the design for subscription based type.

Fact table name: fact_subscription
Grain: each time a customer subscribes or unsubscribes to a communication.
Type: snapshot accumulative
Table creation script (ignoring partition for the time being):
create table fact_subscription (
customer_key int not null, — who the customer is
communication_key int not null, — what communication the customer is (un) subscribing to
channel_key int not null, — what media will be used (email, post, text, RSS)
promotion_source_key int not null, — which promotion is the source or cause of this subscription
brand_key int not null, — which brand managing this (un) subscription
locale_key int not null, — which locale managing this (un) subscription
language_key int not null, — which language this (un) subscription event originated from
expiry_date_key int not null, — the date when this subscription is valid until
subscription_period_key int not null,– how long the subscription is valid for e.g. 3 months, 1 year
permission_id varchar(20) not null, — degenerate dimension from front end CRM system if applicable
subscription_price money null, — how much this subscription costs
number_of_subscriptions int null, — 1 for a normal subscription, 0 for dummy
subscribed_dt datetime null, — date and time the customer subscribed
unsubscribed_dt datetime null, — date and time the customer unsubscribed, NULL if still subscribed
is_active_flag tinyint not null, — 1 if the subscription is active, 0 if it is expired or unsubscribed
created_dt datetime not null, — system date & time when this fact table record was created
last_updated_dt datetime not null — system date & time when this record was last updated
constraint pk_fact_subscription primary key clustered (customer_key, communication_key, channel_key, subscribed_dt))

To get the subscribers’ email address and customer name of Ivory weekly email campaign:

select email_address, customer_name from fact_subscription sub
join dim_customer cus on cus.customer_key = sub.customer_key
join dim_communication com on com.communication_key = sub.communication_key
join dim_channel ch on ch.channel_key = sub.channel_key
where com.communication_name = “Ivory Weekly”
and ch.channel_name = “Email”
and sub.active_flag = 1

We could store the date the permission is given as a dimension, but we would recommend storing the timestamp on the fact table for 2 reasons: a) we don’t loose the time of day element and b) it is easier to retrieve the timestamp data for campaign segmentation. It is not advisable to store the permission in the customer dimension because it would limit the grain to be per customer basis, rather than per customer, communication and date.

So permission management is the capability of a CRM data warehouse to store the permission, based on all of the items described above. And to always keep them up to date. The permission data needs to be made available to the campaign management system to support the campaign segmentation process. It is frequently used, i.e. every time the users create a campaign.

Campaign Segmentation

When creating a campaign, we need to have a list of customers to whom we are going to send it to. These end users are known as campaign target audience. Campaign segmentation process produces this list. Most CRM software has this capability. This is where the SCV play an important role. The richer the customer dimension, the more flexible we can create the segmentation. Segmentation criteria that are commonly used are:

  • Permission
  • Demographics
  • Order data
  • Campaign delivery
  • Campaign response
  • Customer loyalty score
  • Customer profitability

We will give an example on each of the above items so we are clear about what they are. Permission: all customers who subscribed to Norwegian weekly newsletter in the last 3 months. Demographics: all female customers age 20 to 40 who live in Milan. Order data: top 1000 customers (by order value, excluding VAT) who have purchased electronic products from us in the last 12 months. Campaign delivery: exclude customers who had more than 3 hard bounces more in the last 8 weeks. Campaign response: include all customers who have opened the last campaign. Customer loyalty score: include customers from the top tier with more than 500 loyalty points. Customer profitability: include all customers from band A with annual order value > $30,000.

Campaign Results

What we meant by campaign results are:

  • Campaign delivery data, i.e. whether the campaign successfully reaches the target audience. For example, say we have an email campaign with 100,000 target recipients. Because of invalid email addresses, we only sent 99k and did not send 1k. Out of these 99k that went out, 96k were delivered to the target recipients’ mail boxes and 5k were bounced. All this information is called campaign delivery data.
  • Campaign response data, i.e. reactions from the customers receiving the campaign, perhaps by clicking on a link if it is an email campaign, or calling customer service center if it is a postal campaign.
  • Orders resulting from the campaign, i.e. out of the customers who responded to the campaign, how many actually placed their orders, what did they purchase and what are their order values.

Let’s go through these 3 points one by one.

Once campaign segmentation is ready, CRM system executes a campaign and sends it to target audience. Data about to which customers the campaign were successfully delivered and to whom it was not delivered, along with the reason why it was not delivered, should be fed back by the CRM system to the data warehouse. We are not only talking about email campaign here, but also by post, by telephone, by text messages and by home page customisation. This campaign delivery data (i.e. sent, not sent, bounced and delivered) will be useful for future campaigns. One possible design for storing campaign delivery data in the data warehouse is a factless fact table, with customers, communication, date, channel, delivery status and reason as the dimensional keys.

Specific to email campaigns, when the campaign reaches the target audience email box, end user may open that campaign email and perhaps click on any particular offer in that campaign. These open and click through events are fed back to underlying data warehouse. No mechanism is 100% reliable, but one mechanism for logging open events is a transparent 1 pixel image inside the body of the email, with a query string containing a customer identifier on the image tag. The web log of this image is then processed daily and the hit of this image, along with the customer identifier and the timestamp, is stored in the campaign response database. A mechanism for logging click-through events is redirection, i.e. the link on the campaign email hits a landing page with the destination page URL and a customer identifier in the query string. A script behind the landing page then records the time of the event, the URL of requested page and the customer identifier into the campaign response database before redirect the user to the real page. The campaign response database is then fed back by the ETL to the data warehouse to be used for future campaign as additional criteria when doing segmentation.

Open event could be stored on the same fact table as the campaign delivery data. An example of design for a campaign delivery data that contains the open data is below.

Fact table name: campaign delivery
Fact table type: snapshot accumulative
Grain: 1 row for each communication sent to each customer
Creation script:
create table fact_campaign_delivery (
customer_key int not null, — who the customer is
communication_key int not null, — what communication is delivered to the customer
channel_key int not null, — what media is used (email, post, text, RSS)
delivery_status_key tinyint not null, — 1 if successfully delivered, 0 if failed
reason_key int not null, — a positive integer containing failure reason, 0 if successful
open_status tinyint not null, — 1 if opened and 0 if not opened
number_of_opens int not null, — normally 1 if opened but can be more than 1
sent_dt datetime not null, — date and time the communication was sent to this customer
delivered_dt datetime null, — date and time the delivered was logged
bounced_dt datetime null, — date and time bounce event was logged
first_opened_dt datetime null, — date and time the message was first opened, NULL if not opened
created_dt datetime not null, — system date & time when this fact table record was created
last_updated_dt datetime not null, — system date & time when this record was last updated
constraint pk_fact_campaign_delivery primary key clustered (customer_key, communication_key, channel_key, sent_dt))

Because number of opens can be more than 1, if you want to record the date and time of the second and subsequent opens, you will need to put open in its own fact table, separate from campaign delivery. Normally it does not really matter that we don’t get the timestamp of the 2nd and subsequent opens.

Click-through can not be stored on the above fact table because the grain is different. The grain of click-trough is 1 row for each link clicked on a communication. An example of design is given below.

Fact table name: clickthrough
Fact table type: snapshot accumulative
Grain: 1 row for each link/URL on a communication, whether clicked or not clicked.
Creation script:
create table fact_clickthrough (
customer_key int not null, — who the customer is
communication_key int not null, — what communication is delivered to the customer
channel_key int not null, — what media is used (email, post, text, RSS)
URL_key int not null, — which URL on the communication
click_status tinyint not null, — 1 if it is clicked, 0 if not.
number_of_clicks — how many times the URL was clicked by the same customer
first_clicked_dt datetime null, — date and time the URL was first clicked, NULL if not clicked
created_dt datetime not null, — system date & time when this fact table record was created
last_updated_dt datetime not null, — system date & time when this record was last updated
constraint pk_fact_campaign_delivery primary key clustered (customer_key, communication_key, channel_key, URL_key))

If you want to record the time the 2nd click happened, put each click on separate row and make the fact table type transactional. But normally it does not really matter, as long as we know the number of clicks. It is preferable to put each URL (whether it was clicked or not) in 1 row like above, with the number of clicks as a measure.

Some of the customers who responded to the campaign might place their orders. These orders are tracked using promotional code if it is a postal campaign, or using a identifier on the offer link if it is an email campaign, or using a standard software package such as Omniture SiteCatalyst if the order is placed online. Data that are normally fed back into the data warehouse to be used in future campaign are who the customer is, which campaign it is resulting from, and the usual order attributes such as product type, quantity and values. This way it would enable us to analyse campaign effectiveness, analyse customer behaviour and to monitor how much of the company revenue is generated from CRM activities, which could be used for ROI calculations or backing the proposal for future projects.

Customer Analysis

Various types of customer analysis could be performed in the data warehouse. To give you some ideas, below are some examples.

  • Purchase pattern analysis, i.e. what kind of products or services does a particular group of customers purchase. The groupings could be based on demographics or campaign results. Based on the patterns we could try to forecast future purchases and relate it with inventory management, in particular the reorder level and purchasing lead time.
  • Price sensitivity analysis, i.e. identifying changes in shopping and purchasing behaviour if the price changes. In this case we also group the customers for analysis, not individual customers. We try to identify if there are certain patterns which would be useful for setting future marketing strategies and operational directives.
  • Shopping behaviour (especially for online businesses), i.e. identifying the factors associated with site traffic to measure the effectiveness of site design, checkout process design and help increase conversion rates. Shopping behaviour analysis is also conducted to gather the customer interests (which pages on the online store the individuals are more interested with), to be included as a factor when doing personalisation exercise such as site personalisation and personalised offers.
  • Customer attrition analysis, or customer churn analysis, i.e. to answer the questions such as how many customers defected from us each week or month, how many new customers we are getting each week or month, what kind of customers we are loosing (in terms of profitability and demographics) and what kind of new customers we are gaining (in terms of product or service range and demographics). Also included in this kind of analysis is changes in the type of service or product that the customer is having (this does not apply for supermarket but it is applicable for health care and financial services, for example the type of account).
  • Customer profitability analysis, i.e. revenue that we receive from the customer minus the costs associated to that customer, over a certain period (say weekly or annually). We want to know which customers we are loosing money from, and which customers are making money from. The formula to calculate the revenue side is normally not difficult but allocating the cost to each customer activity is technically and politically not easy.
  • Fraud detection, i.e. large increase in credit card purchases which deviate significantly from the individual or group normal pattern (for financial service industry); unusual returns of goods by the same customer (identified by name, post code and customer card number) within short period of time, compared with the daily and seasonal behaviour of the product line of suspected product code (this one is for retail industry); spiky account balances and unusual withdrawals/deposits (for banks), drops in recent invoice values not accompanied with lower usage activity (for telecom industry). Another method is to use 2 (or more) groups of samples, one containing the fraudulent transactions and the other representing good transactions. These groups are then fed into the mining model for training.

Each type of analysis requires different data model, and different industry requires different data model. For example, customer profitability fact table in utility sector (gas and electricity in particular) could be an incremental snapshot type, containing monthly snapshot of all accounts monthly revenues (calculated based on service types, rates and usage) and proportionate cost structure for the same period of time. The revenue may be per kwh but the base cost may be by weight (tons of coal) which makes the equation non-linear hence for some customers we could be making a loss and for others we are making handsome profit.

Although dimensional model can do a lot of analysis, in some cases we have to use multidimensional models, i.e. cubes. Many types of customer analysis especially those that involve predictive analytics, behaviour recognition, statistical analysis, non-linear estimation, cluster analysis and patterns finding, would require data mining running on multidimensional database, sometimes more than 1 MDB/cube. Some analysis would require building applications running multidimensional queries on cubes.


What we meant by personalization is tailoring our web site, products, services, campaigns and offers for a particular customer or a group of customers. There are large categories of personalisation: 1) we ask the customer what their preferences are, or 2) we guess their preferences based on their shopping behaviour, purchase history and demographic attributes. Once we know (or we think we know) the customer preferences, we offer them our products and services which we think would suit their preferences. Examples of personalisation are:

  • Price and product alerts, i.e. we let the customer know if what they like appears in our data warehouse. Price alerts are notification to the customers when there are special offers (lower price) on certain products or services that satisfy their criteria, for example if they would like to fly to certain cities or purchase certain type of digital camera. Product alerts are notification to the customers when a certain product appears in our database. For example: they declare their favourite singer or musical preferences, then we notify the customers when a certain album or single that suit those preferences appear in our database. The basic working principle is matching: on the one hand we have many suppliers supplying us with thousands of products and services every day and on the other hand we have a lot of customers with certain preferences. All we have to do is to match them automatically.
  • Personalised offers, i.e. we offer our customers certain products or services that we think match their needs or profile. There are 2 broad categories on how to choose the products or services: a) based on their past purchases (or shopping/browsing if it is an online store), and b) based on the customer attributes. Example of past purchases: because a customer purchased Canon S300 ink jet printer 3 months ago, they may need BCI-24 colour ink cartridge today. Example of customer demographic attributes: the customer had a 3 months old baby so she may need baby products. For online stores and online services, customers could be identified by using cookie or asking them to login and once identified we could track their shopping behaviour, i.e. which product or service category they are spending a lot of time on, etc.
  • Recommendations, which is basically the same as personalised offer. But this term is normally used when the customer is still shopping on the web site (for online businesses), unlike the term ‘offers’ that normally used when they are not shopping, i.e. via email or post. Recommendation tends to be targeted to one customer, where as personalised offers can be targeted to a group of customers that satisfies certain criteria, for example, those in certain age range or live in certain cities.
  • Site personalisation (specific to online businesses), i.e. the web site contains different products and services (and prices) depending on who the customer is. There are 2 methods which are widely used to identify the customer: login and cookie. Login is the most (if not the only) certain way of identifying who the customer is, i.e. by supplying credentials, such as user ID and password. Serving the same purpose as login are: bio metric ID (such as finger print), challenge response device (such as a device that displays different response numbers every time it is activated, based on certain seed number which has been planted into the device) and security token (such as security card). Using cookie is probably 50-60% at best, never achieve 80% certainty. Some people disabled cookies on their browser, some installed certain plug-in on their browser which prevents cookies, some people regularly cleaned their temporary Internet files including cookies and of course some people don’t use their own, permanent computer, i.e. Internet café, a friend’s house, a shared home computer, an office or campus computer, library’s PC, etc.

The content of site personalisation may be generated by a CRM system (as an XML), by setting up a campaign that is executed once a day. The logic behind this campaign does a data mining on a multidimensional data warehouse or, if we prefer a simpler way, by running a rule-based logic stored as metadata against the dimensional data warehouse. These rules are conditional rules, e.g. similar to IF … THEN … statement but with a lot of IFs. Price and product alert do not need a dimensional data warehouse. They can run on a 3rd normal form ODS. Or even on the front end CRM system.

One of the logic behind personalised offer (and recommendation) is ‘what similar people are interested in’. ‘Similar people’ can be quite a challenging term to implement. Some of the most popular classification techniques are nearest neighbour, neural networks and classification trees. Nearest neighbour is classification of customers based on their position in multidimensional space. Imagine that each dimensional attribute or each distinguishing factor that contributes to the grouping is a vector or arrow. The direction of the arrow is determined by the value of the attributes. A customer is defined by joining all the arrows by putting the beginning of the next arrow at the end of the previous arrow. This way a customer consists of all their dimensional attributes. Customers that are close to each other are classified as “similar”. Close or far is defined as multidimensional distance, i.e. square root of sums of all components’ squares. The difficult thing to do here is assigning numeric values to the dimensional attributes. As we all know dimensional attributes are mostly non-numeric. If the dimension has a hierarchy (such as city or location) then the numeric score depends on whether they have the same parent or grand parent.

Classification trees method is using a diagram where a branch has 2 sub branches. At each branch whether we go to sub branch 1 or sub branch 2 depends on the value of the attributes which is compared to certain criteria (normally a constant). Starting at the trunk, after following all the branches and sub branches we will arrive at the leaves. Now if we bring say 1 million customers to through these paths, some of them will end up at leaf 1, some at leaf 2, some at leaf 3, etc. The leaves are what we call classes. A customer is said to be “similar” if they are in the same class, or a near by class, which is defined by the number of levels.

Customer Insight

Customer information is not useful without its intelligent analysis. Analysis is always evolving and finding new ways to increase revenues through customer insight. Customer data plays vital role to build customer insight. Customer insight is a model to view available customer data and to analyze customer behaviour over period of time.

Using a data warehouse one can create rich customer dimension and use it to create customer insight. Business analyst can analyze complete customer data set in following ways:

Customer shopping Analysis:
Using historic order data, customer-shopping behaviour is analyzed. For example, business analyst can find answers to all the following questions by doing customer data analysis:

  • How many times customer has purchased from us?
  • What is time gap between two consecutive purchases?
  • What is the purchase pattern?
  • What product has he purchased most?

By answering above questions business user can understand customer-shopping behaviour and can design future marketing strategy to retain that customer.

Customer permissions analysis:
Permissions play vital part in defining customer data. Every enterprise has different set of rules to define these permissions. These permissions are stored in data warehouse for future analysis of customer permissions. For example, as described above in permission management section customer/subscriber can subscribe to different communications types over time period and all these historical subscription events are stored in subscription fact table. Business user can use CRM tools to do subscription analysis for different subscribers over period of time.

Customer Loyalty Scheme

Customer loyalty scheme is the way to reward high valued customers and build loyalty among customer bases. Many enterprises use customer scoring/point based system to build loyalty-based program. Customers are scored based on their previous shopping behaviour and points are calculated accordingly. Customer scores can be stored in customer dimension. CRM system uses these customer scores to design campaigns and group customers as per their loyalty points. Different customers are offered different promotions as per their scores.

Below we illustrate which tool is best to serve various CRM functionalities: data warehouse (DW), online transaction processing system (OLTP, i.e. front office application) or Operational Data Store (back office integrated operational database in 3rd normal form). If the cell does not have a ‘Yes’ in it, it does not mean that the tool can not do the functionality. It may be possible, but it’s not the best tool to serve that purpose.

Customer Support

Customer support is one of the important aspects of CRM industry. Many companies use various CRM tools to build customer support systems. Support system helps to solve customer queries, provide them promotional updates etc. For e.g. Customer call center to update billing address or phone number etc.

Many companies use ODS to store latest customer’s data to provide quick and efficient search capability to fetch up customer’s information. The underlying ODS database can be populated from OLTP databases or from data warehouse (in rare cases) for latest customer information. Many CRM vendors provide tools and techniques to transfer data between ODS and data warehouse. ODS can be populated from data warehouse/ Data Marts for customer specific data which is not persisted in OLTP databases.

Functionality OLTP DW ODS
Single Customer View
– Subscribers
– Bookers
– Registered users
– Customer matching
Yes Yes
Permission management
– Subscription based
– Tactical campaigns
– ISP feedback loop
– Communication preferences
Yes Yes Yes
– Order data
– Demographic data
– Campaign delivery
– Campaign response
– Customer loyalty score
– Customer profitability
Yes Yes
Campaign Content
– Promotional Offers
– Routine Newsletter
– Purchaser Lifecycle
– Subscriber Lifecycle
– Cross-selling
Campaign Results
– Delivery rates
– Open rates
– Click through rates
– Conversion rates
Yes Yes
Customer Support
– Complaint Handling
– Cross selling
– Pre-consumption support
– Consumption support
– Emergency support
– Post consumption
Yes Yes
Customer Analysis
– Purchase pattern
– Price sensitivity analysis
– Shopping behaviour
– Customer attrition analysis
– Customer profitability analysis
– Fraud detection
– Alerts/Notification
– Special Offers
– Recommendations
Yes Yes
Customer Loyalty Scheme
– Scheme Administration
– Customer scoring
– Classification
– Satisfaction survey
Yes Yes
Order processing
– Quotation
– Registration
– Custom pricing
– Placing orders
– Contract management
– Order confirmation
– Invoicing
– Payments
– Refunds
– Arrears
– Account management

Vincent Rainardi and Amar Gaddam
28th December 2006

(This is a repost from SQLServerCentral)

27 January 2010

Parallel Data Warehouse (what a clever name)

Filed under: Data Warehousing,SQL Server — Vincent Rainardi @ 8:20 pm
Tags: ,

Parallel Data Warehouse is the formal name given by Microsoft for Project Madison. I have been waiting for this for a long time. After doing SMP warehousing in SQL Server, I was doing MPP warehousing on Teradata for a while, and it wow-ed me. The performance I mean. Both loading performance and query performance. It’s incomparable to SQL Server. I also admired  the way it worked. I got infected by two colleagues who worked for Travelocity at that time. They were so enthusiastic about the MPP technology and about data warehousing. At that time I was thinking that Oracle and Microsoft must be planning to get into this MPP DW market. Then when I was doing SQL Server warehousing again I came across Netezza. It didn’t really wow-ed me this time, as I already knew how MPP worked. Basically Netezza works the same way, just a different variant. At a lower price.

Then about 2 years ago or so I heard that Microsoft acquired Datallegro. And then I read about Professor David Dewitt leading the work on the parallel query optimisation, under the name of “Project Madison”, as the main work is at the Jim Gray lab in Madison, Wisconsin. In the database world, David Dewitt is well known, see this work with Jim Gray, this and this. He has implemented three parallel database systems: DIRECT, Gamma, Paradise. At that time I read that the project was essentially replacing the Ingres with SQL Server, and Linux with Windows. But, after knowing that Prof Dewitt was involved, I had a high hope in Project Madison. I was hoping that

  1. The build is solid, especially the internal network between nodes and the parsing engine (data distribution mechanism/hashing, query/workload distribution, distinct count handling, etc). By the word “solid” I mean the fault tolerance capability e.g. if the primary node fails then the data is read from or written to the secondary node, etc.
  2. The performance is linear, and in the same league as Teradata. The query & loading performance of an MPP DW depends on the performance of its nodes. But to a large degree, the way the nodes collaborate (shredding and distributing queries and combining the results) affects the linearity of the overall system.

Last month at SQLBits I attended Thomas Kejser session on Project Madison/R2’s Parallel Data Warehouse and I was disappointed. I thought I would be able to see the details of how it works. But no, sadly it was not revealed. I was looking for a) DDL syntax, the ‘create table’ statement, how and where we place data and index, and b) how it works technically. I know how MPP works in general so I’m looking for how Microsoft version works. I admired Thomas for his technical knowledge, and his works at SQL CAT. I learned many things from him. DW loading being one of them, SSAS Processing Phases being another.

After Thomas’ session I met Anthony Howcroft, who spoke at the next session (the Fast Track). He was with Datallegro prior to the acquisition. I wish him the very best of luck for marketing PDW in Europe MPP market. I believe at Microsoft price point (which I expect to be lower than Netezza) it may be able to take off. I believe that price barrier is one determinant factor why Teradata customers in UK is only 30-35 companies (that was what I heard 3 years ago, not sure how many now). I believe that many companies (banks, insurances, healthcare, retail) need MPP/PDW and they can be hundreds of them. Definitely not 35.

I must admit PDW is a good name. Outside computer science world, the term “MPP” is not that popular. Say it slowly and hear it for yourself: Parallel Data Warehouse. Massively Parallel Processing. The former sounds familiar and easily understood. The latter sounds laboratory and academical. By naming a product with what the product does, I must admit it is a very clever idea. “Parallel Data Warehouse”, what a clevel name.

If PDW market is hundreds in UK (say 500) and thousands in Europe (say 4000), then we have a hope that it is an attractive market. Skills market I mean, job market. Worldwide we could be looking at 20,000 companies using PDW. Not SQL Server PDW specifically, I meant MPP DW. I’m going to not use the term ‘MPP DW’ anymore, I’m going to refer to MPP DW as PDW. And I will refer Microsoft version as SQL Server PDW. See it is a very clever marketing isn’t it, calling that new product Parallel Data Warehouse? So, yes, 20,000 worldwide. Good job market? Not sure if it’s good, but it’s certainly not a narrow one. Teradata was marketed as ‘no DBA required’, ‘save money on human resources’. I wonder what Microsoft strategy is. Is it the same as Teradata: SQL Server PDW is easy, you don’t need a DBA? That is inherently a dangerous statement isn’t? To SQL DBA market I mean. And not getting a support from SQL DBA market worldwide, is a suicide for SQL Server PDW. So no, that’s wouldn’t be a good strategy.

I am pretty much hoping (and predicting) that Microsoft will take the other way. Market SQL Server PDW as ‘a step up from SQL DBA’ kind of thing. Make PDW as the best friend of SQL DBAs. Create the exam. Be certified. This the daddy of all databases. Come this way and get better pay, etc, etc. Data warehouse designer and developer will too, find it attractive. With the staggered approach into big DW market (the Fast Track on level 1, SQL Server PDW on level 2), supported by the concept of hub-and-spoke, SQL Server R2 Parallel Data Warehouse is in a very good position to get significant market share in parallel data warehouse market, competing against ExadataNetezza, Teradata, Greenplum and Neoview.

6 January 2010

Taking a cube offline

Filed under: Analysis Services — Vincent Rainardi @ 8:32 pm

How do you hide a cube? How do you take the cube offline? Make it hidden or invisible so users can’t access it? I have never came across this question before. My cubes are always accessible by the users all the time. Why would I want to take it offline? Well, read on.

Sometimes a data warehouse may contain wrong data. For example, after a release is deployed to production. On that particular day, rather than letting the users consume incorrect data, it is better to close the BI systems. Meaning that we don’t allow any users to access the warehouse database, access any SSRS report, or access any cube.

But how do you restrict access to a cube (SSAS DB)? The answer of course, is by modifying the security. For each role we can specify whether the role can access a particular cube or not. But, we could have many roles. For example, because of dimensional or cell security. If you have say, 100 roles, it could take 2 hours to change them all. So how do you to prevent the users from accessing the cube?

Of course you could just stop the SSAS service. But if you have several SSAS DBs in that server, you don’t want to stop the SSAS service. Because you want users to be able to access the other cubes.

So, what could you do to take the cube offline? Delete the SSAS DB? But how do you put the DB back after the data warehouse is fixed? Yes, you could backup the DB and restore it back. This way no body can access the cube, because the SSAS DB is physically deleted. But, if the SSAS DB is very large, it takes a long time to back it up. Backing up a 10 GB SSAS DB could take 15 mins. If the cube produce wrong numbers, you don’t want users consuming incorrect data. It could be fatal for the business. you want to take the cube offline immediately. You don’t have 15 mins. You need to do it in 1 minute.

But how? Is there an property called ‘visible’ for a cube? An property that you can easily set using SSMS? Yes we do have ‘visible‘ property for a cube, but it’s in BIDS. Not in SSMS. So you can’t make a cube ‘hidden’ without deploying it first. To use this ‘make a cube visible or hidden’ property you need to either a) import the DB into a BIDS project, modify the attribute and redeploy to server, or b) script the DB to XMLA, modify the cube property execute the modified script and reprocess the cube. But do you need to reprocess?

No. You don’t need to reprocess the cube. You don’t want anybody to access the cube. So, just delete the cube. But, before deleting the cube, import into BIDS project first (or script to XMLA). This is quick, probably about 1 minute. When the DW is corrected, then you deploy from BIDS to the server, or execute the modified XMLA and reprocess the cube.

There is another option: detach. It works like in relational DB. An SSAS DB is physically a folder in the server called DatabaseName.0.db. Inside this folder there are many files: dimensions, partitions, etc. This folder is located under \data folder. ‘Detaching’ an SSAS DB means disconnecting the folder from the SSAS engine. So it’s a quick operation. Close all open files, flush the memory and away you go. Detaching a 5 GB cube only take a few seconds.

So how do you put it back? Attach, of course. ‘Attaching’ an SSAS DB means presenting/connecting this folder to the SSAS engine. Does it take a long time? No. Attaching a 5 GB cube takes about 30 seconds.

Still, better if we have “hidden mode” or “visible mode” though, in SSAS. Like “single user mode” and “multiple user mode” in SQL Server.
SSAS 2011 that is 🙂

5 January 2010

Many-to-many in attribute relationship

Filed under: Analysis Services — Vincent Rainardi @ 7:58 am

In SSAS cube there is a potential that hierarchies are not working correctly because of attribute relationship. Say in the cube we have a hierarchy consisting of:

  • Top level: attribute1
  • 2nd level: attribute2

We also have an attribute relationship that links attribute1 to attribute2. If you click + on attribute2, you can see attribute1 appears ‘under’ attribute2.

And the data in the dim table is as follows:
select attribute2, attribute1, count(*) from dim1 group by attribute2, attribute1

Attribute2 Attribute1 Count
Member1 Member A 600
Member1 Member B 500
Member2 Member A 200
Member2 Member B 300

Because of the attribute relationship, the value of attribute1 is determined based on attribute2, like this:

Attribute2 Attribute1 Count
Member1 Member A 1100
Member2 Member B 500

Basically, if we create an attribute relationship from attribute2 to attribute1, we tell SSAS to determine the value of attribute1 based on the value of attribute2. I’m not sure how SSAS determine the value of Attribute1 for Member1 and Member2: whether it’s Member A or Member B. In my experience it can be either (but not both).

I experience this most when cubes are upgraded from 2005 to 2008. But this is actually not an upgrade problem. It’s an attribute relationship issue: we shouldn’t create the attribute relationship if it’s many-to-many (see red text above). Member1 can belong to Member A, and it can also belong to Member B. That’s many-to-many. We should only create attribute relationship only when it’s many-to-one: Member1 will always belong to Member A. And this principle applies to both 2005 & 2008.

Flexible or rigid doesn’t come into it. This many-to-many and many-to-one has nothing to do with the attribute relationship being flexible or rigid. If Member1 will always belong Member A (like January belongs Q1), then it’s rigid. But if Member1 now belongs Member A but later it can belong to Member B, then it’s flexible.

But how can we create a hierarchy if it’s many-to-many. The simple answer is you can’t. Without hierarchy users can still list attribute1 and attribute2 side-by-side on the row, which will reveal their relationship. So on the attribute relationship tab (on 2008), if you have many-to-many, just link both attributes to the key.

4 January 2010

Comparing Excel 2007 and ProClarity

Filed under: Business Intelligence — Vincent Rainardi @ 9:03 pm

I use Excel 2007 and ProClarity to browse SSAS cubes. This is what I experienced.

  1. In ProClarity you can put measures on the background so the user can pick which measure they want to use. In Excel 2007 you can’t; you can only put measures on the column or row.
  2. If an attribute has 5 members, in ProClarity you can select [member3, member4 and All]. In Excel 2007 you can’t; you can only select All, or select several members, but not both.
  3. If an attribute has 100,000 members, in ProClarity you can search for a member, to display only a certain member in the row. In Excel 2007 you can’t; you need to display all 100,000 members, then deselect All and select the one you want to display. Or, highlight several members, right click, select Filter, choose Keep Only Selected Item. But you can’t search without first displaying every single member, which can take a long time.
  4. In ProClarity you can create a set of members. You can either pick the members manually, or define criteria for inclusion/exclusion using Selector. In Excel 2007 you can create a group containing several members. You can only pick the members manually. You can’t define criteria for inclusion/exclusion. Note on performance: creating a set containing 5 members out of 50 by picking them manually takes 1 second in ProClarity. In Excel 2007 it takes 55 seconds.
  5. In both ProClarity and Excel 2007 you can link to a URL (action). However, I found that if we need to login to reach that URL, Excel 2007 failed to pass the parameter to the URL, so it landed on the front page of the app. Where as in ProClarity, after logging in it open the appropriate page.
    Note: after using ProClarity, if we use Excel 2007 again, Excel will open the correct page.
  6. In SSAS we can have attach an action to the measure: target type = Hierarchy members, Target object = measures. For example, the action expression could point to a URL explaining the meaning of the measure. In ProClarity, we can right click the measure name and select the action. In Excel 2007, you can’t do it; the action is greyed out.
    Note: If the action is attached to Dimension members, Attribute members, or Hierarchy members, it works on both Excel 2007 and ProClarity.
  7. In ProClarity you can drilldown: right click on a member, select Drill Down, and then select the attribute. In Excel you can’t. You need to manually move the attribute to background, set the background attribute to a certain member, and then put the attribute you want to drill down on the row or column.
  8. In ProClarity you can create a calculated measure, such as ratio and percentage difference (A/B and (A-B)/A). Date-based calculation is also available, such as year-on-year growth, rolling average and year-to-date totals. We can then use this measure just like a normal measure, e.g. we can drill down on any attribute. In Excel 2007 we can’t define a calculated measure. If you format the area as a table, and define a new column on the right of the pivot table, the column will just be calculating the cells. It is not a measure. You can’t slice it.
  9. In ProClarity there are no row headers. If you put 5 attributes on the Rows: order number, date ordered, date delivered, date invoiced, date paid, you will have no idea which dates are on which columns. Viewing it in ProClarity is not too bad, as you can see the attribute names on the Rows. But when you export the page to Excel, there is no indication all which attributes were put on the Rows. See below. Column attributes are fine by the way (measure1 to measure5 below).

    In ProClarity, Flattened Row Headers (on the Grid Properties) has nothing to do with row header. It’s when you put a hierachy (rather than an attribute) on the Rows and you want the value to be displayed not like a tree, but left justified.
    In Excel we have row header. So we can clearly see which attributes are on which column.

1 January 2010

SSAS DMV (Dynamic Management View)

Filed under: Analysis Services — Vincent Rainardi @ 12:16 am

What are SSAS DMVs?

SSAS Dynamic Management Views are Analysis Services schema rowsets exposed as tables that we can query using T-SQL SELECT statement. For example, to query discover_connections DMV, we can write:
select * from $system.discover_connections

DMVs are more convenient to use than schema rowsets because:
a)    we use SELECT statement rather than XMLA
b)    the output is a table rather than XML

Not all DMVs are schema rowsets. The DMVs which expose cube and dimension data are not schema rowsets. For example, to get the members of product dimension we can write:
select * from AW.[$Dim Product].[$Dim Product] where Color = ‘Red’

Even though they are T-SQL statements, in the Management Studio we write these SSAS DMV SELECT statements in the DMX query window, not in the SQL query window. This is because we need to connect to SSAS, rather than SQL Server relational engine. The SSAS DMVs are available in SSAS 2008 and 2008 R2; they are not available in SSAS 2005 and 2000.

What are they for?

SSAS DMV can be used to monitor the server resources such as:

  • who are connecting to the SSAS server
  • which database/cube they are using
  • what MDX they are running
  • how much memory and CPU is used by each command and each user
  • how long they took to execute, and when did they run
  • which aggregates are used/hit and which are missed
  • which objects are locked and by whom
  • how much memory and CPU is used by each dimension and each measure
  • how much memory is used by each data cache and objects of SSAS engine
  • which cubes, dimensions, partitions and measures are used most and which are rarely used (how many times they are read, how many rows returned, how many KB read)

SSAS DMVs can also be used to find out the structure of SSAS databases in the server:

  • List of dimensions, hierarchies, measures, measure groups, data sources, cubes, actions and KPIs in each SSAS database
  • Members of every hierarchy and level in every cube
  • List of MDX functions, with their descriptions and parameters
  • List of mining structures, models, and algorithms.

How do I use them?

dbschema_tables lists all the DMVs:
select * from $system.dbschema_tables

discover_connections lists who’s connecting to the server, since when, for how long, from which PC (IP) and using what client:
select * from $system.discover_connections

discover_sessions and discover_commands lists the MDX/command used by each user, which SSAS database/cube they are using, how much CPU and memory is used by each command:
select * from $system.discover_sessions
select * from $system.discover_commands

We can join discover_connections and discover_sessions on Connection_ID column. And we can join discover_sessions and discover_commands on Session_SPID column. To join them we can:

  • Create a linked server from a SQL Server to the SSAS server, then use OpenQuery. See Midas Matelis post here.
  • Use SSIS to query the 3 DMVs separately and the join them in SSIS using Merge Join transformation. See Chris Webb post here.
  • Use ADOMD to query the DMV on the SSAS server, and then join them using DataSet. See my post here.

discover_object_memory_usage and discover_object_activity lists the memory usage and CPU time for each object i.e. cube, dimension, cache, measure, partition, etc. They also show which aggregations were hit or missed, how many times these objects were read, and how many rows were returned by them:
select * from $system.discover_object_memory_usage
select * from $system.discover_object_activity

The following DMVs provide information on the cube structure:
select * from $system.mdschema_cubes
select * from $system.mdschema_dimensions
select * from $system.mdschema_hierarchies
select * from $system.mdschema_levels
select * from $system.mdschema_measuregroups
select * from $system.mdschema_measuregroup_dimensions
select * from $system.mdschema_measures
select * from $system.mdschema_properties
select * from $system.mdschema_members
select * from $system.mdschema_sets
select * from $system.mdschema_kpis
select * from $system.mdschema_input_datasources

We can use where clause and order by:
select * from $system.discover_object_memory_usage where object_parent_path = ‘Global.ObjectPool’ order by object_id desc

We can use ‘and’, ‘or’ and ‘not’. But there is no ‘like’, ‘<>’ (not equal to), ‘in’ and ‘join’:
select * from AW.[$Dim Product].[$Dim Product] where Color = ‘Blue’ or Color = ‘Silver’
select * from $system.discover_object_activity where not object_rows_returned > 0

In 2008 R2 we can use ‘<>’ (not equal to):
select * from $system.dbschema_tables where table_catalog <> ‘Adventure Works DW’ (only works in 2008 R2)

We don’t have ‘like’ so use left or right instead. Remember we can use ‘not’:
select table_schema from $system.dbschema_tables where table_catalog = ‘Adventure Works DW’ and left(table_schema,1) = ‘$’
select * from … where … and not left(table_schema,1) = ‘$’

We can use distinct:
select distinct object_parent_path from $system.discover_object_activity

On some DMVs we need to write ‘SystemRestrictSchema’. For example if we write:
select * from $system.discover_partition_stat
SSAS will respond with:
The ‘CUBE_NAME’ restriction is required but is missing from the request.  Consider using SYSTEMRISTRICTSCHEMA to provide restrictions.

This is because discover_partition_stat requires 4 restrictions: database_name, cube_name, measure_group_name and partition_name. In other words, we need to specify these 4 parameters. So we need to write:
select * from SystemRestrictSchema($system.discover_partition_stat, CUBE_NAME = ‘Adventure Works’, DATABASE_NAME = ‘Adventure Works DW’, MEASURE_GROUP_NAME = ‘Internet Orders’, PARTITION_NAME = ‘Internet_Orders_2004’)

Another example on restriction:
select * from systemrestrictschema($system.discover_performance_counters, PERF_COUNTER_NAME = ‘\MSAS 2008:Connection\Current connections’, PERF_COUNTER_NAME = ‘\MSAS 2008:MDX\Total Cells Calculated’)

If we have a SSAS database with 1 cube, 1 dimension and 1 measure group, when we execute: “select * from $system.dbschema_tables”, at the top of the list we will see 4 rows with table_catalog = ‘DatabaseName’:

No table_catalog table_schema table_name table_type
1 DB Name $Dim1 Dim1MG system_table
2 DB Name $Dim1 $Dim1 table
3 DB Name Cube1 MG1 system_table
4 DB Name Cube1 $Dim1 table

Each of these 4 tables is a DMV that we can query using SELECT statement.

  • No 2 is a database dimension
  • No 4 is cube dimension (a database dimension that is used in a cube).
  • No 3 is the measure group.
  • No 1 is the ‘dimension cube’. For every dimension, SSAS create a cube. This cube is called ‘dimension cube’. This cube contains only 1 dimension with no measure. This cube can’t be accessed from the client. It is only available internally within SSAS engine.

So to query these 4 DMVs, we need to fully qualify the table name, i.e. table_catalog.table_schema.table_name:

#1 select * from [DB Name].[$Dim1].[$Dim1]  –database dimension
#2 select * from [DB Name].[Cube1].[$Dim1]  –cube dimension
#3 select * from [DB Name].[Cube1].[MG1]    –-measure group
#4 select * from [DB Name].[$Dim1].[Dim1MG] –dimension cube

#1 returns the members of Dim1.
#2 also returns the members of Dim1.
#3 returns the fact rows for measure group1.
#4 returns nothing

DMV on Books Online

The only mention about DMV in the SQL Server Books Online is on the Data Mining Schema Rowsets page of the November 2009 edition:

“In SQL Server 2008, the data mining schema rowsets are also exposed as tables in the Transact-SQL language, in the $SYSTEM schema. For example, the following query on an Analysis Services instance returns a list of the schemas that are available on the current instance: SELECT * FROM [$system].[DBSCHEMA_TABLES]”

The Books Online explains every column of the 46 schema rowsets in SSAS: 16 XMLA, 4 OLE DB, 16 OLE DB for OLAP and 10 Data Mining. However, there are 54 DMVs in 2008. The following 12 discover_* DMVs are not mentioned in Nov 2009 SQL Server Books Online’s schema rowsets documentation: traces, trace_definition_providerinfo, trace_columns, trace_event_categories, memoryusage, memorygrant, transactions, master_key, performance_counters, partition_dimension_stat, partition_stat, dimension_stat. But, they are mentioned in the SSAS Protocol Specification section, about Discover Request Types.

There are 3 schema rowsets which are not made into DMV: mdschema_actions, discover_datasources and discover_xml_metadata. To use them we still need to use XMLA. Those of you who wonder about the numbers don’t tally (54 <> 46+12-3), that’s because there is 1 duplicate: discover_keyword is available both as XMLA and OLE DB for OLAP.

In 2008 R2 (I’m using Nov CTP), there 3 new DMVs: discover_storage_tables, discover_storate_table_columns, and discover_storage_tables_column_segments. Again, they are not mentioned in SQL Server Books Online’s schema rowsets documention, but they are mentioned in the SSAS Protocol Specification section

Blog at