Data Warehousing, BI and Data Science

5 October 2018

Junk Dimension Revisited

Filed under: Data Warehousing — Vincent Rainardi @ 5:27 am

If the data we get from the source system consists of many columns, and 11 of those columns are like this:

  • Small, medium, large (2 columns)
  • High, medium, low (2 columns)
  • Red, green, blue, brown (1 column)
  • Red, amber, green (2 columns)
  • Yes, no (2 columns)
  • Long, short (2 column)

1. Source Row

How should we model it in dimensional model?

The answer is of course a junk dimension. If in the above data we have not 5 rows but 5000 rows, then the distinct combination of values is probably only 200. That is the power of the junk dimension. It puts all these attributes into one table, in a compact form.


One of the challenge of implementing a junk dimension is the lookup, to check if the row coming from the source data already exists in the dimension table or not. If it’s only 11 columns like above then we can compare the source row and the target row directly by comparing all the columns. But if the number of columns is 100, then it would be cumbersome to type it all in.

One common solution is to use hasing. We hash the 100 source columns and compare it with the hash column in the dimension which has been computed previously. If we find a match, that means the row already exists in the dimension table.

There are a few functions in SQL Server which can do this. One of them is checksum. But it is possible that two different inputs generate the same hash output. A better function is hashbytes, with much lower possibility of producing the same hash output for different input, particularly if we use it with MD5 algorithm, as we can read on Books Online and on SQL Server Central (link).

With Hashbytes we can choose different algorithms: MD2, MD4, MD5, SHA, SHA1, SHA2. Note that from SQL 2016 only SHA2_256 and SHA2_512 are recommended. All other algorithm are deprecated. So I recommend using SHA2_512 rather than MD5, as it has less chance of producing the same output value for different input values.

Impossible To Have Duplicates

Having the same output value for different input values is called a collision. MD5 is very weak, it took only a few seconds using a normal computer to produce a collision (link).

SHA1 is the second strongest, We need to try 10^19 different inputs before collision occurs, which takes about 100 year if we use a server with 64 cores. Last year Google and CWI Amsterdam managed to produce a collision with SHA1 (link).

With SHA2 using 64 bytes (512 bits) it is practically impossible to have a duplicate/collision. We need to try 10^77 different inputs before a duplicate/collision occurs. If we use a server with 64 cores, it will take about 5800 years.

Warning: Functions Are Slow

One warning when using functions in SQL Server (any function including hashing functions), is that function is always perform a row by row operations. Or RBAR as Jeff Moden calls it, stands for Row By Agonising Row. In relational databases, be it Oracle or SQL Server, or anything else, when dealing with high number of rows we should always perform a set operation, not a row-by-row operation. If the incoming data from source is over 1m rows, hashing them will take time, particularly if when we use a complex algorithm such as MD5.

First Character Technique

One alternative is taking the first character of every column and use it as the business key for lookup, like this:

2. First Character

By comparing the First Character column between the incoming source row and the one stored in the dimension table, we can check if the row already exists or not.

The advantage of using this technique is that there is no way that two different inputs will produce the same output.

But (as always there is a but) if in one column we have two values with the same first character like below, we need to differentiate them:

3. Same first character

In the above example the Status column have two values with the first character = O, i.e. Outstanding and On Hold. So we need to differentiate them, for example by making On Hold as H.

To implement that we will need either a case statement or a lookup table. For example: case when Status = ‘On Hold’ then ‘H’ else Left(Status,1).

If the upper and lower case in the source data is consistent, we don’t need to convert the first character to uppercase. But if the upper/lower case in the source data is inconsistent then we need to convert it to uppercase. For example, if in the source data the Status column contains both “Active” and “active”, then we need to convert both to “A” by taking the uppercase using UPPER() function.

Dealing With Null

If we use Hashbytes, Null values need to be converted to “Unknown” or “1/1/1900” or -1 depending on the data type. Otherwise it won’t work. We can also opt to convert NULL to the appropriate data type before inputting it to the Hashbytes function.

The advantage of Checksum is that it can deal with NULL. We don’t have to convert NULL values to something else. The other advantage of Checksum is that it is faster than Hashbytes with SHA_512.

Data Type

Varchar and Nvarchar produce different output, even through the input text is the same. Similarly Char and Nchar produce different output.

Varchar(100) and Varchar(200) produce the same output, if the input text is the same. So length of the varchar does not affect the output.

If the collations are different, the outputs are different.

We need to remember all the above points when later on we want to change the data type of the attribute column, either in the staging or in the dimension. Or if we want to change the collation of the database.

8000 bytes limitation

The main limitation of Hashbytes function is that it can only accept a maximum of 8000 bytes. So if the total bytes of all the input columns is more than 8000, we will need to split the input columns into two or three and hash them separately.

Hashbytes Performance

So how fast or show is Hashbytes function? A table with 49 columns and 1.5 million rows takes about 44 seconds with MD5 (link).

A table with one varchar(10) column and 11 million rows takes about 17 seconds with SHA2 256 (link)

Obviously it depends on how many CPUs are in the server.

Back to Junk Dimension

So if there are many distinct values in a column (say 100) then we can’t use the first character technique, because there will be many duplicate. In this case Hashbytes is better, because it can take thousands distinct values, and it can take many columns too.

Yes/No Columns

If the source data contains many boolean columns (say 100 columns, all True and False or Yes and No), then we can either put it in the fact table as degenerate dimension columns, or create a junk dimension. I would advise to create a junk dimension, because it will declutter the fact table. And because those columns are attributes so they should be in a dimension table. And because from space consideration the junk dimension solution is much more compact, particularly if the fact table is a snapshot fact table.


25 September 2018

Data Warehouse for Asset Management

Filed under: Data Warehousing — Vincent Rainardi @ 6:50 am

A. Brief Summary

A.1. Fact Tables

The main fact tables in an asset management warehouse are position, performance, risk and transaction. In the next 4 paragraph I’ll explain what they are briefly. Later on in the article I will be explaining them in more detail (or in a separate article).

The position fact table (aka valuation fact table) contains the daily holdings and benchmark constituents for every portfolio, both security/instrument positions and cash positions. The main issue with this fact table is the matching between holding rows and benchmark rows. We have the option of not attaching the benchmark, which makes it flexible particularly for portfolios having more than one benchmarks. The position fact table is mainly used to provide portfolio breakdown, and top N holdings. The other issue is lookthrough.

The performance fact table contains the performance returns of every portfolios compared with their benchmarks, for many different periods e.g. 3M, 6M, 1Y, 3Y. There is also another type of performance fact table, which contains the performance attribution to different factors such as yield curve, sector allocation and security selection. Performance can also be allocated to different sectors, currencies and asset classes, each compared to the benchmark.

Because of the non additive nature of risk numbers, there are many risk fact tables. They mainly hold the risk sensitivities such as interest rate sensitivities (PV01, DV01), inflation sensitivities (IE01), credit rate sensitivities (CR01) and exchange rate sensitivities. The risk fact table also stores the VAR data (value at risk) at portfolio level and at “carved-up” level (for example, industry sector, currency and asset class)

Transaction fact table stores every buy, sell and corporate action in every portfolio, including the fees and charges. The transaction status, which changes at every stage on different dates, can be stored on an accumulated snapshot fact table. Whereas the trades data can be stored on a transaction fact table.

A.2. Dimensions

In an asset/investment management DW, the following dimensions are commonly found:

  1. Security dimension (aka Instrument dimension or Asset dimension) which contains 3 types of securities: held securities, benchmark securities and watchlist securities.
  2. Client dimension (aka customer dimension) containing a list of clients from the CMS such as Salesforce, with their usual attributes such as client type, Mifid category, industry sector, and address.
  3. Portfolio dimension (aka account dimension, product dimension or fund dimension) which contains a list of open funds, segregated account/mandate and trust funds that the company runs.
  4. Benchmark dimension, containing a list of benchmarks which the portfolios are measured against, including blended benchmarks. The “pairing” between portfolios and benchmarks is not done in this dimension.
  5. Country dimension covers two country code schemes (ISO 2 chars) and (ISO 3 chars) and 2 to 3 country name columns (one for each OMS/IMS, plus ISO).
  6. Currency dimension is pretty static. A country column is not useful here, as the currency name already includes the country and some currencies are used by several countries. It is usually a good practice to update country and currency dimensions manually.
  7. Rating dimension. Covers Fitch, Moody’s, and S&P. A good rating dimension have the 3 agencies as columns so we get the translation between them (instead of stacking them up as rows with a “type” column). Also covers “major” rating and “notched” rating as two different attributes/columns (AA is a major and AA+ is a notched). Also covers cash ratings (Moody’s), which is quite different from credit rating. Don’t put the “watch” rating and “outlook” here, they should be a DD on the security dimension.
  8. Industry Sector dimension. Covers at least ICB and GICS, along with major providers such as ML, Barclays, Bloomberg. Some are 4 levels deep, some 7 levels. Unlike rating, don’t try to do translation between providers here, so stack them up with a “provider” column, then level 1 to level 7 columns.
  9. Asset class dimension (aka asset type, instrument type, asset group) which contains a list like: equity, fixed income, cash, property, commodity as level 1. Level 2 contains the breakdown for example for fixed income the level 2s are bond, floating rate note, CDS, bond future. For bond the level 3s are callable/puttable or government/corporate.
  10. Dimensions related to performance return attributes: return type (gross or net of fees), accounting method, cumulative or annualised, arithmetic or geometric. We could model them as a junk dimension because most attributes only have 2-3 values, if we don’t mind losing the description. I don’t recommend the performance period to be put here (e.g. 3 month, 1 year, 5 years), best to put them as columns in the performance return fact table (having them as a column in the performance return is second best).
  11. For the sake of completeness, date dimension.

The usual dilemmas with designing the dimensions are:

  1. Do we include country of risk/incorporation, currency, industry sector, etc. in the security dimension? Absolutely, as type 2 attributes. Plus we also have them as separate dimensions in their own right.
  2. How do you model derivative instruments which has very different attributes? By creating Option, Swaps, Futures and CDS as separate dimensions. Don’t separate the OTCs (Over The Counter) Derivatives, e.g. options, swaps, CDS) from ETDs (Exchange Traded Derivatives, e.g. futures, forwards). Don’t create too many dimensions for derivatives, but group them up. 4 or 5 dimensions sounds about right. Don’t worry too much if certain attributes are duplicated between the main security dimension and the derivative instrument dimension, it’s ok.
  3. Do we put cash in the security dimension or not? Don’t. Only real financial instruments should be in the security dimension (including CD, Repo, OTC and ETD), but leave settled and unsettled cash out (including collateral cash for margin requirements) See my article here for more info: link. Cash in different currencies should be put directly in the position fact table, clearly labelled as settled or unsettled as many portfolio management decisions hang on this label.


B. Fact Tables

B.1. Position Fact Tables

Position fact tables store the valuation for each position from the back office accounting system, including adjustments. The valuation is daily, but the one at the month end is continuously adjusted / restated until a few weeks later.

For example, every business day we could be receiving:

  1. Daily position valuation from the OMS system
  2. Daily position valuation from the accounting system
  3. Today’s version of the last month end position valuation from the accounting system

The valuation is not only on the securities that we hold, but also the cash. There are a few things we need to be aware here, with regards to the validity of our position fact table:

  1. The cash amount is dependent on the settlement status of the transactions for that day.
    For example, this morning we had £1m cash, and this afternoon we bought £300k worth of equities and bonds. So what’s our cash position at the end of today? Answer: Settled cash = £700k, unsettled cash = £300k.
  2. The securities we hold is affected by the trade status.
    For example, today the portfolio manager put an order to buy £200k of bond X. If the status of the order is “executed by the broker” (it’s called “filled orders”) then we will have that security in today’s position. But if the status of the order is “not yet executed by the broker” then we don’t have that security in today’s position. There are many stages of an order/trade including “draft/created”, “verified”, “authorised”, “compliance checked”, “placed with/sent to the broker”, “confirmed by the broker”, “filled/executed by the broker”, “allocated”, “settled”, and “closed”. The position is created when the order is “executed”.
  3. The value of an equity position is the number of shares we hold x the previous day price of that equity (close of business price, e.g. when the exchange ended trading – the exact hour and minute obviously depends on which country the exchange is located). Like wise of bonds and any exchange traded securities. But for OTC derivative positions like options and swaps, the valuation requires some calculations to determine the “fair” market value, e.g. Black-Scholes model for options (this discipline is generally called “derivative pricing”). This is because OTC derivatives are not traded on exchanges, but a contract between two parties.
  4. Some of the positions could be valued in-house, and some positions could be valued by an external company to whom we outsource our back office operation. In this case we may need to merge two sources to get the holistic view.

Apologies I will have to stop here as I need to write about Power BI and DAX. Will continue this article afterwards (approx in a month time).

6 September 2018

Modern Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 7:09 pm

What should 2018 Data Warehousing look like (as opposed to in year 2005)? That’s what the word “Modern” refers to. We now have streaming data, we have big data, we have data lake, and we have data scientists. A modern DW in 2018 incorporates these 4 factors. An “old” DW in 2005 did not incorporate these 4 factors.

The first attempt to define a Modern DW was from Bill Inmon in his DW 2.0 concept in 2006, where he added unstructured data, different user communities and (whitepaper: link, book: link). There was no big data or NoSQL at that time, but we already had unstructured data in data warehousing. I would argue that metadata and different archiving lines were not new, they had been in DW before 2005.

Then there was an attempt to define a Modern DW as Lambda, i.e. streaming and batch. The best definition is probably from Melissa Coates: link, search “characteristics of a modern data warehouse” on this page. Here she defines other importance characteristics such as Hadoop, data lake, NoSQL, bimodal (sandbox). I would argue that data virtualisation, near real-time, governance, agile delivery, self service BI, large volumes, MDM and high variety are not new. They had been in DW before 2005.

The Azure community is quite adamant with the Lamda concept of a modern DW. Notably Adatis, in particular Simon Whiteley in April 2017 (link), and in SQL Saturday. Microsoft supports this Lambda idea in Azure (concept: link, diagram: link).

Then there is a concept of “Data Lake side-by-side with DW”, notably proposed by Barry Devlin, supported by Hortonworks (link) and also proposed by BlueGranite (link) where Melissa works (see her SQL Saturday presentation is here). The diagram on page 5 of the BlueGranite eBook, which essentially the same as page 5 of Melissa’s SQL Saturday, is in my opinion the best diagram describing a modern DW, as we can see below:

BlueGranite diagram

Here we have the data lake as the staging area for both the streaming data and the data warehousing. And it is also the staging area for the NoSQL data store such as graph database. And the data lake is also accessed by the machine learning. So there are 4 purposes of the data lake.

Arguably it’s Microsoft centric (i.e. Azure), but they are the best vendor at the moment with the most complete components for modern DW (in my opinion). A close second is AWS, with Redshift for the DW and S3 for the data lake (catalogue & ETL on Glue). The Hadoop and Spark are on EMR (Elastic MapReduce). GCP is on the 3rd place with BigQuery as the DW, Cloud Storage as the data lake, and Cloud DataFlow for the streaming data. The Hadoop and Spark are on Cloud Dataproc.

15 July 2018

Power BI Q&A

Filed under: Data Warehousing — Vincent Rainardi @ 5:59 pm
  1. Q: Can Power BI read SSAS cube?
    A: Yes, both tabular (link) and multidimensional (link)
  2. Q: Can Power BI display SSRS reports?
    A: Yes but on-premise only, by installing Power BI Report Server (link)
    SSRS (paginated) reports are not available in the Power BI Cloud/Service, but we can pin SSRS (paginated) reports on Power BI Dashboards (link)
  3. Q: Can I use natural language to query the data in Power BI, e.g. “What were the total sales last year?”
    A: Yes, only in the cloud version, only in English language. On a Power BI dashboard, we can type our question in the “Ask a question about your data” box. Power BI Q&A uses the names of the tables, columns, and calculated fields (link).
  4. Q: What is the difference between a Power BI Dashboard and a Power BI Report?
    A: Dashboard is 1 page, whereas Report is multiple pages. Dashboard is only available in the cloud/service version, whereas reports are available on both cloud and on-prem/desktop version. You can subscribe to a report, but not to a dashboard (link).
  5. Q: What is the difference between the “free” version and the “pay” version of Power BI?
    A: On the “free” version we can’t publish and consume app, can’t share dashboard or subscribe to dashboard (link).
  6. Q: Can we view Power BI reports on Android, or just iPads?
    A: We can view Power BI reports on iPad and iPhone (link) as well as on Android tablets and phones (link) and Windows 10 devices (link)
  7. Q: How can Power BI in the cloud connect securely to an SQL Server on-prem?
    A: By installing a data gateway on-prem, which protects the on-prem SQL Server.
  8. Q: What’s the best way for inputing simple data into Power BI report?
    A: Using embedded PowerApps (link)
  9. Q: Can I run R scripts in Power BI?
    A: Yes, Get Data > More, Other > R script. You can then use the output (data frame) in the Power BI visualisation (link).
  10. Q: How much is Power BI?
    A: Power BI Desktop is free. Power BI Pro is $9.99/month per user. Power BI Premium is $5000 to $6000/month for 1 to 1000 users, if only 100 users are authoring reports (link), i.e. the rest are just consuming the reports/dashboards.


11 May 2018

Big Volume Data Warehouse

Filed under: Data Warehousing — Vincent Rainardi @ 7:42 pm

Some people are worried with the volume of a snapshot fact table. If we insert half a million rows a day, after a year we would get 180 million rows, and after 5 years close to 1 billion rows. For example, position valuation of a large asset manager, or account balances of a large retail bank.

The volume issue also occurs on transaction fact table, i.e. we could be inserting a few million rows a day, resulting a few billion rows a year. For example, stock exchange bids and offers, search engines searches, the browsing traffic of a large ecommerce sites, or the good old supermarket checkouts case?

If we haven’t dealt with billions rows fact table like above, we would be worried about the query performance. How long will it take to query a fact table with 10 billion rows? Would it take a minute? 10 minutes? Well it could be, if the database engine has to scan the table. But if it is partitioned, it could only takes a second.

So the question now becomes: ok, great to know that it can take a second to query a 10 billion rows fact table. But how do I partition that big fact table? Well it depends on what fact table it is. If it a snapshot fact table, the partition should be on the snapshot date. If it is a transaction fact table, the partition should be on the transaction date.

Ok, that’s great. But what if we need to query across dates? For example, in the stock exchange data warehouse, to find out which are the top 10 most traded shares by volume (or by value) in the last 3 months. That query will take a lot of minutes. To do this, we need to prepare a fact table which stores the transaction volume and value of every share on every date. We populate this daily snapshot fact table every day, immediately after the main transaction fact table is populated.

For example, today is 11th May 2018. Right after the 10th May data is loaded into the shares transaction table, we calculate the volume and value of every shares traded on 10th May and insert these rows into this daily snapshot fact table. By querying this table it will only take a second to find out what are the top 10 most traded shares in the last 3 months.

15 April 2018

BI vs Data Science

Filed under: Data Warehousing — Vincent Rainardi @ 5:57 am

Last night there was a talk about Business Intelligence (BI) vs Data Science at Dataiku London. Unfortunately I was not able to attend but that event inspired me to write about this topic as it has always been an interesting topic for me. And I think it is very fitting to this blog as it is about Data Warehousing and BI, and recently I added about Data Science too.


Before we go too much further let’s remind ourselves what BI and Data Science are. Data Science is about scientific approaches to manage and analyse data using statistics, machine learning and visualisation (link). BI (Business Intelligence) is also about managing and analysing data and visualisation, but using business approaches, not scientific approaches (link, link).

I have been creating business intelligence and analytics applications across many industries including travel, banking, pharmaceutical, manufacturing, insurance. Both with a data warehouse/mart, and without. They are used for analysing the business, for reporting, and for dashboarding.

I have done Big Data development for a few years, including Hadoop, Hive, Data Lake, Spark, graph and document databases. In the last 6 months I have been learning Machine Learning and Neural Network, using Jupyter, Matlab and R, and recently Tensorflow. I understand how they are used for visual recognition, predicting prices, network security, marketing and playing games.

This Article

I have written a few articles about “Choosing between Big Data and Data Warehousing” (link, link, link) which differentiate the back-end part of BI and Data Science. So in this article I would like to focus on the front-end part, i.e. how they are used.

One of the best approaches to compare the usages of BI vs Data Science is probably using use cases from a particular industry. So in a particular industry we compare what BI can do and what Data Science can do. For example, in insurance BI is used for analysing customers, earnings, risks, financials and claims, and for reporting (operational, management and regulatory reports). Whereas Data Science is used to forecast losses, income and profitability (both account and product), adjusting rates, classifying customers for renewal, and grouping potential customers.

If we dive into other industry, say travel, we will understand the differences between BI and Data Science a bit more. The more industries we study, the more we understand the differences of how BI and Data Science are used. One is using business approach, and the other is using scientific approach.

With the above background I think we are ready to dive into core of this article, so let’s start with the first industry: insurance. I will then do one more industry: travel. I am a beginner in data science and particularly in machine learning. Suggestions and advice from readers will be most welcome, with regards to both about the approach and the content at or via the comment box below. Once again in this article I’m not discussing the how or technical aspect, but the usage or business advantages, comparing the BI vs Data Science.


General insurance is about insuring buildings, ships and businesses from risks like storms, sinking and frauds. There are about 20 classes in general insurance including property, aviation, marine, construction, oil & gas, professional indemnity, public liability and riots. It is a little bit different from life and pension business, and from retail business, i.e. home, car, health, pet and travel insurance.

I guess in this example we need to pick one line out of the above 3 main insurance businesses, which branches into many lines. Let us use car insurance for this example, because it is retail business so many people can relate to, and it’s easier to understand. It has many attributes which provide good illustration for the BI and Data Science usage.

The main usage of BI in car insurance is to analyse premium income and claim costs. These two numbers (and many of their derivatives, i.e. net, gross, subs, etc.) are analysed with regards to vehicle, time, office and customer attributes. Two of most common methods of this analysis is by using OLAP cubes and dashboards. The dashboards presents the most important findings such as income growth by vehicle types and profitability by customer groups, with guided drilldown along chosen pathways. The cubes provide a free-to-roam exploration platform, enabling drilling any numbers to the lowest details on any available pathways.

The business values of these activities are from knowing when and where exactly the revenues and costs are coming from and what factors influence them. This knowledge gives the business greater control and ability to grow the profitability across all business segments. Understanding claims and premiums profile for each type of vehicle (age, van/SUV, fuel, gearbox, engine size, parking, usage, milage) and each type policy holders / drivers (location, years of licence, profession, accident history, health, penalty points) enables the business to target the sales and marketing budgets on the more promising customers, as well as pricing the policy at the correct level. It also enables more precise filtering with regards to following up the leads from the Aggregates such as GoCompare and MoneySupermarket, and from brokers.

The Data Science is used to analyse customer churns, establishing rates, and analyse black box telematics with regards to risks and impact to premium levels. Standard actuarial rates give the business non-competitive advantage as they use standard factors such as driver age, engine size, mileage and parking location. Data Science enables insurers to factor-in new attributes such as driving behaviour, driving routes and driver’s social media, calculating the underwriting rates more specifically and more precisely for each policy. This enables the business to win new businesses, both commercial and retail, as quotes/pricing is the most important factor influencing new customers. Also, machine learning is used to improve the efficiency of online advertising, by predicting which how many users (and of which types) would click-through on different types of ads and by optimising bidding.

So BI and Data Science give different business advantages. It is not true that Data Science will replace BI in car insurance business, and in insurance in general. On the contrary Data Science will complement the business advantages that BI currently delivers.


Selling flights, hotel, holidays and cruises are the core businesses of travel companies. Car hire, travel insurance, excursions, airport transfer, airport parking, tours, restaurants, attractions, shows, villa, currency exchange and rail passes are the secondary businesses. They have retail and commercial business lines, both heavily utilising internet as a marketing platform because it reaches many countries and very cost efficient.

The BI is used for analysing web traffic such as funnels and conversion rates, revenue breakdown, customer segmentation, customer loyalty programs, and marketing campaigns such as email alerts and newsletters. Funnels enable the business to understand how much traffic filters trough each step of the purchasing process, and at which page each customer stops, as well as the sequence of pages they viewed and for how long each. This ultimately enables us to improve the website content and the products, resulting in higher conversion rates. Revenue breakdown is by product, by customer types, by country, by platform, by campaign, and by time. Marketing campaigns enable the business to drive more sales, with the content of each email/newsletter tailored differently to each individual customer. Each image on the newsletter is uniquely tagged enabling us to track which click from which customer triggered each sale.

The business value of these activities are: increased revenue, wider customer base, and increased customer loyalty. The revenue is higher because of higher conversion rates on web traffic and because the marketing campaigns drive more sales to the websites. Also because as a result of breakdown analysis we can focus our budget and time on the more profitable and promising product types and customer segments. The customer base is wider because market penetration from the campaigns. Customer loyalty increased because the offer is tailored specifically for each customer segment, sometimes down to each individual customer.

The Data Science can be used for customer segmentation, predicting campaign results, and analysing web traffic. It seems overlapping with the above BI activities, but if we dive into the detail levels they are actually complementing. By associating real time web traffic data and past purchase data using machine learning, travel companies can find out which customers have the best potential for purchasing products (using cookies to identify customers). The learned parameters are then used to display offers on the web pages the customers are currently on. The offers are specifically tailored to the highest potential that they are interested, based on their browsing behaviours (current and past sessions).

Predicting campaign results is done by first tracking back each sale to find its origin and contributing factors. The origin can be a click on an image on a particular email sent by a particular campaign. This can be identified by the tag on that image, which has the customer ID, campaign ID and product ID. The origin can also be a Google or social media advert from a particular campaign that the travel company runs, giving us the ID of the ad and possibly the user/audience ID. The contributing factors of a successful campaign can be the product, creative, layout, timing, pricing, audience and origin. Knowing which factors contribute the most to the success or failure of each campaign can be very revealing, from which we can then improve the subsequent campaigns. Various machine learning algorithms can be used for this including support vector machine, random forest and neural network.

7 February 2018

Building a Neural Network

Filed under: Data Warehousing — Vincent Rainardi @ 6:03 am

This is a diagram of a neural network:

1. Diagram of a Neural Network

Each node in the above diagram receive input from all the nodes on the previous layer. For example, node 1 on hidden layer 1 receives input from 4 nodes on input layer: x1,x2, x3 and x4, like this:

2. Calculate Output of a Single Node

Each of the lines has a weight. So not all three input have the same magnitude. In the above example, input from x1 has a weight of 0.1, whereas input from x2 has a weight of 0.2.

In addition to the weighted inputs, there is one special input going into the node. It is called the “bias”, notated as b. It is a constant (a number). So node n1 on hidden layer 1 receive 4 inputs (they are weighted) plus a bias. This total input is called z. In the above example it is 13.3.

The output of each node is the a function of the input. This function is called “activation function”. One of the common function used is “rectifier linear unit”, abbreviated as “relu”. It is simply the maximum between the input and zero. In the above example, the output of node n1 is 13.3. This output is called a, stands for “activation”.

Forward Propagation

So, if we have the values of input layer (x1, x2, x3, x4), the values of the weight on each line between the input layer and hidden layer 1, and the values of the biases for all 3 nodes in hidden layer 1, we can calculate the output of all 3 nodes in hidden layer 1 (a1, a2, a3), like step 1 in this diagram:

3. Forward Propagation

Once we calculated the output of hidden layer 1, we can use them to calculate the output of the next layer, i.e. hidden layer 2. This is marked as Step 2 in the above diagram. We can then calculate the output of hidden layer 3 (step 3) and finally calculate the output of the entire network (step 4).

Sigmoid Function

If the neural network is used to predict a binary condition (e.g. whether an image is a car or not), the activation function used on the output layer is usually not a relu function, but a sigmoid function, which looks like this: (source: wikipedia, link)

4. Sigmoid Function

We can see that for the majority of x value (input), the y value (output) is 0 or 1, which fits the binary condition.

So in Step 4, we put 51.877 x 0.7 + 41.099 x 0.3 + 6 = 54.6436 as the x (input) of the sigmoid function, and get 1 as the output of the sigmoid (a).

We then predict the output of the neural network (ŷ) as follows: if a is more than half, set ŷ to 1, otherwise set ŷ to 0.

Calculate The Cost

We do this forward propagation process for every data set that we have. In Machine Learning a data set is called “an example”. So for every example (input) we will have the predicted output (ŷ). We then compare these predicted outputs with the actual outputs and the difference is called “loss”. The average of loss from all examples is called “cost”.

There are many loss functions (link). For a binary classification where the output is a probability between 0 and 1 like above, the appropriate cost function is “cross entropy” which is like below: (source: ”ML Cheatsheet” from “Read the Docs”, link)

5. Graph of Cross Entropy Loss

So if the output should be 1 and the predicted output is 1, there is no loss (loss = 0). If the predicted output is very wrong e.g. a small number like 0.1 then it is penalised heavily. This “heavy penalty” is done by taking the log of the loss. So it is not linear. The formula is like this: (p = predicted output, y = actual output)

Cross Entropy Loss = -(y log(p) + (1-y) log(1-p))

Which is derived from: if y = 1 then p, else 1-p (the probability is p if y is 1, and the probability is 1-p if y is 0). Which becomes: p = py.(1-p)1-y
Taking the log it becomes: y log(p) + (1-y) log(1-p).
Taking a minus of it, it becomes the above. We take the minus because log(x) is like the left graph below, whereas minus log(x) is like the right graph below:

6. Graph of Log and Minus Log

Note. “cross entropy” is the average number bits needed to identify an event, see: Wikipedia: link

That is the loss for 1 example (data set). The cost is the average of loss for every example, which is sum of the above divided by m, where m is the number of examples, like this:

7. Calculating Cost

That average of loss is the cost of the entire neural network for this weighting (the weight on every line, including the bias). Which is not the best weighting. There are better weightings which result in lower costs. If we can find the best weighting, we will get the lowest cost, which means the smallest gap between the prediction and the actual output (across all data sets). Which means it’s the most accurate prediction. To find the best weighting we need to go backward from the output layer going to the left towards the input layer. This is called “back propagation”.

Back Propagation


Update The Parameters



  1. Michael A. Nielson, “Neural Network and Deep Learning” book: link
  2. Denny Britz, “Implementing a Neural Network from scratch in Python”: link
  3. Sunil Ray, “Understanding and Coding Neural Network from scratch in Python and R”: link
  4. Matt Mazur, “Step by Step Back Propagation Example”: link.


27 December 2017

What Machine Learning Can Be Used For

Filed under: Data Warehousing — Vincent Rainardi @ 7:10 am

What Machine Learning Can Be Used For

Machine Learning is a field of computer science that gives computers the ability to learn without being explicitly programmed (Wikipedia). That is a very good definition but isn’t worth much without knowing concrete examples, so in this article I will focus on giving examples of what machine learning is currently used for, and can be used for.

Two Types of Usage

Broadly speaking Machine Learning can be used to do 2 things:

  1. If we have a set of data, Machine Learning can be used to classify that data into groups, or find associations between groups. This is called Unsupervised Learning.
    For example: grouping galaxies using astronomy images, grouping customers using purchase data, users who like news A also like news B.
  2. We have pairs of input and output Machine Learning can be used to predict the output of a new input. This is called Supervised Learning.
    For example: recognising symptoms of skin cancer based on the skin lesion images, identifying hacking attempts in cyber security based on network data, predicting the price movements of equity indices, forecasting the weather.

A. Image Recognition

By far the most widely use of machine learning is image recognition. For example, recognising hand written letters or numbers, vehicle plate numbers, and faces.

A1. Handwriting Recognition

Machine Learning can be used to recognise hand written letters or numbers. This article (link) from Adam Geitley explains the how. How Machine Learning can recognise a hand written number 8. An image of number 8 is a series of numbers which represents how dark each pixel is. Using a machine learning algorithm called Convolutional Neural Network (CNN), computers can be trained to recognise a number 8. And other numbers. And letters. Convolutional Neural Network breaks the image into small tiles and pass the tiles into a small neural network. We then sample the output and pass the sample to another neural network.

We use handwriting recognition in the post office to sort thousands of letters into post code areas by reading the address written on the envelopes. It is also used to read the amounts and recipient names on the handwritten bank cheques.

A2. Plate Number Recognition

The same technique (CNN) is used for character recognition. Not handwritten character but the ones on images. For example, to read the licence plate numbers in the petrol stations. It is called ANPR (Automatic Number Plate Recognition). It is used at toll gates like Dartford Crossing, so drivers does not have to queue to pay. Instead, they can pay from their mobiles later (within 24 hours). It is also used in apartments, i.e. only recognised cars are allowed in (and out!). It is also used to reduce congestions in big cities like London, i.e. we have to pay £11.50 to go into central London. Thousands of cameras at the borders of central London detect which cars are coming in and out so the system can match the payments and the cars, and issue penalties to those who didn’t pay.

A3. Facial Recognition

We can use Machine Learning to recognise faces. This article (link, also from Adam Geitley) explains the how. It uses a series of 3 images to find 128 measurements for each person using HOG algorithm (Histogram of Oriented Gradients). 2 of the 3 images are the photos of that person, the third one is a photo of a different person. This way, each face is converted into a set of 128 numbers, which called an encoding.

It is used in Facebook. We used to tag photos in Facebook and the tagged person get notified “hey you are tagged in such and such photo”. But now we don’t need to tag the faces and the person still get notified. This is because Facebook recognises who the faces belong to. The notified person can then ask the photo to be taken down, or like the photo, or tag it.

Facial recognition is also used for payment. Faces can be converted into a set of 128 numbers, which uniquely identify each person. Well, almost. Alipay uses face recognition for payment in KFC in China (link, link).

Banks now uses face recognition identify customers. Instead of passwords or PIN, to login to their bank accounts customers use facial recognition using the camera on their laptop, tablet or phone. For example, Lloyds bank and HSBC do this (link), and so do First Direct and Tesco bank (link). In some banks, customer can also use their voice or finger print (on their mobile) to login.

Facial recognition is also used by the police. In a beer festival in China security cameras were installed at entrance gates and sounded an alarm when it recognised a face is found in the police database (link). At the Notting Hill Carnival in London, the police used it to match the faces in the crowd to the database of people previously arrested or under bail conditions. In Germany the police uses facial recognition to catch terrorists (link).

A4. Searching Photos

What I mean here is not search the text/label that accompanies the photos, but searching inside the pixels of the photos to find for example if there is a flower in those photos.

  • Shutterstock allows us to search photos based on the composition (link).
  • Facebook Lumos will (not yet) enable us to search photos (link).
  • The best I think at the moment is Google Photos. We can now search, without labelling our photos first (link). I tried searching “curry”, “flower” and “snow” and it correctly displayed photos containing curry, flower and snow. Amazing!

The basic application is what Google Photos does above. Shutterstock, BBC, FB and every media company from should have this technology, but not yet. Media companies like newspapers and TV have millions of photos which take a lot of time to label them all. The cost saving would be enormous if employees can search inside those collection of photos, without labelling them.

The reason it takes the time is that we need to train the Machine Learning behind the photo search engine to recognised millions of words, which takes time. It’s probably a mini project to train a certain word such as flower, but to train millions of word would take years.

A5. Astronomy

Machine Learning can be used to group galaxies according to their shapes, i.e. spiral, elliptical, lenticular and irregular (link). Why don’t we group them manually? Because there are billions of them. We can also use Machine Learning to detect exoplanets using S4 algorithm (link) and CNN (link), rather than random forest (link). We can also use it to analyse galaxy spectrum (link, link).

Most of the application of Machine Learning in Astronomy is to analyse thousands of galaxy images using image recognition algorithms such as Convolutional Neural Network (CNN). Astronomy has a long history of using machine learning from the 90s. But it was using the “old” algorithm such as decision trees and random forest. From 2015 most of the image analysis in Astronomy is using deep learning algorithm such as CNN.

A6. Manufacturing

In manufacturing, Machine Learning is used to automatically detect overheating in motors, and defect in glass products (link). To detect overheating, the thermal images produced by infrared cameras are analysed using CNN. The glass defect is detected using normal-light photographs.

There are many Machine Learning application in manufacturing outside Image Recognition. Unlike Astronomy, manufacturing uses many different sensors, not just visual/images. Temperature, pressure, flow, weight, and many other measurements are analysed using Machine Learning for predictive maintenance and other purposes, see below for details.

A7. Health Care

Machine Learning is used to detect cancer by classifying the FNA images. It is used to detect breast cancer (link), salivary gland cancer, liver cancer and thyroid cancer.

Machine Learning is used to identify tuberculosis from chest x-ray images (link). They used Deep Convolutional Neural Network models to learn TB positive and TB negative x-rays. This is particularly useful in areas where there is no cardiothoracic radiologist.

A8. Retail

Lowe’s Home Improvement uses a robot to answer customer’s questions such as where is item X, and take the customer to the aisle where the item is located (link). It uses computer vision powered by Machine Learning to navigate the aisle in the store autonomously using camera and odometry. It also does inventory monitoring in real time by scanning the shelves in the aisle.

When we find a pair of jeans, glasses or handbag that we like, we can take a photo of it (or a screenshot), then uses an app called SnapShop from GoFIND to find the product in thousands of shops across many brands (link).

When we shop for clothes online, we can see how various garments fits us by trying it online using MeModel from Metail (link). We input our height, weight, bra size, waist, hips, hair style, gender, age and build so MeModel can predict and recommend various dresses, tops, skirts, trousers that fit and suit us. Retailers can see the demographic and body data of their customers, enabling them to provide suitable fashion that their customer like. The recommendation is a computer vision based, powered by Machine Learning algorithms.

B. Other Media Data

Non-image media data are video and audio. There are a lot of Machine Learning application for video data and audio data, e.g. natural language processing, self driving car. In this section I will list just the video and audio data, and will list the text and numeric data in section C.

B1. Self Driving Car

This is probably the one receiving the biggest media attention. It is not just the video feed that the deep learning algorithm needs to analyse, but also data from Radar, Lidar, GPS, and the controllers-actuators in the car itself (link). Kalman Filter is used to do combine data from different sensors to understand what’s around the car, including pedestrian, traffic signs and other vehicles. PID and MPC controllers are used with deep learning algorithm to regulate the engine RPM, breaks, gear, accelerator and wheels of the car so that it moves at the correct speed and direction, with self-tuning. Finally there is an element of routing as well, partly using neural network (link).

B2. Natural Language Processing (NLP)

NLP is about understanding human speech, both oral and written. It is also about producing speech, both oral and written. Speech recognition (link) is a major part of NLP, to translate voice into text. Stemming algorithms (link) are used the base form of each word. Word2Vec (link, link) is used for understanding the meaning of words (link). Dependency parsing is used to understand how each word in a sentence relates to each other (link). The Machine Learning algorithms used for NLP includes Deep Bi-directional RNN, Gated Recurrent Units and Long Short Term Memories (link). Convolutional Neural Networks is used for sentence classification (link) and Neural Machine Translation is used for large scale translation tasks (link).

Text Analytic (or Text Mining) is an important part of NLP. In the paragraph above, all items from Stemming onwards are in the Text Analytic domain. Basically, once the voices are translated into text, we are entering Text Analytic.

B3. Speech Processing

Rather than trying to understand the full meaning of human speech, Speech Processing focuses on just one particular aspect of it. For example:

  • Converting Speech to Text: it is like a Dictaphone crossed with OCR scanner. It can translate what we said to it into a document.
  • Speaker Identification: like a finger print, our voice can be used to identify us. It can be used in telephone banking to identify the customer (HSBC, Barclays), and in forensic to identify the criminal in offences such as kidnapping, extortion, blackmail threats, terrorist calls, ransom calls, etc (link).
  • Speech Generation (or Speech Synthesis): it is concatenation of many recorded voice of different words read by human. If the word in the text does not exist in the “recorded word database” then the word is converted into phoneme (one unit of sound), such as /k/ in car, cool and kit. Each phoneme is converted into voice and concatenated.
  • Noise Reduction: using a deep recurrent auto encoder neural network, we can reduce the noise on an input audio (link). This is done by training the network on stereo (noisy and clean) audio features to predict the clean features given noisy input.
  • Emotional Recognition: given voices as input, a deep recurrent neural network can determine the emotion the speaker is in, for example: excited, angry, sad or neutral (link). This can also be done using facial image processing.
  • Stress and Lie Detector: given a speech, using artificial neural network we can detect if the speaker is under stress or is lying (link). Similar result can also be achieved using facial image processing.

B4. Music

Machine Learning can be used to automatically detect the genre of a piece of music. The best approach is probably using Convolutional Neural Network and Bi-RNN block in parallel, which focus on the spatial features and temporal frame orders extraction (link).

Other application of Machine Learning in music include predicting the region of origin of a piece of folk/traditional music (link) and making music (link), usually using RNN and CNN algorithms.

B5. Video

Machine Learning can be used to recognise objects in video (link). It is done using CNN. As a team from Google and Stanford demonstrated (link) Machine Learning can be used to automatically classify a large set of videos into many classes. Using CNN they classified 1 million YouTube videos into 487 classes.

When we watch streaming videos on mobile phones or computers over 4G or WiFi, sometimes we experience buffering, i.e. long wait or break times before the video continues, or pixelation. The MIT CSAIL team used machine learning to switch between rate and buffer-based Adaptive Bit Rate, to minimize buffering (link).

Machine Learning can also be used for behaviour analysis in animal research. For example, using a video feed we can determine whether Drosophila (fruit flies) were on or off the egg-laying substrates for each frame of the videos (link).

C. Numerical Data

C1. Health Care

Last year Edward Choi et al from Georgia Institute of Technology and Sutter Health used Machine Learning to process data from Sutter-PAMF primary care patients to predict heart failure (link). They used 4 Machine Learning methods: Regularised Logistic Regression, Multilayer Perceptron, Support Vector Machine, K-Nearest Neighbour, and a Recurrent Neural Network called Gated Recurrent Unit. They used Theano (a deep learning software for Python) as well as Python Scikit-Learn.

Machine Learning was also used for predicting HIV/AIDS (link).

There are a lot of ML application with numerical data, such as Predictive Maintenance, Monitoring Efficiency, CRM, Robotic (link, link) as well as in Financial Services. But due to time limitation (I just started doing Andrew Ng’s Deep Learning course) I will have to write them another time.


20 December 2017

Late Data Warehouse Loading

Filed under: Data Warehousing — Vincent Rainardi @ 8:26 am

If your source systems are running late, how do you ensure that your data warehouse is still loaded on time?

In a retail scenario, it is possible that the customer mastering or product mastering system got delayed. Say they usually finish loading at 2am, but today it’s 1pm and they haven’t run yet. What do you? You have the sales data and other fact data waiting to be loaded since 5am. The answer is: load the fact data using yesterday’s customer data and yesterday’s product data.

In a Kimball DW (dimensional model), the data warehouse designer often put ETL constraints that the fact table must be loaded after all the dimensions are loaded. If some dimensions are late, all the fact tables are not loaded.

There are two things we can do in this case:

  1. Make the dependency fact table specific. A fact table should only be dependent on the dimension that it uses. So fact table A which is only uses dimension 1 to 4, should be loaded even though dimension 5 is late.
  2. The second thing we can do is to have an “emergency ETL button” which enables us to load today’s facts using yesterday’s dimension data. Physically, this is what the button does: in the enterprise scheduler (such as Active Batch, Autosys or Control-M), the button removes the dependency on the fact table loads so that they don’t wait for the dimensions to be loaded.

For #2 above, the data warehouse load is happening as normal, but using whatever dimension data available at that time.

When the product dimension data (or whatever dimension which was late) is available, it automatically trigger the normal warehouse load, which will update the Product SK in those fact tables.

The fact row for a new product would have 0 Product SK. How do we know what product this fact row is for? The answer for this is to use double keys, i.e. we have 2 key columns. One for Surrogate Key (SK) and one for the Business Key (BK). That fact row for a new product would have the BK column populated. So when the real load happens, we can populate the SK column by looking up the BK column in the product dimension.

Let me repeat: to enable us to correct the SK when the real load starts, we need to have “double keys” in the fact table: both the SK column and the BK columns. Thanks and credit to Troy Stevenson for this “double key” idea.

13 December 2017

Loading Late Arriving Dimension Rows

Filed under: Data Warehousing — Vincent Rainardi @ 8:19 am

One of very common thing in “Kimball warehousing” (i.e. dimensional model) is late arriving dimension. And yet it is still not obvious for many people what to do when the expected dimension row finally comes, particularly in the case of transactional fact table. In this article I will try to explain what a late arriving dimension row is, and what to do when it comes.

Late Arriving Dimension Row

It is probably easier if I explain using an example. Suppose we had an internet shop. On Monday 11th Dec 2017 at 19:40 a new customer (customer A) registered on the shop, then purchased a product. Both the sales record (order 213) and the customer record were stored in the sales system. On Tuesday at 2am the warehouse load started, reading the sales data from the sales system, but reading the customer data from the customer master. Customer A was not sent from the sales system to the customer master until Tuesday at 8pm (because it’s daily load at 8pm). So when the data warehouse on Tuesday 2am reads the customer master, it did not get the data for customer A. The following day, on Wed 13th Dec at 2am, when the warehouse load started, it received the customer A data from the customer master.

Periodic Snapshot Fact Table

There is an issue here, i.e. at 19:40 on Monday the sales system should have pushed customer A record to the customer master. But it didn’t because the customer master only loads from several source systems (one of them is the sales system) once a day at 8pm. Because of the issue, on Tue 2am when the sales fact table was loaded in the warehouse, there is no row for customer A in Dim Customer, and therefore the Customer SK column for that sales fact row for Sales Order 213 is 0. Let me repeat: the customer surrogate key column for order 213 fact row for business date Mon 11th Dec is 0, meaning “unknown customer”. See yellow cell below.

That’s what happened in the data warehouse load on Tuesday at 2am. What happened on Wed 2am load? At that time, the warehouse is populating the sales fact table for the 12th Dec business date (snapshot date), and this time, the Customer SK column for order 213 is populated with the surrogate key of Customer A, see the green cell below:

1. Periodic Snapshot Fact Table
2. Customer Dimension

We can see above that Order 213 started to exist on 11th Dec (it didn’t exist on 10th Dec). The customer SK for Sales Order 213 was 0 on 11th Dec and 74 on 12th Dec.

Note: because there is a fact row on 11th Dec for Order 213, the total amount for that date is correct (£81) but the breakdown on City is in correct. On 11th Dec Manchester has 0 sales. Only on 12th Dec Manchester has the correct amount of sales, i.e. £45.

Transaction Fact Table

That was a Periodic Snapshot Fact Table. How about a Transaction Fact Table?

It is like this:
3. Transaction Fact Table

Order 213 started to exist in the transaction fact table on 11th Dec with Customer SK = 0. On 12th Dec the Customer SK was updated to 74.

How do we update this Customer SK from 0 to 74 on 12th Dec? More specifically, how do we know which fact row to update on each day?

The answer is by using the staging table. As part of customer dimension loading process, we update each Transaction fact table for every new customer. On the above example, when we loaded sales order 213 on the 11th Dec, we store this in the sales staging table:

4. Sales Staging Table

On the 11th Dec the customer SK is 0. On 12th Dec when the data warehouse load ran, it checks this sales staging for Customer SK = 0 and fund that Order 213 has Customer SK = 0. It then looked at the Sales Order fact table and find the row for order 213 and update the Customer SK column on the fact table to 74. It then updated the customer SK column on the staging table with 74 too.

It is often useful to put an “incomplete” indicator column (Y/N) on this staging table, so that when looking for any SK column = 0, we first filter the rows by Incomplete = Y so we get much less rows.

So let me repeat, as part of the dimension loading process, we update each Transaction fact table for every new customer. This is the customer staging table:

5. Customer Staging Table.JPG

We can see above that Customer A started to exist in the data warehouse on 12th Dec. On 12th Dec, Customer A was a new row inserted into this customer staging table. As part of inserting a new customer, we corrected the Customer SK in every Transaction fact table for these new customers.


  1. If the Customer Dimension is SCD type 2, then we need to update the customer SK in every transaction fact table, not only for new customers, but for updated customers too. Every time a customer is updated, the old row is expired and a new row is created. Right after that, the customer SK in all corresponding fact rows need to be updated. For this reason it is not practical to have a type 2 dimension for a transaction fact table.
    You can argue that there’s no need to update the customer SK (for a type 2 customer dim) because that was the condition of that customer when the sale occurred 3 years ago. Yes but in reality most businesses need to see the latest customer data (as of today), not the 3 year old one. Some solutions create 2 customer SK columns in the transaction fat table: as it was when the sale occurred and as it is today.
  2. The most important feature of an MDM system (Master Data Management) is to create new customers and products the moment they are created in any source system. If this timing requirement can’t be met, there is no point spending any money on the MDM system. Even if the MDM integrates customer/product data from 10 different systems, if the creation of a new customer/product is one day late the MDM is practically not usable. In order for an MDM to be real time like this, there are 2 techniques: a) use a push mechanism, not a pull mechanism, and b) create a “skeleton” customer / product, i.e. the row for a new customer only contains customer IDs, the other columns are left blank. This “skeleton row” is then updated later on. This is a key criteria for any MDM system: the row is gradually built up over time, not in one go. If we don’t do this we can’t meet the fundamental criteria of an MDM system, which is this: the creation of a new customer/product must be real time. The updating is real time, it can be a day late, but the creation of a new customer/product must be real time.
Next Page »

Blog at