Data Warehousing and Data Science

27 December 2017

What Machine Learning Can Be Used For

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 7:10 am

What Machine Learning Can Be Used For

Machine Learning is a field of computer science that gives computers the ability to learn without being explicitly programmed (Wikipedia). That is a very good definition but isn’t worth much without knowing concrete examples, so in this article I will focus on giving examples of what machine learning is currently used for, and can be used for.

Two Types of Usage

Broadly speaking Machine Learning can be used to do 2 things:

  1. If we have a set of data, Machine Learning can be used to classify that data into groups, or find associations between groups. This is called Unsupervised Learning.
    For example: grouping galaxies using astronomy images, grouping customers using purchase data, users who like news A also like news B.
  2. We have pairs of input and output Machine Learning can be used to predict the output of a new input. This is called Supervised Learning.
    For example: recognising symptoms of skin cancer based on the skin lesion images, identifying hacking attempts in cyber security based on network data, predicting the price movements of equity indices, forecasting the weather.

A. Image Recognition

By far the most widely use of machine learning is image recognition. For example, recognising hand written letters or numbers, vehicle plate numbers, and faces.

A1. Handwriting Recognition

Machine Learning can be used to recognise hand written letters or numbers. This article (link) from Adam Geitley explains the how. How Machine Learning can recognise a hand written number 8. An image of number 8 is a series of numbers which represents how dark each pixel is. Using a machine learning algorithm called Convolutional Neural Network (CNN), computers can be trained to recognise a number 8. And other numbers. And letters. Convolutional Neural Network breaks the image into small tiles and pass the tiles into a small neural network. We then sample the output and pass the sample to another neural network.

We use handwriting recognition in the post office to sort thousands of letters into post code areas by reading the address written on the envelopes. It is also used to read the amounts and recipient names on the handwritten bank cheques.

A2. Plate Number Recognition

The same technique (CNN) is used for character recognition. Not handwritten character but the ones on images. For example, to read the licence plate numbers in the petrol stations. It is called ANPR (Automatic Number Plate Recognition). It is used at toll gates like Dartford Crossing, so drivers does not have to queue to pay. Instead, they can pay from their mobiles later (within 24 hours). It is also used in apartments, i.e. only recognised cars are allowed in (and out!). It is also used to reduce congestions in big cities like London, i.e. we have to pay £11.50 to go into central London. Thousands of cameras at the borders of central London detect which cars are coming in and out so the system can match the payments and the cars, and issue penalties to those who didn’t pay.

A3. Facial Recognition

We can use Machine Learning to recognise faces. This article (link, also from Adam Geitley) explains the how. It uses a series of 3 images to find 128 measurements for each person using HOG algorithm (Histogram of Oriented Gradients). 2 of the 3 images are the photos of that person, the third one is a photo of a different person. This way, each face is converted into a set of 128 numbers, which called an encoding.

It is used in Facebook. We used to tag photos in Facebook and the tagged person get notified “hey you are tagged in such and such photo”. But now we don’t need to tag the faces and the person still get notified. This is because Facebook recognises who the faces belong to. The notified person can then ask the photo to be taken down, or like the photo, or tag it.

Facial recognition is also used for payment. Faces can be converted into a set of 128 numbers, which uniquely identify each person. Well, almost. Alipay uses face recognition for payment in KFC in China (link, link).

Banks now uses face recognition identify customers. Instead of passwords or PIN, to login to their bank accounts customers use facial recognition using the camera on their laptop, tablet or phone. For example, Lloyds bank and HSBC do this (link), and so do First Direct and Tesco bank (link). In some banks, customer can also use their voice or finger print (on their mobile) to login.

Facial recognition is also used by the police. In a beer festival in China security cameras were installed at entrance gates and sounded an alarm when it recognised a face is found in the police database (link). At the Notting Hill Carnival in London, the police used it to match the faces in the crowd to the database of people previously arrested or under bail conditions. In Germany the police uses facial recognition to catch terrorists (link).

A4. Searching Photos

What I mean here is not search the text/label that accompanies the photos, but searching inside the pixels of the photos to find for example if there is a flower in those photos.

  • Shutterstock allows us to search photos based on the composition (link).
  • Facebook Lumos will (not yet) enable us to search photos (link).
  • The best I think at the moment is Google Photos. We can now search, without labelling our photos first (link). I tried searching “curry”, “flower” and “snow” and it correctly displayed photos containing curry, flower and snow. Amazing!

The basic application is what Google Photos does above. Shutterstock, BBC, FB and every media company from should have this technology, but not yet. Media companies like newspapers and TV have millions of photos which take a lot of time to label them all. The cost saving would be enormous if employees can search inside those collection of photos, without labelling them.

The reason it takes the time is that we need to train the Machine Learning behind the photo search engine to recognised millions of words, which takes time. It’s probably a mini project to train a certain word such as flower, but to train millions of word would take years.

A5. Astronomy

Machine Learning can be used to group galaxies according to their shapes, i.e. spiral, elliptical, lenticular and irregular (link). Why don’t we group them manually? Because there are billions of them. We can also use Machine Learning to detect exoplanets using S4 algorithm (link) and CNN (link), rather than random forest (link). We can also use it to analyse galaxy spectrum (link, link).

Most of the application of Machine Learning in Astronomy is to analyse thousands of galaxy images using image recognition algorithms such as Convolutional Neural Network (CNN). Astronomy has a long history of using machine learning from the 90s. But it was using the “old” algorithm such as decision trees and random forest. From 2015 most of the image analysis in Astronomy is using deep learning algorithm such as CNN.

A6. Manufacturing

In manufacturing, Machine Learning is used to automatically detect overheating in motors, and defect in glass products (link). To detect overheating, the thermal images produced by infrared cameras are analysed using CNN. The glass defect is detected using normal-light photographs.

There are many Machine Learning application in manufacturing outside Image Recognition. Unlike Astronomy, manufacturing uses many different sensors, not just visual/images. Temperature, pressure, flow, weight, and many other measurements are analysed using Machine Learning for predictive maintenance and other purposes, see below for details.

A7. Health Care

Machine Learning is used to detect cancer by classifying the FNA images. It is used to detect breast cancer (link), salivary gland cancer, liver cancer and thyroid cancer.

Machine Learning is used to identify tuberculosis from chest x-ray images (link). They used Deep Convolutional Neural Network models to learn TB positive and TB negative x-rays. This is particularly useful in areas where there is no cardiothoracic radiologist.

A8. Retail

Lowe’s Home Improvement uses a robot to answer customer’s questions such as where is item X, and take the customer to the aisle where the item is located (link). It uses computer vision powered by Machine Learning to navigate the aisle in the store autonomously using camera and odometry. It also does inventory monitoring in real time by scanning the shelves in the aisle.

When we find a pair of jeans, glasses or handbag that we like, we can take a photo of it (or a screenshot), then uses an app called SnapShop from GoFIND to find the product in thousands of shops across many brands (link).

When we shop for clothes online, we can see how various garments fits us by trying it online using MeModel from Metail (link). We input our height, weight, bra size, waist, hips, hair style, gender, age and build so MeModel can predict and recommend various dresses, tops, skirts, trousers that fit and suit us. Retailers can see the demographic and body data of their customers, enabling them to provide suitable fashion that their customer like. The recommendation is a computer vision based, powered by Machine Learning algorithms.

B. Other Media Data

Non-image media data are video and audio. There are a lot of Machine Learning application for video data and audio data, e.g. natural language processing, self driving car. In this section I will list just the video and audio data, and will list the text and numeric data in section C.

B1. Self Driving Car

This is probably the one receiving the biggest media attention. It is not just the video feed that the deep learning algorithm needs to analyse, but also data from Radar, Lidar, GPS, and the controllers-actuators in the car itself (link). Kalman Filter is used to do combine data from different sensors to understand what’s around the car, including pedestrian, traffic signs and other vehicles. PID and MPC controllers are used with deep learning algorithm to regulate the engine RPM, breaks, gear, accelerator and wheels of the car so that it moves at the correct speed and direction, with self-tuning. Finally there is an element of routing as well, partly using neural network (link).

B2. Natural Language Processing (NLP)

NLP is about understanding human speech, both oral and written. It is also about producing speech, both oral and written. Speech recognition (link) is a major part of NLP, to translate voice into text. Stemming algorithms (link) are used the base form of each word. Word2Vec (link, link) is used for understanding the meaning of words (link). Dependency parsing is used to understand how each word in a sentence relates to each other (link). The Machine Learning algorithms used for NLP includes Deep Bi-directional RNN, Gated Recurrent Units and Long Short Term Memories (link). Convolutional Neural Networks is used for sentence classification (link) and Neural Machine Translation is used for large scale translation tasks (link).

Text Analytic (or Text Mining) is an important part of NLP. In the paragraph above, all items from Stemming onwards are in the Text Analytic domain. Basically, once the voices are translated into text, we are entering Text Analytic.

B3. Speech Processing

Rather than trying to understand the full meaning of human speech, Speech Processing focuses on just one particular aspect of it. For example:

  • Converting Speech to Text: it is like a Dictaphone crossed with OCR scanner. It can translate what we said to it into a document.
  • Speaker Identification: like a finger print, our voice can be used to identify us. It can be used in telephone banking to identify the customer (HSBC, Barclays), and in forensic to identify the criminal in offences such as kidnapping, extortion, blackmail threats, terrorist calls, ransom calls, etc (link).
  • Speech Generation (or Speech Synthesis): it is concatenation of many recorded voice of different words read by human. If the word in the text does not exist in the “recorded word database” then the word is converted into phoneme (one unit of sound), such as /k/ in car, cool and kit. Each phoneme is converted into voice and concatenated.
  • Noise Reduction: using a deep recurrent auto encoder neural network, we can reduce the noise on an input audio (link). This is done by training the network on stereo (noisy and clean) audio features to predict the clean features given noisy input.
  • Emotional Recognition: given voices as input, a deep recurrent neural network can determine the emotion the speaker is in, for example: excited, angry, sad or neutral (link). This can also be done using facial image processing.
  • Stress and Lie Detector: given a speech, using artificial neural network we can detect if the speaker is under stress or is lying (link). Similar result can also be achieved using facial image processing.

B4. Music

Machine Learning can be used to automatically detect the genre of a piece of music. The best approach is probably using Convolutional Neural Network and Bi-RNN block in parallel, which focus on the spatial features and temporal frame orders extraction (link).

Other application of Machine Learning in music include predicting the region of origin of a piece of folk/traditional music (link) and making music (link), usually using RNN and CNN algorithms.

B5. Video

Machine Learning can be used to recognise objects in video (link). It is done using CNN. As a team from Google and Stanford demonstrated (link) Machine Learning can be used to automatically classify a large set of videos into many classes. Using CNN they classified 1 million YouTube videos into 487 classes.

When we watch streaming videos on mobile phones or computers over 4G or WiFi, sometimes we experience buffering, i.e. long wait or break times before the video continues, or pixelation. The MIT CSAIL team used machine learning to switch between rate and buffer-based Adaptive Bit Rate, to minimize buffering (link).

Machine Learning can also be used for behaviour analysis in animal research. For example, using a video feed we can determine whether Drosophila (fruit flies) were on or off the egg-laying substrates for each frame of the videos (link).

C. Numerical Data

C1. Health Care

Last year Edward Choi et al from Georgia Institute of Technology and Sutter Health used Machine Learning to process data from Sutter-PAMF primary care patients to predict heart failure (link). They used 4 Machine Learning methods: Regularised Logistic Regression, Multilayer Perceptron, Support Vector Machine, K-Nearest Neighbour, and a Recurrent Neural Network called Gated Recurrent Unit. They used Theano (a deep learning software for Python) as well as Python Scikit-Learn.

Machine Learning was also used for predicting HIV/AIDS (link).

There are a lot of ML application with numerical data, such as Predictive Maintenance, Monitoring Efficiency, CRM, Robotic (link, link) as well as in Financial Services. But due to time limitation (I just started doing Andrew Ng’s Deep Learning course) I will have to write them another time.


20 December 2017

Late Data Warehouse Loading

Filed under: Data Warehousing — Vincent Rainardi @ 8:26 am

If your source systems are running late, how do you ensure that your data warehouse is still loaded on time?

In a retail scenario, it is possible that the customer mastering or product mastering system got delayed. Say they usually finish loading at 2am, but today it’s 1pm and they haven’t run yet. What do you? You have the sales data and other fact data waiting to be loaded since 5am. The answer is: load the fact data using yesterday’s customer data and yesterday’s product data.

In a Kimball DW (dimensional model), the data warehouse designer often put ETL constraints that the fact table must be loaded after all the dimensions are loaded. If some dimensions are late, all the fact tables are not loaded.

There are two things we can do in this case:

  1. Make the dependency fact table specific. A fact table should only be dependent on the dimension that it uses. So fact table A which is only uses dimension 1 to 4, should be loaded even though dimension 5 is late.
  2. The second thing we can do is to have an “emergency ETL button” which enables us to load today’s facts using yesterday’s dimension data. Physically, this is what the button does: in the enterprise scheduler (such as Active Batch, Autosys or Control-M), the button removes the dependency on the fact table loads so that they don’t wait for the dimensions to be loaded.

For #2 above, the data warehouse load is happening as normal, but using whatever dimension data available at that time.

When the product dimension data (or whatever dimension which was late) is available, it automatically trigger the normal warehouse load, which will update the Product SK in those fact tables.

The fact row for a new product would have 0 Product SK. How do we know what product this fact row is for? The answer for this is to use double keys, i.e. we have 2 key columns. One for Surrogate Key (SK) and one for the Business Key (BK). That fact row for a new product would have the BK column populated. So when the real load happens, we can populate the SK column by looking up the BK column in the product dimension.

Let me repeat: to enable us to correct the SK when the real load starts, we need to have “double keys” in the fact table: both the SK column and the BK columns. Thanks and credit to Troy Stevenson for this “double key” idea.

13 December 2017

Loading Late Arriving Dimension Rows

Filed under: Data Warehousing — Vincent Rainardi @ 8:19 am

One of very common thing in “Kimball warehousing” (i.e. dimensional model) is late arriving dimension. And yet it is still not obvious for many people what to do when the expected dimension row finally comes, particularly in the case of transactional fact table. In this article I will try to explain what a late arriving dimension row is, and what to do when it comes.

Late Arriving Dimension Row

It is probably easier if I explain using an example. Suppose we had an internet shop. On Monday 11th Dec 2017 at 19:40 a new customer (customer A) registered on the shop, then purchased a product. Both the sales record (order 213) and the customer record were stored in the sales system. On Tuesday at 2am the warehouse load started, reading the sales data from the sales system, but reading the customer data from the customer master. Customer A was not sent from the sales system to the customer master until Tuesday at 8pm (because it’s daily load at 8pm). So when the data warehouse on Tuesday 2am reads the customer master, it did not get the data for customer A. The following day, on Wed 13th Dec at 2am, when the warehouse load started, it received the customer A data from the customer master.

Periodic Snapshot Fact Table

There is an issue here, i.e. at 19:40 on Monday the sales system should have pushed customer A record to the customer master. But it didn’t because the customer master only loads from several source systems (one of them is the sales system) once a day at 8pm. Because of the issue, on Tue 2am when the sales fact table was loaded in the warehouse, there is no row for customer A in Dim Customer, and therefore the Customer SK column for that sales fact row for Sales Order 213 is 0. Let me repeat: the customer surrogate key column for order 213 fact row for business date Mon 11th Dec is 0, meaning “unknown customer”. See yellow cell below.

That’s what happened in the data warehouse load on Tuesday at 2am. What happened on Wed 2am load? At that time, the warehouse is populating the sales fact table for the 12th Dec business date (snapshot date), and this time, the Customer SK column for order 213 is populated with the surrogate key of Customer A, see the green cell below:

1. Periodic Snapshot Fact Table
2. Customer Dimension

We can see above that Order 213 started to exist on 11th Dec (it didn’t exist on 10th Dec). The customer SK for Sales Order 213 was 0 on 11th Dec and 74 on 12th Dec.

Note: because there is a fact row on 11th Dec for Order 213, the total amount for that date is correct (£81) but the breakdown on City is in correct. On 11th Dec Manchester has 0 sales. Only on 12th Dec Manchester has the correct amount of sales, i.e. £45.

Transaction Fact Table

That was a Periodic Snapshot Fact Table. How about a Transaction Fact Table?

It is like this:
3. Transaction Fact Table

Order 213 started to exist in the transaction fact table on 11th Dec with Customer SK = 0. On 12th Dec the Customer SK was updated to 74.

How do we update this Customer SK from 0 to 74 on 12th Dec? More specifically, how do we know which fact row to update on each day?

The answer is by using the staging table. As part of customer dimension loading process, we update each Transaction fact table for every new customer. On the above example, when we loaded sales order 213 on the 11th Dec, we store this in the sales staging table:

4. Sales Staging Table

On the 11th Dec the customer SK is 0. On 12th Dec when the data warehouse load ran, it checks this sales staging for Customer SK = 0 and fund that Order 213 has Customer SK = 0. It then looked at the Sales Order fact table and find the row for order 213 and update the Customer SK column on the fact table to 74. It then updated the customer SK column on the staging table with 74 too.

It is often useful to put an “incomplete” indicator column (Y/N) on this staging table, so that when looking for any SK column = 0, we first filter the rows by Incomplete = Y so we get much less rows.

So let me repeat, as part of the dimension loading process, we update each Transaction fact table for every new customer. This is the customer staging table:

5. Customer Staging Table.JPG

We can see above that Customer A started to exist in the data warehouse on 12th Dec. On 12th Dec, Customer A was a new row inserted into this customer staging table. As part of inserting a new customer, we corrected the Customer SK in every Transaction fact table for these new customers.


  1. If the Customer Dimension is SCD type 2, then we need to update the customer SK in every transaction fact table, not only for new customers, but for updated customers too. Every time a customer is updated, the old row is expired and a new row is created. Right after that, the customer SK in all corresponding fact rows need to be updated. For this reason it is not practical to have a type 2 dimension for a transaction fact table.
    You can argue that there’s no need to update the customer SK (for a type 2 customer dim) because that was the condition of that customer when the sale occurred 3 years ago. Yes but in reality most businesses need to see the latest customer data (as of today), not the 3 year old one. Some solutions create 2 customer SK columns in the transaction fat table: as it was when the sale occurred and as it is today.
  2. The most important feature of an MDM system (Master Data Management) is to create new customers and products the moment they are created in any source system. If this timing requirement can’t be met, there is no point spending any money on the MDM system. Even if the MDM integrates customer/product data from 10 different systems, if the creation of a new customer/product is one day late the MDM is practically not usable. In order for an MDM to be real time like this, there are 2 techniques: a) use a push mechanism, not a pull mechanism, and b) create a “skeleton” customer / product, i.e. the row for a new customer only contains customer IDs, the other columns are left blank. This “skeleton row” is then updated later on. This is a key criteria for any MDM system: the row is gradually built up over time, not in one go. If we don’t do this we can’t meet the fundamental criteria of an MDM system, which is this: the creation of a new customer/product must be real time. The updating is real time, it can be a day late, but the creation of a new customer/product must be real time.

Blog at