Data Warehousing, BI and Data Science

15 July 2018

Power BI Q&A

Filed under: Data Warehousing — Vincent Rainardi @ 5:59 pm
  1. Q: Can Power BI read SSAS cube?
    A: Yes, both tabular (link) and multidimensional (link)
  2. Q: Can Power BI display SSRS reports?
    A: Yes but on-premise only, by installing Power BI Report Server (link)
    SSRS (paginated) reports are not available in the Power BI Cloud/Service, but we can pin SSRS (paginated) reports on Power BI Dashboards (link)
  3. Q: Can I use natural language to query the data in Power BI, e.g. “What were the total sales last year?”
    A: Yes, only in the cloud version, only in English language. On a Power BI dashboard, we can type our question in the “Ask a question about your data” box. Power BI Q&A uses the names of the tables, columns, and calculated fields (link).
  4. Q: What is the difference between a Power BI Dashboard and a Power BI Report?
    A: Dashboard is 1 page, whereas Report is multiple pages. Dashboard is only available in the cloud/service version, whereas reports are available on both cloud and on-prem/desktop version. You can subscribe to a report, but not to a dashboard (link).
  5. Q: What is the difference between the “free” version and the “pay” version of Power BI?
    A: On the “free” version we can’t publish and consume app, can’t share dashboard or subscribe to dashboard (link).
  6. Q: Can we view Power BI reports on Android, or just iPads?
    A: We can view Power BI reports on iPad and iPhone (link) as well as on Android tablets and phones (link) and Windows 10 devices (link)
  7. Q: How can Power BI in the cloud connect securely to an SQL Server on-prem?
    A: By installing a data gateway on-prem, which protects the on-prem SQL Server.
  8. Q: What’s the best way for inputing simple data into Power BI report?
    A: Using embedded PowerApps (link)
  9. Q: Can I run R scripts in Power BI?
    A: Yes, Get Data > More, Other > R script. You can then use the output (data frame) in the Power BI visualisation (link).
  10. Q: How much is Power BI?
    A: Power BI Desktop is free. Power BI Pro is $9.99/month per user. Power BI Premium is $5000 to $6000/month for 1 to 1000 users, if only 100 users are authoring reports (link), i.e. the rest are just consuming the reports/dashboards.

 

Advertisements

11 May 2018

Big Volume Data Warehouse

Filed under: Data Warehousing — Vincent Rainardi @ 7:42 pm

Some people are worried with the volume of a snapshot fact table. If we insert half a million rows a day, after a year we would get 180 million rows, and after 5 years close to 1 billion rows. For example, position valuation of a large asset manager, or account balances of a large retail bank.

The volume issue also occurs on transaction fact table, i.e. we could be inserting a few million rows a day, resulting a few billion rows a year. For example, stock exchange bids and offers, search engines searches, the browsing traffic of a large ecommerce sites, or the good old supermarket checkouts case?

If we haven’t dealt with billions rows fact table like above, we would be worried about the query performance. How long will it take to query a fact table with 10 billion rows? Would it take a minute? 10 minutes? Well it could be, if the database engine has to scan the table. But if it is partitioned, it could only takes a second.

So the question now becomes: ok, great to know that it can take a second to query a 10 billion rows fact table. But how do I partition that big fact table? Well it depends on what fact table it is. If it a snapshot fact table, the partition should be on the snapshot date. If it is a transaction fact table, the partition should be on the transaction date.

Ok, that’s great. But what if we need to query across dates? For example, in the stock exchange data warehouse, to find out which are the top 10 most traded shares by volume (or by value) in the last 3 months. That query will take a lot of minutes. To do this, we need to prepare a fact table which stores the transaction volume and value of every share on every date. We populate this daily snapshot fact table every day, immediately after the main transaction fact table is populated.

For example, today is 11th May 2018. Right after the 10th May data is loaded into the shares transaction table, we calculate the volume and value of every shares traded on 10th May and insert these rows into this daily snapshot fact table. By querying this table it will only take a second to find out what are the top 10 most traded shares in the last 3 months.

15 April 2018

BI vs Data Science

Filed under: Data Warehousing — Vincent Rainardi @ 5:57 am

Last night there was a talk about Business Intelligence (BI) vs Data Science at Dataiku London. Unfortunately I was not able to attend but that event inspired me to write about this topic as it has always been an interesting topic for me. And I think it is very fitting to this blog as it is about Data Warehousing and BI, and recently I added about Data Science too.

Definition

Before we go too much further let’s remind ourselves what BI and Data Science are. Data Science is about scientific approaches to manage and analyse data using statistics, machine learning and visualisation (link). BI (Business Intelligence) is also about managing and analysing data and visualisation, but using business approaches, not scientific approaches (link, link).

I have been creating business intelligence and analytics applications across many industries including travel, banking, pharmaceutical, manufacturing, insurance. Both with a data warehouse/mart, and without. They are used for analysing the business, for reporting, and for dashboarding.

I have done Big Data development for a few years, including Hadoop, Hive, Data Lake, Spark, graph and document databases. In the last 6 months I have been learning Machine Learning and Neural Network, using Jupyter, Matlab and R, and recently Tensorflow. I understand how they are used for visual recognition, predicting prices, network security, marketing and playing games.

This Article

I have written a few articles about “Choosing between Big Data and Data Warehousing” (link, link, link) which differentiate the back-end part of BI and Data Science. So in this article I would like to focus on the front-end part, i.e. how they are used.

One of the best approaches to compare the usages of BI vs Data Science is probably using use cases from a particular industry. So in a particular industry we compare what BI can do and what Data Science can do. For example, in insurance BI is used for analysing customers, earnings, risks, financials and claims, and for reporting (operational, management and regulatory reports). Whereas Data Science is used to forecast losses, income and profitability (both account and product), adjusting rates, classifying customers for renewal, and grouping potential customers.

If we dive into other industry, say travel, we will understand the differences between BI and Data Science a bit more. The more industries we study, the more we understand the differences of how BI and Data Science are used. One is using business approach, and the other is using scientific approach.

With the above background I think we are ready to dive into core of this article, so let’s start with the first industry: insurance. I will then do one more industry: travel. I am a beginner in data science and particularly in machine learning. Suggestions and advice from readers will be most welcome, with regards to both about the approach and the content at vrainardi@gmail.com or via the comment box below. Once again in this article I’m not discussing the how or technical aspect, but the usage or business advantages, comparing the BI vs Data Science.

Insurance

General insurance is about insuring buildings, ships and businesses from risks like storms, sinking and frauds. There are about 20 classes in general insurance including property, aviation, marine, construction, oil & gas, professional indemnity, public liability and riots. It is a little bit different from life and pension business, and from retail business, i.e. home, car, health, pet and travel insurance.

I guess in this example we need to pick one line out of the above 3 main insurance businesses, which branches into many lines. Let us use car insurance for this example, because it is retail business so many people can relate to, and it’s easier to understand. It has many attributes which provide good illustration for the BI and Data Science usage.

The main usage of BI in car insurance is to analyse premium income and claim costs. These two numbers (and many of their derivatives, i.e. net, gross, subs, etc.) are analysed with regards to vehicle, time, office and customer attributes. Two of most common methods of this analysis is by using OLAP cubes and dashboards. The dashboards presents the most important findings such as income growth by vehicle types and profitability by customer groups, with guided drilldown along chosen pathways. The cubes provide a free-to-roam exploration platform, enabling drilling any numbers to the lowest details on any available pathways.

The business values of these activities are from knowing when and where exactly the revenues and costs are coming from and what factors influence them. This knowledge gives the business greater control and ability to grow the profitability across all business segments. Understanding claims and premiums profile for each type of vehicle (age, van/SUV, fuel, gearbox, engine size, parking, usage, milage) and each type policy holders / drivers (location, years of licence, profession, accident history, health, penalty points) enables the business to target the sales and marketing budgets on the more promising customers, as well as pricing the policy at the correct level. It also enables more precise filtering with regards to following up the leads from the Aggregates such as GoCompare and MoneySupermarket, and from brokers.

The Data Science is used to analyse customer churns, establishing rates, and analyse black box telematics with regards to risks and impact to premium levels. Standard actuarial rates give the business non-competitive advantage as they use standard factors such as driver age, engine size, mileage and parking location. Data Science enables insurers to factor-in new attributes such as driving behaviour, driving routes and driver’s social media, calculating the underwriting rates more specifically and more precisely for each policy. This enables the business to win new businesses, both commercial and retail, as quotes/pricing is the most important factor influencing new customers. Also, machine learning is used to improve the efficiency of online advertising, by predicting which how many users (and of which types) would click-through on different types of ads and by optimising bidding.

So BI and Data Science give different business advantages. It is not true that Data Science will replace BI in car insurance business, and in insurance in general. On the contrary Data Science will complement the business advantages that BI currently delivers.

Travel

Selling flights, hotel, holidays and cruises are the core businesses of travel companies. Car hire, travel insurance, excursions, airport transfer, airport parking, tours, restaurants, attractions, shows, villa, currency exchange and rail passes are the secondary businesses. They have retail and commercial business lines, both heavily utilising internet as a marketing platform because it reaches many countries and very cost efficient.

The BI is used for analysing web traffic such as funnels and conversion rates, revenue breakdown, customer segmentation, customer loyalty programs, and marketing campaigns such as email alerts and newsletters. Funnels enable the business to understand how much traffic filters trough each step of the purchasing process, and at which page each customer stops, as well as the sequence of pages they viewed and for how long each. This ultimately enables us to improve the website content and the products, resulting in higher conversion rates. Revenue breakdown is by product, by customer types, by country, by platform, by campaign, and by time. Marketing campaigns enable the business to drive more sales, with the content of each email/newsletter tailored differently to each individual customer. Each image on the newsletter is uniquely tagged enabling us to track which click from which customer triggered each sale.

The business value of these activities are: increased revenue, wider customer base, and increased customer loyalty. The revenue is higher because of higher conversion rates on web traffic and because the marketing campaigns drive more sales to the websites. Also because as a result of breakdown analysis we can focus our budget and time on the more profitable and promising product types and customer segments. The customer base is wider because market penetration from the campaigns. Customer loyalty increased because the offer is tailored specifically for each customer segment, sometimes down to each individual customer.

The Data Science can be used for customer segmentation, predicting campaign results, and analysing web traffic. It seems overlapping with the above BI activities, but if we dive into the detail levels they are actually complementing. By associating real time web traffic data and past purchase data using machine learning, travel companies can find out which customers have the best potential for purchasing products (using cookies to identify customers). The learned parameters are then used to display offers on the web pages the customers are currently on. The offers are specifically tailored to the highest potential that they are interested, based on their browsing behaviours (current and past sessions).

Predicting campaign results is done by first tracking back each sale to find its origin and contributing factors. The origin can be a click on an image on a particular email sent by a particular campaign. This can be identified by the tag on that image, which has the customer ID, campaign ID and product ID. The origin can also be a Google or social media advert from a particular campaign that the travel company runs, giving us the ID of the ad and possibly the user/audience ID. The contributing factors of a successful campaign can be the product, creative, layout, timing, pricing, audience and origin. Knowing which factors contribute the most to the success or failure of each campaign can be very revealing, from which we can then improve the subsequent campaigns. Various machine learning algorithms can be used for this including support vector machine, random forest and neural network.

7 February 2018

Building a Neural Network

Filed under: Data Warehousing — Vincent Rainardi @ 6:03 am

This is a diagram of a neural network:

1. Diagram of a Neural Network

Each node in the above diagram receive input from all the nodes on the previous layer. For example, node 1 on hidden layer 1 receives input from 4 nodes on input layer: x1,x2, x3 and x4, like this:

2. Calculate Output of a Single Node

Each of the lines has a weight. So not all three input have the same magnitude. In the above example, input from x1 has a weight of 0.1, whereas input from x2 has a weight of 0.2.

In addition to the weighted inputs, there is one special input going into the node. It is called the “bias”, notated as b. It is a constant (a number). So node n1 on hidden layer 1 receive 4 inputs (they are weighted) plus a bias. This total input is called z. In the above example it is 13.3.

The output of each node is the a function of the input. This function is called “activation function”. One of the common function used is “rectifier linear unit”, abbreviated as “relu”. It is simply the maximum between the input and zero. In the above example, the output of node n1 is 13.3. This output is called a, stands for “activation”.

Forward Propagation

So, if we have the values of input layer (x1, x2, x3, x4), the values of the weight on each line between the input layer and hidden layer 1, and the values of the biases for all 3 nodes in hidden layer 1, we can calculate the output of all 3 nodes in hidden layer 1 (a1, a2, a3), like step 1 in this diagram:

3. Forward Propagation

Once we calculated the output of hidden layer 1, we can use them to calculate the output of the next layer, i.e. hidden layer 2. This is marked as Step 2 in the above diagram. We can then calculate the output of hidden layer 3 (step 3) and finally calculate the output of the entire network (step 4).

Sigmoid Function

If the neural network is used to predict a binary condition (e.g. whether an image is a car or not), the activation function used on the output layer is usually not a relu function, but a sigmoid function, which looks like this: (source: wikipedia, link)

4. Sigmoid Function

We can see that for the majority of x value (input), the y value (output) is 0 or 1, which fits the binary condition.

So in Step 4, we put 51.877 x 0.7 + 41.099 x 0.3 + 6 = 54.6436 as the x (input) of the sigmoid function, and get 1 as the output of the sigmoid (a).

We then predict the output of the neural network (ŷ) as follows: if a is more than half, set ŷ to 1, otherwise set ŷ to 0.

Calculate The Cost

We do this forward propagation process for every data set that we have. In Machine Learning a data set is called “an example”. So for every example (input) we will have the predicted output (ŷ). We then compare these predicted outputs with the actual outputs and the difference is called “loss”. The average of loss from all examples is called “cost”.

There are many loss functions (link). For a binary classification where the output is a probability between 0 and 1 like above, the appropriate cost function is “cross entropy” which is like below: (source: ”ML Cheatsheet” from “Read the Docs”, link)

5. Graph of Cross Entropy Loss

So if the output should be 1 and the predicted output is 1, there is no loss (loss = 0). If the predicted output is very wrong e.g. a small number like 0.1 then it is penalised heavily. This “heavy penalty” is done by taking the log of the loss. So it is not linear. The formula is like this: (p = predicted output, y = actual output)

Cross Entropy Loss = -(y log(p) + (1-y) log(1-p))

Which is derived from: if y = 1 then p, else 1-p (the probability is p if y is 1, and the probability is 1-p if y is 0). Which becomes: p = py.(1-p)1-y
Taking the log it becomes: y log(p) + (1-y) log(1-p).
Taking a minus of it, it becomes the above. We take the minus because log(x) is like the left graph below, whereas minus log(x) is like the right graph below:

6. Graph of Log and Minus Log

Note. “cross entropy” is the average number bits needed to identify an event, see: Wikipedia: link

That is the loss for 1 example (data set). The cost is the average of loss for every example, which is sum of the above divided by m, where m is the number of examples, like this:

7. Calculating Cost

That average of loss is the cost of the entire neural network for this weighting (the weight on every line, including the bias). Which is not the best weighting. There are better weightings which result in lower costs. If we can find the best weighting, we will get the lowest cost, which means the smallest gap between the prediction and the actual output (across all data sets). Which means it’s the most accurate prediction. To find the best weighting we need to go backward from the output layer going to the left towards the input layer. This is called “back propagation”.

Back Propagation

 

Update The Parameters

 

References:

  1. Michael A. Nielson, “Neural Network and Deep Learning” book: link
  2. Denny Britz, “Implementing a Neural Network from scratch in Python”: link
  3. Sunil Ray, “Understanding and Coding Neural Network from scratch in Python and R”: link
  4. Matt Mazur, “Step by Step Back Propagation Example”: link.

 

27 December 2017

What Machine Learning Can Be Used For

Filed under: Data Warehousing — Vincent Rainardi @ 7:10 am

What Machine Learning Can Be Used For

Machine Learning is a field of computer science that gives computers the ability to learn without being explicitly programmed (Wikipedia). That is a very good definition but isn’t worth much without knowing concrete examples, so in this article I will focus on giving examples of what machine learning is currently used for, and can be used for.

Two Types of Usage

Broadly speaking Machine Learning can be used to do 2 things:

  1. If we have a set of data, Machine Learning can be used to classify that data into groups, or find associations between groups. This is called Unsupervised Learning.
    For example: grouping galaxies using astronomy images, grouping customers using purchase data, users who like news A also like news B.
  2. We have pairs of input and output Machine Learning can be used to predict the output of a new input. This is called Supervised Learning.
    For example: recognising symptoms of skin cancer based on the skin lesion images, identifying hacking attempts in cyber security based on network data, predicting the price movements of equity indices, forecasting the weather.

A. Image Recognition

By far the most widely use of machine learning is image recognition. For example, recognising hand written letters or numbers, vehicle plate numbers, and faces.

A1. Handwriting Recognition

Machine Learning can be used to recognise hand written letters or numbers. This article (link) from Adam Geitley explains the how. How Machine Learning can recognise a hand written number 8. An image of number 8 is a series of numbers which represents how dark each pixel is. Using a machine learning algorithm called Convolutional Neural Network (CNN), computers can be trained to recognise a number 8. And other numbers. And letters. Convolutional Neural Network breaks the image into small tiles and pass the tiles into a small neural network. We then sample the output and pass the sample to another neural network.

We use handwriting recognition in the post office to sort thousands of letters into post code areas by reading the address written on the envelopes. It is also used to read the amounts and recipient names on the handwritten bank cheques.

A2. Plate Number Recognition

The same technique (CNN) is used for character recognition. Not handwritten character but the ones on images. For example, to read the licence plate numbers in the petrol stations. It is called ANPR (Automatic Number Plate Recognition). It is used at toll gates like Dartford Crossing, so drivers does not have to queue to pay. Instead, they can pay from their mobiles later (within 24 hours). It is also used in apartments, i.e. only recognised cars are allowed in (and out!). It is also used to reduce congestions in big cities like London, i.e. we have to pay £11.50 to go into central London. Thousands of cameras at the borders of central London detect which cars are coming in and out so the system can match the payments and the cars, and issue penalties to those who didn’t pay.

A3. Facial Recognition

We can use Machine Learning to recognise faces. This article (link, also from Adam Geitley) explains the how. It uses a series of 3 images to find 128 measurements for each person using HOG algorithm (Histogram of Oriented Gradients). 2 of the 3 images are the photos of that person, the third one is a photo of a different person. This way, each face is converted into a set of 128 numbers, which called an encoding.

It is used in Facebook. We used to tag photos in Facebook and the tagged person get notified “hey you are tagged in such and such photo”. But now we don’t need to tag the faces and the person still get notified. This is because Facebook recognises who the faces belong to. The notified person can then ask the photo to be taken down, or like the photo, or tag it.

Facial recognition is also used for payment. Faces can be converted into a set of 128 numbers, which uniquely identify each person. Well, almost. Alipay uses face recognition for payment in KFC in China (link, link).

Banks now uses face recognition identify customers. Instead of passwords or PIN, to login to their bank accounts customers use facial recognition using the camera on their laptop, tablet or phone. For example, Lloyds bank and HSBC do this (link), and so do First Direct and Tesco bank (link). In some banks, customer can also use their voice or finger print (on their mobile) to login.

Facial recognition is also used by the police. In a beer festival in China security cameras were installed at entrance gates and sounded an alarm when it recognised a face is found in the police database (link). At the Notting Hill Carnival in London, the police used it to match the faces in the crowd to the database of people previously arrested or under bail conditions. In Germany the police uses facial recognition to catch terrorists (link).

A4. Searching Photos

What I mean here is not search the text/label that accompanies the photos, but searching inside the pixels of the photos to find for example if there is a flower in those photos.

  • Shutterstock allows us to search photos based on the composition (link).
  • Facebook Lumos will (not yet) enable us to search photos (link).
  • The best I think at the moment is Google Photos. We can now search, without labelling our photos first (link). I tried searching “curry”, “flower” and “snow” and it correctly displayed photos containing curry, flower and snow. Amazing!

The basic application is what Google Photos does above. Shutterstock, BBC, FB and every media company from should have this technology, but not yet. Media companies like newspapers and TV have millions of photos which take a lot of time to label them all. The cost saving would be enormous if employees can search inside those collection of photos, without labelling them.

The reason it takes the time is that we need to train the Machine Learning behind the photo search engine to recognised millions of words, which takes time. It’s probably a mini project to train a certain word such as flower, but to train millions of word would take years.

A5. Astronomy

Machine Learning can be used to group galaxies according to their shapes, i.e. spiral, elliptical, lenticular and irregular (link). Why don’t we group them manually? Because there are billions of them. We can also use Machine Learning to detect exoplanets using S4 algorithm (link) and CNN (link), rather than random forest (link). We can also use it to analyse galaxy spectrum (link, link).

Most of the application of Machine Learning in Astronomy is to analyse thousands of galaxy images using image recognition algorithms such as Convolutional Neural Network (CNN). Astronomy has a long history of using machine learning from the 90s. But it was using the “old” algorithm such as decision trees and random forest. From 2015 most of the image analysis in Astronomy is using deep learning algorithm such as CNN.

A6. Manufacturing

In manufacturing, Machine Learning is used to automatically detect overheating in motors, and defect in glass products (link). To detect overheating, the thermal images produced by infrared cameras are analysed using CNN. The glass defect is detected using normal-light photographs.

There are many Machine Learning application in manufacturing outside Image Recognition. Unlike Astronomy, manufacturing uses many different sensors, not just visual/images. Temperature, pressure, flow, weight, and many other measurements are analysed using Machine Learning for predictive maintenance and other purposes, see below for details.

A7. Health Care

Machine Learning is used to detect cancer by classifying the FNA images. It is used to detect breast cancer (link), salivary gland cancer, liver cancer and thyroid cancer.

Machine Learning is used to identify tuberculosis from chest x-ray images (link). They used Deep Convolutional Neural Network models to learn TB positive and TB negative x-rays. This is particularly useful in areas where there is no cardiothoracic radiologist.

A8. Retail

Lowe’s Home Improvement uses a robot to answer customer’s questions such as where is item X, and take the customer to the aisle where the item is located (link). It uses computer vision powered by Machine Learning to navigate the aisle in the store autonomously using camera and odometry. It also does inventory monitoring in real time by scanning the shelves in the aisle.

When we find a pair of jeans, glasses or handbag that we like, we can take a photo of it (or a screenshot), then uses an app called SnapShop from GoFIND to find the product in thousands of shops across many brands (link).

When we shop for clothes online, we can see how various garments fits us by trying it online using MeModel from Metail (link). We input our height, weight, bra size, waist, hips, hair style, gender, age and build so MeModel can predict and recommend various dresses, tops, skirts, trousers that fit and suit us. Retailers can see the demographic and body data of their customers, enabling them to provide suitable fashion that their customer like. The recommendation is a computer vision based, powered by Machine Learning algorithms.

B. Other Media Data

Non-image media data are video and audio. There are a lot of Machine Learning application for video data and audio data, e.g. natural language processing, self driving car. In this section I will list just the video and audio data, and will list the text and numeric data in section C.

B1. Self Driving Car

This is probably the one receiving the biggest media attention. It is not just the video feed that the deep learning algorithm needs to analyse, but also data from Radar, Lidar, GPS, and the controllers-actuators in the car itself (link). Kalman Filter is used to do combine data from different sensors to understand what’s around the car, including pedestrian, traffic signs and other vehicles. PID and MPC controllers are used with deep learning algorithm to regulate the engine RPM, breaks, gear, accelerator and wheels of the car so that it moves at the correct speed and direction, with self-tuning. Finally there is an element of routing as well, partly using neural network (link).

B2. Natural Language Processing (NLP)

NLP is about understanding human speech, both oral and written. It is also about producing speech, both oral and written. Speech recognition (link) is a major part of NLP, to translate voice into text. Stemming algorithms (link) are used the base form of each word. Word2Vec (link, link) is used for understanding the meaning of words (link). Dependency parsing is used to understand how each word in a sentence relates to each other (link). The Machine Learning algorithms used for NLP includes Deep Bi-directional RNN, Gated Recurrent Units and Long Short Term Memories (link). Convolutional Neural Networks is used for sentence classification (link) and Neural Machine Translation is used for large scale translation tasks (link).

Text Analytic (or Text Mining) is an important part of NLP. In the paragraph above, all items from Stemming onwards are in the Text Analytic domain. Basically, once the voices are translated into text, we are entering Text Analytic.

B3. Speech Processing

Rather than trying to understand the full meaning of human speech, Speech Processing focuses on just one particular aspect of it. For example:

  • Converting Speech to Text: it is like a Dictaphone crossed with OCR scanner. It can translate what we said to it into a document.
  • Speaker Identification: like a finger print, our voice can be used to identify us. It can be used in telephone banking to identify the customer (HSBC, Barclays), and in forensic to identify the criminal in offences such as kidnapping, extortion, blackmail threats, terrorist calls, ransom calls, etc (link).
  • Speech Generation (or Speech Synthesis): it is concatenation of many recorded voice of different words read by human. If the word in the text does not exist in the “recorded word database” then the word is converted into phoneme (one unit of sound), such as /k/ in car, cool and kit. Each phoneme is converted into voice and concatenated.
  • Noise Reduction: using a deep recurrent auto encoder neural network, we can reduce the noise on an input audio (link). This is done by training the network on stereo (noisy and clean) audio features to predict the clean features given noisy input.
  • Emotional Recognition: given voices as input, a deep recurrent neural network can determine the emotion the speaker is in, for example: excited, angry, sad or neutral (link). This can also be done using facial image processing.
  • Stress and Lie Detector: given a speech, using artificial neural network we can detect if the speaker is under stress or is lying (link). Similar result can also be achieved using facial image processing.

B4. Music

Machine Learning can be used to automatically detect the genre of a piece of music. The best approach is probably using Convolutional Neural Network and Bi-RNN block in parallel, which focus on the spatial features and temporal frame orders extraction (link).

Other application of Machine Learning in music include predicting the region of origin of a piece of folk/traditional music (link) and making music (link), usually using RNN and CNN algorithms.

B5. Video

Machine Learning can be used to recognise objects in video (link). It is done using CNN. As a team from Google and Stanford demonstrated (link) Machine Learning can be used to automatically classify a large set of videos into many classes. Using CNN they classified 1 million YouTube videos into 487 classes.

When we watch streaming videos on mobile phones or computers over 4G or WiFi, sometimes we experience buffering, i.e. long wait or break times before the video continues, or pixelation. The MIT CSAIL team used machine learning to switch between rate and buffer-based Adaptive Bit Rate, to minimize buffering (link).

Machine Learning can also be used for behaviour analysis in animal research. For example, using a video feed we can determine whether Drosophila (fruit flies) were on or off the egg-laying substrates for each frame of the videos (link).

C. Numerical Data

C1. Health Care

Last year Edward Choi et al from Georgia Institute of Technology and Sutter Health used Machine Learning to process data from Sutter-PAMF primary care patients to predict heart failure (link). They used 4 Machine Learning methods: Regularised Logistic Regression, Multilayer Perceptron, Support Vector Machine, K-Nearest Neighbour, and a Recurrent Neural Network called Gated Recurrent Unit. They used Theano (a deep learning software for Python) as well as Python Scikit-Learn.

Machine Learning was also used for predicting HIV/AIDS (link).

There are a lot of ML application with numerical data, such as Predictive Maintenance, Monitoring Efficiency, CRM, Robotic (link, link) as well as in Financial Services. But due to time limitation (I just started doing Andrew Ng’s Deep Learning course) I will have to write them another time.

 

20 December 2017

Late Data Warehouse Loading

Filed under: Data Warehousing — Vincent Rainardi @ 8:26 am

If your source systems are running late, how do you ensure that your data warehouse is still loaded on time?

In a retail scenario, it is possible that the customer mastering or product mastering system got delayed. Say they usually finish loading at 2am, but today it’s 1pm and they haven’t run yet. What do you? You have the sales data and other fact data waiting to be loaded since 5am. The answer is: load the fact data using yesterday’s customer data and yesterday’s product data.

In a Kimball DW (dimensional model), the data warehouse designer often put ETL constraints that the fact table must be loaded after all the dimensions are loaded. If some dimensions are late, all the fact tables are not loaded.

There are two things we can do in this case:

  1. Make the dependency fact table specific. A fact table should only be dependent on the dimension that it uses. So fact table A which is only uses dimension 1 to 4, should be loaded even though dimension 5 is late.
  2. The second thing we can do is to have an “emergency ETL button” which enables us to load today’s facts using yesterday’s dimension data. Physically, this is what the button does: in the enterprise scheduler (such as Active Batch, Autosys or Control-M), the button removes the dependency on the fact table loads so that they don’t wait for the dimensions to be loaded.

For #2 above, the data warehouse load is happening as normal, but using whatever dimension data available at that time.

When the product dimension data (or whatever dimension which was late) is available, it automatically trigger the normal warehouse load, which will update the Product SK in those fact tables.

The fact row for a new product would have 0 Product SK. How do we know what product this fact row is for? The answer for this is to use double keys, i.e. we have 2 key columns. One for Surrogate Key (SK) and one for the Business Key (BK). That fact row for a new product would have the BK column populated. So when the real load happens, we can populate the SK column by looking up the BK column in the product dimension.

Let me repeat: to enable us to correct the SK when the real load starts, we need to have “double keys” in the fact table: both the SK column and the BK columns. Thanks and credit to Troy Stevenson for this “double key” idea.

13 December 2017

Loading Late Arriving Dimension Rows

Filed under: Data Warehousing — Vincent Rainardi @ 8:19 am

One of very common thing in “Kimball warehousing” (i.e. dimensional model) is late arriving dimension. And yet it is still not obvious for many people what to do when the expected dimension row finally comes, particularly in the case of transactional fact table. In this article I will try to explain what a late arriving dimension row is, and what to do when it comes.

Late Arriving Dimension Row

It is probably easier if I explain using an example. Suppose we had an internet shop. On Monday 11th Dec 2017 at 19:40 a new customer (customer A) registered on the shop, then purchased a product. Both the sales record (order 213) and the customer record were stored in the sales system. On Tuesday at 2am the warehouse load started, reading the sales data from the sales system, but reading the customer data from the customer master. Customer A was not sent from the sales system to the customer master until Tuesday at 8pm (because it’s daily load at 8pm). So when the data warehouse on Tuesday 2am reads the customer master, it did not get the data for customer A. The following day, on Wed 13th Dec at 2am, when the warehouse load started, it received the customer A data from the customer master.

Periodic Snapshot Fact Table

There is an issue here, i.e. at 19:40 on Monday the sales system should have pushed customer A record to the customer master. But it didn’t because the customer master only loads from several source systems (one of them is the sales system) once a day at 8pm. Because of the issue, on Tue 2am when the sales fact table was loaded in the warehouse, there is no row for customer A in Dim Customer, and therefore the Customer SK column for that sales fact row for Sales Order 213 is 0. Let me repeat: the customer surrogate key column for order 213 fact row for business date Mon 11th Dec is 0, meaning “unknown customer”. See yellow cell below.

That’s what happened in the data warehouse load on Tuesday at 2am. What happened on Wed 2am load? At that time, the warehouse is populating the sales fact table for the 12th Dec business date (snapshot date), and this time, the Customer SK column for order 213 is populated with the surrogate key of Customer A, see the green cell below:

1. Periodic Snapshot Fact Table
2. Customer Dimension

We can see above that Order 213 started to exist on 11th Dec (it didn’t exist on 10th Dec). The customer SK for Sales Order 213 was 0 on 11th Dec and 74 on 12th Dec.

Note: because there is a fact row on 11th Dec for Order 213, the total amount for that date is correct (£81) but the breakdown on City is in correct. On 11th Dec Manchester has 0 sales. Only on 12th Dec Manchester has the correct amount of sales, i.e. £45.

Transaction Fact Table

That was a Periodic Snapshot Fact Table. How about a Transaction Fact Table?

It is like this:
3. Transaction Fact Table

Order 213 started to exist in the transaction fact table on 11th Dec with Customer SK = 0. On 12th Dec the Customer SK was updated to 74.

How do we update this Customer SK from 0 to 74 on 12th Dec? More specifically, how do we know which fact row to update on each day?

The answer is by using the staging table. As part of customer dimension loading process, we update each Transaction fact table for every new customer. On the above example, when we loaded sales order 213 on the 11th Dec, we store this in the sales staging table:

4. Sales Staging Table

On the 11th Dec the customer SK is 0. On 12th Dec when the data warehouse load ran, it checks this sales staging for Customer SK = 0 and fund that Order 213 has Customer SK = 0. It then looked at the Sales Order fact table and find the row for order 213 and update the Customer SK column on the fact table to 74. It then updated the customer SK column on the staging table with 74 too.

It is often useful to put an “incomplete” indicator column (Y/N) on this staging table, so that when looking for any SK column = 0, we first filter the rows by Incomplete = Y so we get much less rows.

So let me repeat, as part of the dimension loading process, we update each Transaction fact table for every new customer. This is the customer staging table:

5. Customer Staging Table.JPG

We can see above that Customer A started to exist in the data warehouse on 12th Dec. On 12th Dec, Customer A was a new row inserted into this customer staging table. As part of inserting a new customer, we corrected the Customer SK in every Transaction fact table for these new customers.

Note:

  1. If the Customer Dimension is SCD type 2, then we need to update the customer SK in every transaction fact table, not only for new customers, but for updated customers too. Every time a customer is updated, the old row is expired and a new row is created. Right after that, the customer SK in all corresponding fact rows need to be updated. For this reason it is not practical to have a type 2 dimension for a transaction fact table.
    You can argue that there’s no need to update the customer SK (for a type 2 customer dim) because that was the condition of that customer when the sale occurred 3 years ago. Yes but in reality most businesses need to see the latest customer data (as of today), not the 3 year old one. Some solutions create 2 customer SK columns in the transaction fat table: as it was when the sale occurred and as it is today.
  2. The most important feature of an MDM system (Master Data Management) is to create new customers and products the moment they are created in any source system. If this timing requirement can’t be met, there is no point spending any money on the MDM system. Even if the MDM integrates customer/product data from 10 different systems, if the creation of a new customer/product is one day late the MDM is practically not usable. In order for an MDM to be real time like this, there are 2 techniques: a) use a push mechanism, not a pull mechanism, and b) create a “skeleton” customer / product, i.e. the row for a new customer only contains customer IDs, the other columns are left blank. This “skeleton row” is then updated later on. This is a key criteria for any MDM system: the row is gradually built up over time, not in one go. If we don’t do this we can’t meet the fundamental criteria of an MDM system, which is this: the creation of a new customer/product must be real time. The updating is real time, it can be a day late, but the creation of a new customer/product must be real time.

29 November 2017

What’s in Azure?

Filed under: Business Intelligence,Data Science,Data Warehousing — Vincent Rainardi @ 5:31 am

As of 28th Nov 2017 we can create the following in Microsoft Azure:

  • Data: SQL Database, SQL Server, SQL Data Warehouse, SQL Elastic Database Pool, MySQL, PostgreSQL, Data Lake, CosmosDB, MongoDB, Redis, Teradata Database, Oracle Database, IBM DB2, Aerospike, XtremeData, HazelCast, ArangoDB.
  • Data Integration: Informatica, Data Factory, Data Catalog, File Sync, Profisee Maestro Integration Server, Information Builder iWay Big Data Integrator, BizTalk Service, IBM WebSphere Application Server, IBM WebSphere MQ, IBM MQ, Datameer, QuerySurge.
  • Big Data: HDInsight, Hadoop, HDFS, HBase, Hive (Interactive Query), Hive Streaming, Spark, Kafka, Storm, Hortonworks, Cloudera, Cassandra, GridGain, MapR, F5 BIG-IP, Syncfusion, Informatica Big Data Mgt, Kyligence, AtScale.
  • Data Science: ML Experimentation, ML Model Management, R Server, Data Science VM, Bot Service, Computer Vision API, Face API, Language Understanding Intelligent Service, Translation Speech API, Text Analytics API, ML Studio Web Service, Sparkling Water H2O.
  • Analytics: Databricks, Stream Analytics, Analysis Services, Data Lake Analytics, Time Series Insights, Tableau Server, Qlik Sense Server, Pyramid Analytics, Nutanix Analytics, Real World Analytics, Exasol Analytic, HP Vertica Analytics, Teradata Aster Analytics, Bing Search, Dundas BI, Power BI, Panorama Necto, Spago BI, Targit, KNIME, SAP Hana, Kepion, Jedox, Diagramics.
  • Internet of Things: IoT Hub, Event Hub, Notification Hub, Pilot Things IoT VPN, TrendMicro IoT Security.
  • Developer Tools: Visual Studio, VS Anywhere, DevOps Project, Team Project, DevTest Labs, Application Insights, API Management, Operational Insights, Jenkins,
  • Infrastructure: Virtual Network, Load Balancer, Network Security Group, Security & Audit, Security & Compliance, Event Tracker Security Centre, Log Analytics, Automation, Active Directory, Scheduler, CloudAMQP, Cradeon.
  • Web: Website + SQL, WebApp, WordPress, Drupal, Joomla, Django, API App, Logic App, CDN, Media Services, VoIP, SiouxApp, App Dynamics, CakePHP, BlogEngine.NET, MVC Forum, Better CMS, Node JS
  • Compute: Windows Server, Red Hat Enterprise Linux, Oracle Linux, Ubuntu Server, VM Instances, Service Fabric Cluster, Web App for Container, Function App, Batch Service, Cloud Service.
  • Blockchain: Ethereum, Hyperledger, Corda, Quorum, STRATO, Chain Core, Stratis.

13 November 2017

Numerical Attributes

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 7:24 pm

Should we put numerical attributes in the dimension table or in the fact table? Numerical attributes are static attributes of an entity but they are numeric, not string. For example: interest rate of a swap and weight of a product.

I usually follow the textbook approach and put them in the dimension. But for measures and attributes of OTC derivative instruments, we should be putting them in the fact table.

The fact table contains daily (and intraday) valuations, such as price, market value and the “Greeks”. An interest rate swap usually has a fixed leg and a floating leg. For example, receive 0.886% pay EUR 3m + 0.567%. So we will get a fixed interest rate of 0.886% and pay our counterparty the rate of EURIBOR 3m plus 0.567% (currently EURIBOR 3m is -0.329%).

These two attributes (the 0.886% and 0.567%) are part of the interest rate swap contract and are fixed for the duration of the contract (until maturity). When it reaches the maturity date, the contract ends and if we want to extend it, we will normally make a new contract, with a new SEDOL. Hence these 2 attributes are essentially static attributes (they are called receive interest rate and pay interest rate).

Because they are static they should be put into the dimension, which in this case is the security dimension or instrument dimension, rather than the fact table, which in this case is the instrument valuation fact table.

But I recommend putting them into the fact table for 4 reasons:

  1. Incorrect business judgement (a simple “sweep all” rule, and if we are wrong then it’s not too bad)
  2. Bytes
  3. Floating
  4. Calculation

1. Incorrect business judgement

There are many types of OTC derivative instruments, e.g. Interest Rate Swaps, Credit Default Swap, FX Options, Equity Index Swap, CDS Options, Asset Swap, Total Return Swap, etc. There are about 15 of them, each with different numerical attributes. So in there are about 50 or so numerical attributes.

We could sit down with the business and categorise these 50 numerical attributes into two:

  • A measure or fact: meaning they are potentially changing every day
  • A dimension attribute: meaning that they are static and never change

But the business judgement could be wrong. Something which they think is static could be changing. Here are some of those 50 attributes and imagine you sit down with the business categorising them into the two categories above:

Strike price interval, put option premium, strike price, intrinsic value, market value, fair value, time value, put call ratio, put call parity, recovery rate, expiration value, underlying asset value, fixed interest rate, floating interest rate margin, nominal amount, notional value, pay duration, pay/receive spread, pay/receive margin, credit spread.

In addition to the input from the business people, it can be useful to check the source system for historical changes. Does the value of that column/field ever changes? If yes, how often?

If we miscategorise them, there are two things which can happen:
a) A measure is miscategorised as an attribute
b) An attribute is miscategorised as a measure

Generally speaking, b) is fine but a) has quite serious consequences, because a) means that 1 instrument in the instrument dimension can have 100 rows, even 1000 rows. And if we have 300,000 instruments (covering both active positions, and non-active positions), the size of the instrument dimension will be very big, 300 million rows, making it slow.

The reason an instrument can have 1000 rows is that the attribute is actually a measure. We think it is static, but it actually changes every day. Or even several times a day.

On the other hand if an attribute is miscategorised as a measure, that attribute will be put in the fact table. And because it is numeric, it does not take a lot of bytes, even if it is repeated every day.

So to be safe, if you are not sure, put it as a measure in the fact table. The setup of an instrument can take a few weeks, particularly if it is a swap. During this period there is a continuous update on the instrument rows. If we have 20 numerical attributes for a swap, during this few weeks period we could have 20 rows because on day 1 only 5 attributes are setup, on day 2 another 5 is setup, on day 3 another 5, and on day 4 they change the attribute setup on day 1, and so on. So the whole thing can take 3 weeks, producing 20 rows. If it is in the fact table, a) we will only get 1 version a day, and b) we are preventing the instrument dimension from ballooning into many millions of rows.

2. Bytes

Numerical attributes take up much less bytes / disk space compared to textual attributes. Therefore, it does not slow things down if we put numerical attributes in the fact table, compared to the risk of miscategorising them into the instrument dimension, which will cause the dimension to “explode”.

3. Floating

If the attribute value is 0.00352281 and it is a float in the source system, it can be stored as 0.0035228100000000001. And it is possible that the value in the following day is 0.0035228100000000000, just because of the rounding in the floating point processor. Perhaps it is derived from a calculation of A/B. And if it changing from 00001 to 00000 then to 00001 again the next day, we have a phenomenon called “Flip Flop”. Flip Flop is where a dimension attribute keep changing values (usually twice a day) because it is updated from 2 different systems. This causes the number of rows in the dimension to “explode”.

4. Calculation

The last argument is that numerical attributes might be required for calculation, for example column1 divided by column2, or column1 + column2 + column3. If all columns involved in the calculation are stored in the fact table, we will be able to perform the calculation without joining the any dimension. We just need to filter on the partitioning date of the fact table, which is very fast.

Conclusion

I am not advocating to store all numerical attributes in the fact table for all cases. Only in the case of OTC instrument, I would recommend putting the numerical attributes in the fact table, because of the 4 reasons above.

But for most cases, usually the best approach is to put the numerical attributes in the dimension. For example: the dimension of a product (the width, length and height) and the weight of a product. Yes they do change from time to time, but not daily, hence we want to record these changes as SCD type 2 in the product dimension.

24 April 2017

Choosing between Big Data and Data Warehousing

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 4:14 am

If we have 100 files each containing 10 million rows that we need to load to a repository so that we can analyse the data. What should we do? Do we put them into Hadoop (HDFS), or into a database (RDBMS)?

Last week I defined the difference between Big Data and Data Warehousing as: Big Data is Hadoop, and Data Warehousing is RDBMS. See my article here: link. Today I would like to illustrate using an example where we need to choose between the two.

There are 4 consideration factors:

  1. Data Structure
  2. Data Volume
  3. Unstructured Data
  4. Schema on Read

1. Data Structure: Simple vs Complex
If all 100 files have the same structure e.g. they all consist of the same 10 columns, then it is better to put them into Hadoop. We can then use Hive, Spark, Presto, R or Python* to analyse the data. For example, to find patterns within the data, doing statistical analysis, or create forecasts. The development time will be shorter, because it is only 1 layer.
*or Phoenix, Impala, BigSQL, Stinger, Drill

If the 100 files contain 100 different tables, it is better to put them into a database, create a data warehouse, and use a Analytic/BI tool such as Tableau or QlikView* to analyse the data. For example to slice and dice the data, find percentage or anomalies and time series analysis. Yes we need to create 3 layers (staging, 3NF, star schema) but it enables us analyse each measure by different dimensions.
*or Looker, PowerBI, MicroStrategy, BusinessObjects, Cognos BI, Birt, Pentaho, Roambi, SAS, Sisense, or other BI tool

So if the data structure is simple, put into Hadoop, and if the structure is complex, put into a data warehouse. This is the general rule, but there are always exceptions. Can data with simple pattern be put into a data warehouse? Of course it can. Can data with complex pattern be put into Hadoop? Of course it can.

Using Hadoop and Hive/Spark/Presto we can also do slice and dice, find percentage or anomalies and do time series analysis. Using a data warehouse we can also do machine learning and data mining to find patterns in the data, do statistical analysis, and create forecasts. So basically, whether we store the data in Hadoop or in a data warehouse, we can still do complete analysis.

The issue here is storing it. Linking 100 tables in Hadoop is difficult and not natural. RDBMS such as SQL Server or Oracle is designed precisely for that task: linking and joining tables. Constructing a data model linking 100 tables is very suitable for an RDBMS. Can we design a data model linking 100 files with different structures in Hadoop? Of course we can. But it is much more difficult. For starters, it is Schema-on-Read, so the columns in the files have no data types. Schema-on-Read means that we don’t try to understand the relationship between the files when loading them into Hadoop. So yes we can load the 100 files into Hadoop, but we keep them as individual files, without relationships between them. This is the same as in Data Lake, which is also using Schema-on-Read, also using HDFS.

2. Data Volume: Small vs Large

100 files containing 10 million rows each is 1 billion rows per day. If all 100 files have the same structure (say they all consists of the same 10 columns), then we will have a performance problem if we load them into an SMP database such as SQL Server or Oracle. Within 3 years, this table will have about 1 trillion rows. Even with partitioning and indexing, it will still be slow to query.

Hadoop on the other hand, will have no problem storing and querying 1 trillion rows. It is designed exactly for this task, by storing it in many files and querying it in parallel using Stinger, Drill, Phoenix, Impala or Spark. The file structure is simple (the same 10 columns each) which lends itself to Hadoop.

Redshift, Azure SQL Data Warehouse, Exadata, Teradata, Greenplum and Netezza are more than capable to handle this, with excellent query performance. But MPPs are more costly than Hadoop, which is why companies tend to choose Hadoop for this task. Using an MPP for this task is like killing a fly with a canon. Not only it is expensive and unnecessary, but also it is too sluggish and cumbersome for the task.

If the 100 source files have a complex structure (such as an export from SAP system) then yes an MPP is a suitable solution as we need to create relationship between the files/tables. But if the source files have a simple structure and we just need to union them, then Hadoop is more suitable and more economical for this task.

So if the data volume is large, like 1 billion per day, and the data structure is simple, put them into Hadoop. But if the data volume is large and the data structure is complex, put them into an MPP.

3. Unstructured Data

If most of those 100 source files are MP4 (video) or MP3 (music), then Hadoop or Data Lake is an ideal platform to store them. An RDBMS be it SNP or MPP are not designed to store video or music files. They can (as a blob, or as externally-linked files), but they are not really designed for it.

If the source files have different number of attributes (such as Facebook or Twitter files,) then Hadoop or Data Lake is an ideal platform to store them. An RDBMS is not really designed for it.

Unstructured Data can also comes in the form of free-format text files (such as emails) and documents (such as journals and patents). Again Hadoop or Data Lake is much better position to store them than a RDBMS. But even better is document database, such as MongoDB, AWS DynamoDB or Azure CosmosDB.

4. Schema-on-Read

One of the advantages of using Hadoop or Data Lake is that they are Schema-on-Read. Meaning that we just store those files without determining whether the columns are numeric or string. It is only when we want to query it then we need to specify the data type.

Why is this an advantage? Because it makes it flexible. In Data Warehousing the first thing we need to do is to analyse the file structure, and design many tables to host the files in a Staging database. Then we design a normalised database to integrate those Staging tables. And then we design a Reporting layer in the form of Fact and Dimension tables and load those normalised tables into them. The whole thing can take a year if we have 100 files. The more number of files we have, the more complex the process and the longer it takes to design the databases for Integration layer and Reporting layer. It is good for the data architect (it gives them a job) but it is not good for the people who pay for this project.

Hadoop on the other hand is Schema-on-Read. After we put these 100 files in Hadoop, we query the first file. And when we query this first file, we specify the data types of each column. We don’t need to touch the other 99 files, yet. And we can already get the benefit. We can analyse the data straight away. On day one! If the other 99 files have the same structure, then we can union them, without extra effort of designing any database, and we can query them straight away. On day two! It is much simpler, we don’t need a team of 10 people designing any Staging, Normalised or Reporting layer for many many months. We can start analysing the data straight away and the project can finish in 2 to 3 months, with only 3 or 4 people. A lot less costly, a lot more agile, and a lot more flexible.

Conclusion

So that’s the 4 consideration factors when choosing between implementing big data or a data warehouse: data structure, data volume, unstructured data and schema on read.

Next Page »

Create a free website or blog at WordPress.com.