Data Warehousing, BI and Data Science

27 December 2017

What Machine Learning Can Be Used For

Filed under: Data Warehousing — Vincent Rainardi @ 7:10 am

What Machine Learning Can Be Used For

Machine Learning is a field of computer science that gives computers the ability to learn without being explicitly programmed (Wikipedia). That is a very good definition but isn’t worth much without knowing concrete examples, so in this article I will focus on giving examples of what machine learning is currently used for, and can be used for.

Two Types of Usage

Broadly speaking Machine Learning can be used to do 2 things:

  1. If we have a set of data, Machine Learning can be used to classify that data into groups, or find associations between groups. This is called Unsupervised Learning.
    For example: grouping galaxies using astronomy images, grouping customers using purchase data, users who like news A also like news B.
  2. We have pairs of input and output Machine Learning can be used to predict the output of a new input. This is called Supervised Learning.
    For example: recognising symptoms of skin cancer based on the skin lesion images, identifying hacking attempts in cyber security based on network data, predicting the price movements of equity indices, forecasting the weather.

A. Image Recognition

By far the most widely use of machine learning is image recognition. For example, recognising hand written letters or numbers, vehicle plate numbers, and faces.

A1. Handwriting Recognition

Machine Learning can be used to recognise hand written letters or numbers. This article (link) from Adam Geitley explains the how. How Machine Learning can recognise a hand written number 8. An image of number 8 is a series of numbers which represents how dark each pixel is. Using a machine learning algorithm called Convolutional Neural Network (CNN), computers can be trained to recognise a number 8. And other numbers. And letters. Convolutional Neural Network breaks the image into small tiles and pass the tiles into a small neural network. We then sample the output and pass the sample to another neural network.

We use handwriting recognition in the post office to sort thousands of letters into post code areas by reading the address written on the envelopes. It is also used to read the amounts and recipient names on the handwritten bank cheques.

A2. Plate Number Recognition

The same technique (CNN) is used for character recognition. Not handwritten character but the ones on images. For example, to read the licence plate numbers in the petrol stations. It is called ANPR (Automatic Number Plate Recognition). It is used at toll gates like Dartford Crossing, so drivers does not have to queue to pay. Instead, they can pay from their mobiles later (within 24 hours). It is also used in apartments, i.e. only recognised cars are allowed in (and out!). It is also used to reduce congestions in big cities like London, i.e. we have to pay £11.50 to go into central London. Thousands of cameras at the borders of central London detect which cars are coming in and out so the system can match the payments and the cars, and issue penalties to those who didn’t pay.

A3. Facial Recognition

We can use Machine Learning to recognise faces. This article (link, also from Adam Geitley) explains the how. It uses a series of 3 images to find 128 measurements for each person using HOG algorithm (Histogram of Oriented Gradients). 2 of the 3 images are the photos of that person, the third one is a photo of a different person. This way, each face is converted into a set of 128 numbers, which called an encoding.

It is used in Facebook. We used to tag photos in Facebook and the tagged person get notified “hey you are tagged in such and such photo”. But now we don’t need to tag the faces and the person still get notified. This is because Facebook recognises who the faces belong to. The notified person can then ask the photo to be taken down, or like the photo, or tag it.

Facial recognition is also used for payment. Faces can be converted into a set of 128 numbers, which uniquely identify each person. Well, almost. Alipay uses face recognition for payment in KFC in China (link, link).

Banks now uses face recognition identify customers. Instead of passwords or PIN, to login to their bank accounts customers use facial recognition using the camera on their laptop, tablet or phone. For example, Lloyds bank and HSBC do this (link), and so do First Direct and Tesco bank (link). In some banks, customer can also use their voice or finger print (on their mobile) to login.

Facial recognition is also used by the police. In a beer festival in China security cameras were installed at entrance gates and sounded an alarm when it recognised a face is found in the police database (link). At the Notting Hill Carnival in London, the police used it to match the faces in the crowd to the database of people previously arrested or under bail conditions. In Germany the police uses facial recognition to catch terrorists (link).

A4. Searching Photos

What I mean here is not search the text/label that accompanies the photos, but searching inside the pixels of the photos to find for example if there is a flower in those photos.

  • Shutterstock allows us to search photos based on the composition (link).
  • Facebook Lumos will (not yet) enable us to search photos (link).
  • The best I think at the moment is Google Photos. We can now search, without labelling our photos first (link). I tried searching “curry”, “flower” and “snow” and it correctly displayed photos containing curry, flower and snow. Amazing!

The basic application is what Google Photos does above. Shutterstock, BBC, FB and every media company from should have this technology, but not yet. Media companies like newspapers and TV have millions of photos which take a lot of time to label them all. The cost saving would be enormous if employees can search inside those collection of photos, without labelling them.

The reason it takes the time is that we need to train the Machine Learning behind the photo search engine to recognised millions of words, which takes time. It’s probably a mini project to train a certain word such as flower, but to train millions of word would take years.

A5. Astronomy

Machine Learning can be used to group galaxies according to their shapes, i.e. spiral, elliptical, lenticular and irregular (link). Why don’t we group them manually? Because there are billions of them. We can also use Machine Learning to detect exoplanets using S4 algorithm (link) and CNN (link), rather than random forest (link). We can also use it to analyse galaxy spectrum (link, link).

Most of the application of Machine Learning in Astronomy is to analyse thousands of galaxy images using image recognition algorithms such as Convolutional Neural Network (CNN). Astronomy has a long history of using machine learning from the 90s. But it was using the “old” algorithm such as decision trees and random forest. From 2015 most of the image analysis in Astronomy is using deep learning algorithm such as CNN.

A6. Manufacturing

In manufacturing, Machine Learning is used to automatically detect overheating in motors, and defect in glass products (link). To detect overheating, the thermal images produced by infrared cameras are analysed using CNN. The glass defect is detected using normal-light photographs.

There are many Machine Learning application in manufacturing outside Image Recognition. Unlike Astronomy, manufacturing uses many different sensors, not just visual/images. Temperature, pressure, flow, weight, and many other measurements are analysed using Machine Learning for predictive maintenance and other purposes, see below for details.

A7. Health Care

Machine Learning is used to detect cancer by classifying the FNA images. It is used to detect breast cancer (link), salivary gland cancer, liver cancer and thyroid cancer.

Machine Learning is used to identify tuberculosis from chest x-ray images (link). They used Deep Convolutional Neural Network models to learn TB positive and TB negative x-rays. This is particularly useful in areas where there is no cardiothoracic radiologist.

A8. Retail

Lowe’s Home Improvement uses a robot to answer customer’s questions such as where is item X, and take the customer to the aisle where the item is located (link). It uses computer vision powered by Machine Learning to navigate the aisle in the store autonomously using camera and odometry. It also does inventory monitoring in real time by scanning the shelves in the aisle.

When we find a pair of jeans, glasses or handbag that we like, we can take a photo of it (or a screenshot), then uses an app called SnapShop from GoFIND to find the product in thousands of shops across many brands (link).

When we shop for clothes online, we can see how various garments fits us by trying it online using MeModel from Metail (link). We input our height, weight, bra size, waist, hips, hair style, gender, age and build so MeModel can predict and recommend various dresses, tops, skirts, trousers that fit and suit us. Retailers can see the demographic and body data of their customers, enabling them to provide suitable fashion that their customer like. The recommendation is a computer vision based, powered by Machine Learning algorithms.

B. Other Media Data

Non-image media data are video and audio. There are a lot of Machine Learning application for video data and audio data, e.g. natural language processing, self driving car. In this section I will list just the video and audio data, and will list the text and numeric data in section C.

B1. Self Driving Car

This is probably the one receiving the biggest media attention. It is not just the video feed that the deep learning algorithm needs to analyse, but also data from Radar, Lidar, GPS, and the controllers-actuators in the car itself (link). Kalman Filter is used to do combine data from different sensors to understand what’s around the car, including pedestrian, traffic signs and other vehicles. PID and MPC controllers are used with deep learning algorithm to regulate the engine RPM, breaks, gear, accelerator and wheels of the car so that it moves at the correct speed and direction, with self-tuning. Finally there is an element of routing as well, partly using neural network (link).

B2. Natural Language Processing (NLP)

NLP is about understanding human speech, both oral and written. It is also about producing speech, both oral and written. Speech recognition (link) is a major part of NLP, to translate voice into text. Stemming algorithms (link) are used the base form of each word. Word2Vec (link, link) is used for understanding the meaning of words (link). Dependency parsing is used to understand how each word in a sentence relates to each other (link). The Machine Learning algorithms used for NLP includes Deep Bi-directional RNN, Gated Recurrent Units and Long Short Term Memories (link). Convolutional Neural Networks is used for sentence classification (link) and Neural Machine Translation is used for large scale translation tasks (link).

Text Analytic (or Text Mining) is an important part of NLP. In the paragraph above, all items from Stemming onwards are in the Text Analytic domain. Basically, once the voices are translated into text, we are entering Text Analytic.

B3. Speech Processing

Rather than trying to understand the full meaning of human speech, Speech Processing focuses on just one particular aspect of it. For example:

  • Converting Speech to Text: it is like a Dictaphone crossed with OCR scanner. It can translate what we said to it into a document.
  • Speaker Identification: like a finger print, our voice can be used to identify us. It can be used in telephone banking to identify the customer (HSBC, Barclays), and in forensic to identify the criminal in offences such as kidnapping, extortion, blackmail threats, terrorist calls, ransom calls, etc (link).
  • Speech Generation (or Speech Synthesis): it is concatenation of many recorded voice of different words read by human. If the word in the text does not exist in the “recorded word database” then the word is converted into phoneme (one unit of sound), such as /k/ in car, cool and kit. Each phoneme is converted into voice and concatenated.
  • Noise Reduction: using a deep recurrent auto encoder neural network, we can reduce the noise on an input audio (link). This is done by training the network on stereo (noisy and clean) audio features to predict the clean features given noisy input.
  • Emotional Recognition: given voices as input, a deep recurrent neural network can determine the emotion the speaker is in, for example: excited, angry, sad or neutral (link). This can also be done using facial image processing.
  • Stress and Lie Detector: given a speech, using artificial neural network we can detect if the speaker is under stress or is lying (link). Similar result can also be achieved using facial image processing.

B4. Music

Machine Learning can be used to automatically detect the genre of a piece of music. The best approach is probably using Convolutional Neural Network and Bi-RNN block in parallel, which focus on the spatial features and temporal frame orders extraction (link).

Other application of Machine Learning in music include predicting the region of origin of a piece of folk/traditional music (link) and making music (link), usually using RNN and CNN algorithms.

B5. Video

Machine Learning can be used to recognise objects in video (link). It is done using CNN. As a team from Google and Stanford demonstrated (link) Machine Learning can be used to automatically classify a large set of videos into many classes. Using CNN they classified 1 million YouTube videos into 487 classes.

When we watch streaming videos on mobile phones or computers over 4G or WiFi, sometimes we experience buffering, i.e. long wait or break times before the video continues, or pixelation. The MIT CSAIL team used machine learning to switch between rate and buffer-based Adaptive Bit Rate, to minimize buffering (link).

Machine Learning can also be used for behaviour analysis in animal research. For example, using a video feed we can determine whether Drosophila (fruit flies) were on or off the egg-laying substrates for each frame of the videos (link).

C. Numerical Data

C1. Health Care

Last year Edward Choi et al from Georgia Institute of Technology and Sutter Health used Machine Learning to process data from Sutter-PAMF primary care patients to predict heart failure (link). They used 4 Machine Learning methods: Regularised Logistic Regression, Multilayer Perceptron, Support Vector Machine, K-Nearest Neighbour, and a Recurrent Neural Network called Gated Recurrent Unit. They used Theano (a deep learning software for Python) as well as Python Scikit-Learn.

Machine Learning was also used for predicting HIV/AIDS (link).

There are a lot of ML application with numerical data, such as Predictive Maintenance, Monitoring Efficiency, CRM, Robotic (link, link) as well as in Financial Services. But due to time limitation (I just started doing Andrew Ng’s Deep Learning course) I will have to write them another time.

 

Advertisements

20 December 2017

Late Data Warehouse Loading

Filed under: Data Warehousing — Vincent Rainardi @ 8:26 am

If your source systems are running late, how do you ensure that your data warehouse is still loaded on time?

In a retail scenario, it is possible that the customer mastering or product mastering system got delayed. Say they usually finish loading at 2am, but today it’s 1pm and they haven’t run yet. What do you? You have the sales data and other fact data waiting to be loaded since 5am. The answer is: load the fact data using yesterday’s customer data and yesterday’s product data.

In a Kimball DW (dimensional model), the data warehouse designer often put ETL constraints that the fact table must be loaded after all the dimensions are loaded. If some dimensions are late, all the fact tables are not loaded.

There are two things we can do in this case:

  1. Make the dependency fact table specific. A fact table should only be dependent on the dimension that it uses. So fact table A which is only uses dimension 1 to 4, should be loaded even though dimension 5 is late.
  2. The second thing we can do is to have an “emergency ETL button” which enables us to load today’s facts using yesterday’s dimension data. Physically, this is what the button does: in the enterprise scheduler (such as Active Batch, Autosys or Control-M), the button removes the dependency on the fact table loads so that they don’t wait for the dimensions to be loaded.

For #2 above, the data warehouse load is happening as normal, but using whatever dimension data available at that time.

When the product dimension data (or whatever dimension which was late) is available, it automatically trigger the normal warehouse load, which will update the Product SK in those fact tables.

The fact row for a new product would have 0 Product SK. How do we know what product this fact row is for? The answer for this is to use double keys, i.e. we have 2 key columns. One for Surrogate Key (SK) and one for the Business Key (BK). That fact row for a new product would have the BK column populated. So when the real load happens, we can populate the SK column by looking up the BK column in the product dimension.

Let me repeat: to enable us to correct the SK when the real load starts, we need to have “double keys” in the fact table: both the SK column and the BK columns. Thanks and credit to Troy Stevenson for this “double key” idea.

13 December 2017

Loading Late Arriving Dimension Rows

Filed under: Data Warehousing — Vincent Rainardi @ 8:19 am

One of very common thing in “Kimball warehousing” (i.e. dimensional model) is late arriving dimension. And yet it is still not obvious for many people what to do when the expected dimension row finally comes, particularly in the case of transactional fact table. In this article I will try to explain what a late arriving dimension row is, and what to do when it comes.

Late Arriving Dimension Row

It is probably easier if I explain using an example. Suppose we had an internet shop. On Monday 11th Dec 2017 at 19:40 a new customer (customer A) registered on the shop, then purchased a product. Both the sales record (order 213) and the customer record were stored in the sales system. On Tuesday at 2am the warehouse load started, reading the sales data from the sales system, but reading the customer data from the customer master. Customer A was not sent from the sales system to the customer master until Tuesday at 8pm (because it’s daily load at 8pm). So when the data warehouse on Tuesday 2am reads the customer master, it did not get the data for customer A. The following day, on Wed 13th Dec at 2am, when the warehouse load started, it received the customer A data from the customer master.

Periodic Snapshot Fact Table

There is an issue here, i.e. at 19:40 on Monday the sales system should have pushed customer A record to the customer master. But it didn’t because the customer master only loads from several source systems (one of them is the sales system) once a day at 8pm. Because of the issue, on Tue 2am when the sales fact table was loaded in the warehouse, there is no row for customer A in Dim Customer, and therefore the Customer SK column for that sales fact row for Sales Order 213 is 0. Let me repeat: the customer surrogate key column for order 213 fact row for business date Mon 11th Dec is 0, meaning “unknown customer”. See yellow cell below.

That’s what happened in the data warehouse load on Tuesday at 2am. What happened on Wed 2am load? At that time, the warehouse is populating the sales fact table for the 12th Dec business date (snapshot date), and this time, the Customer SK column for order 213 is populated with the surrogate key of Customer A, see the green cell below:

1. Periodic Snapshot Fact Table
2. Customer Dimension

We can see above that Order 213 started to exist on 11th Dec (it didn’t exist on 10th Dec). The customer SK for Sales Order 213 was 0 on 11th Dec and 74 on 12th Dec.

Note: because there is a fact row on 11th Dec for Order 213, the total amount for that date is correct (£81) but the breakdown on City is in correct. On 11th Dec Manchester has 0 sales. Only on 12th Dec Manchester has the correct amount of sales, i.e. £45.

Transaction Fact Table

That was a Periodic Snapshot Fact Table. How about a Transaction Fact Table?

It is like this:
3. Transaction Fact Table

Order 213 started to exist in the transaction fact table on 11th Dec with Customer SK = 0. On 12th Dec the Customer SK was updated to 74.

How do we update this Customer SK from 0 to 74 on 12th Dec? More specifically, how do we know which fact row to update on each day?

The answer is by using the staging table. As part of customer dimension loading process, we update each Transaction fact table for every new customer. On the above example, when we loaded sales order 213 on the 11th Dec, we store this in the sales staging table:

4. Sales Staging Table

On the 11th Dec the customer SK is 0. On 12th Dec when the data warehouse load ran, it checks this sales staging for Customer SK = 0 and fund that Order 213 has Customer SK = 0. It then looked at the Sales Order fact table and find the row for order 213 and update the Customer SK column on the fact table to 74. It then updated the customer SK column on the staging table with 74 too.

It is often useful to put an “incomplete” indicator column (Y/N) on this staging table, so that when looking for any SK column = 0, we first filter the rows by Incomplete = Y so we get much less rows.

So let me repeat, as part of the dimension loading process, we update each Transaction fact table for every new customer. This is the customer staging table:

5. Customer Staging Table.JPG

We can see above that Customer A started to exist in the data warehouse on 12th Dec. On 12th Dec, Customer A was a new row inserted into this customer staging table. As part of inserting a new customer, we corrected the Customer SK in every Transaction fact table for these new customers.

Note:

  1. If the Customer Dimension is SCD type 2, then we need to update the customer SK in every transaction fact table, not only for new customers, but for updated customers too. Every time a customer is updated, the old row is expired and a new row is created. Right after that, the customer SK in all corresponding fact rows need to be updated. For this reason it is not practical to have a type 2 dimension for a transaction fact table.
    You can argue that there’s no need to update the customer SK (for a type 2 customer dim) because that was the condition of that customer when the sale occurred 3 years ago. Yes but in reality most businesses need to see the latest customer data (as of today), not the 3 year old one. Some solutions create 2 customer SK columns in the transaction fat table: as it was when the sale occurred and as it is today.
  2. The most important feature of an MDM system (Master Data Management) is to create new customers and products the moment they are created in any source system. If this timing requirement can’t be met, there is no point spending any money on the MDM system. Even if the MDM integrates customer/product data from 10 different systems, if the creation of a new customer/product is one day late the MDM is practically not usable. In order for an MDM to be real time like this, there are 2 techniques: a) use a push mechanism, not a pull mechanism, and b) create a “skeleton” customer / product, i.e. the row for a new customer only contains customer IDs, the other columns are left blank. This “skeleton row” is then updated later on. This is a key criteria for any MDM system: the row is gradually built up over time, not in one go. If we don’t do this we can’t meet the fundamental criteria of an MDM system, which is this: the creation of a new customer/product must be real time. The updating is real time, it can be a day late, but the creation of a new customer/product must be real time.

29 November 2017

What’s in Azure?

Filed under: Business Intelligence,Data Science,Data Warehousing — Vincent Rainardi @ 5:31 am

As of 28th Nov 2017 we can create the following in Microsoft Azure:

  • Data: SQL Database, SQL Server, SQL Data Warehouse, SQL Elastic Database Pool, MySQL, PostgreSQL, Data Lake, CosmosDB, MongoDB, Redis, Teradata Database, Oracle Database, IBM DB2, Aerospike, XtremeData, HazelCast, ArangoDB.
  • Data Integration: Informatica, Data Factory, Data Catalog, File Sync, Profisee Maestro Integration Server, Information Builder iWay Big Data Integrator, BizTalk Service, IBM WebSphere Application Server, IBM WebSphere MQ, IBM MQ, Datameer, QuerySurge.
  • Big Data: HDInsight, Hadoop, HDFS, HBase, Hive (Interactive Query), Hive Streaming, Spark, Kafka, Storm, Hortonworks, Cloudera, Cassandra, GridGain, MapR, F5 BIG-IP, Syncfusion, Informatica Big Data Mgt, Kyligence, AtScale.
  • Data Science: ML Experimentation, ML Model Management, R Server, Data Science VM, Bot Service, Computer Vision API, Face API, Language Understanding Intelligent Service, Translation Speech API, Text Analytics API, ML Studio Web Service, Sparkling Water H2O.
  • Analytics: Databricks, Stream Analytics, Analysis Services, Data Lake Analytics, Time Series Insights, Tableau Server, Qlik Sense Server, Pyramid Analytics, Nutanix Analytics, Real World Analytics, Exasol Analytic, HP Vertica Analytics, Teradata Aster Analytics, Bing Search, Dundas BI, Power BI, Panorama Necto, Spago BI, Targit, KNIME, SAP Hana, Kepion, Jedox, Diagramics.
  • Internet of Things: IoT Hub, Event Hub, Notification Hub, Pilot Things IoT VPN, TrendMicro IoT Security.
  • Developer Tools: Visual Studio, VS Anywhere, DevOps Project, Team Project, DevTest Labs, Application Insights, API Management, Operational Insights, Jenkins,
  • Infrastructure: Virtual Network, Load Balancer, Network Security Group, Security & Audit, Security & Compliance, Event Tracker Security Centre, Log Analytics, Automation, Active Directory, Scheduler, CloudAMQP, Cradeon.
  • Web: Website + SQL, WebApp, WordPress, Drupal, Joomla, Django, API App, Logic App, CDN, Media Services, VoIP, SiouxApp, App Dynamics, CakePHP, BlogEngine.NET, MVC Forum, Better CMS, Node JS
  • Compute: Windows Server, Red Hat Enterprise Linux, Oracle Linux, Ubuntu Server, VM Instances, Service Fabric Cluster, Web App for Container, Function App, Batch Service, Cloud Service.
  • Blockchain: Ethereum, Hyperledger, Corda, Quorum, STRATO, Chain Core, Stratis.

13 November 2017

Numerical Attributes

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 7:24 pm

Should we put numerical attributes in the dimension table or in the fact table? Numerical attributes are static attributes of an entity but they are numeric, not string. For example: interest rate of a swap and weight of a product.

I usually follow the textbook approach and put them in the dimension. But for measures and attributes of OTC derivative instruments, we should be putting them in the fact table.

The fact table contains daily (and intraday) valuations, such as price, market value and the “Greeks”. An interest rate swap usually has a fixed leg and a floating leg. For example, receive 0.886% pay EUR 3m + 0.567%. So we will get a fixed interest rate of 0.886% and pay our counterparty the rate of EURIBOR 3m plus 0.567% (currently EURIBOR 3m is -0.329%).

These two attributes (the 0.886% and 0.567%) are part of the interest rate swap contract and are fixed for the duration of the contract (until maturity). When it reaches the maturity date, the contract ends and if we want to extend it, we will normally make a new contract, with a new SEDOL. Hence these 2 attributes are essentially static attributes (they are called receive interest rate and pay interest rate).

Because they are static they should be put into the dimension, which in this case is the security dimension or instrument dimension, rather than the fact table, which in this case is the instrument valuation fact table.

But I recommend putting them into the fact table for 4 reasons:

  1. Incorrect business judgement (a simple “sweep all” rule, and if we are wrong then it’s not too bad)
  2. Bytes
  3. Floating
  4. Calculation

1. Incorrect business judgement

There are many types of OTC derivative instruments, e.g. Interest Rate Swaps, Credit Default Swap, FX Options, Equity Index Swap, CDS Options, Asset Swap, Total Return Swap, etc. There are about 15 of them, each with different numerical attributes. So in there are about 50 or so numerical attributes.

We could sit down with the business and categorise these 50 numerical attributes into two:

  • A measure or fact: meaning they are potentially changing every day
  • A dimension attribute: meaning that they are static and never change

But the business judgement could be wrong. Something which they think is static could be changing. Here are some of those 50 attributes and imagine you sit down with the business categorising them into the two categories above:

Strike price interval, put option premium, strike price, intrinsic value, market value, fair value, time value, put call ratio, put call parity, recovery rate, expiration value, underlying asset value, fixed interest rate, floating interest rate margin, nominal amount, notional value, pay duration, pay/receive spread, pay/receive margin, credit spread.

In addition to the input from the business people, it can be useful to check the source system for historical changes. Does the value of that column/field ever changes? If yes, how often?

If we miscategorise them, there are two things which can happen:
a) A measure is miscategorised as an attribute
b) An attribute is miscategorised as a measure

Generally speaking, b) is fine but a) has quite serious consequences, because a) means that 1 instrument in the instrument dimension can have 100 rows, even 1000 rows. And if we have 300,000 instruments (covering both active positions, and non-active positions), the size of the instrument dimension will be very big, 300 million rows, making it slow.

The reason an instrument can have 1000 rows is that the attribute is actually a measure. We think it is static, but it actually changes every day. Or even several times a day.

On the other hand if an attribute is miscategorised as a measure, that attribute will be put in the fact table. And because it is numeric, it does not take a lot of bytes, even if it is repeated every day.

So to be safe, if you are not sure, put it as a measure in the fact table. The setup of an instrument can take a few weeks, particularly if it is a swap. During this period there is a continuous update on the instrument rows. If we have 20 numerical attributes for a swap, during this few weeks period we could have 20 rows because on day 1 only 5 attributes are setup, on day 2 another 5 is setup, on day 3 another 5, and on day 4 they change the attribute setup on day 1, and so on. So the whole thing can take 3 weeks, producing 20 rows. If it is in the fact table, a) we will only get 1 version a day, and b) we are preventing the instrument dimension from ballooning into many millions of rows.

2. Bytes

Numerical attributes take up much less bytes / disk space compared to textual attributes. Therefore, it does not slow things down if we put numerical attributes in the fact table, compared to the risk of miscategorising them into the instrument dimension, which will cause the dimension to “explode”.

3. Floating

If the attribute value is 0.00352281 and it is a float in the source system, it can be stored as 0.0035228100000000001. And it is possible that the value in the following day is 0.0035228100000000000, just because of the rounding in the floating point processor. Perhaps it is derived from a calculation of A/B. And if it changing from 00001 to 00000 then to 00001 again the next day, we have a phenomenon called “Flip Flop”. Flip Flop is where a dimension attribute keep changing values (usually twice a day) because it is updated from 2 different systems. This causes the number of rows in the dimension to “explode”.

4. Calculation

The last argument is that numerical attributes might be required for calculation, for example column1 divided by column2, or column1 + column2 + column3. If all columns involved in the calculation are stored in the fact table, we will be able to perform the calculation without joining the any dimension. We just need to filter on the partitioning date of the fact table, which is very fast.

Conclusion

I am not advocating to store all numerical attributes in the fact table for all cases. Only in the case of OTC instrument, I would recommend putting the numerical attributes in the fact table, because of the 4 reasons above.

But for most cases, usually the best approach is to put the numerical attributes in the dimension. For example: the dimension of a product (the width, length and height) and the weight of a product. Yes they do change from time to time, but not daily, hence we want to record these changes as SCD type 2 in the product dimension.

24 April 2017

Choosing between Big Data and Data Warehousing

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 4:14 am

If we have 100 files each containing 10 million rows that we need to load to a repository so that we can analyse the data. What should we do? Do we put them into Hadoop (HDFS), or into a database (RDBMS)?

Last week I defined the difference between Big Data and Data Warehousing as: Big Data is Hadoop, and Data Warehousing is RDBMS. See my article here: link. Today I would like to illustrate using an example where we need to choose between the two.

There are 4 consideration factors:

  1. Data Structure
  2. Data Volume
  3. Unstructured Data
  4. Schema on Read

1. Data Structure: Simple vs Complex
If all 100 files have the same structure, and they all consist of the same 10 columns, then it is better to put them into Hadoop. We can then use Hive* and R to analyse the data. For example, to find patterns within the data, doing statistical analysis, or create forecasts. The development time will be shorter, because it is only 1 layer.
*or Phoenix, Impala, BigSQL, Stinger, Drill, Spark, depending on your Hadoop platform

If the 100 files contain 100 different tables, it is better to put them into a database, create a data warehouse, and use a Analytic/BI tool such as MicroStrategy or Tableau to analyse the data. For example to slice and dice the data, find percentage or anomalies and time series analysis. Yes we need to create 3 layers (staging, 3NF, star schema) but it enables us analyse each measure by different dimensions.

So if the data structure is simple, put into Hadoop, and if the structure is complex, but into a data warehouse. This is the general rule, but there are always exceptions. Can data with simple pattern be put into a data warehouse? Of course it can. Can data with complex pattern be put into Hadoop? Of course it can.

Using Hadoop and Hive/R we can also do slice and dice, find percentage or anomalies and do time series analysis. Using a data warehouse we can also do machine learning and data mining to find patterns in the data, do statistical analysis, and create forecasts. So basically, whether we store the data in Hadoop or in a data warehouse, we can still do complete analysis.

The issue here is storing it. Linking 100 tables in Hadoop is difficult and not natural. RDBMS such as SQL Server or Oracle is designed precisely for that task: linking and joining tables. Constructing a data model linking 100 tables is very suitable for an RDBMS. Can we design a data model linking 100 files with different structures in Hadoop? Of course we can. But it is much more difficult. For starters, it is Schema-on-Read, so the columns in the files have no data types. Schema-on-Read means that we don’t try to understand the relationship between the files when loading them into Hadoop. So yes we can load the 100 files into Hadoop, but we keep them as individual files, without relationships between them. This is the same as in Data Lake, which is also using Schema-on-Read, also using HDFS.

  1. Data Volume: Small vs Large

100 files containing 10 million rows each is 1 billion rows per day. If all 100 files have the same structure (say they all consists of the same 10 columns), then we will have a performance problem if we load them into an SMP database such as SQL Server or Oracle. Within 3 years, this table will have about 1 trillion rows. Even with partitioning and indexing, it will still be slow to query.

Hadoop on the other hand, will have no problem storing and querying 1 trillion rows. It is designed exactly for this task, by storing it in many files and querying it in parallel using Stinger, Drill, Phoenix, Impala or Spark. The file structure is simple (the same 10 columns each) which lends itself to Hadoop.

Teradata, Greenplum, Netezza, Exadata, and Azure SQL Data Warehouse are more than capable to handle this, with excellent query performance. But MPPs are more costly than Hadoop, which is why companies tend to choose Hadoop for this task. Using an MPP for this task is like killing a fly with a canon. Not only it is expensive and unnecesary, but also it is too sluggish and cumbersome for the task.

If the 100 source files have a complex structure (such as an export from SAP system) then yes an MPP is a suitable solution as we need to create relationship between the files/tables. But if the source files have a simple structure and we just need to union them, then Hadoop is more suitable and more economical for this task.

So if the data volume is large, like 1 billion per day, and the data structure is simple, put them into Hadoop. But if the data volume is large and the data structure is complex, put them into an MPP.

  1. Unstructured Data

If most of those 100 source files are MP4 (video) or MP3 (music), then Hadoop or Data Lake is an ideal platform to store them. An RDBMS be it SNP or MPP are not designed to store video or music files. They can (as a blob, or as externally-linked files), but they are not really designed for it.

If the source files are have different number of columns (such as Facebook or Twitter files,) then Hadoop or Data Lake is an ideal platform to store them. An RDBMS is not really designed for it.

Unstructured Data can also comes in the form of free-format text files (such as emails) and documents (such as journals and pattents). Again Hadoop or Data Lake is much better position to store them than a RDBMS.

  1. Schema-on-Read

One of the advantages of using Hadoop or Data Lake is that they are Schema-on-Read. Meaning that we just store those files without determining whether the first column is a numeric or a string. When we want to query it then we need to specify the data type.

Why is this an advantage? Because it makes it flexible. In Data Warehousing the first thing we need to do is to analyse the file structure, and design many tables to host the files in a Staging database. Then we design a normalised database to integrate those Staging tables. And then we design a Reporting layer in the form of Fact and Dimension tables and load those normalised tables into them. The whole thing can take a year if we have 100 files. The more number of files we have, the more complex the process and the longer it takes to design the databases for Integration layer and Reporting layer. It is good for the data architect (it gives them a job) but it is not good for the people who pay for this project.

Hadoop on the other hand is Schema-on-Read. After we put these 100 files in Hadoop, we query the first file. And when we query this first file, we specify the data types of each column. We don’t need to touch the other 99 files, yet. And we can already get the benefit. We can analyse the data straight away. On day one! If the other 99 files have the same structure, then we can union them, without extra effort of designing any database, and we can query them straight away. On day two! It is much simpler, we don’t need a team of 10 people designing any Staging, Normalised or Reporting layer for many many months. We can start analysing the data straight away and the project can finish in 2 to 3 months, with only 3 or 4 people. A lot less costly, a lot more agile, and a lot more flexible.

Just

17 April 2017

Definition of Big Data and Data Warehousing

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 5:38 pm
Tags: ,

I’m annoyed that many people label the normal Data Warehousing & BI stuff as “Big Data”. It is not. For me, Big Data is the “Hadoop stuff” (HDFS). If it is in SQL Server, Oracle or Teradata, it is just a normal database, not Big Data.

Big Data is stored on HDFS (Hadoop Distributed File System), not in RDBMS. Oracle, IBM, Microsoft, SAP, Teradata, all use Hadoop to store Big Data. Big Data is queried using MapReduce.

The reason why Big Data can’t be stored in RDBMS is because the format is not tabular. Sometimes it is 2 columns, sometimes it is 200 columns. Like Twitter data. The second reason is because it is too big. Sensors can make 100 measurements in a second, and in a year it could be Petabytes. Web Logs is another example. Tracking the ask and offer price of every transaction in every stock market is another example. Yes we can put Petabytes into SQL Server or Oracle, into Netezza or Teradata, but not at this speed (and more importantly not at this price!) Hadoop on the other hand is designed exactly to cope with these kind of speed and volume (and price).

Now the usage. What is Big Data Analytics? Big Data Analytics is when we do analytics on Hadoop Data.

Is Fraud Detection Big Data Analytics? Not always. Fraud Detection can be done on a normal Data Warehouse or a database. Is Machine Learning Big Data? Not always. Machine Learning can be done on a normal Data Warehouse or a database. If the Fraud Detection or the Machine Learning is done on data stored in Hadoop, then it is Big Data Analytics.

Even if it is only 200 GB, if it is stored in Hadoop, it is Big Data. Even if the data is 5 Petabyte, if it is stored in an SQL Server database, it is not Big Data, in my opinion.

Even if the data is in tabular format (i.e. columns and rows), if it is stored in Hadoop, it is Big Data. But if it is stored in an Oracle database, it is not Big Data.

Every Big Data architecture that I know uses Hadoop. No companies (or government) implement Big Data on an RDBMS. Or on a non HDFS files. Every single company, every single Big Data case I read implement the Big Data on Hadoop. I may be wrong and would be happy to be corrected. If you know a case which implements Big Data on a non-Hadoop system/architecture, I will grateful if you could let me know, either through comments, or via vrainardi@gmail.com.

 

 

18 September 2016

Rating Dimension

Filed under: Business Knowledge,Data Architecture,Data Warehousing — Vincent Rainardi @ 2:32 am

In an investment banking data warehouse, we have an instrument dimension (aka security dimension). This instrument dimension is used in many fact tables such as trade, risk, valuation, P&L, and performance.

In this instrument dimension we have rating attributes (columns), from the 3 agencies (S&P, Moody, Fitch), an in-house rating and many combination between them. We may have the ratings at both the issuer/obligor level, and at instrument/issue level. We could also have a hierarchy of ratings, e.g. AA+, AA and AA- are grouped into AA. There are separate ratings for cash instruments.

Usual Design

The usual design is to put all rating attributes in the instrument dimension. This is because the granularity of rating is at instrument level. In other words, rating is an attribute of an instrument.

Issues with the Usual Design

The issues with doing that are:

  1. Cash and FX may not be created as an intrument but they can have ratings.
  2. Some fact tables requires the rating outside the context of an instrument.

An example for point 1 is a USD bank balance (settled cash) in a GBP portfolio, which is given a rating of AAA. In the instrument dimension there is no instrument called USD bank balance. The other example is USD cash proceeds which is an unsettled cash.

An example for point 2 is a fact table used for analysing the average ratings of a portfolio across different date. This is called Portfolio Credit Quality analysis. On any given day, a credit portfolio has 3 attributes: highest rating, lowest rating and average rating. The average rating is calculated by converting the rating letter to a number (usually 1 to 27, or using Moody’s rating factor, link), multiply the rating number with the weight of each position, and sum them up. There can be several different criteria, for example: including or excluding debt derivatives, funds, cash, or equity, and doing lookthrough or not.

For example, portfolio X on 16th Sep could have these different attributes:

  1. Average rating: AA+
  2. Highest rating: AAA
  3. Lowest rating: BB-
  4. Average rating excluding cash: AA
  5. Average rating excluding cash and derivatives: BBB+
  6. Average rating excluding cash, derivatives and equity: BBB-
  7. Average rating including funds: AA-
  8. Average rating with lookthrough: A+

So the fact table grain is at portfolio level and date. We therefore cannot use instrument dimension in this fact table, and therefore it is necessary to have rating in its own dimension.

Another example of a fact table which requires rating not at instrument dimension is sectoral credit quality. A credit portfolio consists of many bonds and credit derivatives. Each of these bonds (or CDS) are of certain industry sector. A bond issued by BP is in Energy sector. A bond issued by Santander is in Financial sector. The average rating for each sector is calculated by converting each rating to a number, then multiplying it by the weight and sum up for all positions in that sector.

If we use GICS (link) as an example, and government, on any given day a credit portfolio can have sectoral credit quality like this: (usually excluding cash and equity, but including credit derivatives and debt funds)

  1. Consumer Discretionary: AA+
  2. Consumer Staples: AA-
  3. Energy: A+
  4. Financials:AAA
  5. Health Care: BB-
  6. Industrials: BBB-
  7. Information Technology: AA-
  8. Materials: BBB+
  9. Telecommunication Services: AA
  10. Utilities: BB+
  11. Real Estate: B+
  12. Government: AAA

Above is the average rating for each sector, for a given day, for a particular portfolio.

Design Choices for Rating Dimension

To satisfy the above requirement (fact tables which analyse rating at a level higher than instrument, e.g. at sector level or portfolio level), we can design the rating dimension in several ways:

  1. Rating schemes and hierarchies are created as attributes
  2. Rating schemes are created as rows, with hierarchies as attributes
  3. Rating schemes and hierarchies are created as rows

Before we dive into the details I’d like emphasize that we should not try to find out which approach is the best one, because different projects will require different approaches. So instead of asking “What is the best approach” we should ask “What is the most suitable approach for my project?”

And secondly I’d like to remind us that if the rating data is always used at instrument level than it should reside in the instrument dimension, not in a rating dimension on its own.

Approach 1. Rating schemes and hierarchies are created as attributes

approach1

Of course we should always have RatingKey 0 with description = Unknown. This row is distinctly difference to NR (No Rating). NR means S&P gives the bond a rating of NR. RatingKey 0 means that we don’t have any info about the rating of this bond.

The distinct advantage of this approach is that we have a mapping between different rating schemes. Moody’s Baa3 for example corresponds to S&P’s BBB-.

The Grade column is either Investment Grade (AAA down to BBB-) and High Yield (HY, from BB+ down to D).

Approach 2. Rating schemes are created as rows, with hierarchies as attributes

approach2

The distinct advantage of approach 2 over approach 1 is RatingKey 33 means Aa3, we don’t have to tell it that we meant Moody’s, like in approach 1. In Approach 1, RatingKey 4 can means Aa3 or AA-, depending whether we meant Moody’s or S&P.

The distinct disadvantage of approach 2 compared to approach 1 is we lost the mapping between S&P and Moody’s. In Approach 1 we know that Aa3 in Moody’s corresponds to AA- in S&P, but in Approach 2 we don’t know that. In the majority of the circumstances, this is not an issue, because an instrument with Moody’s rating of Aa3, may not have an S&P rating of AA- any way. Instead, its S&P rating could be AA or BBB+. So the mapping between S&P and Moody’s are the book standard, academic assumption, not the reality in the real world. In the above example, nobody cares whether the corresponding S&P rating for Aa3 is AA-. What the fund managers and analysts want to know is what Moody’s really assigned to that bond.

Approach 3. Rating schemes and hierarchies are created as rows

approach3

The distinct advantage of Approach 3 over Approach 1 and 2 is that RatingKey 31 means AA tier, it is clear and not ambiguous. In Approach 1 and 2, RatingKey 2,3,4 all means AA tier.

The other advantage of Approach 3 is the ability to create short term rating and cash rating, because they are simply different schemes.

Why do we need to keep the Tier column in Approach 3? Because we still need to convert AA+, AA and AA- to AA.

Rating Value (aka Rating Number)

In some projects we need to convert the rating letters to numbers. For this we can create an attribute in the above rating dimension (any of the 3 approaches) called Rating Value. It is usually AAA = 1, AA+ = 2, AA = 3, AA- = 4, and so on. NR (No Rating) and WR (Withdrawn Rating) got 0 rather than 23.

rating-value

Note that unlike in S&P and Moody’s schemes, in Fitch scheme the D rating is differentiated into 3: DDD, DD, and D. So DDD = 22, DD = 23, D = 24.

Miscellaneous

A common mistake is that people assume there are CC+ and CC- rating. There aren’t. Only CCC is split into 3 (CCC+, CCC, CCC-), but CC and C are not.

Sometimes the relationship between a rating in one scheme and another scheme is not a one-to-one mapping. For example, the Moody’s rating of Ca corresponds to 2 ratings in S&P scheme: CC and C. This is why approach 1 is tricky, because the rating does not have one to one relationship between 1 scheme and another. Another example is as I mentioned above, the D rating in S&P corresponds to 3 ratings in Fitch: DDD, DD and D.

20 June 2016

Data Lake vs Data Warehouse

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 6:22 pm

A Data Lake is a storage and analytic system which stores structured and unstructured data from all source systems in the company in its raw form. The data is queried, combined and analysed to get useful patterns and insights. It is built on a cost effetive Hadoop storage infrastructure, and can be dynamically scaled.

A Data Warehouse is a system that retrieves and consolidates data periodically from source systems into a dimensional or normalised data store. It usually keeps years of history and queried for business intelligence or other analytical activities. It is typically updated in batches, not every time a transaction happens in the source system.

Data Lake Advantages

Because data from all source systems are there in the data lake, the main advantage of data lake is it enables the analyst to get analytics and insight from multiple systems. Yes the data is still in their raw form, so would requires some processing, but it is a lot quicker (1-2 week) than building a data warehouse (6-12 months). And that is the second advantage: time to build.

The third advantage of data lake is its ability to store unstructured data such as documents and social media, which we can query and combine with structured data from databases.

The fourth and probably the most powerful business case for a data lake is the cost efficiency. In investment banking world for example, we can store market data (e.g. prices, yields, spreads, ratings) not only for the securities that we hold but all securities in the market, cheaper than if we store it in Oracle or SQL Server.

The fifth advantage of a data lake is the flexibility. Unlike Oracle or SQL Server, data lakes dynamically scales up. If this morning it is 2 TB, in the afternoon it can be 3 TB. As new data arrives, and we need new capacity, we can add storage easily. As we requires more computing power, we can get more computing power there and then, instantly. There is no need to wait a few days for adding a new node in a SQL always-on availability groups, or adding storage to the SAN, or extending the computing power of an Oracle Grid. This means that we do not need to spend a lot of money upfront paying for 3 TB data and 16 processors. We can just start with 300 GB and 4 processors, and expand when required.

Data Lake Disadvantages

The first issue is that the data lake technology is immature. The language which can query across database and unstructure files only has very limited features (each vendor has different language, for Microsoft Azure it is U-SQL). It is probably only 10% of what PL/SQL or T-SQL can do. We can solve this by putting QlikView or Tableau on top. We use U-SQL only to query individual table and we join the data in QlikView/Tableau and do further processing there.

The second issue is cost (but I don’t find this argument holds water). The issue is: it actually costs a lot of money to store data from all source systems in the company, let alone storing external market data which requires a lot of licence cost.

Let’s take Microsoft Azure Data Lake for pricing example. Being in the cloud, the price is only US$ 0.04 per GB and US$0.07/1 million transactions (ref: link). Let’s say that “data from ALL systems in the company” is 500 GB. And every day we store this data in the data lake. So 500 GB per day x 365 days = 182,500 GB per year x $0.04/GB = $7300 per year. Let’s say we have 10 million transactions per day, which is 3650 million transactions per year x $0.07/million transaction = $256 per year. So it is only cost us about $7500 per year. This is a very reasonable price to pay, to have “all the data in the company in one place”. Even 5 years later, when the volume grows to 5 times, it is only about $35,000 per year. Still very reasonable price to pay.

The third issue is performance. Because data lakes stores the data in its raw format, the query which joins the different data could be running like a dog. Luckily Data Lake is running on HDFS, a distributed file system, which is very fast. So yes it is slower than a data warehouse, but it is not too bad. We are not talking about 30 minutes to run a query, but something like 20-30 seconds (compared to DW query which is say 1-3 seconds).

Data Lake being “exploration tool”, I can tolerate a little bit of slowness. After we proof that (for example) the risk data we query is making money, then it can pay for creating a Risk data mart specifically for that risk analysis purpose.

The forth issue is skill. Data Lake requires a superb business knowledge. Because the analyst needs to join tables from different source system. What is the difference between Real Modified Duration and Effective Real Duration and Modified Duration? But this issue is the same whether we are building ad Data Warehouse or a Data Lake. Both requires good business knowledge.

I don’t find U-SQL is difficult to understand for people with SQL knowledge, which is a common skill. But how about Machine Learning? That is difficult to master, right? Yes, that is true, but it is worth paying an expert data scientist to discover the insight in the data, because this insight can be used to save cost or boost our revenue. The potential benefit is 10-100 times the salary of the data scientist.

Conclusion

So considering all the pros and cons above, I am in favour for creating a Data Lake. In addition to having a Data Warehouse of course.

 

21 April 2016

Why do we need a data warehouse?

Filed under: Data Warehousing — Vincent Rainardi @ 4:59 pm
Tags:

I have written about this in 2012 (link) but this one is different.

One of the most compelling reasons for creating a data warehouse is to reduce cost. Imagine if you were a large investment bank, trading fixed income, equities, commodity and currency. Every month you need to produce hundreds of reports to your clients and to regulatory bodies, and for management. The reporting team currently spends a lot of time creating the reports. They all have access to various systems. Risk systems, trading systems (order management), compliance systems, fixed income systems, finance systems. They copy and paste data from these systems into Excel, and do some calculations in Excel, which as the data source for making the reports in PowerPoint and PDF. Let’s find out how much this operation cost. Let’s say there are 8 systems, costing $200 per user per month, and there are 10 people in the team. The licence cost is 8 x $200 x 12 x 10 = $16k/month x 12 months = $192k/year, make it $200k. The salary is $60k x 10 = $600k per year. Total cost = $800k/year.

If we make a data warehouse, taking 1 year and costing $500k, we will be able to put together the data from that 8 systems into 1 database, and the reporting team won’t need access to those system anymore. That saves $200k/year.

Secondly, we will be able to automate the production of those reports. That will reduce the amount of time required to make those reports, thus reduce the number of people required, freeing them to do other activities. Assuming that half of the reports can be automated, that would save us $300k per year.

So total saving per year is $500k. Project cost is $500k. Surely a project with this level of “cost vs benefit” is well worth pursuing.

Next Page »

Blog at WordPress.com.