Data Warehousing and Machine Learning

8 June 2021

Which machine learning algorithms should I use?

Filed under: Data Warehousing — Vincent Rainardi @ 5:06 am

Every month I learn a new machine learning algorithm. Until today I’ve learned about ten algorithms and whenever I’m trying to solve a machine learning problem, the question is always “Which algorithm should I use?”

Almost every machine learning practitioner knows that the answer depends on supervised or unsupervised, then classification or regression. So which algorithm to use is quite straight forward right?

Well, no. Take classification for example. We can use Logistic Regression, Naive Bayes, Support Vector Machine, Decision Tree, Random Forest, Gradient Boosting or Neural Network. Which one should we use?

“Well, it depends” is the answer we often hear. “Depends on what?” that is the question! It would be helpful if we know what factors to consider right?

So in this article I would like to try answering those questions. I’m going to first address the general question on “Which machine learning algorithm should I use?”  This is useful when you are new in machine learning and never heard about classification and regression, let alone ensemble and boosting. There are many good articles already written about this, so I’m going to point you to them.

Then as an example I’m going to dive specifically into classification algorithms. I’ll try to give a brief outline on what factors we need to consider when deciding, such as linearity, interpretability, multiclass and accuracy. Also the strengths and weaknesses of each algorithm.

General guide on which ML algorithms to use

I would recommend that you start with Hui Li’s diagram: link. She categorised ML algorithms into 4: clustering, regression, classification and dimensionality reduction:

It is very easy to follow, and it is detail enough. She wrote it in 2017 but by and large it is still relevant today.

The second one that I’d recommend is Microsoft’s guide: link, which is newer (2019) and more comprehensive. They categorise ML algorithms into 8: clustering, regression, classification (2 class and multiclass), text analytics, image classification, recommenders, and anomaly detection:

So now you know roughly which algorithm to use for each case, using the combination of Hui Li’s and Microsoft’s diagrams. In addition to that, it would be helpful if you read Danny Varghese’s article about comparative study on machine learning algorithms: link. For every algorithm Danny outlines the advantages and disadvantages against other algorithms in the same category. So once you choose an algorithm based on Hui Li’s and Microsoft’s diagrams, check that algorithm against the alternatives on Danny’s list, make sure that the advantages outweigh the disadvantages.

Classification algorithms: which one should I use?

For classification we can use Logistic Regression, Naive Bayes, Support Vector Machine, Decision Tree, Random Forest, Gradient Boosting Machine (GBM), Perceptron, Linear Discriminant Analysis (LDA), K Nearest Neighbours (KNN), Learning Vector Quantisation (LVQ) or Neural Network. What factors do we need to consider when deciding? And what are the strength and weaknesses of each algorithm?

The factors we need to consider are: linearity, interpretability and multiclass.

The first consideration is linearity of the data. The data is linear if the plot between the predictor and the target variable is separable by a straight line, like below.

Note that the plots above are over-simplified as the reality is not only 2 dimensions but many dimensions (e.g. we have have 8 predictors, or 8 X axis) so the separator is not a line but a hyperplane.

  1. If the data is linear, we can use (link): Logistic Regression, Naive Bayes, Support Vector Machine, Perceptron, Linear Discriminant Analysis.
  2. If the data is not linear, we can use (link): Decision Tree, Random Forest, Gradient Boosting Machine, K Nearest Neighbours, Neural Network, Support Vector Machine using Kernel, Learning Vector Quantisation.

Can we use algorithms in #2 for linear classification? Yes we can, but #1 is more suitable.

Can we use #1 for non-linear classification? No we can’t, not without modification. But there are ways to transform data from a non-linear space to a linear space. They are called “kernel trick”, see my article here: link.

The second factor that we need to consider is interpretability, i.e. the ability to explain why a data point is classified into a certain class. Christoph Molnar explains interpretability in great details: link.

  • If we need to be able to explain, we can use Logistic Regression, Naive Bayes, Decision Tree, Linear Support Vector Machine,
  • If we don’t need to be able to explain, we can use Random Forest, Support Vector Machine with Kernel (see Hugo Dolan’s article: link), Gradient Boosting Machine, K Nearest Neighbours, Neural Network, Perceptron, Linear Discriminant Analysis, Learning Vector Quantisation

The third factor that we need to consider whether we are classifying into two classes (binary classification) or more than two classes (multi-class). Support Vector Machine (SVM), Linear Discriminant Analysis (LDA) and Perceptron are binary classification, but everything else can be used for both binary and multi-class. We can make LDA multi-class, see here: link. Ditto SVM: link.

1. Logistic Regression

Strengths: good accuracy on small amount of data,easy tointerpret (we get feature importance), easy to implement, efficient to train (doesn’t need high compute power), can do multi-class.

Weaknesses: tend to overfit on high dimensions (use regularisation), can’t do non-linear classification (or complex relationship), not good with multicollinearity, sensitive to outliers, requires linear relationship between log odds and target variable.

2. Naive Bayes

Strengths: good accuracy on small amount of data, efficient to train (doesn’t need high compute power), easy to implement, highly scalable, can do multi-class, can do continuous and discreet data, not sensitive to irrelevant features.

Weaknesses: features must be are independent, a category which exist in test dataset but not in training data set will get zero probability (zero frequency problem)

3. Decision Tree

Strengths: easy to interpret (intuitive, show interaction between variables), can classify non-linear data, data doesn’t need to be normalised nor scaled, not affected by missing values, not affected by outliers, performs well with unbalanced data (the nature of data distribution does not matter), can do both classification and regression, can do both numerical and categorical data, provide feature importance (calculated from the decrease in node impurity), good with large dataset, able to handle multicollinearity.

Weaknesses: has tendency tooverfit (bias towards training set, requires pruning), not robust (high variance, small change in training data results in major change in the model and output),not good with continuous variable, requires longer time to train the model (resource intensive).

4. Random Forest

Strengths: high accuracy, doesn’t need pruning, no overfitting, low bias with quite low/moderate variance (because of bootstrapping), can do both classification and regression, can do numerical and categorical, can classify non-linear data, data doesn’t need to be normalised nor scaled, not affected by missing values, not affected by outliers, performs well with unbalanced data (the nature of data distribution does not matter), can be parallelised (can use multiple CPUs in parallel), good with high dimensionality.

Weaknesses: long training time, requires large memory, non interpretable (because there are hundreds of trees).

5. Support Vector Machine (Linear Vanilla)

Strengths: scales well with high dimensional data, stable (low variance), less risk of overftting, doesn’t rely on the entire data (not affected by missing values), works well with noise.

Weaknesses: long training time for large data, requires features scaling.

6. Support Vector Machine (with Kernel)

Strengths: scales well with high dimensional data, stable (low variance), handle non-linear data very well, less risk of overftting (because of regularisation), good with outliers (has gamma and C to control), can detect outliers in anomaly detection, works well with noise.

Weaknesses: long training time for large data, tricky to find appropriate kernel, need large memory, requires features scaling, difficult to interpret.

7. Gradient Boosting

Strengths: high accuracy, flexible with various loss functions, minimal pre-processing, not affected by missing values, works well with unbalanced data, can do both classification and regression.

Weaknesses: tendency to overfit (because it continues to minimise errors), sensitive with outliers, large memory requirement (thousands of trees), long training time, large grid search for hyperparameter, not good with noise, difficult to interpret.

7. K Nearest Neighbours

Strengths: simple to understand (intuitive), simple to implement (both binary and multi-class), handles non-linear data well, non parametric (no requirements on data distribution), respond quickly to data changes in real time implementation, can do both classification and regression.

Weaknesses: long training time, doesn’t work well with high dimensional data, requires scaling, doesn’t work well with imbalanced data, sensitive to outliers and noise, affected by missing values.

8. Neural Network

Strengths: high accuracy,handles non-linear data well, generalise well on unseen data (low variance), non parametric (no requirements on data distribution), works with heteroskedastic data (non-constant variance), works with highly volatile data (time series), works with incomplete data (not affected by missing values), fault tolerance

Weaknesses: requires large amount of data,computationally expensive (requires parallel processors/GPU and large memory), not interpretable, tricky to get the architecture right (#layers, #neurons, functions, etc.)

5 June 2021

The Trick in Understanding Human Language

Filed under: Data Warehousing — Vincent Rainardi @ 10:08 am

I started learning Natural Language Processing (NLP) with such enthusiasm. There were 3 stages in NLP. The first stage is lexical analysis where the root words and phrases are identified, dealing with stop words and misspelling. The second stage is syntactic analysis where the nouns, verbs, etc. are identified and the grammar is analysed. The third stage is semantic analysis which is about understanding the meanings of the words.

So I thought, this is amazing! I knew computers now understand human languages, for example Alexa and chatbots. And I would be diving into that wonderful world, learning how it’s done. At the end of this process I would be able to create a chatbot that could understand human language. Cool!

I did build a chatbot that could “understand” human language, but disappointingly it doesn’t really understand it. A chatbot uses a “trick” to guess the meaning of our sentences, identifying the most probable intention. It outputs prepared responses and we do need to define which response for which input. So no it does not understand human language in the way I initially thought. We are still far away from having clever robots like in “I Robot” and “Ex Machina”.

In this article I’m writing that learning experience, hoping that it would enlighten those who have not entered the NLP world.

Lexical Analysis

Lexical analysis is about identifying words and phrases, and dealing with stop words and misspelling. I learned how to identify the base form of words, such as “play” in the word “playing” and “player”. This process is called stemming, where we identify rules such as removing “ing” and “ion” suffixes. For this we use regular expression.

The base form of “best” is “good”, which can’t be identified using stemming. For this we use lemmatisation, which is done using a combination of lookups and rules. Both are widely implemented using NLTK in Python, see Ivo’s Bernardo’s article on stemming (link) and Selva Prabhakaran’s article on lemmatisation (link).

But before that we will need to break text into paragraphs, sentences and words. This is called tokenisation. We deal with “she’d” and “didn’t” which are actually “she would” and “did not”. We deal with tokens which are not words, like dates, time e.g. “3:15”, symbols, email address, numbers, years, brackets. See my article on tokenisation here: link.

Then we need to deal with misspelling and for this we need to know how similar two words are using edit distance. Edit distance is the number of operations (like delete a letter, insert a letter, etc.) required to change one word into another.

A crude way of representing a text is using “bag of words”. First we remove the stop words such as the, in, a, is, etc. because stop words exist in every text so they don’t provide useful information. Then we construct a dictionary from the distinct list of words in the text. For every sentence we mark each word whether it exist in the dictionary or not. The result is that a sentence is now converted to a series of 1 and 0. A more sophisticated version uses the word frequency instead of 1 and 0, see: link. Either way in the end the sentences are converted into numbers.

Once a document is converted into numbers, we can run machine learning algorithms on it such as classification. For example we can classify whether an email / text is a spam or not.

That is, in 1 minute, Lexical Analysis 🙂 We can (crudely) represent a document as numbers and use this numerical representation to classify documents. But at this stage the machine doesn’t understand the documents, at all.

Syntactic Analysis

Syntactic Analysis is about breaking (or parsing) a sentence into phrases such as noun phrase, verb phrase, etc. and recognising them. We do this is because the meaning of a word (e.g. “play”) depends on whether it is a noun or a verb. This “noun”, “verb”, “adjective”, “preposition”, etc. are called “part of speech” or POS for short. So the first step is to identify the POS tag for each word.

There are many different approaches for doing POS tagging: supervised, unsupervised, rule based, stochastic, Conditional Random Fields, Hidden Markov Model, memory based learning, etc. Fahim Muhammad Hassan cataloged them in his thesis: link.

  • The Hidden Markov Model (HMM) is arguably the most popular, where the POS tag is determined based not only on the word, but also the POS tag of the previous word. Many have written about HMM and its implementation in Python. For introduction I recommend Raymond Kwok’s article (link) and for a formal lecture Ramon van Handel from Princeton University (link).
  • The best approach in terms of accuracy is the Recurrent Neural Network (RNN). RNN uses deep learning approach where the feedback is fed to the next stage. The most popular implementations are LTSM and GRU. Tanya Dayanand wrote a good short explanation here: link (notebook here).

Once we know the POS tags for each word, we can now parse or break a sentence into phrases (e.g. noun phrase, verb phrase, etc.) or into subject, modifier, object, etc. in order to understand them. The former is called constituency grammar and the latter is called dependency grammar.

  • Constituency grammar: the most popular method is Context Free Grammar (CFG, link), which specifies the rules of how words are grouped into phrases. For example, a noun phrase (e.g. “the sun”) may consist of a determinant (the) and a noun (sun). A sentence can consist of a noun phrase and a verb phrase, e.g. the sun shines brightly.
  • In dependency grammar we first identify the root verb, followed by the subject and object of that verb. Then the modifiers which are an adjective, noun or preposition that modifies the subject or the object. The most popular framework is the Universal Dependency (link). Two of the most popular Universal Dependency English parser is from Georgetown University (link) and Stamford (link).

In addition to parsing sentences into phrases, we need to identify named entities such as city name, person name, etc. In general this subject is called Information Extraction (link) covering the whole pipeline from pre-processing, entity recognition, relation recognition, record linkage and knowledge generation. Recognising named entities is vital for chatbots in order to understand the intention. There are many approaches in Named Entity Recognition (NER) such as Naive Bayes, Decision Trees and Conditional Random Field (see Sidharth Macherla’s article: link). There are many good libraries that we can use, such as NLTK, spaCy and Stanford. Susan Li wrote a good article on NER implementation: link.

That is syntactic analysis in 2 minutes, in which we break sentenses into phrases and words, and recognising each word as verb, noun, etc. or a named entity. At this stage the machine still doesn’t understand the meaning of the sentence!

Semantic Analysis – The Trick

Now that we have parsed sentenses into words and identified the named entities, the final step is to understand those words. This is the biggest learning point for me in NLP. Machines don’t understand the text word by word like human do, but by converting each word into numerical representation (called vectors) and then extract the topic. The topic is the centre of those word vectors.

And that is the big “trick” in NLP. We can do all the lexical analysis and syntactic analysis all we want, but in the end we need to convert the words into vectors, and the centre of those vectors is meaning of those words (the topic). So the meaning is also a vector!

In the real world the vector representations have hundred of dimensions and in the diagrams below I only use 2 dimensions so they are massively over simplified but I hope they can convey my point across. In diagram A we have a sentence “Running is a sport”. Each word is a blue circle and the centre (the centroid) is the solid blue circle. The vector representing this centre is the blue arrow. This blue arrow is the “meaning” which is just a bunch of numbers that make up that vectors (in reality it’s hundred of numbers).

In diagram B we have another sentence “He walks as an exercise”. Each word is a brown circle and the centre is the solid brown circle. The vector representing this centre is the brown arrow. That brown arrow is the “meaning” of that sentence. So the meaning is just a bunch of numbers that make up that brown arrow.

In diagram C we superimpose diagram A and diagram B, and in diagram D we remove the word vectors, leaving just the 2 meaning vectors. Now we can find out how close the meanings of the 2 sentences are, just by looking at how close these 2 vectors are.

Remember that in reality it’s not 2 dimensions but hundreds of dimensions. But you can clearly see the mechanism here. We convert sentences into numbers (vectors) and we compare the numbers. So the computer still don’t understand the sentences, but it can compare sentences.

Say we have a collection of sentences about cooking. We can represent each of these sentences as numbers/vectors. See the left diagram below. The blue circles are the sentences and the solid blue circle is the centre.

If we have a collection sentences about banking, we can do the same. Ditto with holiday. Now we have 3 blue dots (or 3 blue arrows), each representing different topic. One for cooking, one for banking, one for holiday. See the right diagram above.

Now if we have an input, like “I went to Paris and saw Eiffle tower”, the NLP will be able to determine whether this input is about holiday, cooking or banking, without even knowing what a holiday, cooking or banking are! Without even knowing what Eiffle tower and Paris are. Or even what “went” and “saw” are. All it knows is that the vector for “I went to Paris and saw Eiffle tower” is closer to the holiday vector than to the cooking vector or the banking vector. Very smart!

That is the trick in understanding human languages. Convert the sentences into numbers and compare them!

Semantic Analysis – The Steps

Semantic means meaning. Semantic Analysis is about understanding the meaning. Now that we have an idea of how semantic analysis trick is done, let’s understand the steps.

First we convert the words into vectors. There are 2 approaches for doing this:

  1. Frequency based
  2. Prediction based

In the frequency based approach the basic assumption is: words which are used and occur in the same context (e.g. a document or a sentence) tend to have similar meanings. This principle is called Distributional Semantics (link). First we create a matrix containing the word counts per document i.e. the occurance frequency of each word. This matrix is called Occurance Matrix, the rows are the words and the columns are the documents. Then we reduce the number of rows in this matrix using Singular Value Decomposition (link). Each row in this final matrix is the word vector, it represents how that word is distributed in various document. That is the vector for that word. Examples of frequency based approach are: Latent Semantic Analysis (link) and Explicit Semantic Analysis (link).

The prediction based approach uses neural network to learn how words are related to each other. The input of the neural network is the word, represented as a one-hot vector (link), which means that all numbers are zero except one. There is only 1 hidden layer in the neural network, with hundreds of neuron. The output of the neural network is the context words, i.e. the words closest to the input words. For example: if the input word is “car”, the outputs are like below left, with the vector representing the word “car” on the right (source: link).

Examples of the prediction approach are Word2Vec from Google (link), GloVe from Stanford (link) and fastText from Facebook (link).

Once the words become vectors, we use cosine similarity (link) to find out how close the vectors are to each other. And that is how computers “understand” human language, i.e. by converting them into vectors and comparing them with other vectors.

Chatbot

I’m going to end this artcle with chatbot. A chatbot is a conversational engine/bot, which we can use to order tickets, book a hotel, talk to customer service, etc. We can build a chatbot using Rasa (link), IBM Watson (link), Amazon Lex (link) or Google Chat (link).

A chatbot has 2 components:

  • Natural Language Processing (NLP)
  • Dialogue Management

The Natural Language Processing part does Named Entity Recognition (NER) and intention classification. The NER part identifies named entities such as city name, person name, etc. The intention classification part detects what is the intention in the input sentence. For example for a hotel booking chatbot the intention can be greeting, finding hotels, specify location, specify price range, make a booking, etc.

The Dialogue Management part determine what is the response and next step for each intent. For example, if the intention is greeting the response is saying “Hi how can I help you” then wait for an input. If the intension is “finding hotels” the response is asking “In which location” then wait for an input.

And that’s it. The intention classification uses the “trick” I explained in this article to understand human language. It converts the sentence into vector and compare it with the list of intentions (which have been converted into vectors too). That’s how a chatbot “understand” what we are typing. And then it uses a series of “if-then-else” to output the correct response for each intension. Easy isn’t it?

No, from my experience it’s not easy. We need to prepare lots of examples to train the NLP. For each intention we need to supply many examples. For each location we need to specify the other possible names. For example: Madras for Chennai, Bengaluru for Bengalore and Delhi for New Delhi. And we need to provide a list of cities that we are operating in. And we need to cover so many possible dialogue flows in the conversation. And then we need to run it sooo many times over and over again (and it could take 15-30 minutes per run!), each time correcting the mistakes.

It was very time consuming but it was fun and very illuminating. Now I understand what’s going on behind the scene when I’m talking to a chatbot on the internet, or talking to Alex in my kitchen.

31 May 2021

Why Linear Regression is so hard

Filed under: Data Warehousing — Vincent Rainardi @ 8:36 am

2 years ago I thought linear regression was the easiest algorithm. But it turns out that it is quite difficult to do, because the X and the Y must have a linear relationship, and the errors must be normally distributed, independent and have equal variance. That kind of data in reality is much more unlikely to happen in nature than I initially thought. And if these 4 criteria are not satisfied, we can’t use linear regression. In addition we also face multicollinearity, overfitting and extrapolation when doing linear regression. In this article I would like to explain these issues, and how to solve them.

Criteria 1. X and Y must have a linear relationship

The first issue is the relationship between the X (independent variables) and the Y (the predicted variable) might not be linear. For example, below is a classic case of a “lower tail” where below x1 the data is lower than the linear values (the red points).

Criteria 2. Error terms must be distributed normally

The second issue is that the errors might not be distributed normally. Below left is an example where the error terms are distributed normally. Error terms are the difference between the actual values and the predicted values, aka the residuals. Remember that normally distributed means that 1 standard deviation must cover 68.2% of the data and 2 SD 95.4% and 2 SD 99.7%. Secondly the centre must be 0. Note: The image on the right is from Wikipedia (link).

Below are 3 examples where the error terms are not distributed normally:

On the left the distribution is almost flat. In the middle, the centre is 2 not 0. On the right, the red bars are too low so that the 3 SD is lower than 99.7%. Unless the error terms are distributed normally, we cannot use the linear regression model that we created.

Criteria 3. Error terms must be independent

Error terms must be independent of what? Independent of three things:

  1. of the independent variables (the X1, X2, etc)
  2. of the predicted variable (the Y)
  3. of the previous error terms (see: Robert Nau’s explanation here)

See below for 3 illustrations where the error terms are not independent:

  • Left image: the error terms are correlated to one of the independent variables. In this example the higher the X the lower the error terms.
  • Middle image: the error terms are correlated to the predicted variable. In this example the higher the Y the higher the error terms.
    Note: in linear regression “the predicted variable” can means two things: the actual values and the predicted values. In the context of error terms independence the convention is the predicted values (y hat) because that is what the model represents and we want to know if we can use the model or not. Saying that, the plot would be similar if we use the actual values rather than the predicted values, because error terms are the difference between the actual values and the predicted values.
  • Right image: the error terms are correlated to the previous value of the error terms. This one is also called autocorrelation or serial correlation; it usually happens on time series data.

The reason why we cannot use the model if the error terms is not independent is because the model is bias and therefore not accurate. For example on the left and middle plots above we can see that the error term (which is the difference between the the actual value and the predicted value, which reflects the model’s accuracy) changes depending on the independent variable and dependent variable.

Independent error terms means that the error terms are randomly scattered around 0 (with regards to the predicted values), like this:

Notice that this chart is between the error terms and the predicted values (y hat), not the actual values.
There are 3 things that we should check on the above scattered chart:

  1. That positive and negative error terms are roughly distributed equally. Meaning that the number of data points above and below the x axis are roughly equal.
  2. That there are no outliers. Meaning that there are no data points which are far away from everything else. For example: all data points are within -2 to +2 range but there is a data point at +4.
  3. Most of the error terms are around zero. Meaning that the further away we move vertically from the x axis, the less crowded the data points are. This is to satisfy the “error terms should be distributed normally” criteria which is centered on zero.

Criteria 4. Error terms must have equal variance

It means that the data points are scattered equally around zero, no matter what the predicted values are. In the image below the error terms are not the same across the predicted values (Y hat). Around Y hat = a the error terms have a small variance, at Y hat = b the error terms have a large variance and at Y hat = c the error terms have a small variance.

What should we do?

If the X-Y plot or the residual plot indicates that there is a non-linear relationship in the data (i.e. the 4 points above), there are four things we can do:

  1. We can transform the independent variables or the predicted variable.
  2. We can use polynomial regression
  3. We can do non-linear regression
  4. We can do segmented regression

The first thing is transforming one or more of the independent variables (X) into ln(X), e^X, e^-X, square root of X, etc:

  • First, we need to find out which independent variable is not linear. This is done by plotting each independent variable against the predicted variable (one by one).
  • Then we choose a suitable transformation based on the chart from the first step above, for example: (graphs from fooplot.com)
  • Then we transform the non-linear independent variable, for example we transform X to ln(X), and we use this ln(X) as the independent variable in the linear regression.

The second one is using polynomial regression instead of linear regression, like this:

We can read about polynomial regression in Wikipedia (link) and in Towards Data Science (link, by Animesh Agarwal). As we can read in Animesh’ article, the degree of the polynomial that we choose affects the overfitting, so it’s a trade off between the bias and variance.

The third one that we can do is non-linear regression. By non-linear I mean the model parameters/coefficients (the betas), not the independent variables (the X). Meaning that it is not in the form of “y = beta1 something + beta2 something + beta3 something + …” For example, this is a non-linear regression:

In non-linear regression we approximate the model using first order Taylor series. We can read about polynomial regression in Wikipedia (link).

The last one is segmented regression, where we partition the independent variables into several segments, and for each segment we use linear regression. So instead of 1 long line, the linear regression is several “broken lines”. That is why this technique is known as “broken-stick regression” which we can read in Wikipedia: link. It is also known as “piecewise regression” as the Python implementation is using numpy.piecewise() function, which we can read in Stack Overflow: link.

Multicollinearity, overfitting, extrapolation

At the beginning of this article I also mentioned about these 3 issues when doing linear regression. What are these issues and how do we solve them?

Multicollinearlity means that one of the independent variables is highly correlated to another independent variable. This is a problem because it causes the model to have high variance, i.e. the model coefficients change erratically when there are small changes in the data, causing the model to be unstable.

The solution is to drop one of the multicollinear variables. We can read more about multicollinearity in Wikipedia, including a few other solutions: link.

Overfitting happens when we use high degree polynomial regression. We detect overfitting by comparing the accuracy in the training and test data set. If the accuracy on the training data set is very high (>90%) and the accuracy on the test data set is much lower (a difference of 10% or more) then the model is overfitting (see: link).

The solution is to use regularisation such as Lasso or Ridge (link), using feature selection (link), or using cross validation (link).

Extrapolation is about using the linear, polynomial or non-linear regression model beyond the range of the training and test data. The considerations and real world examples are given in this Medium article by Dennish Ash: link.

The solution is to review the linearity relationship between the independent variable and the predicted variable in the data range where we want to do extrapolation. We review using business sense (not using data), checking if the relationship is still linear outside the data range that we have.

One consideration is that the further the distance to the training and test data range, the more risky the extrapolation. For example, if in the training and test data the independent variable is between 20 and 140, predicting the output for 180 is more risky than predicting the output for 145.

Note on plots in machine learning

Machine learning is a science about data and as such when making plots/graphs we must always make it clear the meaning of each axis. And yet bizarrely during my 2 years in machine learning I encountered so many graphs with the axis not labelled! This irritates me so much. We must label the axis properly, because depending on what the axis are the graph could mean an entirely different thing.

For example: the graph below says heteroscedastic but has no label on either the y axis nor the x axis. So how could we know what those data points are? Is it independent variable against the dependent variable? It turns out that the x axis is the predicted variable and y axis is the error term.

27 May 2021

Ensembles – Odd and Even

Filed under: Data Warehousing — Vincent Rainardi @ 7:01 am

In machine learning we have a technique called Ensembles, i.e. we combine multiple models. The more models we use, the higher the chance of getting right. That is understandable and expected. But the number of models being odd or even has a significant effect too. I didn’t expect that and in this short article I would like to share it.

I’ll start from the end. Below is the probability of using 2 to 7 models to predict a binary output, i.e. right or wrong. Each model has 75% chance of getting it right i.e. correctly predicting the output.

If we look at the top row (2, 4 and 6 models) the probability of the ensemble getting it right increases, i.e. 56%, 74%, 83%. If we look at the bottom row (3, 5 and 7 models) it also increases, i.e. 84%, 90%, 93%.

But from 3 models to 4 models it is down from 84% to 74% because we have 21% of “Not sure”. This 21% is when 2 models are right and 2 models are wrong and therefore the output is “Not sure”. Therefore we would rather use 3 models than 4 models because 3 models is better than 4 models, in terms of the chance of getting it right (correctly predicting the output).

The same thing happen between 5 and 6 models. The probability of the ensemble getting it right decreases from 90% to 83% because we have 13% of “Not sure”. This is where 3 models are right and and 3 model are wrong so the output is “Not sure”.

So when using ensembles to predict binary output we need to use odd number of models, because they don’t have “Not sure” where equal number of models are right and wrong.

We also need to remember that each model must have >50% chance of predicting the correct result. Because if not the model ensemble is weaker than the individual model. For example, if each model has only 40% of predicting the correct output, then using 3 models gives us 35%, 5 models 32% and 7 models 29% (see below).

The second thing that we need to remember when making an ensemble of models is that the models need to be independent, meaning that they have different areas of strength.

We can see that this “independent” principle is reflected in the calculation of each ensemble. For example: for 3 models, when all 3 models get it right, the probability is 75% x 75% x 75% (see below). This 75% x 75% x 75% means that the 3 models are completely independent to each other.

This “completely independent” is a prefect condition and it doesn’t happen in reality. So in the above case the probability of each of the 3 models getting it right is lower than 42%. But we have to try to get the models independent the best we can. Meaning that we need to get them as different as possible, with each model should have their own areas of speciality, their own areas of strength.

24 May 2021

SVM with RBF Kernel

Filed under: Data Warehousing — Vincent Rainardi @ 5:20 pm

Out of all machine learning algorithms, SVM with RBF Kernel is the one that fascinates me the most. So in this article I am going to try to explain what it is, and why it works wonders.

I will begin by explaining a problem, and how this algorithm solves that problem.

Then I will explain what it is. SVM = Support Vector Machine, and RBF = Radial Basis Function. So I’ll explain what a support vector is, what a support vector machine is, what a kernel is and what a radial basis function is. Then I’ll combine them all and give a overall picture of what SVM with RBF Kernel is.

After we understand what it is, I’m going to briefly explain how it works.

Ok let’s start.

The Problem

We need to clasify 1000 PET scan images into cancer and benign. Whether it is cancer or benign is affected by two variable, X and Y. Fortunately the cancer and benign scans are linearly separable like this:

Figure 1. Linearly separable cancer and benign scans

We call this space a “linear space”. In this case we can find the equation of a line* which separate cancer and benign scans. Because it is linearly separable we can use linear machine learning algorithms.

The problem is when the data set is not linearly separable, like this:

Figure 2. Cancer and benign scans which are not linearly separable

In this case it is separable by an ellipse. We call this space a non linear space. We can find the equation for the ellipse but it won’t work with linear machine learning algorithms.

The Solution

The solution to this problem is to transform the non linear space into a linear space, like this:

Figure 3. Transforming a non-linear space into a linear space

Once it is in a linear space, we can use linear machine learning algorithms.

Why is it important to be able to use a linear ML algorithms? Because there are many popular linear ML algorithms which work well.

What is a Support Vector?

The 4 data points A, B, C, D in figure 4 below are called support vectors. They are the data points located nearest to the separator line.

Figure 4. Support Vectors

They are called support vector because they are the ones which determine where the separator line is located. The other data points don’t matter, they don’t affect where the spearator line is located. Even if we remove all the other data points the separator line will still be the same, as illustrated below:

Figure 5. Support Vectors affect the separator line

What is a Support Vector Machine?

Support Vector Machine is a machine learning algorithm which uses the support vector concept above to classify data. One of the main features of SVM is that it allows some data points to be deliberately misclassified, in order to achieve a higher overall accuracy.

Figure 6. SVM deliberately allows misclassifiction

In figure 6, data point A is deliberately misclassified. The SVM algorithm ignores data point A, so it can better classify all the other data points. As a result it achieves better overall accuracy, compared to if it tries to include data point A. This principle makes SVM work well when the data is partially intermingled.

What is a Kernel?

A kernel is a transformation from one space to another. For example, in figure 7 we transform the data points from variable X and Y to variable R and T. We can take variable R for example as “the distance from point A”.

Figure 7 Kernel – transforming data from one space to another

To be more precise, transformation like this is called a “Kernel Function” not just a Kernel.

What is a Radial Function?

Radial function is a function whose value depends on the distance from the point of origin (x = 0 and y = 0).

For example if the distance to origin is constant, then it is a circle if we plot it on X and Y axes. In figure 8 below we can see a circle with r = 2, where r is the distance from origin. In this case r is the radius of the circle and point O is the point of origin.

Figure 8 Radial Function

What is a Radial Basis Function (RBF)?

A Radial Basis Function (RBF) is a radial function where the reference point is not the origin. For example, distance of 3 from point (5,5) is like this:

Figure 9. Radial Basis Function

We can sum multiple RBFs to get shapes with multiple centres like this:

So what is SVM with RBF Kernel?

SVM with RBF Kernel is a machine learning algorithm which is capable to classify data points separated with radial based shapes like this:

Figure 11 SVM with RBF kernel
(source: http://qingkaikong.blogspot.com/2016/12/machine-learning-8-support-vector.html)

And that ability in machine learning is amazing because it can “hug” the data points closely, precisely separating them out.

References

  1. SVM: https://en.wikipedia.org/wiki/Support-vector_machine
  2. Kernel: https://en.wikipedia.org/wiki/Positive-definite_kernel
  3. RBF: https://en.wikipedia.org/wiki/Radial_basis_function

22 May 2021

Tokenisation

Filed under: Data Warehousing — Vincent Rainardi @ 7:38 am

One my teachers said to me once: the best way to learn is by writing it. It’s been 30 years and his words still rings true in my head.

One of the exciting subjects in machine learning is natural language (NL). There are 2 main subjects in NL: natural language processing (NLP) and natural language generation (NLG).

  • NLP is about processing and understanding human languages in the form of a text or voice. For example: reading a book, an email or a tweet, listening to people talking, singing, radio, etc.
  • NLG is about creating a text or voice in human languages. For example: creating a poetry or a news article, generating a voice which says some sentences, singing a song or a radio broadcast.

My article today is about NLP. One specific part of NLP. In NLP we have 3 levels of processing: lexical processing, syntactic processing and semantic processing.

  • Lexical processing is looking at a text without thinking about the grammar. We don’t differentiate if a word is a noun or a verb. In other words we don’t consider the role or position of that word in a sentence. For example, we break a text into paragraphs, paragraph into sentences and sentence into words. We change each word to their root form, e.g. we change “talking”, “talked”, “talks” to “talk”.
  • Syntactic processing is looking at a text to understand the role or function of each word. The meaning of a word depends on its role in the sentence. For example: subject, predicate or object. A noun, a verb, an adverb or an adjective. Present tense, past tense or in the future.
  • Semantic processing is trying to understand the meaning of the text. We try to understand the meaning of each word, each sentence, each paragraph and eventually the whole text.

My article today is about lexical processing. One specific part of lexical processing, called tokenisation.

Tokenisation is the process of breaking a text into smaller pieces. For example: breaking sentences into words. The sentence: “Are you ok?” she asked, can be tokenised into 5 words: are, you, ok, she, asked.

We can tokenise text in various different ways: (source: link)

  • characters
  • words
  • sentences
  • lines
  • paragraphs
  • N-grams

N-gram tokenisation is about breaking text into tokens with N number of characters in each token.
So 3-gram means 3 characters in each token. (source: link)

For example: the word “learning” can be tokenised into 3-gram like this: lea, ear, arn, rni, nin, ing.

One of the most popular libraries in NLP is the Natural Language Toolkit (NLTK). In NLTK library we have a few tokenisers: word tokeniser, sentence tokeniser, tweet tokerniser and regular expression tokeniser. Let’s go through them one by one.

Word Tokenizer

In NLTK we have a word tokeniser called word_tokenize. This tokeniser breaks text into word not only on spaces but also on apostrophy, greater than, less than and brackets. Periods, commas and colons are tokenised separately.

Python code – print the text:

document = "I'll do it don't you worry. O'Connor'd go at 3 o'clock, can't go wrong. " \
         + "Amazon's delivery at 3:15, but it's nice'. A+B>5 but #2 is {red}, (green) and [blue], email: a@b.com" 
print(document)  
I'll do it don't you worry. O'Connor'd go at 3 o'clock, can't go wrong. Amazon's delivery at 3:15, but it's nice'. A+B>5 but #2 is {red}, (green) and [blue], email: a@b.com

Tokenise using a space:

words = document.split()
print(words)
["I'll", 'do', 'it', "don't", 'you', 'worry.', "O'Connor'd", 'go', 'at', '3', "o'clock,", "can't", 'go', 'wrong.', "Amazon's", 'delivery', 'at', '3:15,', 'but', "it's", "nice'.", 'A+B>5', 'but', '#2', 'is', '{red},', '(green)', 'and', '[blue],', 'email:', 'a@b.com']

Tokenise using word_tokenize from NLTK:

from nltk.tokenize import word_tokenize
words = word_tokenize(document)
print(words)
['I', "'ll", 'do', 'it', 'do', "n't", 'you', 'worry', '.', "O'Connor", "'d", 'go', 'at', '3', "o'clock", ',', 'ca', "n't", 'go', 'wrong', '.', 'Amazon', "'s", 'delivery', 'at', '3:15', ',', 'but', 'it', "'s", 'nice', "'", '.', 'A+B', '>', '5', 'but', '#', '2', 'is', '{', 'red', '}', ',', '(', 'green', ')', 'and', '[', 'blue', ']', ',', 'email', ':', 'a', '@', 'b.com']

We can see above that using spaces we get these:

I’ll   don’t   worry.   O’Connor’d   o’clock,   can’t   Amazon’s   3:15   it’s   A+B>5   #2   {red},   (green)   [blue],   email:   a@b.com

Whereas using word_tokenise from NLTK we get these:

I   ‘ll   n’t   worry   .   O’Connor   ‘d   o’clock   ,   ca   n’t   Amazon   ‘s   3:15   it   ‘s   A+B   >   5   #   2   {  red  }  ,  (  green  )  [  blue  ]  ,   email   :   a   @   b.com

Notice that using NLTK these become separate tokens whereas using spaces they are not:

‘ll  n’t  .  ca  O’Connor  ‘d  o’clock  ‘s  A+B  >  #  {}  ()  []  ,  :   @

Sentence Tokenizer

In NLTK we have a sentence tokeniser called sent_tokenize. This tokeniser breaks text into sentences not only on periods but also on ellipsis, question marks and exclamation mark.

Python code – split on period:

document = "Oh... do you mind? Sit please. She said {...} go! So go."
words = document.split(".")
print(words)
['Oh', '', '', ' do you mind? Sit please', ' She said {', '', '', '} go! So go', '']

Using sent_tokenize from NLTK:

from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(document)
print(sentences)
['Oh... do you mind?', 'Sit please.', 'She said {...} go!', 'So go.']

Notice that NLTK breaks the text on periods (.), ellipsis (…),  question mark (?) and exclamation mark (!).

Also notice that if we use period we get a space in the beginning for the sentence. Using NLTK we don’t.

Tweet Tokenizer

In NLTK we have a tweet tokeniser. We can use this tokeniser to break a tweet into tokens, considering the smileys, emojis and hashtags.

Python code – using NLTK word tokeniser:

document = "I watched it :) It was gr8 <3 😍 #bingewatching"
words = word_tokenize(document)
print(words)
['I', 'watched', 'it', ':', ')', 'It', 'was', 'gr8', '<', '3', '😍', '#', 'Netflix']

Using NLTK tweet tokeniser:

from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()
tknzr.tokenize(document)
['I', 'watched', 'it', ':)', 'It', 'was', 'gr8', '<3', '😍', '#Netflix']

Notice that using tweet tokeniser we get smileys like <3 and hashtags as a token, whereas using word tokeniser the < and # are split up.

Regular Expression Tokenizer

In NLTK we have a regular expression tokeniser. We can use this tokeniser to break a tweet into tokens, considering the smileys, emojis and hash tags.

Python code:

from nltk.tokenize import regexp_tokenize
document = "Watched it 3x in 2 weeks!! 10 episodes #TheCrown #Netflix"
hashtags = "#[\w]+"
numbers  = "[0-9]+"

regexp_tokenize(document, hashtags)
['#TheCrown', '#Netflix']

regexp_tokenize(document, numbers)
['3', '2', '10']

Notice that using regular expression tokeniser we can extract hash tags and numbers. We can also use it to extract dates, email address, monetary amount.

 

16 April 2021

Logistic Regression with PCA in Python

Filed under: Data Warehousing — Vincent Rainardi @ 8:31 pm

Logistic Regression means predicting a catagorical variable, without losing too much information. For example, whether a client will invest or not. JavaTPoint provies a good, short overview on Logistic Regression: link. Jurafsky & Martin from Stanford provide a more detailed view, along with the mathematics: link. Wikipedia provides a comprehensive view, as always: link.

In this article I will be writing how to do Linear Regression in Python. I won’t be explaining what it is, but only how to do it in Python.

PCA means Principal Component Analysis. When we have a lot of variables, we can reduce them using PCA. Matt Berns provide a good overview and resources: link. Lindsay Smith from Otago provides a good academic overview: link. And as always, Wikipedia provies a comprehensive explanation: link.

I think it would be good to kill two birds with on stone. So in this article I will build 2 Logistic Regression models, one with PCA and one without PCA. This way it will provide examples for both cases.

One of the weaknesses of PCA is that we won’t know which variables are the top predictors. To know the top predictors we will have to build the Linear Regression model without PCA. As we don’t use PCA, to reduce the number of variables I will use RFE + manual (see here for an example on reducing variables using RFE + manual on Linear Regression). One of the advantages of PCA is that we don’t need to worry about multicollinearity in the data (highly correlated features). So on the second model where I don’t use PCA, I have to handle the multicollinearity, i.e. remove the highly correlated features using VIF (Variance Inflation Factor).

There are 5 steps:

  1. Data preparation
    • Load and understand the data
    • Fix data quality issues
    • Data conversion
    • Create derived variables
    • Visualise the data
    • Check highly correlated variables
    • Check outliers
    • Handle class imbalance (see here)
    • Scaling the data
  2. Model 1: Logistic Regression Model with PCA
    • Split the data into X and y
    • Split the data into training and test data set
    • Decide the number of PCA components based on the explained variance
    • Train the PCA model
    • Check the correlations between components
    • Apply PCA model to the test data
    • Train the Logistic Regression model
  3. Model evaluation for Model 1
    • Calculate the Area Under the Curve (AUC)
    • Calculate accuracy, sensitivity & specificity for different cut off points
    • Choose a cut off point
  4. Model 2: Logistic Regression Model without PCA
    • Drop highly correlated columns
    • Split the data into X and y
    • Train the Logistic Regression model
    • Reduce the variables using RFE
    • Remove one variable manually based on the P-value and VIF
    • Rebuild the model
    • Repeat the last 2 steps until P value < 0.05 and VIF < 5
  5. Model evaluation for Model 2
    • Calculate the Area Under the Curve (AUC)
    • Calculate accuracy, sensitivity & specificity for different cut off points and choose a cut off point
    • Identify the most important predictors

Step 1 is long and is not the core of this article so I will be skipping Step 1 and go directly into Step 2. Step 1 is common to various ML scenario so I will be writing it in a separate article and put the link in here so you can refer to it. One part in step 1 is about handling class imbalance, which I’ve written here: link.

Let’s start.

Step 2. Model 1: Logistic Regression Model with PCA

# Split the data into X and y
y = high_value_balanced.pop("target_variable")
X = high_value_balanced

# Split the data into training and test data set
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7,test_size=0.3,random_state=42)

# Decide the number of PCA components based on the retained information
pca = PCA(random_state=88)
pca.fit(X_train)
explained_variance = np.cumsum(pca.explained_variance_ratio_)
plt.vlines(x=80, ymax=1, ymin=0, colors="r", linestyles="--")
plt.hlines(y=0.95, xmax=120, xmin=0, colors="g", linestyles="--")
plt.plot(explained_variance)

We can see above that to retain 95% explained variance (meaning we retain 95% of the information) we need to use 80 PCA components. So we build the PCA model with 80 components.

# Train the PCA model 
pca_final = IncrementalPCA(n_components=80)
df_train_pca = pca_final.fit_transform(X_train)

# Note that the above can be automated like this: (without using plot)
pca_final = PCA(0.95)
df_train_pca = pca_again.fit_transform(X_train)

# Check the correlations between components
corr_mat = np.corrcoef(df_train_pca.transpose())
plt.figure(figsize=[15,8])
sns.heatmap(corr_mat)
plt.show()

As we can see in the heatmap above, all of the correlations are near zero (black). This one of the key features of PCA, i.e. the transformed features are not correlated to one another, i.e. their vectors are orthogonal to each other.

# Apply PCA model to the test data
df_test_pca = pca_final.transform(X_test)

# Train the Logistic Regression model
LR_PCA_Learner = LogisticRegression()
LR_PCA_Model = LR_PCA_Learner.fit(df_train_pca, y_train)

Step 3. Model evaluation for Model 1

# Calculate the Area Under the Curve (AUC)
pred_test = LR_PCA_Model.predict_proba(df_test_pca)
"{:2.2}".format(metrics.roc_auc_score(y_test, pred_test[:,1]))

# Calculate the predicted probabilities and convert to dataframe
y_pred = LR_PCA_Model.predict_proba(df_test_pca)
y_pred_df = pd.DataFrame(y_pred)
y_pred_1 = y_pred_df.iloc[:,[1]]
y_test_df = pd.DataFrame(y_test)

# Put the index as ID column, remove index from both dataframes and combine them
y_test_df["ID"] = y_test_df.index
y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)
y_pred_final = pd.concat([y_test_df,y_pred_1],axis=1)
y_pred_final = y_pred_final.rename(columns = { 1 : "Yes_Prob", "target_variable" : "Yes" } )
y_pred_final = y_pred_final.reindex(["ID", "Yes", "Yes_Prob"], axis=1)

# Create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_pred_final[i]= y_pred_final.Yes_Prob.map(lambda x: 1 if x > i else 0)

# Calculate accuracy, sensitivity & specificity for different cut off points
Probability = pd.DataFrame( columns = ['Probability', 'Accuracy', 'Sensitivity', 'Specificity'])
for i in numbers:
    CM = metrics.confusion_matrix(y_pred_final.Yes, y_pred_final[i] )
    Total = sum(sum(CM))
    Accuracy    = (CM[0,0]+CM[1,1])/Total
    Sensitivity = CM[1,1]/(CM[1,1]+CM[1,0])
    Specificity = CM[0,0]/(CM[0,0]+CM[0,1])
    Probability.loc[i] =[ i, Accuracy, Sensitivity, Specificity]
Probability.plot.line(x='Probability', y=['Accuracy','Sensitivity','Specificity'])

Choose a cut off point

Different applications have different priorities when choosing the cut off point. For some applications, the true positives is more important and the true negative doesn’t matter. In this case we should use sensitivity as the evaluation criteria, and choose the cut off point to make the sensitivity as high as possible e.g. probability = 0.1 which gives sensitivity almost 100%.

For some applications, the true negative is more important and the true positive doesn’t matter. In these case we should use specificity as the evaluation criteria, and choose the cut off point to make the sensitivity as high as possible, e.g. probability = 0.9 which gives specificity almost 100%.

For most applications the true positive and the true negative are equally important. In this case we should use accuracy as the evaluation criteria, and choose the cut off point to make the accuracy as high as possible, e.g. probability = 0.5 which gives accuracy about 82%.

In most cases it is not the above 3 extreme, but somewhere in the middle, i.e. the true positive is very important but the true negative also matters, even though not as important as the true positive. In this case we should choose the cut off point to make sensitivity high but the specificity not too low. For example, probability = 0.3 which gives sensitivity about 90%, specificity about 65% and accuracy about 80%.

So let’s do the the last paragraph, cut off point = 0.3

y_pred_final['predicted'] = y_pred_final.Churn_Prob.map( lambda x: 1 if x > 0.3 else 0)
confusion_matrix = metrics.confusion_matrix( y_pred_final.Yes, y_pred_final.predicted )
Probability[Probability["Probability"]==0.3]

We get sensitivity = 91.8%, specificifity = 65.9%, accuracy = 78.9%

Step 4. Model 2: Logistic Regression Model without PCA

# Drop highly correlated columns
data_corr = pd.DataFrame(data.corr()["target_variable"])
data_corr = data_corr[data_corr["target_variable"] != 1]
churn_corr.sort_values(by=["abs_corr"], ascending=False).head(5)

# Split the data into X and y, and normalise the data 
y = data.pop("target_variable")
normalised_data = (data - data.mean())/data.std()
X = normalised_data

# Train the Logistic Regression model
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7,test_size=0.3,random_state=88)
LR_model = LogisticRegression(max_iter = 200)
LR_model.fit(X_train, y_train)

# Reduce the variables using RFE
RFE_model = RFE(LR_model, n_features_to_select = 15)
RFE_model = RFE_model.fit(X_train, y_train)
selected_columns = X_train.columns[RFE_model.support_]

# Rebuild the model
X_train_RFE = X_train[selected_columns]
LR_model.fit(X_train_RFE, y_train)
LR2 = sm.GLM(y_train,(sm.add_constant(X_train_RFE)), family = sm.families.Binomial())
LR_model2 = LR2.fit()
LR_model2.summary()

# Check the VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
Model_VIF = pd.DataFrame()
Model_VIF["Variable"] = X_train_RFE.columns
number_of_variables = X_train_RFE.shape[1]
Model_VIF["VIF"] = [variance_inflation_factor(X_train_RFE.values, i) for i in range(number_of_variables)]
Model_VIF.sort_values(by="VIF", ascending=False)

# Remove one variable manually based on the P-value and VIF
X_train_RFE.drop(columns=["column8"], axis=1, inplace=True)
LR2 = sm.GLM(y_train,(sm.add_constant(X_train_RFE)), family = sm.families.Binomial())
LR_Model2 = LR2.fit()
LR_Model2.summary()

Repeat the last 2 steps until P value < 0.05 and VIF < 5.

Step 5. Model evaluation for Model 2

# Calculate the Area Under the Curve (AUC)
df_test = LR_Model2.transform(X_test)
pred_test = LR_Model2.predict_proba(df_test)
"{:2.2}".format(metrics.roc_auc_score(y_test, pred_test[:,1]))

Calculate accuracy, sensitivity & specificifity for different cut off points and choose a cut off point:

See “choose cut off point” section above

Identify the most important predictors:

From the model output above, i.e. “LR_Model2.summary()” we can see the most important predictors.

15 April 2021

Linear Regression in Python

Filed under: Data Warehousing — Vincent Rainardi @ 6:53 am

Linear Regression is about predicting a numerical variable. There are 5 steps when we do it in Python:

  1. Prepare the data
    • Load and understand the data
    • Fix data quality issues
    • Remove non required columns
    • Visualise and analyse the data
    • Identify highly correlated columns and remove them
    • Create derived variables
    • Create dummy variables for categorical variables
  2. Build the model
    • Split the data into training data and test data
    • Scale the numerical variables in the training data
    • Split the data into y and X
    • Automatically choose top 15 features using RFE (Recursive Feature Elimination)
    • Manually drop features based on P-value and VIF (Variance Inflation Factor)
    • Rebuild the model using OLS (Ordinary Least Squares)
    • Repeat the last 2 steps until all variables have P-value < 0.05 and VIF < 5
  3. Check the distribution of the error terms
  4. Make predictions
    • Scale the numberical variables in the test data
    • Remove the dropped features in the test data
    • Make predictions based on the test data
  5. Model evaluation
    • Plot the predicted vs actual values
    • Calculate R2, Adjusted R2 and F statistics
    • Create the linear equation for the best fitted line
    • Identify top predictors

Below is the Python code for the above steps. I will skip step 1 (preparing the data) and directly go to step 2 because step 1 is common to all ML models (not just linear regression) so I will write it in a separate article.

2. Build the model

# Split the data into training data and test data 
from sklearn.model_selection import train_test_split
np.random.seed(0)
df_train, df_test = train_test_split(df_data, train_size = 0.7, test_size = 0.3, random_state = 100)

# Scale the numerical variables in the training data
from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler()
continuous_columns = ["column1", " column2", " column3", " column4"]
df_train[continuous_columns] = minmax_scaler.fit_transform(df_train[continuous_columns])

# Split the data into y and X
y_train = df_train.pop("count")
x_train = df_train

# Automatically choose top 15 features using RFE
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
data_LR = LinearRegression()
data_LR.fit(x_train, y_train)
data_RFE = RFE(data_LR, 15)             
data_RFE = data_RFE.fit(x_train, y_train)

# Check which columns are selected by RFE and which are not
list(zip(x_train.columns,data_RFE.support_,data_RFE.ranking_))
selected_columns = x_train.columns[data_RFE.support_]
unselected_columns = x_train.columns[~data_RFE.support_]

# Train the model based on the columns selected by RFE
# and check the coefficients, R2, F statistics and P values
x_train = x_train[selected_columns] 
import statsmodels.api as data_stat_model  
x_train = data_stat_model.add_constant(x_train) 
data_OLS_result = data_stat_model.OLS(y_train, x_train).fit() 
data_OLS_result.params.sort_values(ascending=False) 
print(data_OLS_result.summary()) 

# Calculate the VIF (Variance Importance Factor) 
from statsmodels.stats.outliers_influence import variance_inflation_factor
data_VIF = pd.DataFrame()
data_VIF['variable'] = x_train.columns
number_of_variables = x_train.shape[1]
data_VIF['VIF'] = [variance_inflation_factor(x_train.values, i) for i in range(number_of_variables)]
data_VIF.sort_values(by="VIF", ascending=False) 

# Drop one column and rebuild the model
# And check the coefficients, R-squared, F statistics and P values
x_train.drop(columns=["column5"], axis=1, inplace=True)
x_train = bike_stat_model.add_constant(x_train)
data_OLS_result = data_stat_model.OLS(y_train, x_train).fit()
print(data_OLS_result.summary())

# Check the VIF again
data_VIF = pd.DataFrame()
data_VIF['variable'] = x_train.columns
number_of_variables = x_train.shape[1]
data_VIF['VIF'] = [variance_inflation_factor(x_train.values, i) for i in range(number_of_variables)]
data_VIF.sort_values(by="VIF", ascending=False) 

Keep dropping one column at a time and rebuild the model until all variables have P value < 0.05 and VIF < 5.

The result from print(data_OLS_result.summary()) is something like this, where we can see the R2 and Adjusted R2 of the training data:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  count   R-squared:                       0.841
Model:                            OLS   Adj. R-squared:                  0.838
Method:                 Least Squares   F-statistic:                     219.8
Date:                Tue, 29 Dec 2020   Prob (F-statistic):          6.03e-190
Time:                        09:07:14   Log-Likelihood:                 508.17
No. Observations:                 510   AIC:                            -990.3
Df Residuals:                     497   BIC:                            -935.3
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.2444      0.028      8.658      0.000       0.189       0.300
column1        0.2289      0.008     28.108      0.000       0.213       0.245
column2        0.1258      0.011     10.986      0.000       0.103       0.148
column3        0.5717      0.025     22.422      0.000       0.522       0.622
column4       -0.1764      0.038     -4.672      0.000      -0.251      -0.102
column5       -0.1945      0.026     -7.541      0.000      -0.245      -0.144
column6       -0.2362      0.026     -8.946      0.000      -0.288      -0.184

3. Check the distribution of the error terms

In linear regression we assume that the error term follows normal distribution. So we have to check this assumption before we can use the model for making predictions. We check this by looking at the histogram of the error term visually, making sure that the error terms are normally distributed around zero and that the left and right side are broadly similar.

fig = plt.figure()
y_predicted = data_OLS_result.predict(x_train)
sns.distplot((y_train - y_predicted), bins = 20)
fig.suptitle('Error Terms', fontsize = 16)
plt.show()

4. Making predictions

# Scale the numberical variables in the test data (just transform, no need to fit)
df_test[continuous_columns] = minmax_scaler.transform(df_test[continuous_columns])

# Split the test data into X and y
y_test = df_test.pop('count')
x_test = df_test

# Remove the features dropped by RFE and manual process
x_test = x_test[selected_columns]
x_test = x_test.drop(["column5", "column6", "column7"], axis = 1)

# Add the constant variable to test data (because by default stats model line goes through the origin)
x_test = data_stat_model.add_constant(x_test)

# Make predictions based on the test data
y_predicted = data_OLS_result.predict(x_test)

5. Model Evaluation

Now that we have built the model, and use the model to make prediction, we need to evaluate the performance of the model, i.e. how close the predictions are to the actual values.

# Compare the actual and predicted values
fig = plt.figure()
plt.scatter(y_test, y_predicted)
fig.suptitle('Compare actual (Y Test) vs Y predicted', fontsize = 16)
plt.xlabel('Y Test', fontsize = 14)
plt.ylabel('Y Predicted', fontsize = 14)      
plt.show()
  • We can see here that the Y Predicted and the Y Test have linear relation, which is what we expect.
  • There are a few data points which deviates from the line, for example the one on the lower left corner.

We can now calculate the R2 score on the test data like this:

from sklearn.metrics import r2_score
r2_score(y_test, y_predicted)

We can also calculate the Adjusted R2 like this:

Based on the coefficient values from the OLS regression result we construct the linear equation for the best fitted line, starting from the top predictors like this:

                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.2444      0.028      8.658      0.000       0.189       0.300
column1        0.2289      0.008     28.108      0.000       0.213       0.245
column2        0.1258      0.011     10.986      0.000       0.103       0.148
column3        0.5717      0.025     22.422      0.000       0.522       0.622
column4       -0.1764      0.038     -4.672      0.000      -0.251      -0.102
column5       -0.1945      0.026     -7.541      0.000      -0.245      -0.144
column6       -0.2362      0.026     -8.946      0.000      -0.288      -0.184

y = 0.2444 + 0.5717 column3 – 0.2362 column6  + 0.2289 column1 – 0.1945 column5 – 0.1764 column4 + …

Based on the absolute value of the coefficients we can see that the top 3 predictors are column3, column6 and column1. It is very useful in any machine learning project to know the top predictors, i.e. the most influencing features because then we can take business action to ensure that those features are maximised (or minimised if the coefficient is negative).

Now we have the linear regression equation we can use this equation for predict the target variable for any given input values.

7 April 2021

Handling Class Imbalance

Filed under: Data Warehousing — Vincent Rainardi @ 7:17 am

In this article I will explain a few ways to treat class imbalance in machine learning. I will also give some examples in Python.

What is class imbalance?

Imagine if you have a data set containing 2 classes: 100 class A and 100 class B. This is called a balanced data set. But if those 2 classes are 5000 class A and 100 class B that is an imbalanced data set. This is not limited to 2 classes, but can happen on more than 2 classes. For example: class A and B both have 5000 members, whereas class C and D both have 100 members.

In an imbalance data set, the class with fewer members is called the minority class. The class with much more members is called the majority class. So if class A has 5000 members and class B 100 members, class A is the majority class and class B is the minority class.

Note that the “class” here is the target variable, not the independent variable. So the target variable is a categorical variable, not a continuous variable. A case where the target data set has 2 classes like above is called “binary classification” and it is quite common in machine learning.

At what ratio it is called class imbalance?

There is no exact definition on the ratio. If class A is 20% of class B I would call it imbalance. Whereas if class A is 70% of class B I would call it balance. 50% I would say is a good bet. It is wrong to dwell on finding the precise ratio range because each data set and each ML algorithm is diffferent. Some cases have bad results at 40%, some cases are ok with 40%.

Why class imbalance occurs

Some data is naturally imbalance, because one class happens rarely in nature, whereas the other happens frequently. For example: cancer, fraud, spam, accidents. The number of people with cancer are naturally much less than those without. The number of fraudulant credit card payments are naturally much less than good payments. The number of spam emails are much less than good emails. The number of flight having accidents are naturally much less than good flights.

Why class imbalance needs to be treated

Some machine learning algorithms don’t work well if the target variable is imbalanced, because during training the majority class would be favoured. As a result the model would be skewed toward the majority class. This situation is an issue because in most cases what we are interested in is predicting the minority class. For example: predicting that a transaction is a fraud, or that an email is a spam, is more important than predicting the majority class.

That is the reason why class imbalance needs to be treated. Because the model would be skewed towards the majority class, and we need to predict the minority class.

How to treat class imbalance

We resolve this situation by oversampling the minority class or by undersampling the majority class.

Oversampling the minority class means we randomly choose sample data from the minority class many times, whereas on the majority class we don’t do anything.

For example if class A has 5000 members and class B has 100 members, we resample class B 4950 times. Meaning that we pick data randomly from class B 4950 times. Effectively it is like duplicating class B data 50 times.

Undersampling the minority class means that we randomly selecting data from the majority class as many times as the minority class. In the above example we randomly pick 100 samples from class A, so that both class A and class B have 100 members.

Apart from randomly selecting data there are many other techniques, including:

  • Creating a new samples (called synthetic data)
  • Selecting samples not randomly but favouring samples which are misclassified
  • Selecting samples not randomly but favouring samples which resembles the other class

Jason Brownlee explained several other techniques such as SMOTE, Borderline Oversampling, CNN, ENN, OSS in this article: link.

Python examples

1. Random Oversampling

# Import resample from the Scikit Learn library
from sklearn.utils import resample

# Put the majority class and minority class on separate dataframes
majority_df = df[df["fraud"]==0]
minority_df = df[df["fraud"]==1] 

# Oversampling the minority class randomly
new_minority_df = resample( minority_df, replace = True, 
                            n_samples = len(majority_df), 
                            random_state = 0 )

# Combine the new minority class with the majority class
balanced_df = pd.concat([majority_df, new_minority_df])

2. Synthetic Minority Oversampling Technique (SMOTE)

# Import SMOTE from the Imbalance Learn library
from imblearn.over_sampling import SMOTE

# Oversampling the minority class using SMOTE
s = SMOTE()
X_new, y_new = s.fit_resample(X, y)

Jason Brownlee illustrates very well which part of the minority class got oversampled by SMOTE in this article: link. Please notice how the minority class differs on the first 3 plots in his article. We can see clearly how SMOTE with random undersampling is better than SMOTE alone or random undersampling alone.

6 April 2021

Natural Language Processing (NLP)

Filed under: Data Warehousing — Vincent Rainardi @ 8:15 am

NLP is different to all other machine learning areas. Machine learning usually deals with mathematics, with numbers. It is about finding a pattern in the numbers, and make a prediction. The root of analysis is mathematical such as matrix, vectors, statistics, probability and calculus. But NLP is about words and sentences which is is very different.

We are now used to Alexa, Siri and Google able to understand us and answer us back in a conversation (5 years ago it wasn’t like that). When we type a reply to an email in Gmail or a message in Linked In we are now used to receiving suggestions about what we are going to type. And when we login to British Gas, or online banking or online retail shop we now find chat bots with whom we will be able have a useful conversation. Much better than 5 years ago. There is no doubt there has been a significant advancements in this area.

The processing of language, be it voice or text, are done in 3 levels. The bottom level is lexical analysis, where ML deals with each word in isolation. The middle level is syntax analysis, where ML analyses the words within the context of the sentence and the grammar. The top level is semantic analysis where ML tries to understand the meaning of the sentence.

To do lexical analysis we start with regular expression. We use regular expression to find words within a text, and to replace them with another words. Then we learn how to identify and remove stop words such as and, the, a which occur frequently but don’t provide useful information during lexical analysis. The third step is learning how to break the text into sentences and into words. And finally for each word we try to find the base word either using stemming, lemmatisation or soundex.

Stemming is a process of removing prefixes and suffixes like “ing” and “er” from “learning” and “learner” to get the base word which is learn. Lemmatisation is a processes of changing a word to its root, e.g. from “went” to “go”, and from “better”, “well”, “best” to “good”. Soundex is a 4-character code that represents the pronounciation of a word, rather than its spelling.

The syntax analysis is done by tagging each word as noun, verb, adjective, etc. (called “part of speech”). The tagging is done by parsing (breaking up) the sentences into groups of words (phrases), analysing the grammatical patterns, and considering the dependencies between words.

Semantic analysis is about understanding the meaning of the words and sentences by looking at the structure of the sentence and the word tagging. Words such as “Orange” can mean colour, a fruit or a area, and “Apple” can mean a fruit or a company, depending on the sentence. In semantic analysis we either assign predefined categories to a text (for example for sentiment analysis, for classifying messages or for chat bots) or pull out a specific information from a text (for example for extracting certain terms from IRS contracts, or other documents).

Next Page »

Blog at WordPress.com.