Data Warehousing and Data Science

31 May 2021

Why Linear Regression is so hard

Filed under: Machine Learning — Vincent Rainardi @ 8:36 am

2 years ago I thought linear regression was the easiest algorithm. But it turns out that it is quite difficult to do, because the X and the Y must have a linear relationship, and the errors must be normally distributed, independent and have equal variance. That kind of data in reality is much more unlikely to happen in nature than I initially thought. And if these 4 criteria are not satisfied, we can’t use linear regression. In addition we also face multicollinearity, overfitting and extrapolation when doing linear regression. In this article I would like to explain these issues, and how to solve them.

Criteria 1. X and Y must have a linear relationship

The first issue is the relationship between the X (independent variables) and the Y (the predicted variable) might not be linear. For example, below is a classic case of a “lower tail” where below x1 the data is lower than the linear values (the red points).

Criteria 2. Error terms must be distributed normally

The second issue is that the errors might not be distributed normally. Below left is an example where the error terms are distributed normally. Error terms are the difference between the actual values and the predicted values, aka the residuals. Remember that normally distributed means that 1 standard deviation must cover 68.2% of the data and 2 SD 95.4% and 2 SD 99.7%. Secondly the centre must be 0. Note: The image on the right is from Wikipedia (link).

Below are 3 examples where the error terms are not distributed normally:

On the left the distribution is almost flat. In the middle, the centre is 2 not 0. On the right, the red bars are too low so that the 3 SD is lower than 99.7%. Unless the error terms are distributed normally, we cannot use the linear regression model that we created.

Criteria 3. Error terms must be independent

Error terms must be independent of what? Independent of three things:

  1. of the independent variables (the X1, X2, etc)
  2. of the predicted variable (the Y)
  3. of the previous error terms (see: Robert Nau’s explanation here)

See below for 3 illustrations where the error terms are not independent:

  • Left image: the error terms are correlated to one of the independent variables. In this example the higher the X the lower the error terms.
  • Middle image: the error terms are correlated to the predicted variable. In this example the higher the Y the higher the error terms.
    Note: in linear regression “the predicted variable” can means two things: the actual values and the predicted values. In the context of error terms independence the convention is the predicted values (y hat) because that is what the model represents and we want to know if we can use the model or not. Saying that, the plot would be similar if we use the actual values rather than the predicted values, because error terms are the difference between the actual values and the predicted values.
  • Right image: the error terms are correlated to the previous value of the error terms. This one is also called autocorrelation or serial correlation; it usually happens on time series data.

The reason why we cannot use the model if the error terms is not independent is because the model is bias and therefore not accurate. For example on the left and middle plots above we can see that the error term (which is the difference between the the actual value and the predicted value, which reflects the model’s accuracy) changes depending on the independent variable and dependent variable.

Independent error terms means that the error terms are randomly scattered around 0 (with regards to the predicted values), like this:

Notice that this chart is between the error terms and the predicted values (y hat), not the actual values.
There are 3 things that we should check on the above scattered chart:

  1. That positive and negative error terms are roughly distributed equally. Meaning that the number of data points above and below the x axis are roughly equal.
  2. That there are no outliers. Meaning that there are no data points which are far away from everything else. For example: all data points are within -2 to +2 range but there is a data point at +4.
  3. Most of the error terms are around zero. Meaning that the further away we move vertically from the x axis, the less crowded the data points are. This is to satisfy the “error terms should be distributed normally” criteria which is centered on zero.

Criteria 4. Error terms must have equal variance

It means that the data points are scattered equally around zero, no matter what the predicted values are. In the image below the error terms are not the same across the predicted values (Y hat). Around Y hat = a the error terms have a small variance, at Y hat = b the error terms have a large variance and at Y hat = c the error terms have a small variance.

What should we do?

If the X-Y plot or the residual plot indicates that there is a non-linear relationship in the data (i.e. the 4 points above), there are four things we can do:

  1. We can transform the independent variables or the predicted variable.
  2. We can use polynomial regression
  3. We can do non-linear regression
  4. We can do segmented regression

The first thing is transforming one or more of the independent variables (X) into ln(X), e^X, e^-X, square root of X, etc:

  • First, we need to find out which independent variable is not linear. This is done by plotting each independent variable against the predicted variable (one by one).
  • Then we choose a suitable transformation based on the chart from the first step above, for example: (graphs from
  • Then we transform the non-linear independent variable, for example we transform X to ln(X), and we use this ln(X) as the independent variable in the linear regression.

The second one is using polynomial regression instead of linear regression, like this:

We can read about polynomial regression in Wikipedia (link) and in Towards Data Science (link, by Animesh Agarwal). As we can read in Animesh’ article, the degree of the polynomial that we choose affects the overfitting, so it’s a trade off between the bias and variance.

The third one that we can do is non-linear regression. By non-linear I mean the model parameters/coefficients (the betas), not the independent variables (the X). Meaning that it is not in the form of “y = beta1 something + beta2 something + beta3 something + …” For example, this is a non-linear regression:

In non-linear regression we approximate the model using first order Taylor series. We can read about polynomial regression in Wikipedia (link).

The last one is segmented regression, where we partition the independent variables into several segments, and for each segment we use linear regression. So instead of 1 long line, the linear regression is several “broken lines”. That is why this technique is known as “broken-stick regression” which we can read in Wikipedia: link. It is also known as “piecewise regression” as the Python implementation is using numpy.piecewise() function, which we can read in Stack Overflow: link.

Multicollinearity, overfitting, extrapolation

At the beginning of this article I also mentioned about these 3 issues when doing linear regression. What are these issues and how do we solve them?

Multicollinearlity means that one of the independent variables is highly correlated to another independent variable. This is a problem because it causes the model to have high variance, i.e. the model coefficients change erratically when there are small changes in the data, causing the model to be unstable.

The solution is to drop one of the multicollinear variables. We can read more about multicollinearity in Wikipedia, including a few other solutions: link.

Overfitting happens when we use high degree polynomial regression. We detect overfitting by comparing the accuracy in the training and test data set. If the accuracy on the training data set is very high (>90%) and the accuracy on the test data set is much lower (a difference of 10% or more) then the model is overfitting (see: link).

The solution is to use regularisation such as Lasso or Ridge (link), using feature selection (link), or using cross validation (link).

Extrapolation is about using the linear, polynomial or non-linear regression model beyond the range of the training and test data. The considerations and real world examples are given in this Medium article by Dennish Ash: link.

The solution is to review the linearity relationship between the independent variable and the predicted variable in the data range where we want to do extrapolation. We review using business sense (not using data), checking if the relationship is still linear outside the data range that we have.

One consideration is that the further the distance to the training and test data range, the more risky the extrapolation. For example, if in the training and test data the independent variable is between 20 and 140, predicting the output for 180 is more risky than predicting the output for 145.

Note on plots in machine learning

Machine learning is a science about data and as such when making plots/graphs we must always make it clear the meaning of each axis. And yet bizarrely during my 2 years in machine learning I encountered so many graphs with the axis not labelled! This irritates me so much. We must label the axis properly, because depending on what the axis are the graph could mean an entirely different thing.

For example: the graph below says heteroscedastic but has no label on either the y axis nor the x axis. So how could we know what those data points are? Is it independent variable against the dependent variable? It turns out that the x axis is the predicted variable and y axis is the error term.

27 May 2021

Ensembles – Odd and Even

Filed under: Machine Learning — Vincent Rainardi @ 7:01 am

In machine learning we have a technique called Ensembles, i.e. we combine multiple models. The more models we use, the higher the chance of getting right. That is understandable and expected. But the number of models being odd or even has a significant effect too. I didn’t expect that and in this short article I would like to share it.

I’ll start from the end. Below is the probability of using 2 to 7 models to predict a binary output, i.e. right or wrong. Each model has 75% chance of getting it right i.e. correctly predicting the output.

If we look at the top row (2, 4 and 6 models) the probability of the ensemble getting it right increases, i.e. 56%, 74%, 83%. If we look at the bottom row (3, 5 and 7 models) it also increases, i.e. 84%, 90%, 93%.

But from 3 models to 4 models it is down from 84% to 74% because we have 21% of “Not sure”. This 21% is when 2 models are right and 2 models are wrong and therefore the output is “Not sure”. Therefore we would rather use 3 models than 4 models because 3 models is better than 4 models, in terms of the chance of getting it right (correctly predicting the output).

The same thing happen between 5 and 6 models. The probability of the ensemble getting it right decreases from 90% to 83% because we have 13% of “Not sure”. This is where 3 models are right and and 3 model are wrong so the output is “Not sure”.

So when using ensembles to predict binary output we need to use odd number of models, because they don’t have “Not sure” where equal number of models are right and wrong.

We also need to remember that each model must have >50% chance of predicting the correct result. Because if not the model ensemble is weaker than the individual model. For example, if each model has only 40% of predicting the correct output, then using 3 models gives us 35%, 5 models 32% and 7 models 29% (see below).

The second thing that we need to remember when making an ensemble of models is that the models need to be independent, meaning that they have different areas of strength.

We can see that this “independent” principle is reflected in the calculation of each ensemble. For example: for 3 models, when all 3 models get it right, the probability is 75% x 75% x 75% (see below). This 75% x 75% x 75% means that the 3 models are completely independent to each other.

This “completely independent” is a prefect condition and it doesn’t happen in reality. So in the above case the probability of each of the 3 models getting it right is lower than 42%. But we have to try to get the models independent the best we can. Meaning that we need to get them as different as possible, with each model should have their own areas of speciality, their own areas of strength.

24 May 2021

SVM with RBF Kernel

Filed under: Machine Learning — Vincent Rainardi @ 5:20 pm

Out of all machine learning algorithms, SVM with RBF Kernel is the one that fascinates me the most. So in this article I am going to try to explain what it is, and why it works wonders.

I will begin by explaining a problem, and how this algorithm solves that problem.

Then I will explain what it is. SVM = Support Vector Machine, and RBF = Radial Basis Function. So I’ll explain what a support vector is, what a support vector machine is, what a kernel is and what a radial basis function is. Then I’ll combine them all and give a overall picture of what SVM with RBF Kernel is.

After we understand what it is, I’m going to briefly explain how it works.

Ok let’s start.

The Problem

We need to clasify 1000 PET scan images into cancer and benign. Whether it is cancer or benign is affected by two variable, X and Y. Fortunately the cancer and benign scans are linearly separable like this:

Figure 1. Linearly separable cancer and benign scans

We call this space a “linear space”. In this case we can find the equation of a line* which separate cancer and benign scans. Because it is linearly separable we can use linear machine learning algorithms.

The problem is when the data set is not linearly separable, like this:

Figure 2. Cancer and benign scans which are not linearly separable

In this case it is separable by an ellipse. We call this space a non linear space. We can find the equation for the ellipse but it won’t work with linear machine learning algorithms.

The Solution

The solution to this problem is to transform the non linear space into a linear space, like this:

Figure 3. Transforming a non-linear space into a linear space

Once it is in a linear space, we can use linear machine learning algorithms.

Why is it important to be able to use a linear ML algorithms? Because there are many popular linear ML algorithms which work well.

What is a Support Vector?

The 4 data points A, B, C, D in figure 4 below are called support vectors. They are the data points located nearest to the separator line.

Figure 4. Support Vectors

They are called support vector because they are the ones which determine where the separator line is located. The other data points don’t matter, they don’t affect where the spearator line is located. Even if we remove all the other data points the separator line will still be the same, as illustrated below:

Figure 5. Support Vectors affect the separator line

What is a Support Vector Machine?

Support Vector Machine is a machine learning algorithm which uses the support vector concept above to classify data. One of the main features of SVM is that it allows some data points to be deliberately misclassified, in order to achieve a higher overall accuracy.

Figure 6. SVM deliberately allows misclassifiction

In figure 6, data point A is deliberately misclassified. The SVM algorithm ignores data point A, so it can better classify all the other data points. As a result it achieves better overall accuracy, compared to if it tries to include data point A. This principle makes SVM work well when the data is partially intermingled.

What is a Kernel?

A kernel is a transformation from one space to another. For example, in figure 7 we transform the data points from variable X and Y to variable R and T. We can take variable R for example as “the distance from point A”.

Figure 7 Kernel – transforming data from one space to another

To be more precise, transformation like this is called a “Kernel Function” not just a Kernel.

What is a Radial Function?

Radial function is a function whose value depends on the distance from the point of origin (x = 0 and y = 0).

For example if the distance to origin is constant, then it is a circle if we plot it on X and Y axes. In figure 8 below we can see a circle with r = 2, where r is the distance from origin. In this case r is the radius of the circle and point O is the point of origin.

Figure 8 Radial Function

What is a Radial Basis Function (RBF)?

A Radial Basis Function (RBF) is a radial function where the reference point is not the origin. For example, distance of 3 from point (5,5) is like this:

Figure 9. Radial Basis Function

We can sum multiple RBFs to get shapes with multiple centres like this:

So what is SVM with RBF Kernel?

SVM with RBF Kernel is a machine learning algorithm which is capable to classify data points separated with radial based shapes like this:

Figure 11 SVM with RBF kernel

And that ability in machine learning is amazing because it can “hug” the data points closely, precisely separating them out.


  1. SVM:
  2. Kernel:
  3. RBF:

22 May 2021


Filed under: Machine Learning — Vincent Rainardi @ 7:38 am

One my teachers said to me once: the best way to learn is by writing it. It’s been 30 years and his words still rings true in my head.

One of the exciting subjects in machine learning is natural language (NL). There are 2 main subjects in NL: natural language processing (NLP) and natural language generation (NLG).

  • NLP is about processing and understanding human languages in the form of a text or voice. For example: reading a book, an email or a tweet, listening to people talking, singing, radio, etc.
  • NLG is about creating a text or voice in human languages. For example: creating a poetry or a news article, generating a voice which says some sentences, singing a song or a radio broadcast.

My article today is about NLP. One specific part of NLP. In NLP we have 3 levels of processing: lexical processing, syntactic processing and semantic processing.

  • Lexical processing is looking at a text without thinking about the grammar. We don’t differentiate if a word is a noun or a verb. In other words we don’t consider the role or position of that word in a sentence. For example, we break a text into paragraphs, paragraph into sentences and sentence into words. We change each word to their root form, e.g. we change “talking”, “talked”, “talks” to “talk”.
  • Syntactic processing is looking at a text to understand the role or function of each word. The meaning of a word depends on its role in the sentence. For example: subject, predicate or object. A noun, a verb, an adverb or an adjective. Present tense, past tense or in the future.
  • Semantic processing is trying to understand the meaning of the text. We try to understand the meaning of each word, each sentence, each paragraph and eventually the whole text.

My article today is about lexical processing. One specific part of lexical processing, called tokenisation.

Tokenisation is the process of breaking a text into smaller pieces. For example: breaking sentences into words. The sentence: “Are you ok?” she asked, can be tokenised into 5 words: are, you, ok, she, asked.

We can tokenise text in various different ways: (source: link)

  • characters
  • words
  • sentences
  • lines
  • paragraphs
  • N-grams

N-gram tokenisation is about breaking text into tokens with N number of characters in each token.
So 3-gram means 3 characters in each token. (source: link)

For example: the word “learning” can be tokenised into 3-gram like this: lea, ear, arn, rni, nin, ing.

One of the most popular libraries in NLP is the Natural Language Toolkit (NLTK). In NLTK library we have a few tokenisers: word tokeniser, sentence tokeniser, tweet tokerniser and regular expression tokeniser. Let’s go through them one by one.

Word Tokenizer

In NLTK we have a word tokeniser called word_tokenize. This tokeniser breaks text into word not only on spaces but also on apostrophy, greater than, less than and brackets. Periods, commas and colons are tokenised separately.

Python code – print the text:

document = "I'll do it don't you worry. O'Connor'd go at 3 o'clock, can't go wrong. " \
         + "Amazon's delivery at 3:15, but it's nice'. A+B>5 but #2 is {red}, (green) and [blue], email:" 
I'll do it don't you worry. O'Connor'd go at 3 o'clock, can't go wrong. Amazon's delivery at 3:15, but it's nice'. A+B>5 but #2 is {red}, (green) and [blue], email:

Tokenise using a space:

words = document.split()
["I'll", 'do', 'it', "don't", 'you', 'worry.', "O'Connor'd", 'go', 'at', '3', "o'clock,", "can't", 'go', 'wrong.', "Amazon's", 'delivery', 'at', '3:15,', 'but', "it's", "nice'.", 'A+B>5', 'but', '#2', 'is', '{red},', '(green)', 'and', '[blue],', 'email:', '']

Tokenise using word_tokenize from NLTK:

from nltk.tokenize import word_tokenize
words = word_tokenize(document)
['I', "'ll", 'do', 'it', 'do', "n't", 'you', 'worry', '.', "O'Connor", "'d", 'go', 'at', '3', "o'clock", ',', 'ca', "n't", 'go', 'wrong', '.', 'Amazon', "'s", 'delivery', 'at', '3:15', ',', 'but', 'it', "'s", 'nice', "'", '.', 'A+B', '>', '5', 'but', '#', '2', 'is', '{', 'red', '}', ',', '(', 'green', ')', 'and', '[', 'blue', ']', ',', 'email', ':', 'a', '@', '']

We can see above that using spaces we get these:

I’ll   don’t   worry.   O’Connor’d   o’clock,   can’t   Amazon’s   3:15   it’s   A+B>5   #2   {red},   (green)   [blue],   email:

Whereas using word_tokenise from NLTK we get these:

I   ‘ll   n’t   worry   .   O’Connor   ‘d   o’clock   ,   ca   n’t   Amazon   ‘s   3:15   it   ‘s   A+B   >   5   #   2   {  red  }  ,  (  green  )  [  blue  ]  ,   email   :   a   @

Notice that using NLTK these become separate tokens whereas using spaces they are not:

‘ll  n’t  .  ca  O’Connor  ‘d  o’clock  ‘s  A+B  >  #  {}  ()  []  ,  :   @

Sentence Tokenizer

In NLTK we have a sentence tokeniser called sent_tokenize. This tokeniser breaks text into sentences not only on periods but also on ellipsis, question marks and exclamation mark.

Python code – split on period:

document = "Oh... do you mind? Sit please. She said {...} go! So go."
words = document.split(".")
['Oh', '', '', ' do you mind? Sit please', ' She said {', '', '', '} go! So go', '']

Using sent_tokenize from NLTK:

from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(document)
['Oh... do you mind?', 'Sit please.', 'She said {...} go!', 'So go.']

Notice that NLTK breaks the text on periods (.), ellipsis (…),  question mark (?) and exclamation mark (!).

Also notice that if we use period we get a space in the beginning for the sentence. Using NLTK we don’t.

Tweet Tokenizer

In NLTK we have a tweet tokeniser. We can use this tokeniser to break a tweet into tokens, considering the smileys, emojis and hashtags.

Python code – using NLTK word tokeniser:

document = "I watched it :) It was gr8 <3 😍 #bingewatching"
words = word_tokenize(document)
['I', 'watched', 'it', ':', ')', 'It', 'was', 'gr8', '<', '3', '😍', '#', 'Netflix']

Using NLTK tweet tokeniser:

from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()
['I', 'watched', 'it', ':)', 'It', 'was', 'gr8', '<3', '😍', '#Netflix']

Notice that using tweet tokeniser we get smileys like <3 and hashtags as a token, whereas using word tokeniser the < and # are split up.

Regular Expression Tokenizer

In NLTK we have a regular expression tokeniser. We can use this tokeniser to break a tweet into tokens, considering the smileys, emojis and hash tags.

Python code:

from nltk.tokenize import regexp_tokenize
document = "Watched it 3x in 2 weeks!! 10 episodes #TheCrown #Netflix"
hashtags = "#[\w]+"
numbers  = "[0-9]+"

regexp_tokenize(document, hashtags)
['#TheCrown', '#Netflix']

regexp_tokenize(document, numbers)
['3', '2', '10']

Notice that using regular expression tokeniser we can extract hash tags and numbers. We can also use it to extract dates, email address, monetary amount.


Blog at