2 years ago I thought linear regression was the easiest algorithm. But it turns out that it is quite difficult to do, because the X and the Y must have a linear relationship, and the errors must be normally distributed, independent and have equal variance. That kind of data in reality is much more unlikely to happen in nature than I initially thought. And if these 4 criteria are not satisfied, we can’t use linear regression. In addition we also face multicollinearity, overfitting and extrapolation when doing linear regression. In this article I would like to explain these issues, and how to solve them.

__Criteria 1. X and Y must have a linear relationship__

The first issue is the relationship between the X (independent variables) and the Y (the predicted variable) might not be linear. For example, below is a classic case of a “lower tail” where below x1 the data is lower than the linear values (the red points).

__Criteria 2. Error terms must be distributed normally__

The second issue is that the errors might not be distributed normally. Below left is an example where the error terms are distributed normally. Error terms are the difference between the actual values and the predicted values, aka the residuals. Remember that normally distributed means that 1 standard deviation must cover 68.2% of the data and 2 SD 95.4% and 2 SD 99.7%. Secondly the centre must be 0. Note: The image on the right is from Wikipedia (link).

Below are 3 examples where the error terms are not distributed normally:

On the left the distribution is almost flat. In the middle, the centre is 2 not 0. On the right, the red bars are too low so that the 3 SD is lower than 99.7%. Unless the error terms are distributed normally, we cannot use the linear regression model that we created.

__Criteria 3. Error terms must be independent__

Error terms must be independent of what? Independent of three things:

- of the independent variables (the X
_{1}, X_{2}, etc) - of the predicted variable (the Y)
- of the previous error terms (see: Robert Nau’s explanation here)

See below for 3 illustrations where the error terms are not independent:

- Left image: the error terms are correlated to one of the independent variables. In this example the higher the X the lower the error terms.
- Middle image: the error terms are correlated to the predicted variable. In this example the higher the Y the higher the error terms.

Note: in linear regression “the predicted variable” can means two things: the actual values and the predicted values. In the context of error terms independence the convention is the predicted values (y hat) because that is what the model represents and we want to know if we can use the model or not. Saying that, the plot would be similar if we use the actual values rather than the predicted values, because error terms are the difference between the actual values and the predicted values. - Right image: the error terms are correlated to the previous value of the error terms. This one is also called autocorrelation or serial correlation; it usually happens on time series data.

The reason why we cannot use the model if the error terms is not independent is because the model is bias and therefore not accurate. For example on the left and middle plots above we can see that the error term (which is the difference between the the actual value and the predicted value, which reflects the model’s accuracy) changes depending on the independent variable and dependent variable.

Independent error terms means that the error terms are randomly scattered around 0 (with regards to the predicted values), like this:

Notice that this chart is between the error terms and the predicted values (y hat), not the actual values.

There are 3 things that we should check on the above scattered chart:

- That positive and negative error terms are roughly distributed equally. Meaning that the number of data points above and below the x axis are roughly equal.
- That there are no outliers. Meaning that there are no data points which are far away from everything else. For example: all data points are within -2 to +2 range but there is a data point at +4.
- Most of the error terms are around zero. Meaning that the further away we move vertically from the x axis, the less crowded the data points are. This is to satisfy the “error terms should be distributed normally” criteria which is centered on zero.

__Criteria 4. Error terms must have equal variance__

It means that the data points are scattered equally around zero, no matter what the predicted values are. In the image below the error terms are not the same across the predicted values (Y hat). Around Y hat = a the error terms have a small variance, at Y hat = b the error terms have a large variance and at Y hat = c the error terms have a small variance.

__What should we do__**?**

If the X-Y plot or the residual plot indicates that there is a non-linear relationship in the data (i.e. the 4 points above), there are four things we can do:

- We can transform the independent variables or the predicted variable.
- We can use polynomial regression
- We can do non-linear regression
- We can do segmented regression

The first thing is transforming one or more of the independent variables (X) into ln(X), e^X, e^-X, square root of X, etc:

- First, we need to find out which independent variable is not linear. This is done by plotting each independent variable against the predicted variable (one by one).
- Then we choose a suitable transformation based on the chart from the first step above, for example: (graphs from fooplot.com)

- Then we transform the non-linear independent variable, for example we transform X to ln(X), and we use this ln(X) as the independent variable in the linear regression.

The second one is using **polynomial regression** instead of linear regression, like this:

We can read about polynomial regression in Wikipedia (link) and in Towards Data Science (link, by Animesh Agarwal). As we can read in Animesh’ article, the degree of the polynomial that we choose affects the overfitting, so it’s a trade off between the bias and variance.

The third one that we can do is **non-linear regression**. By non-linear I mean the model parameters/coefficients (the betas), not the independent variables (the X). Meaning that it is not in the form of “y = beta1 something + beta2 something + beta3 something + …” For example, this is a non-linear regression:

In non-linear regression we approximate the model using first order Taylor series. We can read about polynomial regression in Wikipedia (link).

The last one is **segmented regression**, where we partition the independent variables into several segments, and for each segment we use linear regression. So instead of 1 long line, the linear regression is several “broken lines”. That is why this technique is known as “broken-stick regression” which we can read in Wikipedia: link. It is also known as “piecewise regression” as the Python implementation is using numpy.piecewise() function, which we can read in Stack Overflow: link.

__Multicollinearity, overfitting, extrapolation__

At the beginning of this article I also mentioned about these 3 issues when doing linear regression. What are these issues and how do we solve them?

Multicollinearlity means that one of the independent variables is highly correlated to another independent variable. This is a problem because it causes the model to have high variance, i.e. the model coefficients change erratically when there are small changes in the data, causing the model to be unstable.

The solution is to drop one of the multicollinear variables. We can read more about multicollinearity in Wikipedia, including a few other solutions: link.

Overfitting happens when we use high degree polynomial regression. We detect overfitting by comparing the accuracy in the training and test data set. If the accuracy on the training data set is very high (>90%) and the accuracy on the test data set is much lower (a difference of 10% or more) then the model is overfitting (see: link).

The solution is to use regularisation such as Lasso or Ridge (link), using feature selection (link), or using cross validation (link).

Extrapolation is about using the linear, polynomial or non-linear regression model beyond the range of the training and test data. The considerations and real world examples are given in this Medium article by Dennish Ash: link.

The solution is to review the linearity relationship between the independent variable and the predicted variable in the data range where we want to do extrapolation. We review using business sense (not using data), checking if the relationship is still linear outside the data range that we have.

One consideration is that the further the distance to the training and test data range, the more risky the extrapolation. For example, if in the training and test data the independent variable is between 20 and 140, predicting the output for 180 is more risky than predicting the output for 145.

__Note on plots in machine learning__

Machine learning is a science about data and as such when making plots/graphs we must always make it clear the meaning of each axis. And yet bizarrely during my 2 years in machine learning I encountered so many graphs with the axis not labelled! This irritates me so much. We must label the axis properly, because depending on what the axis are the graph could mean an entirely different thing.

For example: the graph below says heteroscedastic but has no label on either the y axis nor the x axis. So how could we know what those data points are? Is it independent variable against the dependent variable? It turns out that the x axis is the predicted variable and y axis is the error term.

[…] regression, it is the error terms which must be normally distributed, not the data itself (see link, point 2). So subtract some points if the candidates assume that in linear regression the data must […]

Pingback by Interview Questions for Data Scientists | Data Warehousing and Machine Learning — 8 December 2021 @ 7:38 am |