Linear Regression is about predicting a numerical variable. There are 5 steps when we do it in Python:

- Prepare the data
- Load and understand the data
- Fix data quality issues
- Remove non required columns
- Visualise and analyse the data
- Identify highly correlated columns and remove them
- Create derived variables
- Create dummy variables for categorical variables

- Build the model
- Split the data into training data and test data
- Scale the numerical variables in the training data
- Split the data into y and X
- Automatically choose top 15 features using RFE (Recursive Feature Elimination)
- Manually drop features based on P-value and VIF (Variance Inflation Factor)
- Rebuild the model using OLS (Ordinary Least Squares)
- Repeat the last 2 steps until all variables have P-value < 0.05 and VIF < 5

- Check the distribution of the error terms
- Make predictions
- Scale the numberical variables in the test data
- Remove the dropped features in the test data
- Make predictions based on the test data

- Model evaluation
- Plot the predicted vs actual values
- Calculate R
^{2}, Adjusted R^{2}and F statistics - Create the linear equation for the best fitted line
- Identify top predictors

Below is the Python code for the above steps. I will skip step 1 (preparing the data) and directly go to step 2 because step 1 is common to all ML models (not just linear regression) so I will write it in a separate article.

**2. Build the model**

```
# Split the data into training data and test data
from sklearn.model_selection import train_test_split
np.random.seed(0)
df_train, df_test = train_test_split(df_data, train_size = 0.7, test_size = 0.3, random_state = 100)
# Scale the numerical variables in the training data
from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler()
continuous_columns = ["column1", " column2", " column3", " column4"]
df_train[continuous_columns] = minmax_scaler.fit_transform(df_train[continuous_columns])
# Split the data into y and X
y_train = df_train.pop("count")
x_train = df_train
# Automatically choose top 15 features using RFE
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
data_LR = LinearRegression()
data_LR.fit(x_train, y_train)
data_RFE = RFE(data_LR, 15)
data_RFE = data_RFE.fit(x_train, y_train)
# Check which columns are selected by RFE and which are not
list(zip(x_train.columns,data_RFE.support_,data_RFE.ranking_))
selected_columns = x_train.columns[data_RFE.support_]
unselected_columns = x_train.columns[~data_RFE.support_]
# Train the model based on the columns selected by RFE
# and check the coefficients, R2, F statistics and P values
x_train = x_train[selected_columns]
import statsmodels.api as data_stat_model
x_train = data_stat_model.add_constant(x_train)
data_OLS_result = data_stat_model.OLS(y_train, x_train).fit()
data_OLS_result.params.sort_values(ascending=False)
print(data_OLS_result.summary())
# Calculate the VIF (Variance Importance Factor)
from statsmodels.stats.outliers_influence import variance_inflation_factor
data_VIF = pd.DataFrame()
data_VIF['variable'] = x_train.columns
number_of_variables = x_train.shape[1]
data_VIF['VIF'] = [variance_inflation_factor(x_train.values, i) for i in range(number_of_variables)]
data_VIF.sort_values(by="VIF", ascending=False)
# Drop one column and rebuild the model
# And check the coefficients, R-squared, F statistics and P values
x_train.drop(columns=["column5"], axis=1, inplace=True)
x_train = bike_stat_model.add_constant(x_train)
data_OLS_result = data_stat_model.OLS(y_train, x_train).fit()
print(data_OLS_result.summary())
# Check the VIF again
data_VIF = pd.DataFrame()
data_VIF['variable'] = x_train.columns
number_of_variables = x_train.shape[1]
data_VIF['VIF'] = [variance_inflation_factor(x_train.values, i) for i in range(number_of_variables)]
data_VIF.sort_values(by="VIF", ascending=False)
```

Keep dropping one column at a time and rebuild the model until all variables have P value < 0.05 and VIF < 5.

The result from print(data_OLS_result.summary()) is something like this, where we can see the R^{2} and Adjusted R^{2} of the training data:

OLS Regression Results ============================================================================== Dep. Variable: count R-squared: 0.841 Model: OLS Adj. R-squared: 0.838 Method: Least Squares F-statistic: 219.8 Date: Tue, 29 Dec 2020 Prob (F-statistic): 6.03e-190 Time: 09:07:14 Log-Likelihood: 508.17 No. Observations: 510 AIC: -990.3 Df Residuals: 497 BIC: -935.3 Df Model: 12 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 0.2444 0.028 8.658 0.000 0.189 0.300 column1 0.2289 0.008 28.108 0.000 0.213 0.245 column2 0.1258 0.011 10.986 0.000 0.103 0.148 column3 0.5717 0.025 22.422 0.000 0.522 0.622 column4 -0.1764 0.038 -4.672 0.000 -0.251 -0.102 column5 -0.1945 0.026 -7.541 0.000 -0.245 -0.144 column6 -0.2362 0.026 -8.946 0.000 -0.288 -0.184

**3. Check the distribution of the error terms**

In linear regression we assume that the error term follows normal distribution. So we have to check this assumption before we can use the model for making predictions. We check this by looking at the histogram of the error term visually, making sure that the error terms are normally distributed around zero and that the left and right side are broadly similar.

```
fig = plt.figure()
y_predicted = data_OLS_result.predict(x_train)
sns.distplot((y_train - y_predicted), bins = 20)
fig.suptitle('Error Terms', fontsize = 16)
plt.show()
```

**4. Making predictions**

```
# Scale the numberical variables in the test data (just transform, no need to fit)
df_test[continuous_columns] = minmax_scaler.transform(df_test[continuous_columns])
# Split the test data into X and y
y_test = df_test.pop('count')
x_test = df_test
# Remove the features dropped by RFE and manual process
x_test = x_test[selected_columns]
x_test = x_test.drop(["column5", "column6", "column7"], axis = 1)
# Add the constant variable to test data (because by default stats model line goes through the origin)
x_test = data_stat_model.add_constant(x_test)
# Make predictions based on the test data
y_predicted = data_OLS_result.predict(x_test)
```

**5. Model Evaluation**

Now that we have built the model, and use the model to make prediction, we need to evaluate the performance of the model, i.e. how close the predictions are to the actual values.

```
# Compare the actual and predicted values
fig = plt.figure()
plt.scatter(y_test, y_predicted)
fig.suptitle('Compare actual (Y Test) vs Y predicted', fontsize = 16)
plt.xlabel('Y Test', fontsize = 14)
plt.ylabel('Y Predicted', fontsize = 14)
plt.show()
```

- We can see here that the Y Predicted and the Y Test have linear relation, which is what we expect.
- There are a few data points which deviates from the line, for example the one on the lower left corner.

We can now calculate the R2 score on the test data like this:

```
from sklearn.metrics import r2_score
r2_score(y_test, y_predicted)
```

We can also calculate the Adjusted R^{2} like this:

Based on the coefficient values from the OLS regression result we construct the linear equation for the best fitted line, starting from the top predictors like this:

coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 0.2444 0.028 8.658 0.000 0.189 0.300 column1 0.2289 0.008 28.108 0.000 0.213 0.245 column2 0.1258 0.011 10.986 0.000 0.103 0.148 column3 0.5717 0.025 22.422 0.000 0.522 0.622 column4 -0.1764 0.038 -4.672 0.000 -0.251 -0.102 column5 -0.1945 0.026 -7.541 0.000 -0.245 -0.144 column6 -0.2362 0.026 -8.946 0.000 -0.288 -0.184

**y = 0.2444 + 0.5717 column3 – 0.2362 column6 Â + 0.2289 column1 – 0.1945 column5 – 0.1764 column4 + …**

Based on the absolute value of the coefficients we can see that the top 3 predictors are column3, column6 and column1. It is very useful in any machine learning project to know the top predictors, i.e. the most influencing features because then we can take business action to ensure that those features are maximised (or minimised if the coefficient is negative).

Now we have the linear regression equation we can use this equation for predict the target variable for any given input values.

## Leave a Reply