# Data Warehousing and Data Science

## 15 April 2021

### Linear Regression in Python

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 6:53 am

Linear Regression is about predicting a numerical variable. There are 5 steps when we do it in Python:

1. Prepare the data
• Load and understand the data
• Fix data quality issues
• Remove non required columns
• Visualise and analyse the data
• Identify highly correlated columns and remove them
• Create derived variables
• Create dummy variables for categorical variables
2. Build the model
• Split the data into training data and test data
• Scale the numerical variables in the training data
• Split the data into y and X
• Automatically choose top 15 features using RFE (Recursive Feature Elimination)
• Manually drop features based on P-value and VIF (Variance Inflation Factor)
• Rebuild the model using OLS (Ordinary Least Squares)
• Repeat the last 2 steps until all variables have P-value < 0.05 and VIF < 5
3. Check the distribution of the error terms
4. Make predictions
• Scale the numberical variables in the test data
• Remove the dropped features in the test data
• Make predictions based on the test data
5. Model evaluation
• Plot the predicted vs actual values
• Calculate R2, Adjusted R2 and F statistics
• Create the linear equation for the best fitted line
• Identify top predictors

Below is the Python code for the above steps. I will skip step 1 (preparing the data) and directly go to step 2 because step 1 is common to all ML models (not just linear regression) so I will write it in a separate article.

2. Build the model

```# Split the data into training data and test data
from sklearn.model_selection import train_test_split
np.random.seed(0)
df_train, df_test = train_test_split(df_data, train_size = 0.7, test_size = 0.3, random_state = 100)

# Scale the numerical variables in the training data
from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler()
continuous_columns = ["column1", " column2", " column3", " column4"]
df_train[continuous_columns] = minmax_scaler.fit_transform(df_train[continuous_columns])

# Split the data into y and X
y_train = df_train.pop("count")
x_train = df_train

# Automatically choose top 15 features using RFE
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
data_LR = LinearRegression()
data_LR.fit(x_train, y_train)
data_RFE = RFE(data_LR, 15)
data_RFE = data_RFE.fit(x_train, y_train)

# Check which columns are selected by RFE and which are not
list(zip(x_train.columns,data_RFE.support_,data_RFE.ranking_))
selected_columns = x_train.columns[data_RFE.support_]
unselected_columns = x_train.columns[~data_RFE.support_]

# Train the model based on the columns selected by RFE
# and check the coefficients, R2, F statistics and P values
x_train = x_train[selected_columns]
import statsmodels.api as data_stat_model
data_OLS_result = data_stat_model.OLS(y_train, x_train).fit()
data_OLS_result.params.sort_values(ascending=False)
print(data_OLS_result.summary())

# Calculate the VIF (Variance Importance Factor)
from statsmodels.stats.outliers_influence import variance_inflation_factor
data_VIF = pd.DataFrame()
data_VIF['variable'] = x_train.columns
number_of_variables = x_train.shape[1]
data_VIF['VIF'] = [variance_inflation_factor(x_train.values, i) for i in range(number_of_variables)]
data_VIF.sort_values(by="VIF", ascending=False)

# Drop one column and rebuild the model
# And check the coefficients, R-squared, F statistics and P values
x_train.drop(columns=["column5"], axis=1, inplace=True)
data_OLS_result = data_stat_model.OLS(y_train, x_train).fit()
print(data_OLS_result.summary())

# Check the VIF again
data_VIF = pd.DataFrame()
data_VIF['variable'] = x_train.columns
number_of_variables = x_train.shape[1]
data_VIF['VIF'] = [variance_inflation_factor(x_train.values, i) for i in range(number_of_variables)]
data_VIF.sort_values(by="VIF", ascending=False)
```

Keep dropping one column at a time and rebuild the model until all variables have P value < 0.05 and VIF < 5.

The result from print(data_OLS_result.summary()) is something like this, where we can see the R2 and Adjusted R2 of the training data:

```                            OLS Regression Results
==============================================================================
Dep. Variable:                  count   R-squared:                       0.841
Method:                 Least Squares   F-statistic:                     219.8
Date:                Tue, 29 Dec 2020   Prob (F-statistic):          6.03e-190
Time:                        09:07:14   Log-Likelihood:                 508.17
No. Observations:                 510   AIC:                            -990.3
Df Residuals:                     497   BIC:                            -935.3
Df Model:                          12
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.2444      0.028      8.658      0.000       0.189       0.300
column1        0.2289      0.008     28.108      0.000       0.213       0.245
column2        0.1258      0.011     10.986      0.000       0.103       0.148
column3        0.5717      0.025     22.422      0.000       0.522       0.622
column4       -0.1764      0.038     -4.672      0.000      -0.251      -0.102
column5       -0.1945      0.026     -7.541      0.000      -0.245      -0.144
column6       -0.2362      0.026     -8.946      0.000      -0.288      -0.184
```

3. Check the distribution of the error terms

In linear regression we assume that the error term follows normal distribution. So we have to check this assumption before we can use the model for making predictions. We check this by looking at the histogram of the error term visually, making sure that the error terms are normally distributed around zero and that the left and right side are broadly similar.

```fig = plt.figure()
y_predicted = data_OLS_result.predict(x_train)
sns.distplot((y_train - y_predicted), bins = 20)
fig.suptitle('Error Terms', fontsize = 16)
plt.show()
```

4. Making predictions

```# Scale the numberical variables in the test data (just transform, no need to fit)
df_test[continuous_columns] = minmax_scaler.transform(df_test[continuous_columns])

# Split the test data into X and y
y_test = df_test.pop('count')
x_test = df_test

# Remove the features dropped by RFE and manual process
x_test = x_test[selected_columns]
x_test = x_test.drop(["column5", "column6", "column7"], axis = 1)

# Add the constant variable to test data (because by default stats model line goes through the origin)

# Make predictions based on the test data
y_predicted = data_OLS_result.predict(x_test)
```

5. Model Evaluation

Now that we have built the model, and use the model to make prediction, we need to evaluate the performance of the model, i.e. how close the predictions are to the actual values.

```# Compare the actual and predicted values
fig = plt.figure()
plt.scatter(y_test, y_predicted)
fig.suptitle('Compare actual (Y Test) vs Y predicted', fontsize = 16)
plt.xlabel('Y Test', fontsize = 14)
plt.ylabel('Y Predicted', fontsize = 14)
plt.show()
```
• We can see here that the Y Predicted and the Y Test have linear relation, which is what we expect.
• There are a few data points which deviates from the line, for example the one on the lower left corner.

We can now calculate the R2 score on the test data like this:

```from sklearn.metrics import r2_score
r2_score(y_test, y_predicted)
```

We can also calculate the Adjusted R2 like this:

Based on the coefficient values from the OLS regression result we construct the linear equation for the best fitted line, starting from the top predictors like this:

```                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.2444      0.028      8.658      0.000       0.189       0.300
column1        0.2289      0.008     28.108      0.000       0.213       0.245
column2        0.1258      0.011     10.986      0.000       0.103       0.148
column3        0.5717      0.025     22.422      0.000       0.522       0.622
column4       -0.1764      0.038     -4.672      0.000      -0.251      -0.102
column5       -0.1945      0.026     -7.541      0.000      -0.245      -0.144
column6       -0.2362      0.026     -8.946      0.000      -0.288      -0.184
```

y = 0.2444 + 0.5717 column3 – 0.2362 column6 Â + 0.2289 column1 – 0.1945 column5 – 0.1764 column4 + …

Based on the absolute value of the coefficients we can see that the top 3 predictors are column3, column6 and column1. It is very useful in any machine learning project to know the top predictors, i.e. the most influencing features because then we can take business action to ensure that those features are maximised (or minimised if the coefficient is negative).

Now we have the linear regression equation we can use this equation for predict the target variable for any given input values.