Data Warehousing, BI and Data Science

16 April 2021

Logistic Regression with PCA in Python

Filed under: Data Warehousing — Vincent Rainardi @ 8:31 pm

Logistic Regression means predicting a catagorical variable, without losing too much information. For example, whether a client will invest or not. JavaTPoint provies a good, short overview on Logistic Regression: link. Jurafsky & Martin from Stanford provide a more detailed view, along with the mathematics: link. Wikipedia provides a comprehensive view, as always: link.

In this article I will be writing how to do Linear Regression in Python. I won’t be explaining what it is, but only how to do it in Python.

PCA means Principal Component Analysis. When we have a lot of variables, we can reduce them using PCA. Matt Berns provide a good overview and resources: link. Lindsay Smith from Otago provides a good academic overview: link. And as always, Wikipedia provies a comprehensive explanation: link.

I think it would be good to kill two birds with on stone. So in this article I will build 2 Logistic Regression models, one with PCA and one without PCA. This way it will provide examples for both cases.

One of the weaknesses of PCA is that we won’t know which variables are the top predictors. To know the top predictors we will have to build the Linear Regression model without PCA. As we don’t use PCA, to reduce the number of variables I will use RFE + manual (see here for an example on reducing variables using RFE + manual on Linear Regression). One of the advantages of PCA is that we don’t need to worry about multicollinearity in the data (highly correlated features). So on the second model where I don’t use PCA, I have to handle the multicollinearity, i.e. remove the highly correlated features using VIF (Variance Inflation Factor).

There are 5 steps:

  1. Data preparation
    • Load and understand the data
    • Fix data quality issues
    • Data conversion
    • Create derived variables
    • Visualise the data
    • Check highly correlated variables
    • Check outliers
    • Handle class imbalance (see here)
    • Scaling the data
  2. Model 1: Logistic Regression Model with PCA
    • Split the data into X and y
    • Split the data into training and test data set
    • Decide the number of PCA components based on the explained variance
    • Train the PCA model
    • Check the correlations between components
    • Apply PCA model to the test data
    • Train the Logistic Regression model
  3. Model evaluation for Model 1
    • Calculate the Area Under the Curve (AUC)
    • Calculate accuracy, sensitivity & specificity for different cut off points
    • Choose a cut off point
  4. Model 2: Logistic Regression Model without PCA
    • Drop highly correlated columns
    • Split the data into X and y
    • Train the Logistic Regression model
    • Reduce the variables using RFE
    • Remove one variable manually based on the P-value and VIF
    • Rebuild the model
    • Repeat the last 2 steps until P value < 0.05 and VIF < 5
  5. Model evaluation for Model 2
    • Calculate the Area Under the Curve (AUC)
    • Calculate accuracy, sensitivity & specificity for different cut off points and choose a cut off point
    • Identify the most important predictors

Step 1 is long and is not the core of this article so I will be skipping Step 1 and go directly into Step 2. Step 1 is common to various ML scenario so I will be writing it in a separate article and put the link in here so you can refer to it. One part in step 1 is about handling class imbalance, which I’ve written here: link.

Let’s start.

Step 2. Model 1: Logistic Regression Model with PCA

# Split the data into X and y
y = high_value_balanced.pop("target_variable")
X = high_value_balanced

# Split the data into training and test data set
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7,test_size=0.3,random_state=42)

# Decide the number of PCA components based on the retained information
pca = PCA(random_state=88)
pca.fit(X_train)
explained_variance = np.cumsum(pca.explained_variance_ratio_)
plt.vlines(x=80, ymax=1, ymin=0, colors="r", linestyles="--")
plt.hlines(y=0.95, xmax=120, xmin=0, colors="g", linestyles="--")
plt.plot(explained_variance)

We can see above that to retain 95% explained variance (meaning we retain 95% of the information) we need to use 80 PCA components. So we build the PCA model with 80 components.

# Train the PCA model 
pca_final = IncrementalPCA(n_components=80)
df_train_pca = pca_final.fit_transform(X_train)

# Note that the above can be automated like this: (without using plot)
pca_final = PCA(0.95)
df_train_pca = pca_again.fit_transform(X_train)

# Check the correlations between components
corr_mat = np.corrcoef(df_train_pca.transpose())
plt.figure(figsize=[15,8])
sns.heatmap(corr_mat)
plt.show()

As we can see in the heatmap above, all of the correlations are near zero (black). This one of the key features of PCA, i.e. the transformed features are not correlated to one another, i.e. their vectors are orthogonal to each other.

# Apply PCA model to the test data
df_test_pca = pca_final.transform(X_test)

# Train the Logistic Regression model
LR_PCA_Learner = LogisticRegression()
LR_PCA_Model = LR_PCA_Learner.fit(df_train_pca, y_train)

Step 3. Model evaluation for Model 1

# Calculate the Area Under the Curve (AUC)
pred_test = LR_PCA_Model.predict_proba(df_test_pca)
"{:2.2}".format(metrics.roc_auc_score(y_test, pred_test[:,1]))

# Calculate the predicted probabilities and convert to dataframe
y_pred = LR_PCA_Model.predict_proba(df_test_pca)
y_pred_df = pd.DataFrame(y_pred)
y_pred_1 = y_pred_df.iloc[:,[1]]
y_test_df = pd.DataFrame(y_test)

# Put the index as ID column, remove index from both dataframes and combine them
y_test_df["ID"] = y_test_df.index
y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)
y_pred_final = pd.concat([y_test_df,y_pred_1],axis=1)
y_pred_final = y_pred_final.rename(columns = { 1 : "Yes_Prob", "target_variable" : "Yes" } )
y_pred_final = y_pred_final.reindex(["ID", "Yes", "Yes_Prob"], axis=1)

# Create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_pred_final[i]= y_pred_final.Yes_Prob.map(lambda x: 1 if x > i else 0)

# Calculate accuracy, sensitivity & specificity for different cut off points
Probability = pd.DataFrame( columns = ['Probability', 'Accuracy', 'Sensitivity', 'Specificity'])
for i in numbers:
    CM = metrics.confusion_matrix(y_pred_final.Yes, y_pred_final[i] )
    Total = sum(sum(CM))
    Accuracy    = (CM[0,0]+CM[1,1])/Total
    Sensitivity = CM[1,1]/(CM[1,1]+CM[1,0])
    Specificity = CM[0,0]/(CM[0,0]+CM[0,1])
    Probability.loc[i] =[ i, Accuracy, Sensitivity, Specificity]
Probability.plot.line(x='Probability', y=['Accuracy','Sensitivity','Specificity'])

Choose a cut off point

Different applications have different priorities when choosing the cut off point. For some applications, the true positives is more important and the true negative doesn’t matter. In this case we should use sensitivity as the evaluation criteria, and choose the cut off point to make the sensitivity as high as possible e.g. probability = 0.1 which gives sensitivity almost 100%.

For some applications, the true negative is more important and the true positive doesn’t matter. In these case we should use specificity as the evaluation criteria, and choose the cut off point to make the sensitivity as high as possible, e.g. probability = 0.9 which gives specificity almost 100%.

For most applications the true positive and the true negative are equally important. In this case we should use accuracy as the evaluation criteria, and choose the cut off point to make the accuracy as high as possible, e.g. probability = 0.5 which gives accuracy about 82%.

In most cases it is not the above 3 extreme, but somewhere in the middle, i.e. the true positive is very important but the true negative also matters, even though not as important as the true positive. In this case we should choose the cut off point to make sensitivity high but the specificity not too low. For example, probability = 0.3 which gives sensitivity about 90%, specificity about 65% and accuracy about 80%.

So let’s do the the last paragraph, cut off point = 0.3

y_pred_final['predicted'] = y_pred_final.Churn_Prob.map( lambda x: 1 if x > 0.3 else 0)
confusion_matrix = metrics.confusion_matrix( y_pred_final.Yes, y_pred_final.predicted )
Probability[Probability["Probability"]==0.3]

We get sensitivity = 91.8%, specificifity = 65.9%, accuracy = 78.9%

Step 4. Model 2: Logistic Regression Model without PCA

# Drop highly correlated columns
data_corr = pd.DataFrame(data.corr()["target_variable"])
data_corr = data_corr[data_corr["target_variable"] != 1]
churn_corr.sort_values(by=["abs_corr"], ascending=False).head(5)

# Split the data into X and y, and normalise the data 
y = data.pop("target_variable")
normalised_data = (data - data.mean())/data.std()
X = normalised_data

# Train the Logistic Regression model
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7,test_size=0.3,random_state=88)
LR_model = LogisticRegression(max_iter = 200)
LR_model.fit(X_train, y_train)

# Reduce the variables using RFE
RFE_model = RFE(LR_model, n_features_to_select = 15)
RFE_model = RFE_model.fit(X_train, y_train)
selected_columns = X_train.columns[RFE_model.support_]

# Rebuild the model
X_train_RFE = X_train[selected_columns]
LR_model.fit(X_train_RFE, y_train)
LR2 = sm.GLM(y_train,(sm.add_constant(X_train_RFE)), family = sm.families.Binomial())
LR_model2 = LR2.fit()
LR_model2.summary()

# Check the VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
Model_VIF = pd.DataFrame()
Model_VIF["Variable"] = X_train_RFE.columns
number_of_variables = X_train_RFE.shape[1]
Model_VIF["VIF"] = [variance_inflation_factor(X_train_RFE.values, i) for i in range(number_of_variables)]
Model_VIF.sort_values(by="VIF", ascending=False)

# Remove one variable manually based on the P-value and VIF
X_train_RFE.drop(columns=["column8"], axis=1, inplace=True)
LR2 = sm.GLM(y_train,(sm.add_constant(X_train_RFE)), family = sm.families.Binomial())
LR_Model2 = LR2.fit()
LR_Model2.summary()

Repeat the last 2 steps until P value < 0.05 and VIF < 5.

Step 5. Model evaluation for Model 2

# Calculate the Area Under the Curve (AUC)
df_test = LR_Model2.transform(X_test)
pred_test = LR_Model2.predict_proba(df_test)
"{:2.2}".format(metrics.roc_auc_score(y_test, pred_test[:,1]))

Calculate accuracy, sensitivity & specificifity for different cut off points and choose a cut off point:

See “choose cut off point” section above

Identify the most important predictors:

From the model output above, i.e. “LR_Model2.summary()” we can see the most important predictors.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: