I was tuning fraudulent credit card transaction data from Kaggle (link) and found that for classifier, XGBoost provides the highest AUC compared to other algorithms (99.18%). It is a little tricky to tune though, so in this article I’d like to share my experience in tuning it.

**What is XGBoost?**

XGBoost stands for Extreme Gradient Boosting. So before you read about XGBoost, you need to understand first what is Gradient Boosting, and what is Boosting. Here are good introductions to this topic: link, link. The basis algorithm for XGBoost is Decision Tree. Then many trees are used together in a technique called Ensemble (for example Random Forest). So a complete journey to understanding XGboost from the ground up is:

- Decision Tree
- Ensemble
- Stacking, Bagging, Boosting (link)
- Random Forest
- Gradient Boosting
- Extreme Gradient Boosting

**Higgs Boson**

The original paper by Tianqi Chen and Carlos Guestrin who created XGBoost is here: link.

XGBoost was used to solve Higgs Boson classification problem, again by Tianqi Chen, and Tong He: link. Higgs Boson is the last elementary particle discovered. It was discovered in 2012 at the Large Hadron Collider at CERN. The particle was predicted by Peter Higgs in 1964.

**Reference**

A good reference for tuning XGBoost model is a guide from Prashant Banerjee: link (search for “typical value”). Another good one is from Aarshay Jain: link (again, search for “typical value”). The guide from the developers: link and the list of hyperparameters are here: link.

**Python Code**

Here’s the code in its entirety:

```
# Import required libraries
import numpy as np
import pandas as pd
from sklearn import preprocessing
# Load the data from Google drive
from google.colab import drive
drive.mount('/content/gdrive')
df = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/creditcard.csv')
# Drop time column as fraudulent transactions can happen at any time
df = df.drop("Time", axis = 1)
# Get the class variable and put into y and the rest into X
y = df["Class"]
X = df.drop("Class", axis = 1)
# Stratified split into train & test data
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split( X, y, test_size = 0.2, stratify = y, random_state = 42 )
# Fix data skewness
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(copy=False)
train_return = pt.fit_transform(X_train)
test_return = pt.fit_transform(X_test)
# Balance the train and test data using SMOTE
from imblearn.over_sampling import SMOTE
SMOTE = SMOTE(random_state=42)
X_smote_train, y_smote_train = SMOTE.fit_resample(X_train, y_train)
X_smote_test, y_smote_test = SMOTE.fit_resample(X_test, y_test)
# Sample training data for tuning models (use full training data for final run)
tuning_sample = 20000
idx = np.random.choice(len(X_smote_train), size=tuning_sample)
X_smote_tuning = X_smote_train.iloc[idx]
y_smote_tuning = y_smote_train.iloc[idx]
# Import libraries from Scikit Learn
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier
# Create a function to calculate AUC using predict proba
def Get_AUC(Model, X, y):
prob = Model.predict_proba(X)[:, 1]
return roc_auc_score(y, prob) * 100
# Perform grid search cross validation with different parameters
parameters = {'n_estimators':[90], 'max_depth':[6], 'learning_rate':[0.2],
'subsample':[0.5], 'colsample_bytree':[0.3], 'min_child_width': [1],
'gamma':[0], 'alpha':[0.001], 'reg_lambda':[0.001]}
XGB = XGBClassifier()
CV = GridSearchCV(XGB, parameters, cv=3, scoring='roc_auc', n_jobs=-1)
# Hyperparameter tuning to find the best parameters
CV.fit(X_smote_tuning, y_smote_tuning)
print("The best parameters are:", CV.best_params_)
Output: The best parameters are: {'alpha': 0.001, 'colsample_bytree': 0.3, 'gamma': 0, 'learning_rate': 0.2, 'max_depth': 6, 'min_child_width': 1, 'n_estimators': 90, 'reg_lambda': 0.001, 'subsample': 0.5}
# Fit the model with the best parameters and get the AUC
XGB = XGBClassifier(n_estimators = CV.best_params_["n_estimators"], max_depth = CV.best_params_["max_depth"],
learning_rate = CV.best_params_["learning_rate"], colsample_bytree = CV.best_params_["colsample_bytree"],
subsample = CV.best_params_["subsample"], min_child_width = CV.best_params_["min_child_width"],
gamma = CV.best_params_["gamma"], alpha = CV.best_params_["alpha"],
reg_lambda = CV.best_params_["reg_lambda"])
Model = XGB.fit(X_smote_train, y_smote_train)
AUC = Get_AUC(Model, X_smote_test, y_smote_test)
print("AUC =", '{:.2f}%'.format(AUC))
Output: 99.18
```

**Tuning Process**

So here is the tuning process that I did for XG Boost model, for the above data, using the above code.

**Step 1. Broad ranges on the top 3 parameters**

First, I read the expected values for the parameters from the guides (see the Reference section above).

Note: In this article when I say parameters I mean hyperparameters.

Then, using the Grid Search cross validation I set the parameters in very broad ranges as follows:

- n_estimators: 10, 100, 500
- max_depth: 3, 10, 30
- learning_rate: 0.01, 0.1, 1

I used only 20k data out of 284,807 transactions so the cross validation process didn’t take hours but only minutes. I tried with 10k, 20k, 50k samples and found that 10k results didn’t represent the whole training data (284k), 50k and above were very slow, but 20k is fast enough and yet it is representative.

I would recommend trying only 3 values for each parameter and only the 3 parameters above to begin with. This way it would take 10 minutes. These 3 parameters are the most influencing factors, we need to nail them down first. They are mentioned in the Reference section above.

**Step 2. Narrow down the top 3 parameters**

I then narrow down the range of these 3 parameters. For example, for n_estimators out of 10, 100, 500, the Grid Search shows that the best value was 100. So I changed the grid search with 80, 100, 120. Still getting 100 as the best parameter so I did a grid search with 90, 100, 110 and got 90. Finally I did the grid search with 85, 90, 95 and it still gave out 90 as the best n_estimators so that was my final value for this parameter.

But I understood there was interaction between the parameter so when tuning n_estimators I included the max_depth of 3, 10, 30 and learning_rate of 0.01, 0.1, 1. And when the n_estimator was settled at 90, I started narrowing down the max_depth (which was giving out 10) to 7, 10, 14. The result was 7 so I narrowed it down to 6, 7, 8. The result was 6 and that was the final value for this max_depth.

For the learning_rate I started with 0.01, 0.1, 1 and the best was 0.1. Then 0.05, 0.1, 0.2 and the best was 0.2. Tried 0.15, 0.2, 0.25 and the best was 0.2 so that was the final value for the learning_rate.

So the top 3 parameters are: n_estimators = 90, max_depth = 6, learning_rate = 0.2. The max_depth = 6 was the same as the default value, so I could have not used this parameter if I wanted to.

**Note:**

Note that I didn’t put all the possible ranges/values for all 3 parameters into a grid search CV and let it run for the whole night. It’s all manual and I nailed down the parameters one by one, which only took about an hour. Manual is a lot quicker because from the prevous run I knew the optimum range of parameters, so I could narrow it down further. It’s a very controlled and targetted process, that’s why it’s quick.

Also note that I used only 20k data for tuning, but for getting AUC I fit the full training data and predicted using the full test data.

**Step 3. The next 3 parameters**

With the top 3 parameters fixed, I tried the next 3 parameters as follows:

- colsample_bytree: 0.1, 0.5, 0.9
- subsample: 0.1, 0.5, 0.9
- min_child_width: 1, 5, 10

I picked these 3 parameters were based on the guidelines given by the XGBoost developers and the blog posts which are in the Reference section above.

The results are as follows: the optimum parameters = colsample_bytree = 0.3, subsample = 0.5, min_child_width = 1. This gives an AUC of 98.69%.

For the explanation about what these parameters are, please refer to the XGBoost documentation here: link.

It is possible that the AUC is lower than the AUC from the previous step. In this case I tried the values for that parameters manually using the full training data. For example, with tuning data (20k) the best min_child_width was 0 but this gives AUC of 98.09% which was lower than the previous AUC value before using min_child_width (98.69%). So I tried 0, 1 and 2 values of min_child_width using the full training data. In other words, the tuning data (20k) is good for narrowing down from the broad range to narrow range, but when it’s narrow range we might need to use the full training data. To do this I replaced the “XGB = … “ in last cell with this:

```
XGB = XGBClassifier(n_estimators = 90, max_depth = 6, learning_rate = 0.2,
colsample_bytree = 0.3, subsample = 0.5, min_child_width = 1)
```

**Step 4. Three regularisation parameters**

Reading from the guides from the Reference section above, it seems that the next 3 most important parameters are gamma, alpha and lambda. They are the regularisation parameters and their value ranges are in the XGBoost documentation (link).

- gamma: 0, 1, 10. Optimum value: 0.
- alpha: 0, 10, 1000. Optimum value: 0.001
- reg_lambda: 0.1, 0.5, 0.9. Optimum value: 0.001

After tuning with these 3 paramters, the AUC increased to 99.18%.

I confirmed the result by replacing the XGB = … in the last cell with this:

```
XGB = XGBClassifier(n_estimators = 90, max_depth = 6, learning_rate = 0.2,
colsample_bytree = 0.3, subsample = 0.5, min_child_width = 1,
gamma = 0, alpha = 0.001, reg_lambda = 0.001)
```

**Note on the imbalanced data**

XGBoost has 2 parameters to deal with imbalanced data: scale_pos_weight and max_delta_step. You can read how to implement them in the XG Boost documentation: link.

I did use them, trying the scale_pos_weight values of 1, 10,100 and the optimum value was 10, but it only gave AUC of 96.83%.

So I tried different approaches for handling imbalance data, i.e. random oversampling, SMOTE and ADASYN. SMOTE gave the best result, i.e. the AUC of 99.18% above.

**Note on the data skewness**

The credit card fraud data is skewed, meaning it is not distributed normally. This is particularly so with the amount feature, which is distributed differently between the left of the mean and the right of the mean. A few other features such as V3 are also like that.

I used PowerTransformer to fix the data skewness, as you can see in the above code. I fixed the skewness separately between the training data and test data. So I split the data first, and then fix the skewness. This is better than fixing the skewness first because when afterwards the data is split, then it would become skewed.

**Note on the stratified sampling**

Because the data is very imbalanced, I use stratified sampling so that the ratio between the 2 classes are kept the same between the training data and the test data. I use 80-20% split rather than 70-30% split to give the model more data to learn, and because 20% is one fifth which is large enough unseen data to test the trained model against.

I don’t believe 10% test data is fair enough testing, in my opinion 20% is the minimum we should not go lower than that, not even 15%. I verified this in Kaggle i.e. that most practices in Kaggle are using test data of 20%, 25% or 30%. I didn’t see any one uses test data lower than 20% or higher than 30%.

**Note on deleting the time column**

The time column is not the time of day as in 8am or 9pm. It is the number of seconds elapsed between this transaction and the first transaction in the dataset (link). The distribution of the time column on class 0 and class 1 shows that the frauds can happen at any time:

And there is no correlation between time and class:

And by the way, the credit card transaction data is only for 2 days. So there is no enough time to form a pattern for the time of day.

So those are my reasons for deleting the time column.

But what Nimrod said on Linked In made me tried again. He said: Great read Vincent. I wonder though have you checked the time column before dropping it? I get that fraud can happen at any time, but perhaps some times of the day are more densely packed with fraudulent transaction? (link)

So I downloaded the time column and the class column into Excel. Divide the time column by (3600 x 24) which is the number of seconds in an hour and the number of hours in a day to get it in “day unit”. This “day unit” ranges from 0 to 1.9716 because there are only 2 days worth of transactions.

I then took the decimal part of the day unit, which is when the fraud happen during the day (value between 0 and 1). Multiplied by 24 I get the hour in the day. And it looks like this when I graph the number of frauds happened against the hour in the day:

Note that in the above chart 8 does not mean 8am and 21 does not mean 9pm. 8 means 8 hours from the first transaction, 21 means 21 hours from the first transaction. But we can see clearly that the fraud is high on 2nd hour and 11th hour. We need to remember though that the data is only 2 days worth of transactions. But still, it clearly shows that some times of the day are more densely packed with fraudulent transaction, just as Nimrod says (link). So I shouldn’t delete the time column actually, but convert it to the time of day.

[…] prevents overfitting. In this case it is 0.4 (the default is 0.3, range is from 0 to 1, see link, link). Gamma is the minimum loss reduction required to make a further partition on a leaf node of the […]

Pingback by Automating Machine Learning using Azure ML | Data Warehousing and Machine Learning — 26 November 2021 @ 8:07 am |