Data Warehousing, BI and Data Science

7 April 2021

Handling Class Imbalance

Filed under: Data Warehousing — Vincent Rainardi @ 7:17 am

In this article I will explain a few ways to treat class imbalance in machine learning. I will also give some examples in Python.

What is class imbalance?

Imagine if you have a data set containing 2 classes: 100 class A and 100 class B. This is called a balanced data set. But if those 2 classes are 5000 class A and 100 class B that is an imbalanced data set. This is not limited to 2 classes, but can happen on more than 2 classes. For example: class A and B both have 5000 members, whereas class C and D both have 100 members.

In an imbalance data set, the class with fewer members is called the minority class. The class with much more members is called the majority class. So if class A has 5000 members and class B 100 members, class A is the majority class and class B is the minority class.

Note that the “class” here is the target variable, not the independent variable. So the target variable is a categorical variable, not a continuous variable. A case where the target data set has 2 classes like above is called “binary classification” and it is quite common in machine learning.

At what ratio it is called class imbalance?

There is no exact definition on the ratio. If class A is 20% of class B I would call it imbalance. Whereas if class A is 70% of class B I would call it balance. 50% I would say is a good bet. It is wrong to dwell on finding the precise ratio range because each data set and each ML algorithm is diffferent. Some cases have bad results at 40%, some cases are ok with 40%.

Why class imbalance occurs

Some data is naturally imbalance, because one class happens rarely in nature, whereas the other happens frequently. For example: cancer, fraud, spam, accidents. The number of people with cancer are naturally much less than those without. The number of fraudulant credit card payments are naturally much less than good payments. The number of spam emails are much less than good emails. The number of flight having accidents are naturally much less than good flights.

Why class imbalance needs to be treated

Some machine learning algorithms don’t work well if the target variable is imbalanced, because during training the majority class would be favoured. As a result the model would be skewed toward the majority class. This situation is an issue because in most cases what we are interested in is predicting the minority class. For example: predicting that a transaction is a fraud, or that an email is a spam, is more important than predicting the majority class.

That is the reason why class imbalance needs to be treated. Because the model would be skewed towards the majority class, and we need to predict the minority class.

How to treat class imbalance

We resolve this situation by oversampling the minority class or by undersampling the majority class.

Oversampling the minority class means we randomly choose sample data from the minority class many times, whereas on the majority class we don’t do anything.

For example if class A has 5000 members and class B has 100 members, we resample class B 4950 times. Meaning that we pick data randomly from class B 4950 times. Effectively it is like duplicating class B data 50 times.

Undersampling the minority class means that we randomly selecting data from the majority class as many times as the minority class. In the above example we randomly pick 100 samples from class A, so that both class A and class B have 100 members.

Apart from randomly selecting data there are many other techniques, including:

  • Creating a new samples (called synthetic data)
  • Selecting samples not randomly but favouring samples which are misclassified
  • Selecting samples not randomly but favouring samples which resembles the other class

Jason Brownlee explained several other techniques such as SMOTE, Borderline Oversampling, CNN, ENN, OSS in this article: link.

Python examples

1. Random Oversampling

# Import resample from the Scikit Learn library
from sklearn.utils import resample

# Put the majority class and minority class on separate dataframes
majority_df = df[df["fraud"]==0]
minority_df = df[df["fraud"]==1] 

# Oversampling the minority class randomly
new_minority_df = resample( minority_df, replace = True, 
                            n_samples = len(majority_df), 
                            random_state = 0 )

# Combine the new minority class with the majority class
balanced_df = pd.concat([majority_df, new_minority_df])

2. Synthetic Minority Oversampling Technique (SMOTE)

# Import SMOTE from the Imbalance Learn library
from imblearn.over_sampling import SMOTE

# Oversampling the minority class using SMOTE
s = SMOTE()
X_new, y_new = s.fit_resample(X, y)

Jason Brownlee illustrates very well which part of the minority class got oversampled by SMOTE in this article: link. Please notice how the minority class differs on the first 3 plots in his article. We can see clearly how SMOTE with random undersampling is better than SMOTE alone or random undersampling alone.

6 April 2021

Natural Language Processing (NLP)

Filed under: Data Warehousing — Vincent Rainardi @ 8:15 am

NLP is different to all other machine learning areas. Machine learning usually deals with mathematics, with numbers. It is about finding a pattern in the numbers, and make a prediction. The root of analysis is mathematical such as matrix, vectors, statistics, probability and calculus. But NLP is about words and sentences which is is very different.

We are now used to Alexa, Siri and Google able to understand us and answer us back in a conversation (5 years ago it wasn’t like that). When we type a reply to an email in Gmail or a message in Linked In we are now used to receiving suggestions about what we are going to type. And when we login to British Gas, or online banking or online retail shop we now find chat bots with whom we will be able have a useful conversation. Much better than 5 years ago. There is no doubt there has been a significant advancements in this area.

The processing of language, be it voice or text, are done in 3 levels. The bottom level is lexical analysis, where ML deals with each word in isolation. The middle level is syntax analysis, where ML analyses the words within the context of the sentence and the grammar. The top level is semantic analysis where ML tries to understand the meaning of the sentence.

To do lexical analysis we start with regular expression. We use regular expression to find words within a text, and to replace them with another words. Then we learn how to identify and remove stop words such as and, the, a which occur frequently but don’t provide useful information during lexical analysis. The third step is learning how to break the text into sentences and into words. And finally for each word we try to find the base word either using stemming, lemmatisation or soundex.

Stemming is a process of removing prefixes and suffixes like “ing” and “er” from “learning” and “learner” to get the base word which is learn. Lemmatisation is a processes of changing a word to its root, e.g. from “went” to “go”, and from “better”, “well”, “best” to “good”. Soundex is a 4-character code that represents the pronounciation of a word, rather than its spelling.

The syntax analysis is done by tagging each word as noun, verb, adjective, etc. (called “part of speech”). The tagging is done by parsing (breaking up) the sentences into groups of words (phrases), analysing the grammatical patterns, and considering the dependencies between words.

Semantic analysis is about understanding the meaning of the words and sentences by looking at the structure of the sentence and the word tagging. Words such as “Orange” can mean colour, a fruit or a area, and “Apple” can mean a fruit or a company, depending on the sentence. In semantic analysis we either assign predefined categories to a text (for example for sentiment analysis, for classifying messages or for chat bots) or pull out a specific information from a text (for example for extracting certain terms from IRS contracts, or other documents).

Blog at