Data Warehousing and Data Science

16 February 2022

Forecasting time series: using statistics vs machine learning

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 6:59 am

This article outlines how ARIMA and LSTM are used for forecasting time series, and which one is better.
A lot of references are available at the end of this article for those who would like to find out further.


In ML, we use regression, to predict the values of a variable (y) based on the values of other variables (x1, x2, x3, …). For example, we predict the stock price of a company, based on its financial ratios, fundamentals and ESG factors.

In time series forecasting, we predict the values of a variable in the future, based on the values of that variable in the past. For example, we predict the stock price of a company, based on the past prices.

A time series is a sequence of numbers, each collected at a fixed period of time.

How do we forecast a time series? There are 2 ways: a) using statistics, b) using machine learning. In this article I’ll give a brief explanation of both. But before that let’s clear out one thing first: is “time series” plural or singular?

Time Series: plural or singular?

A time series is a sequence of numbers like this 1, 2, 3, 4, 5, … This is one time series, not one time serie.

We can have two time series like this: 1, 2, 3, 4, 5, … and 6, 7, 8, 9, 10, … These are two time series, not two time serieses.

So the singular form is “series” and the plural form is also “series”, not “serieses”. The word “series” is both singular and plural. See Merriam-Webster dictionary explanation in Ref #1 below.

Forecasting a time series means to find out what the next numbers in one series (1, 2, 3, 4, 5, …)

Forecasting two time series means to find out what the next numbers in two series (1, 2, 3, 4, 5, … and 6, 7, 8, 9, 10, …)

Forecasting time series using statistics

We can use regression to forecast a time series. We can also use Moving Average to forecast a time series.

Auto-Regressive model (AR)

Using regression, we use the past values of the forecast variable as the input variables. Which is why this method is called Auto-Regressive model. It is called auto because the input variables are the forecast variable itself, but the past values of it.

where yt-1, yt-2, yt-3are the past values of y, and c, c1, c2, c3 are constants.

ϵt = white noise. It is a sequence of random numbers, with the average of zero and the standard deviation is the same over time.

Moving Average model (MA)

Using Moving Average model the forecast variable is the mean of the series plus the error terms.

where ϵt = yt – yt-1 (white noise error term), μ is the mean and a1, a2, a3 are constants.

It is called moving average because we start with the average (mean), then keep moving/shifting the average by a factor of epsilon (the error term).

I need to emphasise here that the Moving Average model is not the Moving Average analysis that we use for the stock price, where we simply calculate the average of stock prices in the last 20 days.

ARMA model

ARMA model is the combination of the Auto-Regressive model and Moving Average. That is why it is called ARMA, the AR bit means Auto-Regressive, whereas the MA bit means Moving Average. So we forecast using the previous values of the forecast variable (Auto-Regressive model), and using the mean plus the error terms (Moving Average model).

ARIMA has 2 parameters i.e. ARMA(p,q)
where p = order of the autoregressive and q = order of the moving average.
Whereas AR and MA has 1 parameter i.e. AR(p) and MA(q).

ARIMA model

The ARIMA model is ARMA model plus differencing. Differencing means creating a new series by taking difference between the value at t and at (t-1).

For example, from this series: 0, 1, 3, 2, 3, 3, … (call it y)
We can make a new series by taking the difference between the numbers: 1, 2, -1, 1, 0, … (call it y’)
We can take the difference again (called second order differencing): 1, -3, 2, -1, … (call it y’’)

The I in ARIMA stands for Integrated. Integrated here means Differencing.

So the difference between the ARMA model and the ARIMA is: in ARMA we use y, whereas in ARIMA we use y’ or y’’.

In the ARIMA model use AR model and MA model on y’ or y’’, like this:

ARIMA has 3 parameters i.e. ARIMA(p,d,q)
where p = order of the autoregressive, d = degree of the first order differencing, and q = order of the moving average.


The S here means Seasonal and the X here means Exogenous.

Seasonal means that it has a repeating pattern from season to season. For example, the series on top line below consists of the trend part, the seasonal part and the random part. The seasonal part has a repeating pattern. Source: Ref #5.

The SARIMAX model include the seasonal part as well as the non-seasonal part.

SARIMAX has 7 parameters i.e. SARIMAX(p,d,q)x(P,D,Q,s)

Where p, d, q are as defined above, and P, D, Q are the seasonal terms of the p, d, q parameters, and s is the number seasons per year, e.g. for monthly s = 12, for quarterly s = 4.

In timer series, a exogenous variable means parallel time series which is used as a weighted input to the model (Ref #6)

Exogenous variable is one of the parameter in SARIMAX. In Python (statsmodels library), the parameters for SARIMAX are:

SARIMAX (y, X, order=(p, d, q), seasonal_order=(P, D, Q, s))

where y is the time series, X is the Exogenous variable/factor, and the others are as described before.

Forecasting time series using machine learning

The area of machine learning which deals with temporal sequence. is called Recurrent Neural Network (RNN). Temporal sequence means anything which has time element (a series of things happening one after the other), such as speech, handwriting, images, video. And that includes time series of course.

RNN is an neural network which has an internal memory. Which is it able to recognise patterns in time series. There are many RNN models, such as Elman network, Jordan network, Hopfield network, LSTM, GRU.

The most widely used method for predicting a time series is LSTM. An LSTM cell has 3 gates: an input gate, an output gate and a forget gate:

The horizontal line at the top (from ct-1 to ct) is the cell state. It is the memory of the cell. Along this line, there are 3 things happening: the cell state is multiplied by the “forget gate”, increased/reduced by the “input gate” and finally the value is taken to the “output gate”.

  • The forget gate removes unwanted information from the cell state (c), based on the previous input (ht-1) and the current input (xt).
  • The input gate adds new information to the cell state. The current input (xt) and the previous output (ht-1) pass through a σ and a tanh, multiplied then added to the cell memory line.
  • The output gate calculates the output from the cell state (c), the previous input (ht-1) and the current input (xt).

Architecturally, there are different ways we can use to forecast time series using LSTM: (Ref #7)

  • Fully Connected LSTM: a neural network with several layers of LSTM units with each layer fully connected to the next layer.
  • Bidirectional LSTM: the LSTM model learns the time series in backward direction in addition to the forward direction.
  • CNN LSTM: the time series is processed by a CNN first (1 dimensional), then processed by LSTM.
  • ConvLSTM: the convolutional structure is in inside the LSTM cell (in both the input-to-state and state-to-state transitions), see Ref #13 and #16.
  • Encoder-Decoder LSTM: for forecasting several time steps. The Encoder maps the time series into a fixed length vector, and decoder maps this vector back to a variable-length output sequence.

Which one is better, ARIMA or LSTM?

Well that is a million dollar question! Some research suggests that LSTM is better (Ref #17, #20, #24), some suggests that ARIMA is better (Ref #19) and some says that XGB is better than LSTM and ARIMA (#23). So it depends on the cases, but generally speaking LSTM is better in terms of accuracy (RMSE, MAPE).

It is an interesting topic for research. Plus other approaches such as Facebook’s Prophet, GRU, GAN and their combinations (Ref #25, #26, #27). It is possible to get better accuracy by combining the above approaches. I’m still searching a topic for my MSc dissertation, and it looks that this could be the one!


  1. Merriam-Webster dictionary explanation on “series” plurality: link
  2. Forecasting: Principles and Practice, by Rob J. Hyndman and George Athanasopoulos: link
  3. Wikipedia on ARIMA: link
  4. ARIMA model on Statsmodel: link
  5. Penn State Eberly College of Science: link
  6. Quick Adviser: link
  7. How to Develop LSTM Model for Time Series Forecasting by Jason Brownlee: link
  8. Time Series Prediction with LSTM RNN in Python with Keras: link
  9. Time Series Forecasting: Predicting Stock Prices Using An ARIMA Model by Serafeim Loukas: link
  10. Time Series Forecasting: Predicting Stock Prices Using An LSTM Model by Serafeim Loukas: link
  11. Wikipedia on RNN: link
  12. RNN and LSTM by Vincent Rainardi: link
  13. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting, by Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, Wang-chun Woo: link
  14. Exploiting the ConvLSTM: Human Action Recognition using Raw Depth Video-Based RNN, by Adrian Sanchez-Caballero, David Fuentes-Jimenez, Cristina Losada-Guti´errez: link
  15. Convolutional LSTM for spatial forecasting, by Sigrid Keydana: link
  16. Very Deep Convolutional Networks for End-to-End Speech Recognition, by Yu Zhang, William Chan, Navdeep Jaitly: link
  17. A Comparison of ARIMA and LSTM in Forecasting Time Series, by Sima Siami-Namini, Neda Tavakoli, Akbar Siami Namin: link
  18. ARIMA vs Prophet vs LSTM for Time Series Prediction, by Konstantin Kutzkov: link
  19. A Comparative Analysis of the ARIMA and LSTM Predictive Models and Their Effectiveness for Predicting Wind Speed, by Meftah Elsaraiti, Adel Merabet: link
  20. Weather Forecasting Using Merged LSTM and ARIMA Model, by Afan Galih Salman, Yaya Heryadi, Edi Abdurahman, Wayan Suparta: link
  21. Comparing ARIMA Model and LSTM RNN Model in Time-Series Forecasting, by Vaibhav Kumar: link
  22. A Comparison between ARIMA, LSTM, and GRU for Time Series Forecasting, by Peter Yamak, Li Yujian, Pius Kwao Gadosey: link
  23. Machine Learning Outperforms Classical Forecasting on Horticultural Sales Predictions by Florian Haselbeck, Jennifer Killinger, Klaus Menrad, Thomas Hannus, Dominik G. Grimm: link
  24. Forecasting Covid-19 Transmission with ARIMA and LSTM Techniques in Morocco by Mohamed Amine Rguibi, Najem Moussa, Abdellah Madani, Abdessadak Aaroud, Khalid Zine-dine: link
  25. Time Series Forecasting papers on Research Gate: link
  26. Stock Price Forecasting by a Deep Convolutional Generative Adversarial Network by Alessio Staffini: link
  27. A novel approach based on combining deep learning models with statistical methods for COVID-19 time series forecasting by Hossein Abbasimehr, Reza Paki, Aram Bahrini: link

10 February 2022

Using CNN for Stock Prediction

Filed under: Data Science,Data Warehousing — Vincent Rainardi @ 7:23 am

It really puzzled me when people talked about using CNN for stock market prediction. CNN is for processing images. How can CNN be used for predicting the stock market? Surely we need LSTM for that, because it is a time series?

The key here is to recognise that time series can be viewed in polar coordinate, like this: (Ref #2 and #3)

Stock charts are time series, which is basically the price of the stock across time. This is in Cartesian coordinate, i.e. X and Y coordinate. A point in X and Y coordinate can be converted into polar coordinate and vice versa, like this: (Ref #4)

This way, the line in the x and y chart (such as time series) can be converted into a Polar Coordinate chart, like above. The reason we are converting it to Polar Coordinate is to make it easier to detect patterns or anomalies (Ref #5).

To identify the temporal correlation in different time intervals we look at the cosine of sum between each pair of points (Ref #6 and #7), like this:

The above matrix of cosine is called the Gramian Matrix.

To make it easier to visualise, we convert the Gramian Matrix into an image, like below left: (Ref #3)

 The image on the left is a Gramian Angular Summation Field (GASF). If the Gramian Matrix uses sin instead of cos, and the operative is minus instead of plus, then the image is called a Gramian Angular Difference Field (GADF), like above right. Together, GASF and GADF are called GAF (Gramian Angular Field).

So why do we use GAF? What are the advantages? The first advantage is that GAF preserves temporal dependency. Time increases from the top left corner of the GAF image to the bottom right corner. Each element of the Gramian Matrix is a superposition or difference of directions with respect to the time difference between 2 points. Therefore GAF contains temporal correlations.

Second, the main diagonal in the Gramian Matrix contains the values with no time difference between 2 points. Meaning that the main diagonal contains the actual angular values. Using this main diagonal we can construct the time series of the features learned by the neural network (Ref 6# and #7).

Different colour schemes are used when creating GAF chart, but the common one is from blue to green to yellow to red. Blue for -1, green for 0, yellow for 0.3 and red for 1, like this: (Ref 6)

Once the time series become images, we can process them using CNN. But how do we use CNN to predict the stock prices? The idea is to take time series of thousands stocks and indices from various time periods, convert them into GAF images, and label each image with the percentage up or down the next day, like below.

We then train a CNN to classify those GAF images to predict which series will be up or down the next day and by how much (a regression exercise).

Secondly, for each of the stock chart we produce various different indicators such as Bollinger Bands, Exponential Moving Average, Ichimoku Cloud, etc. like below (Ref 15), and convert all of them to GAF images.

We put all these overlays together with the stock/index, forming a 3D image (dimensions: x = time, y = values, z = indicators. We use the same labels, which is the percentage up or down the next day, to train a CNN network using those 3D images and those labels. Now, we don’t only use the stock prices to predict the next day movement, but also the indicators.

Of course, we can use the conventional LSTM to do the same task, without converting them to the Polar Coordinate. But the point of this article is to use CNN for stock prediction.

I’ll finish with 2 more points:

  1. Apart from GAF there are other method for converting time series into images, for example Markov Transition Field (MTF, Ref 9) and Reference Plot (RP, Ref 10). We can use MTF and RP images (both the prices and the indicators) to predict the next day prices.
  2. There are other methods for using CNN to predict stock prices without involving images. The stock prices (and their indicators) remain as time series. See Ref 11 first, then 12 and 13. The time series is cut at different points and converted into matrix, like below.

If you Google “using CNN for predicting stock prices”, the chances are it is this matrix method that you will find, rather than using images. Because this matrix method uses X and y (input variable and target variable) then we can also use other ML algorithm including classical algorithms such as Linear Regression, Decision Trees, Random Forest, Support Vector Machines and Extreme Gradient Boosting.

The third method is using CNN-LSTM. In this method the local perception and weight sharing of CNN is used to reduce the number of parameters in the data, before the data is processed by LSTM (Ref 14).

So there you go, there are 3 ways of using CNN for predicting stock prices. The first one is using images (GAF, MTF, RP, etc), the second one is converting the time series into X and y matrix and the third one is by putting CNN in front of LSTM.


  1. Encoding time series as images, Louis de Vitry (link).
  2. Convolutional Neural Network for stock trading using techincal indicators, Kumar Chandar S (link)
  3. Imaging time series for classification of EMI discharge sources, Imene Mitiche, Gordon Morison, Alan Nesbitt, Michael Hughes-Narborough, Brian G. Stewart, Philip Boreha (link)
  4. Keisan online calculator (link)
  5. Sensor classification using Convolutional Neural Network by encoding multivariate time series as two dimensional colored images, Chao-Lung Yang, Zhi-Xuan Chen, Chen-Yi Yang (link)
  6. Spatially Encoding Temporal Correlations to Classify Temporal Data Using Convolutional Neural Networks, Zhiguang Wang, Tim Oates (link)
  7. Imaging Time-Series to Improve Classification and Imputation, ZhiguangWang and Tim Oates (link)
  8. How to encode Time-Series into Images for Financial Forecasting using Convolutional Neural Networks, Claudio Mazzoni (link)
  9. Advanced visualization techniques for time series analysis, Michaël Hoarau (link)
  10. Financial Market Prediction Using Recurrence Plot and Convolutional Neural Network, Tameru Hailesilassie (link)
  11. How to Develop Convolutional Neural Network Models for Time Series Forecasting, Jason Brownlee (link)
  12. Using CNN for financial time series prediction, Jason Brownlee (link)
  13. CNNPred: CNN-based stock market prediction using several data sources, Ehsan Hoseinzade, Saman Haratizadeh (link)
  14. A CNN-LSTM-Based Model to Forecast Stock Prices Wenjie Lu, Jiazheng Li, Yifan Li, Aijun Sun, Jingyang Wang (link)
  15. (link)

18 January 2022

How to do AI without Machine Learning?

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 8:40 am

I’m doing a master’s degree titled ML and AI, and all this time I’ve been wondering what the difference between AI and ML is. I know AI is a superset of ML, but what is in AI but not in ML? Is it possible to do AI without ML? If so, how?

The Old Days of AI: rule-based

In 1990s there was no machine learning. To be clear, here machine learning includes classical algorithms like Decision Trees, Naive Bayes and SVM, as well as Deep Learning (neural network). In the 1990s there was no machine learning, but there was a lot of news about Artificial Intelligence. Deep Blue was the culmination of that.

So we know there was AI when there was no ML. There was AI without ML. But what was it? What was that AI without ML? Well rule-based of course. The technology that Deep Blue used is called the “Expert System”, which is based on rules defined and tuned by chess masters. You can read about the software behind Deep Blue here: link.

A rule-based system is essentially IF-THEN. There are many different types of rules so I need to clarify which one. It is the IF-THEN rule that makes up an Expert System. There are 2 main components of an Expert System (ES): the Inference Engine and the Knowledge Base. You can read the software architecture of an Expert System here: link.

Search and Fuzzy Logic

Besides ML and ES, another way to do AI is using Search. There are various ways to do search, such as Heuristic Search (Informed Search), Iterative Search and Adversarial Search. You can read the details in an excellent book by Crina Grosan and Ajith Abraham: link, page 13 to 129.

In the Expert System world, the IF-THEN rule-based is not the only way to do Expert System. There is another way: using fuzzy logic. In an IF-THEN rule-based expert system, the truth value is either 0 or 1. In a fuzzy logic system, the truth value is any real number between 0 and 1 (link). There are several fuzzy logic systems, such as Mandani and TSK (you can read the details here: link)

Evolutionary Algorithm and Swarm Intelligence

Another way for doing AI is using Evolutionary Algorithm (EA). EA uses concepts in evolution/biology such as reproduction, natural selection and mutation, in order to develop a solution/AI: link.

And finally, another way for doing AI is Swarm Intelligence: link. Swarm Intelligence (SI) is inspired by the behaviour of a group of animals, such as birds and ants. SI-based AI system consists of a group of agents interacting with each another, and with the environment (similar to Reinforcement Learning but using many agents).


So there you have it, there are a few other ways for doing AI:

  • Machine Learning
  • Expert System (Rule-Based)
  • Fuzzy Logic
  • Search
  • Evolutionary Algorithm
  • Swarm Intelligence

So just because we study ML we should not think that we are the only one, the only way to do AI. There are other ways, which might be better. Which may produce a better AI. Who knows, you haven’t studied them right? Well I know for sure now, that AI is not just ML. I hope this article is useful for you.


  1. Expert System, Wikipedia: link
  2. History of AI, Wikipedia: link
  3. AI without ML, Teradata: link
  4. AI without ML, Claudia Pohlink: link
  5. Rule-based AI vs ML, We Are Brain: link
  6. Intelligence Systems, Crina Grosan & Ajith Abraham: link

17 January 2022

Machine Learning or Data Science?

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 8:07 am

I’ve just got my post grad diploma in machine learning and all this time I was wondering what data science was. I have written an article about what data science is: link, but now that I understand a bit more about machine learning, I understand there is a lot of overlap between the two (ML and DS).

Last night when I read a Data Science book by Andrew Vermeulen (link) I was wondering which of the things I’ve learned in ML is actually DS. I list the items and label them ML or DS:

Yes, machine learning is definitely part of data science. Strictly speaking, the data cleansing, data analysis, statistics and visualisation are data science but not machine learning. We can see this in this proceedings: link.

So Data Science consists of the followings:

  • Data Cleansing
  • Data Analysis
  • Statistics (including probability, central limit theorem, hypothesis testing)
  • Data Visualisation
  • Machine Learning (including all ML models)

But in my opinion one cannot learn ML without studying statistics, visualisation, data loading, data cleansing and data analysis. In order to understand ML models properly, one must understand all the above fields.

Berkeley School of Information argues that the followings are also included in data science: link

  • Data Warehousing
  • Data Acquisition
  • Data Processing
  • Data Architecture
  • Business Intelligence
  • Data Reporting

I disagree with this opinion. From what I see many companies, Data Warehousing, acquisition/ processing and Data Architecture are part of a role called Data Engineer. A Data Engineer prepare and stores the data, including designing the data models and data ingestion process.

Because Data Visualisation is part of data science, it is tempted to think that Business Intelligence and Data Reporting are part of Data Science. But this is not true. The data visualisation in the data science is more on the data behaviour, such as clustering and statistical analysis, whereas BI is more on the business side, such as portfolio performance or risk reporting. This is only my opinion though, I’m sure other people have different opinions.

So there are 2 fields/roles in the data industry these days:

  • Data Science: data cleansing, data analysis, statistics, machine learning, data visualisation.
  • Data Engineering: data acquisition, data loading/processing, data quality, data architecture.

Whereas in the old days the roles are: business/data analyst, data architect, BI developer, ETL developer.

24 December 2021

Using Reinforcement Learning to Manage Portfolio Allocation

Filed under: Data Science,Data Warehousing — Vincent Rainardi @ 9:03 am

I am a firm believer on the iterative approach, rather than big bang. To create something complex we need to build it incrementally. For my masters dissertation I would like to use machine learning to manage investment portfolios. Last week I described the “layout of the land” of this topic in this article: link. That made me realise 3 things:

  1. That the topic is very large, from predicting stock prices to managing risk, from managing stock composition to crypto currencies.
  2. That what I’d like to do is managing the portfolio allocation. In terms of assets I would prefer stocks, rather than fixed income or crypto currencies.
  3. That the best approach for this is using Reinforcement Learning (Q network).

Problem Statement

So as the first step, I would like to simply use a Q network to decide which portfolio allocation would be best in terms of maximising return. So the reward is the return (current market price minus the purchase cost). The environment is 4 stocks in different industry sectors:

  • 1 in financial industry: JP Morgan, symbol: JPM
  • 1 in retail industry: Home Depot, symbol: HD
  • 1 in commodity industry: Exxon Mobile, symbol: XOM
  • 1 in healthcare industry: Pfizer, symbol: PFE

All from the same country i.e. US. The action is to choose the composition of these 4 stocks in the portfolio, plus cash. To simplify things the composition must be: 1 stock = 40% weight, and the other 3 stocks and cash = 15% weight. So there are only 5 possible actions to take:

Every working day, the agent must decide which action it wants to take, i.e. which composition it wants to use. Then the reward is calculated by comparing the valuation of the portfolio at the end of the day to the previous day, minus the transaction cost and the holding cost. The portfolio valuation is obtained by summing the valuation of the 4 stocks (held quantity x today closing price) plus the cash. The transaction cost is $10 per trade. The holding cost is 0.005% of the portfolio value per day, including weekend.

I will be using use 4 years of price data from 19th Dec 2016 to 18th Dec 2020 to train the model, and 19th Jan 2021 to 18th Dec of 2021 to test it. Note that stock prices are only available on Monday to Friday, and when it’s not a public holiday in the US. All 5 prices will be fed into the Q model (open, close, high, low, adjusted close) plus the daily volume too.

The Environment

In Reinforcement Learning we have an environment and an agent. The environment consists of a state space and an action space. In Reinforcement Learning we need to define 6 things in the environment:

  1. The state space i.e. a list of all the possible states
  2. The action space i.e. a list of all possible actions
  3. The reward for doing an action from a particular state
  4. The next state after doing an action
  5. An episode, and how many time steps in an episode
  6. How the environment will be reset at the beginning of an episode

State Space: The state is the current state of the portfolio on a particular date, i.e. composition 1 to 5 (C1 to C5). In the beginning, before any trade is made, the portfolio consists of 100% cash (trade means buying or selling a stock). This beginning state is called C0.

Action Space: The action that the agent can take is buying or selling stocks so that the portfolio is in a certain composition (from composition 1 to 5). So there are 5 actions. Let’s name these 5 actions as A1 to A5. If the agent does action A2, then the next state is C2, because the portfolio will be in composition 2. If the agent does action A3, then the next state will be C3. And so on. Of course the agent can also do nothing (let’s call it A0), in this case the next state is the same as the previous state.

Episode: One episode in this case is 30 trading days. So at the beginning of an episode, a date within the training data will be randomly chosen as the start date. For example, the start date = 17st June 2018. Then every trading day the agent would take an action. A “trading day” means a day when the US stock markets are open. 17th June is a Sunday, not a trading day, so it starts on 18th June 2018, like this:

2018-06-17 Sunday No action
2018-06-18 Monday, Action = A2
2018-06-19 Tuesday, Action = A3
2018-06-20 Wednesday, Action = A5
… and so on until
2018-07-26 Thursday, Action = A4
2018-07-27 Friday, Action = A5
2018-07-30 Monday, Action = A1

In the above, the actions are just examples. Every day the agent determines which action to take, between A1 to A5. The agent can only make 1 action per day, i.e. at the beginning of every day.

Note that US public holidays are not trading days. So for example, 25th Dec 2018 (Tuesday) is Christmas day, so no action.

Reward: The portfolio valuation is calculated as the valuation of all the stocks, plus cash. The reward is calculated based on the profit for that day, i.e. the portfolio value at the end of the day, minus the portfolio value at the start of the day.

Beginning of an episode: At the beginning of an episode the portfolio consist entirely of cash. This is an additional state, in addition to C1 to C5 defined above. So we have 6 states in total: C0 to C6

Portfolio Valuation

At this initial state we need to define how much cash is in the portfolio. Let’s define that as USD $1 million. So on that first day in the episode (say Sunday 17th June 2018), the value of the portfolio was $1 million.

Let’s say that the next day, Monday 18th June 2018, the agent decided to take action C1, which brings the portfolio to state C1. So on that Monday morning the portfolio consisted of: 40% cash, 15% JPM, 15% HD, 15% XOM and 15% PFE. The value of the portfolio in the beginning of that Monday 18th June 2018 was the sum of the 40% cash and the initial value of the holdings (i.e. the 4 stocks):

  • The value of 40% cash = $400,000
  • 15% ($150,000) to buy JPM stock. Opening price: 107.260002. Quantity: 1398.470979
  • 15% ($150,000) to buy HD stock. Opening price: 198.940002. Quantity: 753.9961722
  • 15% ($150,000) to buy XOM stock. Opening price: 80.400002. Quantity: 1865.671595
  • 15% ($150,000) to buy PFE stock. Opening price: 198.940002. Quantity: 753.9961722

In the above, the prices are from the stock market data. The quantity held is simply calculated as the money to buy the stock divided by the opening price. The value of the portfolio at the end of that Monday 18th June 2018 is the sum of the 40% cash and the value of the 4 stocks (based on the quantity held):

  • 40% cash = $400,000
  • JPM: Closing price: 108.180000. Value = 151,286.59
  • HD: Closing price: 200.690002. Value = 151,319.49
  • XOM: Closing price: 80.82. Value = 150,783.58
  • PFE: Closing price: 34.278938. Value = 150,124.55

We also need to subtract the transaction cost, which is $10 per trade, and the holding cost (the cost we pay to the investment platform, for keeping our holdings), which is 0.01% of the portfolio value, per day:

So after 1 day of trading, 18th June 2018, after the agent decided to take action C1, the portfolio value is 1,003,424.03. So the profit is $3,424.03.

The reward

The reward is the profit for that day, i.e. the portfolio value at the end of the day, minus the portfolio value at the start of the day. So in this case the reward is $3,474.21. Note that the reward can be negative, i.e. if on that day the value of the portfolio at the end of the day is lower than at the start of the day.

For each trading day there will be a reward, which is added to the previous day reward to make the cumulative reward. Every day we calculate the cumulative reward.

Episodes and Total Score

An episode is 30 trading days. At the end of the episode (30th July 2018) the cumulative reward is the “total score” for the agent.

Then another episode is chosen and the environment is reset. The 30 days trading begins and the reward and cumulative reward is calculated every day. And at the end of the episode, the total score is obtained.

And so the agent keep learning, episode by episode, each time adjusting the weights of the neural network within the agent. In the early episodes, we expect the total score to be low because the agent is still learning. But after many episodes, we expect the total score to be consistently high, because the agent has learned the pattern in the stock prices. So it knows which stock would be the most profitable to invest in.

Generating Experience

So far so good. But the state space is not actually just the current portfolio holdings/composition. The state space also include the current prices, and historical prices. Not only the prices of the holdings, but also the stocks not in holding (because they also determine what should be held).

And, in reality, the stocks in the holdings are not just 4. There are 40 to 100 stocks, depending on the size and the policy of the fund. And the investment universe (out of which we choose the stock to hold) is about 500 to 1000 stocks.

So obviously, we can’t have the mapping of all the state and actions, to the “value” (the net profit for today). Because there are so many combinations of the state and actions (millions).

In Reinforcement Learning this problem is solved by approximating the value using a neural network. We don’t record all those combinations of historical prices (states) and stock allocations (actions) and their values (today’s profit). Instead, we train a neural network to learn the relationship between the states, action and values. Then use it to approximate the value, for a given state and action.

In Reinforcement Learning, we generate the experience using a Q network. Generating an experience means that the system will choose to either to do exploration or exploitation. This is called “Epsilon Greedy” algorithm.

  1. Set the epsilon (ε), which is the boundary between exploration and exploitation.
  2. Generate a random number.
  3. If the number is less than epsilon, choose a random action (exploration).
  4. If the number is more than epsilon (or equal), choose the best action (exploitation).
    The best action is the action with the highest reward.
  5. Calculate the reward.
  6. Determine the next state.
  7. Store the state, the action, the reward and the next state.

So that’s the topic for the next article, i.e. how to use neural network to approximate the value, for a given state and action. The state in this case is the historical prices, and the action here is the portfolio composition (or stock allocation) The value here is the profit or gain on a particular day.

So the problem statement becomes: given all the historical prices, what is the best portfolio composition for today? And the value of that portfolio composition is: the net profit we make today.

Once we are able to create a neural network which can answer the above question, then we’ll create the second neural network to do the Reinforcement Learning, using Action and Reward, using Environment and Agent. This second NN will be learning how to optimise a portfolio, i.e. what stocks should be held in order to maximise the return during a 30-day period (one episode).

14 December 2021

Managing Investment Portfolios Using Machine Learning

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 8:28 am

In investment management, machine learning can be used on different areas of portfolio management including portfolio construction, signal generation, trade execution, asset allocation, security selection, position sizing, strategy testing, alpha factor design and risk management. Portfolio management is first a prediction problem for the vector of expected returns and covariance matrix, and then an optimization problem for returns, risk, and market impact (link). We can use various ML algorithms for managing investment portfolios, including reinforcement learning, Elastic Net, RNN (LSTM), CNN and Random Forest.

In this article I would like to give the “overview of the land” on how machine learning is used to manage investment portfolios. So I’m going to list down various methods that people use, but not going to explain the mechanism of each method. For example: I could be mentioning that people predict the direction of stock prices using random forest, but I’m not going to explain the why nor how. For each case I’m going to provide a link, so that you can read more about them if you are interested. The idea is that people who wish to do a research about portfolio management using ML can understand the current landscape, i.e. what have been done recently.

In the next article (link), I will be describing the problem statement of using Reinforcement Learning to manage portfolio allocation. So out of so many things about portfolio management I describe in this article, I choose only 1 (which is portfolio allocation). And out of so many ML approaches I describe in this article, I choose only 1 (which is RL).

But first, a brief overview on what portfolio management is. This is important as some of us are not from the investment sector. We know about the stock markets, but have no idea how a portfolio is managed.

What is Portfolio Management?

Portfolio management is the art and science of making optimal investment decisions to minimise the risks and maximizing the returns, to meet the investor’s financial objectives and risk tolerance. Active portfolio management means strategically buying and selling securities (stocks, bonds, options, commodity, property, cash, etc) in an effort to beat the market. Whereas passive portfolio management means matching the returns of the market by replicating some indexes.

Portfolio management involves the following stages:

  1. Objectives and constraints
  2. Portfolio strategies
  3. Asset allocation
  4. Portfolio construction
  5. Portfolio monitoring
  6. Day to day operations

Let’s examine those 6 stages one by one.

1. Define the investment objectives and constraints

First the investors or portfolio managers need to define the short term and long term investment goals, and how much and which types of risks the investor is willing to take. Other constraints include the capital amount, the time constraints, the asset types, the liquidity, the geographical regions, the ESG factors (environment, social, governance), the “duration” (sensitivity to interest rate changes) and the currency. Whether hedging FX or credit risks is allowed or not (using FX forwards and CDS), having more than 10% of cash is allowed or not, investing in developed markets is allowed or not, how much market volatility is allowed, whether investing in companies with market cap less than $1 billion is allowed, whether investing in coal or oil companies is allowed or not, whether investing in currencies is allowed or not, etc. – those are all portfolio constraints too.

2. Define the portfolio strategies

Based on the objectives and constraints, the investors or portfolio managers define the portfolio strategies, i.e. active or passive, top down or bottom up, growth or value investing, income investing, contrarian investing, buy and hold, momentum trading, long short strategy, indexing, pairs trading, dollar cost averaging (see here for more details). Hedging strategies, diversification strategies, duration strategies, currency strategies, risk strategies, stop loss strategies, liquidity strategy (to deal with redemptions and subscriptions), cash strategies – these are all strategies in managing portfolios.

3. Define the asset allocations

Based on the objectives and constraints, the investors or portfolio managers define what types of assets they should be investing. For example, if the objective is to make a difference in climate change, then the investment universe would be low carbon companies, clean energy companies and green transport companies. If one of the investment constraints is the invest in Asia, but not in Japan, and only in fixed income (not equity) then the investment universe would be the bonds issued by companies based in China, India, Korea, Singapore, etc. The asset types could be commodity (like oil or gold), property (like offices or houses), cash-like assets (like treasury bonds), government bonds, corporate bonds, futures, ETFs, crypto currencies, options, CDS (credit default swaps), MBS (mortgage based securities), ABS (asset based securities), time deposits, etc.

4. Portfolio construction

Once the strategies and the asset allocations are defined, the investors or portfolio managers begin building the portfolio by buying assets in the stock/bond markets and by entering contracts (e.g. CDS contracts, IRS contracts, property contracts, forward exchange contracts). Every company that they are going to buy is evaluated. The financial performance is evaluated (financial ratios, balance sheet, cash flow, etc), the board of directors are evaluated (independence, diversity, board size, directors skills, ages and background, etc), the stock prices are evaluated (company value, historical momentum, etc), the controversies are evaluated (incidents, health & safety record, employee disputes, law breaking records & penalties, etc), the environmental factors are evaluated (pollutions, climate change policies, carbon and plastic records, etc). So it is not just financial, but a lot more than that.

5. Portfolio monitoring

Then they need put in place a risk monitoring system, compliance monitoring system, performance monitoring system, portfolio reporting system and asset allocation monitoring system. Every trade is monitored (market abuse, trade transparency, capital requirements, post trade reporting, authorisation requirements, derivative reporting), and every day each portfolio holding are monitored. Cash level and portfolio breakdown are monitored every day. Early warnings are detected and reported (for threshold breach), market movement effect are monitored, operational risks are monitored & reported. Client reporting are in place (e.g. when investment values drop more than 10% the client must be notified), and audits are put in place (data security audit, IT systems audit, legal audit, anti money laundering audit, KYC/on-boarding process, insider trading).

6. Day-to-day operations

On the day-to-day operation, the investors or portfolio managers basically identify the potential trades to make money (to enhance the return). Trade means buying or selling securities. For this they screen potential companies (based on financial ratios, technical indicators, ESG factors, etc) to come up with a short list of companies that they will buy. They research these companies in depth and finally come up one a company they are going to buy (company A). They calculate which holding in the portfolio they will need to sell (company B) to buy this new company. They calculate the ideal size of holding for company A (in a global portfolio, each holding is about 1-2% of the portfolio), and it depends on the other properties of this company as well (sector, country, benchmark comparison, etc). Then they make 2 trades: buy A and sell B.

Apart from trades to make money, there are trades for other purposes: trades to mitigate risks, trade for compliance, trade for rebalancing, trade for benchmark adjustments, trades to improve liquidity, etc.

What are not included in portfolio management are the sales and marketing operation, the business development, the product development. These activities are also directly impacting the portfolio management though, because subscriptions and redemptions change the AUM (asset under management), but they are not considered part of portfolio management.

Machine Learning Methods Used in Portfolio Management

Below are various research papers which use various machine learning models and algorithms to manage investment portfolios, including predicting stock prices and minimising the risks.

Part 1. Using Reinforcement Learning

  • A deep Q-learning portfolio management framework for the crypto currency market (Sep 2020, link)
    A deep Q-learning portfolio management framework consisting of 2 elements: a set of local agents that learn assets behaviours and a global agent that describes the global reward function. Implemented on a crypto portfolio composed by four crypto currencies. Data: Bitcoin (BTC), Litecoin (LTC), Ethereum (ETH) and Riple (XRP) July 2017 to January 2019.
  • RL based Portfolio Management with Augmented Asset Movement Prediction States (Feb 2020, link)
    Using State-Augmented RL framework (SARL) to augment the asset price information with their price movement prediction (derived from news), evaluated on accumulated profits and risk-adjusted profits. Datasets: Bitcoin and high tech stock market, and 7 year Reuters news articles. Using LSTM for predicting the asset movement and NLP (Glove) to embed the news then feed into HAN to predict asset movement.
  • Adversarial Deep RL in Portfolio Management (Nov 2018, link)
    Using 3 RL algorithms: Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimization (PPO) and Policy Gradient (PG). China stock market data. Using Adversarial Training method to improve the training efficiency and promote average daily return and Sharpe ratio.
  • Financial Portfolio Management using Reinforcement Learning (Jan 2020, link)
    Using 3 RL strategies are used to train the models to maximize the returns and minimize the risks: DQN, T-DQN, and D-DQN. Indian stock market from Yahoo finance data, from 2008 to 2020.
  • Using RL for risk-return balanced portfolio management with market conditions embedding (Feb 2021, link)
    A deep RL method to tackle the risk-return balancing problem by using macro market conditions as indicators to adjust the proportion between long and short funds to lower the risk of market fluctuations, using the negative maximum drawdown as the reward function.
  • Enhancing Q-Learning for Optimal Asset Allocation (Dec 1997, link)
    To enhance the Q-Iearning algorithm for optimal asset allocation using only one value-function for many assets and allows model-free policy-iteration.
  • Portfolio Optimization using Reinforcement Learning (Apr 2021, link)
    Experimenting with RL for building optimal portfolio of 3 cryptocurrencies (Dash, Litecoin, Staker) and comparing it with Markowitz’ Efficient Frontier approach. Given the price history, to allocate a fixed amount of money between the 3 currencies every day to maximize the returns.

Part 2. Using Recurrent Neural Network (RNN)

  • Mutual Fund Portfolio Management Using LSTM (Oct 2020, link)
    Predicting the company stock prices on 31/12/2019 in IT, banking and pharmaceutical sectors based on Bombay stock prices from 1/1/2012 to 31/12/2015. Mutual funds are created from stocks in each sector, and across sectors.
  • Stock Portfolio Optimization Using a Deep Learning LSTM Model (Nov 2021, link)
    Time series analysis of the top 5 stocks historical prices from the nine different sectors in the Indian stock market from 1/1/2016 to 31/12/2020. Optimum portfolios are built for each of these sectors. The predicted returns and risks of each portfolio are computed using LSTM.
  • Deep RL for Asset Allocation in US Equities (Oct 2020, link)
    A model-free solution to the asset allocation problem, learning to solve the problem using time series and deep NN. Daily data for the top 24 stocks in the US equities universe with daily rebalancing. Compare LSTM, CNN, and RNN with traditional portfolio management approaches like mean-variance, minimum variance, risk parity, and equally weighted.
  • Portfolio Management with LSTM (Dec 2018, link)
    Predicting short term and long term stock price movements using LSTM model. 15 stocks, 17 years of daily Philippine Stock Exchange price data. Simple portfolio management algorithm which buys and sells stocks based on the predicted prices.
  • Anomaly detection for portfolio risk management (June 2018, link)
    ARMA-GARCH and EWMA econometric models, and LSTM and HTM machine learning algorithms, were evaluated for the task of performing unsupervised anomaly detection on the streaming time series of portfolio risk measures. Datasets: returns and VAR (value at risk).

Part 3. Using Random Forest

  • Forecasting directional movements of stock prices for intraday trading using LSTM and random forests (June 2021, link)
    Using random forests and CuDNNLSTM to forecast the directional movements of S&P 500 constituent stocks from January 1993 to December 2018 for intraday trading (closing and opening prices returns and intraday returns). On each trading day, buy the 10 stocks with the highest probability and sell short the 10 stocks with the lowest probability to outperform the market in terms of intraday returns.
  • Stock Selection with Random Forest in the Chinese stock market (Aug 2019, link)
    Evaluates the robustness of the random forest model for stock selection. Fundamental/technical feature space and pure momentum feature space are adopted to forecast the price trend in the short and long term. Data: all companies on the Chinese stock market from 8/2/2013 to 8/8/2017. Stocks are divided into N classes based on the forward excess returns of each stock. RF model is used in the subsequent trading period to predict the probability for each stock that belongs to the category with the largest excess return. The selected stocks constituting the portfolio are held for a certain period, and the portfolio constituents are then renewed based on the new probability ranking.
  • Predicting clean energy stock price using random forests approach (Jan 2021, link)
    Using random forests to predict the stock price direction of clean energy exchange traded funds. For a 20-day forecast horizon, tree bagging and random forests methods produce 85% to 90% accuracy rates while logistic regression models are 55% to 60%.
  • Stock Market Prices Prediction using Random Forest and Extra Tree Regression (Sep 2019, link)
    Comparing Linear Regression, Decision Tree and Random Forest models. Using the last 5 years historical stock prices for all companies on S&P 500 index. From these the price of the stock for the sixth day are predicted.

Part 4. Using Gradient Boosting

  • A Machine Learning Integrated Portfolio Rebalance Framework with Risk-Aversion Adjustment (July 2021, link)
    A portfolio rebalance framework that integrates ML models into the mean-risk portfolios in multi-period settings with risk-aversion adjustment. In each period, the risk-aversion coefficient is adjusted automatically according to market trend movements predicted by ML models. The XGBoost model provides the best prediction of market movement, while the proposed portfolio rebalance strategy generates portfolios with superior out-of-sample performances compared to the benchmarks. Data: 25 US stocks, 13-week Treasury Bill and S&P 500 index from 01/09/1995 to 12/31/2018 with 1252 weekly returns.
  • The Success of AdaBoost and Its Application in Portfolio Management (Mar 2021, link)
    A novel approach to explain why AdaBoost is a successful classifier introducing a measure of the influence of the noise points. Applying AdaBoost in portfolio management via empirical studies in the Chinese stock market:
    1. Selecting an optimal portfolio management strategy based on AdaBoost
    2. Good performance of the equal-weighted strategy based on AdaBoost
    Data: June 2002 and ends in June 2017, 181 months, Chinese A-share market. 60 fundamentals & technical factors.
  • Moving Forward from Predictive Regressions: Boosting Asset Allocation Decisions (Jan 2021, link)
    A flexible utility-based empirical approach to directly determine asset allocation decisions between risky and risk-free assets. Single-step customized gradient boosting method specifically designed to find optimal portfolio weights in a direct utility maximization. Empirical results of the monthly U.S. data show the superiority of boosted portfolio weights over several benchmarks, generating interpretable results and profitable asset allocation decisions. Data: The Welch-Goyal dataset, containing macroeconomic variables and the S&P 500 index from December 1950 to December 2018.
  • Understanding Machine Learning for Diversified Portfolio Construction by Explainable AI (Feb 2020, link)
    A pipeline to investigate heuristic diversification strategies in asset allocation. Use explainable AI to compare the robustness of different strategies and back out implicit rules for decision making. Augment the asset universe with scenarios generated with a block bootstrap from the empirical dataset. The empirical dataset consists of 17 equity index, government bond, and commodity futures markets across 20 years. The two strategies are back tested for the empirical dataset and for about 100,000 bootstrapped datasets. XGBoost is used to regress the Calmar ratio spread between the two strategies against features of the bootstrapped datasets.

22 November 2021

Tuning XGBoost Models

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 7:15 am

I was tuning fraudulent credit card transaction data from Kaggle (link) and found that for classifier, XGBoost provides the highest AUC compared to other algorithms (99.18%). It is a little tricky to tune though, so in this article I’d like to share my experience in tuning it.

What is XGBoost?

XGBoost stands for Extreme Gradient Boosting. So before you read about XGBoost, you need to understand first what is Gradient Boosting, and what is Boosting. Here are good introductions to this topic: link, link. The basis algorithm for XGBoost is Decision Tree. Then many trees are used together in a technique called Ensemble (for example Random Forest). So a complete journey to understanding XGboost from the ground up is:

  1. Decision Tree
  2. Ensemble
  3. Stacking, Bagging, Boosting (link)
  4. Random Forest
  5. Gradient Boosting
  6. Extreme Gradient Boosting

Higgs Boson

The original paper by Tianqi Chen and Carlos Guestrin who created XGBoost is here: link.
XGBoost was used to solve Higgs Boson classification problem, again by Tianqi Chen, and Tong He: link. Higgs Boson is the last elementary particle discovered. It was discovered in 2012 at the Large Hadron Collider at CERN. The particle was predicted by Peter Higgs in 1964.


A good reference for tuning XGBoost model is a guide from Prashant Banerjee: link (search for “typical value”). Another good one is from Aarshay Jain: link (again, search for “typical value”). The guide from the developers: link and the list of hyperparameters are here: link.

Python Code

Here’s the code in its entirety:

# Import required libraries
import numpy as np
import pandas as pd
from sklearn import preprocessing

# Load the data from Google drive
from google.colab import drive
df = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/creditcard.csv')

# Drop time column as fraudulent transactions can happen at any time
df = df.drop("Time", axis = 1)

# Get the class variable and put into y and the rest into X
y = df["Class"]
X = df.drop("Class", axis = 1)

# Stratified split into train & test data
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split( X, y, test_size = 0.2, stratify = y, random_state = 42 )

# Fix data skewness
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(copy=False)
train_return = pt.fit_transform(X_train)
test_return  = pt.fit_transform(X_test)

# Balance the train and test data using SMOTE
from imblearn.over_sampling import SMOTE
SMOTE = SMOTE(random_state=42)
X_smote_train, y_smote_train = SMOTE.fit_resample(X_train, y_train)
X_smote_test, y_smote_test = SMOTE.fit_resample(X_test, y_test)

# Sample training data for tuning models (use full training data for final run)
tuning_sample = 20000
idx = np.random.choice(len(X_smote_train), size=tuning_sample)
X_smote_tuning = X_smote_train.iloc[idx]
y_smote_tuning = y_smote_train.iloc[idx]

# Import libraries from Scikit Learn
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier

# Create a function to calculate AUC using predict proba
def Get_AUC(Model, X, y):
    prob = Model.predict_proba(X)[:, 1] 
    return roc_auc_score(y, prob) * 100

# Perform grid search cross validation with different parameters
parameters = {'n_estimators':[90], 'max_depth':[6], 'learning_rate':[0.2], 
              'subsample':[0.5], 'colsample_bytree':[0.3], 'min_child_width': [1],
              'gamma':[0], 'alpha':[0.001], 'reg_lambda':[0.001]}
XGB = XGBClassifier()
CV = GridSearchCV(XGB, parameters, cv=3, scoring='roc_auc', n_jobs=-1)

# Hyperparameter tuning to find the best parameters, y_smote_tuning)
print("The best parameters are:", CV.best_params_)

Output: The best parameters are: {'alpha': 0.001, 'colsample_bytree': 0.3, 'gamma': 0, 'learning_rate': 0.2, 'max_depth': 6, 'min_child_width': 1, 'n_estimators': 90, 'reg_lambda': 0.001, 'subsample': 0.5}

# Fit the model with the best parameters and get the AUC
XGB = XGBClassifier(n_estimators = CV.best_params_["n_estimators"], max_depth = CV.best_params_["max_depth"], 
                    learning_rate = CV.best_params_["learning_rate"], colsample_bytree = CV.best_params_["colsample_bytree"], 
                    subsample = CV.best_params_["subsample"], min_child_width = CV.best_params_["min_child_width"],
                    gamma = CV.best_params_["gamma"], alpha = CV.best_params_["alpha"], 
                    reg_lambda = CV.best_params_["reg_lambda"])
Model =, y_smote_train)
AUC = Get_AUC(Model, X_smote_test, y_smote_test)
print("AUC =", '{:.2f}%'.format(AUC))

Output: 99.18

Tuning Process

So here is the tuning process that I did for XG Boost model, for the above data, using the above code.

Step 1. Broad ranges on the top 3 parameters

First, I read the expected values for the parameters from the guides (see the Reference section above).
Note: In this article when I say parameters I mean hyperparameters.

Then, using the Grid Search cross validation I set the parameters in very broad ranges as follows:

  • n_estimators: 10, 100, 500
  • max_depth: 3, 10, 30
  • learning_rate: 0.01, 0.1, 1

I used only 20k data out of 284,807 transactions so the cross validation process didn’t take hours but only minutes. I tried with 10k, 20k, 50k samples and found that 10k results didn’t represent the whole training data (284k), 50k and above were very slow, but 20k is fast enough and yet it is representative.

I would recommend trying only 3 values for each parameter and only the 3 parameters above to begin with. This way it would take 10 minutes. These 3 parameters are the most influencing factors, we need to nail them down first. They are mentioned in the Reference section above.

Step 2. Narrow down the top 3 parameters

I then narrow down the range of these 3 parameters. For example, for n_estimators out of 10, 100, 500, the Grid Search shows that the best value was 100. So I changed the grid search with 80, 100, 120. Still getting 100 as the best parameter so I did a grid search with 90, 100, 110 and got 90. Finally I did the grid search with 85, 90, 95 and it still gave out 90 as the best n_estimators so that was my final value for this parameter.

But I understood there was interaction between the parameter so when tuning n_estimators I included the max_depth of 3, 10, 30 and learning_rate of 0.01, 0.1, 1. And when the n_estimator was settled at 90, I started narrowing down the max_depth (which was giving out 10) to 7, 10, 14. The result was 7 so I narrowed it down to 6, 7, 8. The result was 6 and that was the final value for this max_depth.

For the learning_rate I started with 0.01, 0.1, 1 and the best was 0.1. Then 0.05, 0.1, 0.2 and the best was 0.2. Tried 0.15, 0.2, 0.25 and the best was 0.2 so that was the final value for the learning_rate.

So the top 3 parameters are: n_estimators = 90, max_depth = 6, learning_rate = 0.2. The max_depth = 6 was the same as the default value, so I could have not used this parameter if I wanted to.


Note that I didn’t put all the possible ranges/values for all 3 parameters into a grid search CV and let it run for the whole night. It’s all manual and I nailed down the parameters one by one, which only took about an hour. Manual is a lot quicker because from the prevous run I knew the optimum range of parameters, so I could narrow it down further. It’s a very controlled and targetted process, that’s why it’s quick.

Also note that I used only 20k data for tuning, but for getting AUC I fit the full training data and predicted using the full test data.

Step 3. The next 3 parameters

With the top 3 parameters fixed, I tried the next 3 parameters as follows:

  • colsample_bytree: 0.1, 0.5, 0.9
  • subsample: 0.1, 0.5, 0.9
  • min_child_width: 1, 5, 10

I picked these 3 parameters were based on the guidelines given by the XGBoost developers and the blog posts which are in the Reference section above.

The results are as follows: the optimum parameters = colsample_bytree = 0.3, subsample = 0.5, min_child_width = 1. This gives an AUC of 98.69%.

For the explanation about what these parameters are, please refer to the XGBoost documentation here: link.

It is possible that the AUC is lower than the AUC from the previous step. In this case I tried the values for that parameters manually using the full training data. For example, with tuning data (20k) the best min_child_width was 0 but this gives AUC of 98.09% which was lower than the previous AUC value before using min_child_width (98.69%). So I tried 0, 1 and 2 values of min_child_width using the full training data. In other words, the tuning data (20k) is good for narrowing down from the broad range to narrow range, but when it’s narrow range we might need to use the full training data. To do this I replaced the “XGB = … “ in last cell with this:

XGB = XGBClassifier(n_estimators = 90, max_depth = 6, learning_rate = 0.2, 
                    colsample_bytree = 0.3, subsample = 0.5, min_child_width = 1)

Step 4. Three regularisation parameters

Reading from the guides from the Reference section above, it seems that the next 3 most important parameters are gamma, alpha and lambda. They are the regularisation parameters and their value ranges are in the XGBoost documentation (link).

  • gamma: 0, 1, 10. Optimum value: 0.
  • alpha: 0, 10, 1000. Optimum value: 0.001
  • reg_lambda: 0.1, 0.5, 0.9. Optimum value: 0.001

After tuning with these 3 paramters, the AUC increased to 99.18%.

I confirmed the result by replacing the XGB = … in the last cell with this:

XGB = XGBClassifier(n_estimators = 90, max_depth = 6, learning_rate = 0.2, 
                    colsample_bytree = 0.3, subsample = 0.5, min_child_width = 1,
                    gamma = 0, alpha = 0.001, reg_lambda = 0.001)

Note on the imbalanced data

XGBoost has 2 parameters to deal with imbalanced data: scale_pos_weight and max_delta_step. You can read how to implement them in the XG Boost documentation: link.

I did use them, trying the scale_pos_weight values of 1, 10,100 and the optimum value was 10, but it only gave AUC of 96.83%.

So I tried different approaches for handling imbalance data, i.e. random oversampling, SMOTE and ADASYN. SMOTE gave the best result, i.e. the AUC of 99.18% above.

Note on the data skewness

The credit card fraud data is skewed, meaning it is not distributed normally. This is particularly so with the amount feature, which is distributed differently between the left of the mean and the right of the mean. A few other features such as V3 are also like that.

I used PowerTransformer to fix the data skewness, as you can see in the above code. I fixed the skewness separately between the training data and test data. So I split the data first, and then fix the skewness. This is better than fixing the skewness first because when afterwards the data is split, then it would become skewed.

Note on the stratified sampling

Because the data is very imbalanced, I use stratified sampling so that the ratio between the 2 classes are kept the same between the training data and the test data. I use 80-20% split rather than 70-30% split to give the model more data to learn, and because 20% is one fifth which is large enough unseen data to test the trained model against.

I don’t believe 10% test data is fair enough testing, in my opinion 20% is the minimum we should not go lower than that, not even 15%. I verified this in Kaggle i.e. that most practices in Kaggle are using test data of 20%, 25% or 30%. I didn’t see any one uses test data lower than 20% or higher than 30%.

Note on deleting the time column

The time column is not the time of day as in 8am or 9pm. It is the number of seconds elapsed between this transaction and the first transaction in the dataset (link). The distribution of the time column on class 0 and class 1 shows that the frauds can happen at any time:

And there is no correlation between time and class:

And by the way, the credit card transaction data is only for 2 days. So there is no enough time to form a pattern for the time of day.

So those are my reasons for deleting the time column.

But what Nimrod said on Linked In made me tried again. He said: Great read Vincent. I wonder though have you checked the time column before dropping it? I get that fraud can happen at any time, but perhaps some times of the day are more densely packed with fraudulent transaction? (link)

So I downloaded the time column and the class column into Excel. Divide the time column by (3600 x 24) which is the number of seconds in an hour and the number of hours in a day to get it in “day unit”. This “day unit” ranges from 0 to 1.9716 because there are only 2 days worth of transactions.

I then took the decimal part of the day unit, which is when the fraud happen during the day (value between 0 and 1). Multiplied by 24 I get the hour in the day. And it looks like this when I graph the number of frauds happened against the hour in the day:

Note that in the above chart 8 does not mean 8am and 21 does not mean 9pm. 8 means 8 hours from the first transaction, 21 means 21 hours from the first transaction. But we can see clearly that the fraud is high on 2nd hour and 11th hour. We need to remember though that the data is only 2 days worth of transactions. But still, it clearly shows that some times of the day are more densely packed with fraudulent transaction, just as Nimrod says (link). So I shouldn’t delete the time column actually, but convert it to the time of day.

16 April 2021

Logistic Regression with PCA in Python

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 8:31 pm

Logistic Regression means predicting a catagorical variable, without losing too much information. For example, whether a client will invest or not. JavaTPoint provies a good, short overview on Logistic Regression: link. Jurafsky & Martin from Stanford provide a more detailed view, along with the mathematics: link. Wikipedia provides a comprehensive view, as always: link.

In this article I will be writing how to do Linear Regression in Python. I won’t be explaining what it is, but only how to do it in Python.

PCA means Principal Component Analysis. When we have a lot of variables, we can reduce them using PCA. Matt Berns provide a good overview and resources: link. Lindsay Smith from Otago provides a good academic overview: link. And as always, Wikipedia provies a comprehensive explanation: link.

I think it would be good to kill two birds with on stone. So in this article I will build 2 Logistic Regression models, one with PCA and one without PCA. This way it will provide examples for both cases.

One of the weaknesses of PCA is that we won’t know which variables are the top predictors. To know the top predictors we will have to build the Linear Regression model without PCA. As we don’t use PCA, to reduce the number of variables I will use RFE + manual (see here for an example on reducing variables using RFE + manual on Linear Regression). One of the advantages of PCA is that we don’t need to worry about multicollinearity in the data (highly correlated features). So on the second model where I don’t use PCA, I have to handle the multicollinearity, i.e. remove the highly correlated features using VIF (Variance Inflation Factor).

There are 5 steps:

  1. Data preparation
    • Load and understand the data
    • Fix data quality issues
    • Data conversion
    • Create derived variables
    • Visualise the data
    • Check highly correlated variables
    • Check outliers
    • Handle class imbalance (see here)
    • Scaling the data
  2. Model 1: Logistic Regression Model with PCA
    • Split the data into X and y
    • Split the data into training and test data set
    • Decide the number of PCA components based on the explained variance
    • Train the PCA model
    • Check the correlations between components
    • Apply PCA model to the test data
    • Train the Logistic Regression model
  3. Model evaluation for Model 1
    • Calculate the Area Under the Curve (AUC)
    • Calculate accuracy, sensitivity & specificity for different cut off points
    • Choose a cut off point
  4. Model 2: Logistic Regression Model without PCA
    • Drop highly correlated columns
    • Split the data into X and y
    • Train the Logistic Regression model
    • Reduce the variables using RFE
    • Remove one variable manually based on the P-value and VIF
    • Rebuild the model
    • Repeat the last 2 steps until P value < 0.05 and VIF < 5
  5. Model evaluation for Model 2
    • Calculate the Area Under the Curve (AUC)
    • Calculate accuracy, sensitivity & specificity for different cut off points and choose a cut off point
    • Identify the most important predictors

Step 1 is long and is not the core of this article so I will be skipping Step 1 and go directly into Step 2. Step 1 is common to various ML scenario so I will be writing it in a separate article and put the link in here so you can refer to it. One part in step 1 is about handling class imbalance, which I’ve written here: link.

Let’s start.

Step 2. Model 1: Logistic Regression Model with PCA

# Split the data into X and y
y = high_value_balanced.pop("target_variable")
X = high_value_balanced

# Split the data into training and test data set
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7,test_size=0.3,random_state=42)

# Decide the number of PCA components based on the retained information
pca = PCA(random_state=88)
explained_variance = np.cumsum(pca.explained_variance_ratio_)
plt.vlines(x=80, ymax=1, ymin=0, colors="r", linestyles="--")
plt.hlines(y=0.95, xmax=120, xmin=0, colors="g", linestyles="--")

We can see above that to retain 95% explained variance (meaning we retain 95% of the information) we need to use 80 PCA components. So we build the PCA model with 80 components.

# Train the PCA model 
pca_final = IncrementalPCA(n_components=80)
df_train_pca = pca_final.fit_transform(X_train)

# Note that the above can be automated like this: (without using plot)
pca_final = PCA(0.95)
df_train_pca = pca_again.fit_transform(X_train)

# Check the correlations between components
corr_mat = np.corrcoef(df_train_pca.transpose())

As we can see in the heatmap above, all of the correlations are near zero (black). This one of the key features of PCA, i.e. the transformed features are not correlated to one another, i.e. their vectors are orthogonal to each other.

# Apply PCA model to the test data
df_test_pca = pca_final.transform(X_test)

# Train the Logistic Regression model
LR_PCA_Learner = LogisticRegression()
LR_PCA_Model =, y_train)

Step 3. Model evaluation for Model 1

# Calculate the Area Under the Curve (AUC)
pred_test = LR_PCA_Model.predict_proba(df_test_pca)
"{:2.2}".format(metrics.roc_auc_score(y_test, pred_test[:,1]))

# Calculate the predicted probabilities and convert to dataframe
y_pred = LR_PCA_Model.predict_proba(df_test_pca)
y_pred_df = pd.DataFrame(y_pred)
y_pred_1 = y_pred_df.iloc[:,[1]]
y_test_df = pd.DataFrame(y_test)

# Put the index as ID column, remove index from both dataframes and combine them
y_test_df["ID"] = y_test_df.index
y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)
y_pred_final = pd.concat([y_test_df,y_pred_1],axis=1)
y_pred_final = y_pred_final.rename(columns = { 1 : "Yes_Prob", "target_variable" : "Yes" } )
y_pred_final = y_pred_final.reindex(["ID", "Yes", "Yes_Prob"], axis=1)

# Create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_pred_final[i]= x: 1 if x > i else 0)

# Calculate accuracy, sensitivity & specificity for different cut off points
Probability = pd.DataFrame( columns = ['Probability', 'Accuracy', 'Sensitivity', 'Specificity'])
for i in numbers:
    CM = metrics.confusion_matrix(y_pred_final.Yes, y_pred_final[i] )
    Total = sum(sum(CM))
    Accuracy    = (CM[0,0]+CM[1,1])/Total
    Sensitivity = CM[1,1]/(CM[1,1]+CM[1,0])
    Specificity = CM[0,0]/(CM[0,0]+CM[0,1])
    Probability.loc[i] =[ i, Accuracy, Sensitivity, Specificity]
Probability.plot.line(x='Probability', y=['Accuracy','Sensitivity','Specificity'])

Choose a cut off point

Different applications have different priorities when choosing the cut off point. For some applications, the true positives is more important and the true negative doesn’t matter. In this case we should use sensitivity as the evaluation criteria, and choose the cut off point to make the sensitivity as high as possible e.g. probability = 0.1 which gives sensitivity almost 100%.

For some applications, the true negative is more important and the true positive doesn’t matter. In these case we should use specificity as the evaluation criteria, and choose the cut off point to make the sensitivity as high as possible, e.g. probability = 0.9 which gives specificity almost 100%.

For most applications the true positive and the true negative are equally important. In this case we should use accuracy as the evaluation criteria, and choose the cut off point to make the accuracy as high as possible, e.g. probability = 0.5 which gives accuracy about 82%.

In most cases it is not the above 3 extreme, but somewhere in the middle, i.e. the true positive is very important but the true negative also matters, even though not as important as the true positive. In this case we should choose the cut off point to make sensitivity high but the specificity not too low. For example, probability = 0.3 which gives sensitivity about 90%, specificity about 65% and accuracy about 80%.

So let’s do the the last paragraph, cut off point = 0.3

y_pred_final['predicted'] = lambda x: 1 if x > 0.3 else 0)
confusion_matrix = metrics.confusion_matrix( y_pred_final.Yes, y_pred_final.predicted )

We get sensitivity = 91.8%, specificifity = 65.9%, accuracy = 78.9%

Step 4. Model 2: Logistic Regression Model without PCA

# Drop highly correlated columns
data_corr = pd.DataFrame(data.corr()["target_variable"])
data_corr = data_corr[data_corr["target_variable"] != 1]
churn_corr.sort_values(by=["abs_corr"], ascending=False).head(5)

# Split the data into X and y, and normalise the data 
y = data.pop("target_variable")
normalised_data = (data - data.mean())/data.std()
X = normalised_data

# Train the Logistic Regression model
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7,test_size=0.3,random_state=88)
LR_model = LogisticRegression(max_iter = 200), y_train)

# Reduce the variables using RFE
RFE_model = RFE(LR_model, n_features_to_select = 15)
RFE_model =, y_train)
selected_columns = X_train.columns[RFE_model.support_]

# Rebuild the model
X_train_RFE = X_train[selected_columns], y_train)
LR2 = sm.GLM(y_train,(sm.add_constant(X_train_RFE)), family = sm.families.Binomial())
LR_model2 =

# Check the VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
Model_VIF = pd.DataFrame()
Model_VIF["Variable"] = X_train_RFE.columns
number_of_variables = X_train_RFE.shape[1]
Model_VIF["VIF"] = [variance_inflation_factor(X_train_RFE.values, i) for i in range(number_of_variables)]
Model_VIF.sort_values(by="VIF", ascending=False)

# Remove one variable manually based on the P-value and VIF
X_train_RFE.drop(columns=["column8"], axis=1, inplace=True)
LR2 = sm.GLM(y_train,(sm.add_constant(X_train_RFE)), family = sm.families.Binomial())
LR_Model2 =

Repeat the last 2 steps until P value < 0.05 and VIF < 5.

Step 5. Model evaluation for Model 2

# Calculate the Area Under the Curve (AUC)
df_test = LR_Model2.transform(X_test)
pred_test = LR_Model2.predict_proba(df_test)
"{:2.2}".format(metrics.roc_auc_score(y_test, pred_test[:,1]))

Calculate accuracy, sensitivity & specificifity for different cut off points and choose a cut off point:

See “choose cut off point” section above

Identify the most important predictors:

From the model output above, i.e. “LR_Model2.summary()” we can see the most important predictors.

15 April 2021

Linear Regression in Python

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 6:53 am

Linear Regression is about predicting a numerical variable. There are 5 steps when we do it in Python:

  1. Prepare the data
    • Load and understand the data
    • Fix data quality issues
    • Remove non required columns
    • Visualise and analyse the data
    • Identify highly correlated columns and remove them
    • Create derived variables
    • Create dummy variables for categorical variables
  2. Build the model
    • Split the data into training data and test data
    • Scale the numerical variables in the training data
    • Split the data into y and X
    • Automatically choose top 15 features using RFE (Recursive Feature Elimination)
    • Manually drop features based on P-value and VIF (Variance Inflation Factor)
    • Rebuild the model using OLS (Ordinary Least Squares)
    • Repeat the last 2 steps until all variables have P-value < 0.05 and VIF < 5
  3. Check the distribution of the error terms
  4. Make predictions
    • Scale the numberical variables in the test data
    • Remove the dropped features in the test data
    • Make predictions based on the test data
  5. Model evaluation
    • Plot the predicted vs actual values
    • Calculate R2, Adjusted R2 and F statistics
    • Create the linear equation for the best fitted line
    • Identify top predictors

Below is the Python code for the above steps. I will skip step 1 (preparing the data) and directly go to step 2 because step 1 is common to all ML models (not just linear regression) so I will write it in a separate article.

2. Build the model

# Split the data into training data and test data 
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df_data, train_size = 0.7, test_size = 0.3, random_state = 100)

# Scale the numerical variables in the training data
from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler()
continuous_columns = ["column1", " column2", " column3", " column4"]
df_train[continuous_columns] = minmax_scaler.fit_transform(df_train[continuous_columns])

# Split the data into y and X
y_train = df_train.pop("count")
x_train = df_train

# Automatically choose top 15 features using RFE
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
data_LR = LinearRegression(), y_train)
data_RFE = RFE(data_LR, 15)             
data_RFE =, y_train)

# Check which columns are selected by RFE and which are not
selected_columns = x_train.columns[data_RFE.support_]
unselected_columns = x_train.columns[~data_RFE.support_]

# Train the model based on the columns selected by RFE
# and check the coefficients, R2, F statistics and P values
x_train = x_train[selected_columns] 
import statsmodels.api as data_stat_model  
x_train = data_stat_model.add_constant(x_train) 
data_OLS_result = data_stat_model.OLS(y_train, x_train).fit() 

# Calculate the VIF (Variance Importance Factor) 
from statsmodels.stats.outliers_influence import variance_inflation_factor
data_VIF = pd.DataFrame()
data_VIF['variable'] = x_train.columns
number_of_variables = x_train.shape[1]
data_VIF['VIF'] = [variance_inflation_factor(x_train.values, i) for i in range(number_of_variables)]
data_VIF.sort_values(by="VIF", ascending=False) 

# Drop one column and rebuild the model
# And check the coefficients, R-squared, F statistics and P values
x_train.drop(columns=["column5"], axis=1, inplace=True)
x_train = bike_stat_model.add_constant(x_train)
data_OLS_result = data_stat_model.OLS(y_train, x_train).fit()

# Check the VIF again
data_VIF = pd.DataFrame()
data_VIF['variable'] = x_train.columns
number_of_variables = x_train.shape[1]
data_VIF['VIF'] = [variance_inflation_factor(x_train.values, i) for i in range(number_of_variables)]
data_VIF.sort_values(by="VIF", ascending=False) 

Keep dropping one column at a time and rebuild the model until all variables have P value < 0.05 and VIF < 5.

The result from print(data_OLS_result.summary()) is something like this, where we can see the R2 and Adjusted R2 of the training data:

                            OLS Regression Results                            
Dep. Variable:                  count   R-squared:                       0.841
Model:                            OLS   Adj. R-squared:                  0.838
Method:                 Least Squares   F-statistic:                     219.8
Date:                Tue, 29 Dec 2020   Prob (F-statistic):          6.03e-190
Time:                        09:07:14   Log-Likelihood:                 508.17
No. Observations:                 510   AIC:                            -990.3
Df Residuals:                     497   BIC:                            -935.3
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
const          0.2444      0.028      8.658      0.000       0.189       0.300
column1        0.2289      0.008     28.108      0.000       0.213       0.245
column2        0.1258      0.011     10.986      0.000       0.103       0.148
column3        0.5717      0.025     22.422      0.000       0.522       0.622
column4       -0.1764      0.038     -4.672      0.000      -0.251      -0.102
column5       -0.1945      0.026     -7.541      0.000      -0.245      -0.144
column6       -0.2362      0.026     -8.946      0.000      -0.288      -0.184

3. Check the distribution of the error terms

In linear regression we assume that the error term follows normal distribution. So we have to check this assumption before we can use the model for making predictions. We check this by looking at the histogram of the error term visually, making sure that the error terms are normally distributed around zero and that the left and right side are broadly similar.

fig = plt.figure()
y_predicted = data_OLS_result.predict(x_train)
sns.distplot((y_train - y_predicted), bins = 20)
fig.suptitle('Error Terms', fontsize = 16)

4. Making predictions

# Scale the numberical variables in the test data (just transform, no need to fit)
df_test[continuous_columns] = minmax_scaler.transform(df_test[continuous_columns])

# Split the test data into X and y
y_test = df_test.pop('count')
x_test = df_test

# Remove the features dropped by RFE and manual process
x_test = x_test[selected_columns]
x_test = x_test.drop(["column5", "column6", "column7"], axis = 1)

# Add the constant variable to test data (because by default stats model line goes through the origin)
x_test = data_stat_model.add_constant(x_test)

# Make predictions based on the test data
y_predicted = data_OLS_result.predict(x_test)

5. Model Evaluation

Now that we have built the model, and use the model to make prediction, we need to evaluate the performance of the model, i.e. how close the predictions are to the actual values.

# Compare the actual and predicted values
fig = plt.figure()
plt.scatter(y_test, y_predicted)
fig.suptitle('Compare actual (Y Test) vs Y predicted', fontsize = 16)
plt.xlabel('Y Test', fontsize = 14)
plt.ylabel('Y Predicted', fontsize = 14)
  • We can see here that the Y Predicted and the Y Test have linear relation, which is what we expect.
  • There are a few data points which deviates from the line, for example the one on the lower left corner.

We can now calculate the R2 score on the test data like this:

from sklearn.metrics import r2_score
r2_score(y_test, y_predicted)

We can also calculate the Adjusted R2 like this:

Based on the coefficient values from the OLS regression result we construct the linear equation for the best fitted line, starting from the top predictors like this:

                 coef    std err          t      P>|t|      [0.025      0.975]
const          0.2444      0.028      8.658      0.000       0.189       0.300
column1        0.2289      0.008     28.108      0.000       0.213       0.245
column2        0.1258      0.011     10.986      0.000       0.103       0.148
column3        0.5717      0.025     22.422      0.000       0.522       0.622
column4       -0.1764      0.038     -4.672      0.000      -0.251      -0.102
column5       -0.1945      0.026     -7.541      0.000      -0.245      -0.144
column6       -0.2362      0.026     -8.946      0.000      -0.288      -0.184

y = 0.2444 + 0.5717 column3 – 0.2362 column6  + 0.2289 column1 – 0.1945 column5 – 0.1764 column4 + …

Based on the absolute value of the coefficients we can see that the top 3 predictors are column3, column6 and column1. It is very useful in any machine learning project to know the top predictors, i.e. the most influencing features because then we can take business action to ensure that those features are maximised (or minimised if the coefficient is negative).

Now we have the linear regression equation we can use this equation for predict the target variable for any given input values.

7 April 2021

Handling Class Imbalance

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 7:17 am

In this article I will explain a few ways to treat class imbalance in machine learning. I will also give some examples in Python.

What is class imbalance?

Imagine if you have a data set containing 2 classes: 100 class A and 100 class B. This is called a balanced data set. But if those 2 classes are 5000 class A and 100 class B that is an imbalanced data set. This is not limited to 2 classes, but can happen on more than 2 classes. For example: class A and B both have 5000 members, whereas class C and D both have 100 members.

In an imbalance data set, the class with fewer members is called the minority class. The class with much more members is called the majority class. So if class A has 5000 members and class B 100 members, class A is the majority class and class B is the minority class.

Note that the “class” here is the target variable, not the independent variable. So the target variable is a categorical variable, not a continuous variable. A case where the target data set has 2 classes like above is called “binary classification” and it is quite common in machine learning.

At what ratio it is called class imbalance?

There is no exact definition on the ratio. If class A is 20% of class B I would call it imbalance. Whereas if class A is 70% of class B I would call it balance. 50% I would say is a good bet. It is wrong to dwell on finding the precise ratio range because each data set and each ML algorithm is diffferent. Some cases have bad results at 40%, some cases are ok with 40%.

Why class imbalance occurs

Some data is naturally imbalance, because one class happens rarely in nature, whereas the other happens frequently. For example: cancer, fraud, spam, accidents. The number of people with cancer are naturally much less than those without. The number of fraudulant credit card payments are naturally much less than good payments. The number of spam emails are much less than good emails. The number of flight having accidents are naturally much less than good flights.

Why class imbalance needs to be treated

Some machine learning algorithms don’t work well if the target variable is imbalanced, because during training the majority class would be favoured. As a result the model would be skewed toward the majority class. This situation is an issue because in most cases what we are interested in is predicting the minority class. For example: predicting that a transaction is a fraud, or that an email is a spam, is more important than predicting the majority class.

That is the reason why class imbalance needs to be treated. Because the model would be skewed towards the majority class, and we need to predict the minority class.

How to treat class imbalance

We resolve this situation by oversampling the minority class or by undersampling the majority class.

Oversampling the minority class means we randomly choose sample data from the minority class many times, whereas on the majority class we don’t do anything.

For example if class A has 5000 members and class B has 100 members, we resample class B 4950 times. Meaning that we pick data randomly from class B 4950 times. Effectively it is like duplicating class B data 50 times.

Undersampling the minority class means that we randomly selecting data from the majority class as many times as the minority class. In the above example we randomly pick 100 samples from class A, so that both class A and class B have 100 members.

Apart from randomly selecting data there are many other techniques, including:

  • Creating a new samples (called synthetic data)
  • Selecting samples not randomly but favouring samples which are misclassified
  • Selecting samples not randomly but favouring samples which resembles the other class

Jason Brownlee explained several other techniques such as SMOTE, Borderline Oversampling, CNN, ENN, OSS in this article: link.

Python examples

1. Random Oversampling

# Import resample from the Scikit Learn library
from sklearn.utils import resample

# Put the majority class and minority class on separate dataframes
majority_df = df[df["fraud"]==0]
minority_df = df[df["fraud"]==1] 

# Oversampling the minority class randomly
new_minority_df = resample( minority_df, replace = True, 
                            n_samples = len(majority_df), 
                            random_state = 0 )

# Combine the new minority class with the majority class
balanced_df = pd.concat([majority_df, new_minority_df])

2. Synthetic Minority Oversampling Technique (SMOTE)

# Import SMOTE from the Imbalance Learn library
from imblearn.over_sampling import SMOTE

# Oversampling the minority class using SMOTE
s = SMOTE()
X_new, y_new = s.fit_resample(X, y)

Jason Brownlee illustrates very well which part of the minority class got oversampled by SMOTE in this article: link. Please notice how the minority class differs on the first 3 plots in his article. We can see clearly how SMOTE with random undersampling is better than SMOTE alone or random undersampling alone.

Next Page »

Blog at