Data Warehousing and Data Science

16 February 2022

Forecasting time series: using statistics vs machine learning

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 6:59 am

This article outlines how ARIMA and LSTM are used for forecasting time series, and which one is better.
A lot of references are available at the end of this article for those who would like to find out further.


In ML, we use regression, to predict the values of a variable (y) based on the values of other variables (x1, x2, x3, …). For example, we predict the stock price of a company, based on its financial ratios, fundamentals and ESG factors.

In time series forecasting, we predict the values of a variable in the future, based on the values of that variable in the past. For example, we predict the stock price of a company, based on the past prices.

A time series is a sequence of numbers, each collected at a fixed period of time.

How do we forecast a time series? There are 2 ways: a) using statistics, b) using machine learning. In this article I’ll give a brief explanation of both. But before that let’s clear out one thing first: is “time series” plural or singular?

Time Series: plural or singular?

A time series is a sequence of numbers like this 1, 2, 3, 4, 5, … This is one time series, not one time serie.

We can have two time series like this: 1, 2, 3, 4, 5, … and 6, 7, 8, 9, 10, … These are two time series, not two time serieses.

So the singular form is “series” and the plural form is also “series”, not “serieses”. The word “series” is both singular and plural. See Merriam-Webster dictionary explanation in Ref #1 below.

Forecasting a time series means to find out what the next numbers in one series (1, 2, 3, 4, 5, …)

Forecasting two time series means to find out what the next numbers in two series (1, 2, 3, 4, 5, … and 6, 7, 8, 9, 10, …)

Forecasting time series using statistics

We can use regression to forecast a time series. We can also use Moving Average to forecast a time series.

Auto-Regressive model (AR)

Using regression, we use the past values of the forecast variable as the input variables. Which is why this method is called Auto-Regressive model. It is called auto because the input variables are the forecast variable itself, but the past values of it.

where yt-1, yt-2, yt-3are the past values of y, and c, c1, c2, c3 are constants.

ϵt = white noise. It is a sequence of random numbers, with the average of zero and the standard deviation is the same over time.

Moving Average model (MA)

Using Moving Average model the forecast variable is the mean of the series plus the error terms.

where ϵt = yt – yt-1 (white noise error term), μ is the mean and a1, a2, a3 are constants.

It is called moving average because we start with the average (mean), then keep moving/shifting the average by a factor of epsilon (the error term).

I need to emphasise here that the Moving Average model is not the Moving Average analysis that we use for the stock price, where we simply calculate the average of stock prices in the last 20 days.

ARMA model

ARMA model is the combination of the Auto-Regressive model and Moving Average. That is why it is called ARMA, the AR bit means Auto-Regressive, whereas the MA bit means Moving Average. So we forecast using the previous values of the forecast variable (Auto-Regressive model), and using the mean plus the error terms (Moving Average model).

ARIMA has 2 parameters i.e. ARMA(p,q)
where p = order of the autoregressive and q = order of the moving average.
Whereas AR and MA has 1 parameter i.e. AR(p) and MA(q).

ARIMA model

The ARIMA model is ARMA model plus differencing. Differencing means creating a new series by taking difference between the value at t and at (t-1).

For example, from this series: 0, 1, 3, 2, 3, 3, … (call it y)
We can make a new series by taking the difference between the numbers: 1, 2, -1, 1, 0, … (call it y’)
We can take the difference again (called second order differencing): 1, -3, 2, -1, … (call it y’’)

The I in ARIMA stands for Integrated. Integrated here means Differencing.

So the difference between the ARMA model and the ARIMA is: in ARMA we use y, whereas in ARIMA we use y’ or y’’.

In the ARIMA model use AR model and MA model on y’ or y’’, like this:

ARIMA has 3 parameters i.e. ARIMA(p,d,q)
where p = order of the autoregressive, d = degree of the first order differencing, and q = order of the moving average.


The S here means Seasonal and the X here means Exogenous.

Seasonal means that it has a repeating pattern from season to season. For example, the series on top line below consists of the trend part, the seasonal part and the random part. The seasonal part has a repeating pattern. Source: Ref #5.

The SARIMAX model include the seasonal part as well as the non-seasonal part.

SARIMAX has 7 parameters i.e. SARIMAX(p,d,q)x(P,D,Q,s)

Where p, d, q are as defined above, and P, D, Q are the seasonal terms of the p, d, q parameters, and s is the number seasons per year, e.g. for monthly s = 12, for quarterly s = 4.

In timer series, a exogenous variable means parallel time series which is used as a weighted input to the model (Ref #6)

Exogenous variable is one of the parameter in SARIMAX. In Python (statsmodels library), the parameters for SARIMAX are:

SARIMAX (y, X, order=(p, d, q), seasonal_order=(P, D, Q, s))

where y is the time series, X is the Exogenous variable/factor, and the others are as described before.

Forecasting time series using machine learning

The area of machine learning which deals with temporal sequence. is called Recurrent Neural Network (RNN). Temporal sequence means anything which has time element (a series of things happening one after the other), such as speech, handwriting, images, video. And that includes time series of course.

RNN is an neural network which has an internal memory. Which is it able to recognise patterns in time series. There are many RNN models, such as Elman network, Jordan network, Hopfield network, LSTM, GRU.

The most widely used method for predicting a time series is LSTM. An LSTM cell has 3 gates: an input gate, an output gate and a forget gate:

The horizontal line at the top (from ct-1 to ct) is the cell state. It is the memory of the cell. Along this line, there are 3 things happening: the cell state is multiplied by the “forget gate”, increased/reduced by the “input gate” and finally the value is taken to the “output gate”.

  • The forget gate removes unwanted information from the cell state (c), based on the previous input (ht-1) and the current input (xt).
  • The input gate adds new information to the cell state. The current input (xt) and the previous output (ht-1) pass through a σ and a tanh, multiplied then added to the cell memory line.
  • The output gate calculates the output from the cell state (c), the previous input (ht-1) and the current input (xt).

Architecturally, there are different ways we can use to forecast time series using LSTM: (Ref #7)

  • Fully Connected LSTM: a neural network with several layers of LSTM units with each layer fully connected to the next layer.
  • Bidirectional LSTM: the LSTM model learns the time series in backward direction in addition to the forward direction.
  • CNN LSTM: the time series is processed by a CNN first (1 dimensional), then processed by LSTM.
  • ConvLSTM: the convolutional structure is in inside the LSTM cell (in both the input-to-state and state-to-state transitions), see Ref #13 and #16.
  • Encoder-Decoder LSTM: for forecasting several time steps. The Encoder maps the time series into a fixed length vector, and decoder maps this vector back to a variable-length output sequence.

Which one is better, ARIMA or LSTM?

Well that is a million dollar question! Some research suggests that LSTM is better (Ref #17, #20, #24), some suggests that ARIMA is better (Ref #19) and some says that XGB is better than LSTM and ARIMA (#23). So it depends on the cases, but generally speaking LSTM is better in terms of accuracy (RMSE, MAPE).

It is an interesting topic for research. Plus other approaches such as Facebook’s Prophet, GRU, GAN and their combinations (Ref #25, #26, #27). It is possible to get better accuracy by combining the above approaches. I’m still searching a topic for my MSc dissertation, and it looks that this could be the one!


  1. Merriam-Webster dictionary explanation on “series” plurality: link
  2. Forecasting: Principles and Practice, by Rob J. Hyndman and George Athanasopoulos: link
  3. Wikipedia on ARIMA: link
  4. ARIMA model on Statsmodel: link
  5. Penn State Eberly College of Science: link
  6. Quick Adviser: link
  7. How to Develop LSTM Model for Time Series Forecasting by Jason Brownlee: link
  8. Time Series Prediction with LSTM RNN in Python with Keras: link
  9. Time Series Forecasting: Predicting Stock Prices Using An ARIMA Model by Serafeim Loukas: link
  10. Time Series Forecasting: Predicting Stock Prices Using An LSTM Model by Serafeim Loukas: link
  11. Wikipedia on RNN: link
  12. RNN and LSTM by Vincent Rainardi: link
  13. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting, by Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, Wang-chun Woo: link
  14. Exploiting the ConvLSTM: Human Action Recognition using Raw Depth Video-Based RNN, by Adrian Sanchez-Caballero, David Fuentes-Jimenez, Cristina Losada-Guti´errez: link
  15. Convolutional LSTM for spatial forecasting, by Sigrid Keydana: link
  16. Very Deep Convolutional Networks for End-to-End Speech Recognition, by Yu Zhang, William Chan, Navdeep Jaitly: link
  17. A Comparison of ARIMA and LSTM in Forecasting Time Series, by Sima Siami-Namini, Neda Tavakoli, Akbar Siami Namin: link
  18. ARIMA vs Prophet vs LSTM for Time Series Prediction, by Konstantin Kutzkov: link
  19. A Comparative Analysis of the ARIMA and LSTM Predictive Models and Their Effectiveness for Predicting Wind Speed, by Meftah Elsaraiti, Adel Merabet: link
  20. Weather Forecasting Using Merged LSTM and ARIMA Model, by Afan Galih Salman, Yaya Heryadi, Edi Abdurahman, Wayan Suparta: link
  21. Comparing ARIMA Model and LSTM RNN Model in Time-Series Forecasting, by Vaibhav Kumar: link
  22. A Comparison between ARIMA, LSTM, and GRU for Time Series Forecasting, by Peter Yamak, Li Yujian, Pius Kwao Gadosey: link
  23. Machine Learning Outperforms Classical Forecasting on Horticultural Sales Predictions by Florian Haselbeck, Jennifer Killinger, Klaus Menrad, Thomas Hannus, Dominik G. Grimm: link
  24. Forecasting Covid-19 Transmission with ARIMA and LSTM Techniques in Morocco by Mohamed Amine Rguibi, Najem Moussa, Abdellah Madani, Abdessadak Aaroud, Khalid Zine-dine: link
  25. Time Series Forecasting papers on Research Gate: link
  26. Stock Price Forecasting by a Deep Convolutional Generative Adversarial Network by Alessio Staffini: link
  27. A novel approach based on combining deep learning models with statistical methods for COVID-19 time series forecasting by Hossein Abbasimehr, Reza Paki, Aram Bahrini: link

18 January 2022

How to do AI without Machine Learning?

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 8:40 am

I’m doing a master’s degree titled ML and AI, and all this time I’ve been wondering what the difference between AI and ML is. I know AI is a superset of ML, but what is in AI but not in ML? Is it possible to do AI without ML? If so, how?

The Old Days of AI: rule-based

In 1990s there was no machine learning. To be clear, here machine learning includes classical algorithms like Decision Trees, Naive Bayes and SVM, as well as Deep Learning (neural network). In the 1990s there was no machine learning, but there was a lot of news about Artificial Intelligence. Deep Blue was the culmination of that.

So we know there was AI when there was no ML. There was AI without ML. But what was it? What was that AI without ML? Well rule-based of course. The technology that Deep Blue used is called the “Expert System”, which is based on rules defined and tuned by chess masters. You can read about the software behind Deep Blue here: link.

A rule-based system is essentially IF-THEN. There are many different types of rules so I need to clarify which one. It is the IF-THEN rule that makes up an Expert System. There are 2 main components of an Expert System (ES): the Inference Engine and the Knowledge Base. You can read the software architecture of an Expert System here: link.

Search and Fuzzy Logic

Besides ML and ES, another way to do AI is using Search. There are various ways to do search, such as Heuristic Search (Informed Search), Iterative Search and Adversarial Search. You can read the details in an excellent book by Crina Grosan and Ajith Abraham: link, page 13 to 129.

In the Expert System world, the IF-THEN rule-based is not the only way to do Expert System. There is another way: using fuzzy logic. In an IF-THEN rule-based expert system, the truth value is either 0 or 1. In a fuzzy logic system, the truth value is any real number between 0 and 1 (link). There are several fuzzy logic systems, such as Mandani and TSK (you can read the details here: link)

Evolutionary Algorithm and Swarm Intelligence

Another way for doing AI is using Evolutionary Algorithm (EA). EA uses concepts in evolution/biology such as reproduction, natural selection and mutation, in order to develop a solution/AI: link.

And finally, another way for doing AI is Swarm Intelligence: link. Swarm Intelligence (SI) is inspired by the behaviour of a group of animals, such as birds and ants. SI-based AI system consists of a group of agents interacting with each another, and with the environment (similar to Reinforcement Learning but using many agents).


So there you have it, there are a few other ways for doing AI:

  • Machine Learning
  • Expert System (Rule-Based)
  • Fuzzy Logic
  • Search
  • Evolutionary Algorithm
  • Swarm Intelligence

So just because we study ML we should not think that we are the only one, the only way to do AI. There are other ways, which might be better. Which may produce a better AI. Who knows, you haven’t studied them right? Well I know for sure now, that AI is not just ML. I hope this article is useful for you.


  1. Expert System, Wikipedia: link
  2. History of AI, Wikipedia: link
  3. AI without ML, Teradata: link
  4. AI without ML, Claudia Pohlink: link
  5. Rule-based AI vs ML, We Are Brain: link
  6. Intelligence Systems, Crina Grosan & Ajith Abraham: link

17 January 2022

Machine Learning or Data Science?

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 8:07 am

I’ve just got my post grad diploma in machine learning and all this time I was wondering what data science was. I have written an article about what data science is: link, but now that I understand a bit more about machine learning, I understand there is a lot of overlap between the two (ML and DS).

Last night when I read a Data Science book by Andrew Vermeulen (link) I was wondering which of the things I’ve learned in ML is actually DS. I list the items and label them ML or DS:

Yes, machine learning is definitely part of data science. Strictly speaking, the data cleansing, data analysis, statistics and visualisation are data science but not machine learning. We can see this in this proceedings: link.

So Data Science consists of the followings:

  • Data Cleansing
  • Data Analysis
  • Statistics (including probability, central limit theorem, hypothesis testing)
  • Data Visualisation
  • Machine Learning (including all ML models)

But in my opinion one cannot learn ML without studying statistics, visualisation, data loading, data cleansing and data analysis. In order to understand ML models properly, one must understand all the above fields.

Berkeley School of Information argues that the followings are also included in data science: link

  • Data Warehousing
  • Data Acquisition
  • Data Processing
  • Data Architecture
  • Business Intelligence
  • Data Reporting

I disagree with this opinion. From what I see many companies, Data Warehousing, acquisition/ processing and Data Architecture are part of a role called Data Engineer. A Data Engineer prepare and stores the data, including designing the data models and data ingestion process.

Because Data Visualisation is part of data science, it is tempted to think that Business Intelligence and Data Reporting are part of Data Science. But this is not true. The data visualisation in the data science is more on the data behaviour, such as clustering and statistical analysis, whereas BI is more on the business side, such as portfolio performance or risk reporting. This is only my opinion though, I’m sure other people have different opinions.

So there are 2 fields/roles in the data industry these days:

  • Data Science: data cleansing, data analysis, statistics, machine learning, data visualisation.
  • Data Engineering: data acquisition, data loading/processing, data quality, data architecture.

Whereas in the old days the roles are: business/data analyst, data architect, BI developer, ETL developer.

14 December 2021

Managing Investment Portfolios Using Machine Learning

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 8:28 am

In investment management, machine learning can be used on different areas of portfolio management including portfolio construction, signal generation, trade execution, asset allocation, security selection, position sizing, strategy testing, alpha factor design and risk management. Portfolio management is first a prediction problem for the vector of expected returns and covariance matrix, and then an optimization problem for returns, risk, and market impact (link). We can use various ML algorithms for managing investment portfolios, including reinforcement learning, Elastic Net, RNN (LSTM), CNN and Random Forest.

In this article I would like to give the “overview of the land” on how machine learning is used to manage investment portfolios. So I’m going to list down various methods that people use, but not going to explain the mechanism of each method. For example: I could be mentioning that people predict the direction of stock prices using random forest, but I’m not going to explain the why nor how. For each case I’m going to provide a link, so that you can read more about them if you are interested. The idea is that people who wish to do a research about portfolio management using ML can understand the current landscape, i.e. what have been done recently.

In the next article (link), I will be describing the problem statement of using Reinforcement Learning to manage portfolio allocation. So out of so many things about portfolio management I describe in this article, I choose only 1 (which is portfolio allocation). And out of so many ML approaches I describe in this article, I choose only 1 (which is RL).

But first, a brief overview on what portfolio management is. This is important as some of us are not from the investment sector. We know about the stock markets, but have no idea how a portfolio is managed.

What is Portfolio Management?

Portfolio management is the art and science of making optimal investment decisions to minimise the risks and maximizing the returns, to meet the investor’s financial objectives and risk tolerance. Active portfolio management means strategically buying and selling securities (stocks, bonds, options, commodity, property, cash, etc) in an effort to beat the market. Whereas passive portfolio management means matching the returns of the market by replicating some indexes.

Portfolio management involves the following stages:

  1. Objectives and constraints
  2. Portfolio strategies
  3. Asset allocation
  4. Portfolio construction
  5. Portfolio monitoring
  6. Day to day operations

Let’s examine those 6 stages one by one.

1. Define the investment objectives and constraints

First the investors or portfolio managers need to define the short term and long term investment goals, and how much and which types of risks the investor is willing to take. Other constraints include the capital amount, the time constraints, the asset types, the liquidity, the geographical regions, the ESG factors (environment, social, governance), the “duration” (sensitivity to interest rate changes) and the currency. Whether hedging FX or credit risks is allowed or not (using FX forwards and CDS), having more than 10% of cash is allowed or not, investing in developed markets is allowed or not, how much market volatility is allowed, whether investing in companies with market cap less than $1 billion is allowed, whether investing in coal or oil companies is allowed or not, whether investing in currencies is allowed or not, etc. – those are all portfolio constraints too.

2. Define the portfolio strategies

Based on the objectives and constraints, the investors or portfolio managers define the portfolio strategies, i.e. active or passive, top down or bottom up, growth or value investing, income investing, contrarian investing, buy and hold, momentum trading, long short strategy, indexing, pairs trading, dollar cost averaging (see here for more details). Hedging strategies, diversification strategies, duration strategies, currency strategies, risk strategies, stop loss strategies, liquidity strategy (to deal with redemptions and subscriptions), cash strategies – these are all strategies in managing portfolios.

3. Define the asset allocations

Based on the objectives and constraints, the investors or portfolio managers define what types of assets they should be investing. For example, if the objective is to make a difference in climate change, then the investment universe would be low carbon companies, clean energy companies and green transport companies. If one of the investment constraints is the invest in Asia, but not in Japan, and only in fixed income (not equity) then the investment universe would be the bonds issued by companies based in China, India, Korea, Singapore, etc. The asset types could be commodity (like oil or gold), property (like offices or houses), cash-like assets (like treasury bonds), government bonds, corporate bonds, futures, ETFs, crypto currencies, options, CDS (credit default swaps), MBS (mortgage based securities), ABS (asset based securities), time deposits, etc.

4. Portfolio construction

Once the strategies and the asset allocations are defined, the investors or portfolio managers begin building the portfolio by buying assets in the stock/bond markets and by entering contracts (e.g. CDS contracts, IRS contracts, property contracts, forward exchange contracts). Every company that they are going to buy is evaluated. The financial performance is evaluated (financial ratios, balance sheet, cash flow, etc), the board of directors are evaluated (independence, diversity, board size, directors skills, ages and background, etc), the stock prices are evaluated (company value, historical momentum, etc), the controversies are evaluated (incidents, health & safety record, employee disputes, law breaking records & penalties, etc), the environmental factors are evaluated (pollutions, climate change policies, carbon and plastic records, etc). So it is not just financial, but a lot more than that.

5. Portfolio monitoring

Then they need put in place a risk monitoring system, compliance monitoring system, performance monitoring system, portfolio reporting system and asset allocation monitoring system. Every trade is monitored (market abuse, trade transparency, capital requirements, post trade reporting, authorisation requirements, derivative reporting), and every day each portfolio holding are monitored. Cash level and portfolio breakdown are monitored every day. Early warnings are detected and reported (for threshold breach), market movement effect are monitored, operational risks are monitored & reported. Client reporting are in place (e.g. when investment values drop more than 10% the client must be notified), and audits are put in place (data security audit, IT systems audit, legal audit, anti money laundering audit, KYC/on-boarding process, insider trading).

6. Day-to-day operations

On the day-to-day operation, the investors or portfolio managers basically identify the potential trades to make money (to enhance the return). Trade means buying or selling securities. For this they screen potential companies (based on financial ratios, technical indicators, ESG factors, etc) to come up with a short list of companies that they will buy. They research these companies in depth and finally come up one a company they are going to buy (company A). They calculate which holding in the portfolio they will need to sell (company B) to buy this new company. They calculate the ideal size of holding for company A (in a global portfolio, each holding is about 1-2% of the portfolio), and it depends on the other properties of this company as well (sector, country, benchmark comparison, etc). Then they make 2 trades: buy A and sell B.

Apart from trades to make money, there are trades for other purposes: trades to mitigate risks, trade for compliance, trade for rebalancing, trade for benchmark adjustments, trades to improve liquidity, etc.

What are not included in portfolio management are the sales and marketing operation, the business development, the product development. These activities are also directly impacting the portfolio management though, because subscriptions and redemptions change the AUM (asset under management), but they are not considered part of portfolio management.

Machine Learning Methods Used in Portfolio Management

Below are various research papers which use various machine learning models and algorithms to manage investment portfolios, including predicting stock prices and minimising the risks.

Part 1. Using Reinforcement Learning

  • A deep Q-learning portfolio management framework for the crypto currency market (Sep 2020, link)
    A deep Q-learning portfolio management framework consisting of 2 elements: a set of local agents that learn assets behaviours and a global agent that describes the global reward function. Implemented on a crypto portfolio composed by four crypto currencies. Data: Bitcoin (BTC), Litecoin (LTC), Ethereum (ETH) and Riple (XRP) July 2017 to January 2019.
  • RL based Portfolio Management with Augmented Asset Movement Prediction States (Feb 2020, link)
    Using State-Augmented RL framework (SARL) to augment the asset price information with their price movement prediction (derived from news), evaluated on accumulated profits and risk-adjusted profits. Datasets: Bitcoin and high tech stock market, and 7 year Reuters news articles. Using LSTM for predicting the asset movement and NLP (Glove) to embed the news then feed into HAN to predict asset movement.
  • Adversarial Deep RL in Portfolio Management (Nov 2018, link)
    Using 3 RL algorithms: Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimization (PPO) and Policy Gradient (PG). China stock market data. Using Adversarial Training method to improve the training efficiency and promote average daily return and Sharpe ratio.
  • Financial Portfolio Management using Reinforcement Learning (Jan 2020, link)
    Using 3 RL strategies are used to train the models to maximize the returns and minimize the risks: DQN, T-DQN, and D-DQN. Indian stock market from Yahoo finance data, from 2008 to 2020.
  • Using RL for risk-return balanced portfolio management with market conditions embedding (Feb 2021, link)
    A deep RL method to tackle the risk-return balancing problem by using macro market conditions as indicators to adjust the proportion between long and short funds to lower the risk of market fluctuations, using the negative maximum drawdown as the reward function.
  • Enhancing Q-Learning for Optimal Asset Allocation (Dec 1997, link)
    To enhance the Q-Iearning algorithm for optimal asset allocation using only one value-function for many assets and allows model-free policy-iteration.
  • Portfolio Optimization using Reinforcement Learning (Apr 2021, link)
    Experimenting with RL for building optimal portfolio of 3 cryptocurrencies (Dash, Litecoin, Staker) and comparing it with Markowitz’ Efficient Frontier approach. Given the price history, to allocate a fixed amount of money between the 3 currencies every day to maximize the returns.

Part 2. Using Recurrent Neural Network (RNN)

  • Mutual Fund Portfolio Management Using LSTM (Oct 2020, link)
    Predicting the company stock prices on 31/12/2019 in IT, banking and pharmaceutical sectors based on Bombay stock prices from 1/1/2012 to 31/12/2015. Mutual funds are created from stocks in each sector, and across sectors.
  • Stock Portfolio Optimization Using a Deep Learning LSTM Model (Nov 2021, link)
    Time series analysis of the top 5 stocks historical prices from the nine different sectors in the Indian stock market from 1/1/2016 to 31/12/2020. Optimum portfolios are built for each of these sectors. The predicted returns and risks of each portfolio are computed using LSTM.
  • Deep RL for Asset Allocation in US Equities (Oct 2020, link)
    A model-free solution to the asset allocation problem, learning to solve the problem using time series and deep NN. Daily data for the top 24 stocks in the US equities universe with daily rebalancing. Compare LSTM, CNN, and RNN with traditional portfolio management approaches like mean-variance, minimum variance, risk parity, and equally weighted.
  • Portfolio Management with LSTM (Dec 2018, link)
    Predicting short term and long term stock price movements using LSTM model. 15 stocks, 17 years of daily Philippine Stock Exchange price data. Simple portfolio management algorithm which buys and sells stocks based on the predicted prices.
  • Anomaly detection for portfolio risk management (June 2018, link)
    ARMA-GARCH and EWMA econometric models, and LSTM and HTM machine learning algorithms, were evaluated for the task of performing unsupervised anomaly detection on the streaming time series of portfolio risk measures. Datasets: returns and VAR (value at risk).

Part 3. Using Random Forest

  • Forecasting directional movements of stock prices for intraday trading using LSTM and random forests (June 2021, link)
    Using random forests and CuDNNLSTM to forecast the directional movements of S&P 500 constituent stocks from January 1993 to December 2018 for intraday trading (closing and opening prices returns and intraday returns). On each trading day, buy the 10 stocks with the highest probability and sell short the 10 stocks with the lowest probability to outperform the market in terms of intraday returns.
  • Stock Selection with Random Forest in the Chinese stock market (Aug 2019, link)
    Evaluates the robustness of the random forest model for stock selection. Fundamental/technical feature space and pure momentum feature space are adopted to forecast the price trend in the short and long term. Data: all companies on the Chinese stock market from 8/2/2013 to 8/8/2017. Stocks are divided into N classes based on the forward excess returns of each stock. RF model is used in the subsequent trading period to predict the probability for each stock that belongs to the category with the largest excess return. The selected stocks constituting the portfolio are held for a certain period, and the portfolio constituents are then renewed based on the new probability ranking.
  • Predicting clean energy stock price using random forests approach (Jan 2021, link)
    Using random forests to predict the stock price direction of clean energy exchange traded funds. For a 20-day forecast horizon, tree bagging and random forests methods produce 85% to 90% accuracy rates while logistic regression models are 55% to 60%.
  • Stock Market Prices Prediction using Random Forest and Extra Tree Regression (Sep 2019, link)
    Comparing Linear Regression, Decision Tree and Random Forest models. Using the last 5 years historical stock prices for all companies on S&P 500 index. From these the price of the stock for the sixth day are predicted.

Part 4. Using Gradient Boosting

  • A Machine Learning Integrated Portfolio Rebalance Framework with Risk-Aversion Adjustment (July 2021, link)
    A portfolio rebalance framework that integrates ML models into the mean-risk portfolios in multi-period settings with risk-aversion adjustment. In each period, the risk-aversion coefficient is adjusted automatically according to market trend movements predicted by ML models. The XGBoost model provides the best prediction of market movement, while the proposed portfolio rebalance strategy generates portfolios with superior out-of-sample performances compared to the benchmarks. Data: 25 US stocks, 13-week Treasury Bill and S&P 500 index from 01/09/1995 to 12/31/2018 with 1252 weekly returns.
  • The Success of AdaBoost and Its Application in Portfolio Management (Mar 2021, link)
    A novel approach to explain why AdaBoost is a successful classifier introducing a measure of the influence of the noise points. Applying AdaBoost in portfolio management via empirical studies in the Chinese stock market:
    1. Selecting an optimal portfolio management strategy based on AdaBoost
    2. Good performance of the equal-weighted strategy based on AdaBoost
    Data: June 2002 and ends in June 2017, 181 months, Chinese A-share market. 60 fundamentals & technical factors.
  • Moving Forward from Predictive Regressions: Boosting Asset Allocation Decisions (Jan 2021, link)
    A flexible utility-based empirical approach to directly determine asset allocation decisions between risky and risk-free assets. Single-step customized gradient boosting method specifically designed to find optimal portfolio weights in a direct utility maximization. Empirical results of the monthly U.S. data show the superiority of boosted portfolio weights over several benchmarks, generating interpretable results and profitable asset allocation decisions. Data: The Welch-Goyal dataset, containing macroeconomic variables and the S&P 500 index from December 1950 to December 2018.
  • Understanding Machine Learning for Diversified Portfolio Construction by Explainable AI (Feb 2020, link)
    A pipeline to investigate heuristic diversification strategies in asset allocation. Use explainable AI to compare the robustness of different strategies and back out implicit rules for decision making. Augment the asset universe with scenarios generated with a block bootstrap from the empirical dataset. The empirical dataset consists of 17 equity index, government bond, and commodity futures markets across 20 years. The two strategies are back tested for the empirical dataset and for about 100,000 bootstrapped datasets. XGBoost is used to regress the Calmar ratio spread between the two strategies against features of the bootstrapped datasets.

22 November 2021

Tuning XGBoost Models

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 7:15 am

I was tuning fraudulent credit card transaction data from Kaggle (link) and found that for classifier, XGBoost provides the highest AUC compared to other algorithms (99.18%). It is a little tricky to tune though, so in this article I’d like to share my experience in tuning it.

What is XGBoost?

XGBoost stands for Extreme Gradient Boosting. So before you read about XGBoost, you need to understand first what is Gradient Boosting, and what is Boosting. Here are good introductions to this topic: link, link. The basis algorithm for XGBoost is Decision Tree. Then many trees are used together in a technique called Ensemble (for example Random Forest). So a complete journey to understanding XGboost from the ground up is:

  1. Decision Tree
  2. Ensemble
  3. Stacking, Bagging, Boosting (link)
  4. Random Forest
  5. Gradient Boosting
  6. Extreme Gradient Boosting

Higgs Boson

The original paper by Tianqi Chen and Carlos Guestrin who created XGBoost is here: link.
XGBoost was used to solve Higgs Boson classification problem, again by Tianqi Chen, and Tong He: link. Higgs Boson is the last elementary particle discovered. It was discovered in 2012 at the Large Hadron Collider at CERN. The particle was predicted by Peter Higgs in 1964.


A good reference for tuning XGBoost model is a guide from Prashant Banerjee: link (search for “typical value”). Another good one is from Aarshay Jain: link (again, search for “typical value”). The guide from the developers: link and the list of hyperparameters are here: link.

Python Code

Here’s the code in its entirety:

# Import required libraries
import numpy as np
import pandas as pd
from sklearn import preprocessing

# Load the data from Google drive
from google.colab import drive
df = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/creditcard.csv')

# Drop time column as fraudulent transactions can happen at any time
df = df.drop("Time", axis = 1)

# Get the class variable and put into y and the rest into X
y = df["Class"]
X = df.drop("Class", axis = 1)

# Stratified split into train & test data
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split( X, y, test_size = 0.2, stratify = y, random_state = 42 )

# Fix data skewness
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(copy=False)
train_return = pt.fit_transform(X_train)
test_return  = pt.fit_transform(X_test)

# Balance the train and test data using SMOTE
from imblearn.over_sampling import SMOTE
SMOTE = SMOTE(random_state=42)
X_smote_train, y_smote_train = SMOTE.fit_resample(X_train, y_train)
X_smote_test, y_smote_test = SMOTE.fit_resample(X_test, y_test)

# Sample training data for tuning models (use full training data for final run)
tuning_sample = 20000
idx = np.random.choice(len(X_smote_train), size=tuning_sample)
X_smote_tuning = X_smote_train.iloc[idx]
y_smote_tuning = y_smote_train.iloc[idx]

# Import libraries from Scikit Learn
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier

# Create a function to calculate AUC using predict proba
def Get_AUC(Model, X, y):
    prob = Model.predict_proba(X)[:, 1] 
    return roc_auc_score(y, prob) * 100

# Perform grid search cross validation with different parameters
parameters = {'n_estimators':[90], 'max_depth':[6], 'learning_rate':[0.2], 
              'subsample':[0.5], 'colsample_bytree':[0.3], 'min_child_width': [1],
              'gamma':[0], 'alpha':[0.001], 'reg_lambda':[0.001]}
XGB = XGBClassifier()
CV = GridSearchCV(XGB, parameters, cv=3, scoring='roc_auc', n_jobs=-1)

# Hyperparameter tuning to find the best parameters, y_smote_tuning)
print("The best parameters are:", CV.best_params_)

Output: The best parameters are: {'alpha': 0.001, 'colsample_bytree': 0.3, 'gamma': 0, 'learning_rate': 0.2, 'max_depth': 6, 'min_child_width': 1, 'n_estimators': 90, 'reg_lambda': 0.001, 'subsample': 0.5}

# Fit the model with the best parameters and get the AUC
XGB = XGBClassifier(n_estimators = CV.best_params_["n_estimators"], max_depth = CV.best_params_["max_depth"], 
                    learning_rate = CV.best_params_["learning_rate"], colsample_bytree = CV.best_params_["colsample_bytree"], 
                    subsample = CV.best_params_["subsample"], min_child_width = CV.best_params_["min_child_width"],
                    gamma = CV.best_params_["gamma"], alpha = CV.best_params_["alpha"], 
                    reg_lambda = CV.best_params_["reg_lambda"])
Model =, y_smote_train)
AUC = Get_AUC(Model, X_smote_test, y_smote_test)
print("AUC =", '{:.2f}%'.format(AUC))

Output: 99.18

Tuning Process

So here is the tuning process that I did for XG Boost model, for the above data, using the above code.

Step 1. Broad ranges on the top 3 parameters

First, I read the expected values for the parameters from the guides (see the Reference section above).
Note: In this article when I say parameters I mean hyperparameters.

Then, using the Grid Search cross validation I set the parameters in very broad ranges as follows:

  • n_estimators: 10, 100, 500
  • max_depth: 3, 10, 30
  • learning_rate: 0.01, 0.1, 1

I used only 20k data out of 284,807 transactions so the cross validation process didn’t take hours but only minutes. I tried with 10k, 20k, 50k samples and found that 10k results didn’t represent the whole training data (284k), 50k and above were very slow, but 20k is fast enough and yet it is representative.

I would recommend trying only 3 values for each parameter and only the 3 parameters above to begin with. This way it would take 10 minutes. These 3 parameters are the most influencing factors, we need to nail them down first. They are mentioned in the Reference section above.

Step 2. Narrow down the top 3 parameters

I then narrow down the range of these 3 parameters. For example, for n_estimators out of 10, 100, 500, the Grid Search shows that the best value was 100. So I changed the grid search with 80, 100, 120. Still getting 100 as the best parameter so I did a grid search with 90, 100, 110 and got 90. Finally I did the grid search with 85, 90, 95 and it still gave out 90 as the best n_estimators so that was my final value for this parameter.

But I understood there was interaction between the parameter so when tuning n_estimators I included the max_depth of 3, 10, 30 and learning_rate of 0.01, 0.1, 1. And when the n_estimator was settled at 90, I started narrowing down the max_depth (which was giving out 10) to 7, 10, 14. The result was 7 so I narrowed it down to 6, 7, 8. The result was 6 and that was the final value for this max_depth.

For the learning_rate I started with 0.01, 0.1, 1 and the best was 0.1. Then 0.05, 0.1, 0.2 and the best was 0.2. Tried 0.15, 0.2, 0.25 and the best was 0.2 so that was the final value for the learning_rate.

So the top 3 parameters are: n_estimators = 90, max_depth = 6, learning_rate = 0.2. The max_depth = 6 was the same as the default value, so I could have not used this parameter if I wanted to.


Note that I didn’t put all the possible ranges/values for all 3 parameters into a grid search CV and let it run for the whole night. It’s all manual and I nailed down the parameters one by one, which only took about an hour. Manual is a lot quicker because from the prevous run I knew the optimum range of parameters, so I could narrow it down further. It’s a very controlled and targetted process, that’s why it’s quick.

Also note that I used only 20k data for tuning, but for getting AUC I fit the full training data and predicted using the full test data.

Step 3. The next 3 parameters

With the top 3 parameters fixed, I tried the next 3 parameters as follows:

  • colsample_bytree: 0.1, 0.5, 0.9
  • subsample: 0.1, 0.5, 0.9
  • min_child_width: 1, 5, 10

I picked these 3 parameters were based on the guidelines given by the XGBoost developers and the blog posts which are in the Reference section above.

The results are as follows: the optimum parameters = colsample_bytree = 0.3, subsample = 0.5, min_child_width = 1. This gives an AUC of 98.69%.

For the explanation about what these parameters are, please refer to the XGBoost documentation here: link.

It is possible that the AUC is lower than the AUC from the previous step. In this case I tried the values for that parameters manually using the full training data. For example, with tuning data (20k) the best min_child_width was 0 but this gives AUC of 98.09% which was lower than the previous AUC value before using min_child_width (98.69%). So I tried 0, 1 and 2 values of min_child_width using the full training data. In other words, the tuning data (20k) is good for narrowing down from the broad range to narrow range, but when it’s narrow range we might need to use the full training data. To do this I replaced the “XGB = … “ in last cell with this:

XGB = XGBClassifier(n_estimators = 90, max_depth = 6, learning_rate = 0.2, 
                    colsample_bytree = 0.3, subsample = 0.5, min_child_width = 1)

Step 4. Three regularisation parameters

Reading from the guides from the Reference section above, it seems that the next 3 most important parameters are gamma, alpha and lambda. They are the regularisation parameters and their value ranges are in the XGBoost documentation (link).

  • gamma: 0, 1, 10. Optimum value: 0.
  • alpha: 0, 10, 1000. Optimum value: 0.001
  • reg_lambda: 0.1, 0.5, 0.9. Optimum value: 0.001

After tuning with these 3 paramters, the AUC increased to 99.18%.

I confirmed the result by replacing the XGB = … in the last cell with this:

XGB = XGBClassifier(n_estimators = 90, max_depth = 6, learning_rate = 0.2, 
                    colsample_bytree = 0.3, subsample = 0.5, min_child_width = 1,
                    gamma = 0, alpha = 0.001, reg_lambda = 0.001)

Note on the imbalanced data

XGBoost has 2 parameters to deal with imbalanced data: scale_pos_weight and max_delta_step. You can read how to implement them in the XG Boost documentation: link.

I did use them, trying the scale_pos_weight values of 1, 10,100 and the optimum value was 10, but it only gave AUC of 96.83%.

So I tried different approaches for handling imbalance data, i.e. random oversampling, SMOTE and ADASYN. SMOTE gave the best result, i.e. the AUC of 99.18% above.

Note on the data skewness

The credit card fraud data is skewed, meaning it is not distributed normally. This is particularly so with the amount feature, which is distributed differently between the left of the mean and the right of the mean. A few other features such as V3 are also like that.

I used PowerTransformer to fix the data skewness, as you can see in the above code. I fixed the skewness separately between the training data and test data. So I split the data first, and then fix the skewness. This is better than fixing the skewness first because when afterwards the data is split, then it would become skewed.

Note on the stratified sampling

Because the data is very imbalanced, I use stratified sampling so that the ratio between the 2 classes are kept the same between the training data and the test data. I use 80-20% split rather than 70-30% split to give the model more data to learn, and because 20% is one fifth which is large enough unseen data to test the trained model against.

I don’t believe 10% test data is fair enough testing, in my opinion 20% is the minimum we should not go lower than that, not even 15%. I verified this in Kaggle i.e. that most practices in Kaggle are using test data of 20%, 25% or 30%. I didn’t see any one uses test data lower than 20% or higher than 30%.

Note on deleting the time column

The time column is not the time of day as in 8am or 9pm. It is the number of seconds elapsed between this transaction and the first transaction in the dataset (link). The distribution of the time column on class 0 and class 1 shows that the frauds can happen at any time:

And there is no correlation between time and class:

And by the way, the credit card transaction data is only for 2 days. So there is no enough time to form a pattern for the time of day.

So those are my reasons for deleting the time column.

But what Nimrod said on Linked In made me tried again. He said: Great read Vincent. I wonder though have you checked the time column before dropping it? I get that fraud can happen at any time, but perhaps some times of the day are more densely packed with fraudulent transaction? (link)

So I downloaded the time column and the class column into Excel. Divide the time column by (3600 x 24) which is the number of seconds in an hour and the number of hours in a day to get it in “day unit”. This “day unit” ranges from 0 to 1.9716 because there are only 2 days worth of transactions.

I then took the decimal part of the day unit, which is when the fraud happen during the day (value between 0 and 1). Multiplied by 24 I get the hour in the day. And it looks like this when I graph the number of frauds happened against the hour in the day:

Note that in the above chart 8 does not mean 8am and 21 does not mean 9pm. 8 means 8 hours from the first transaction, 21 means 21 hours from the first transaction. But we can see clearly that the fraud is high on 2nd hour and 11th hour. We need to remember though that the data is only 2 days worth of transactions. But still, it clearly shows that some times of the day are more densely packed with fraudulent transaction, just as Nimrod says (link). So I shouldn’t delete the time column actually, but convert it to the time of day.

13 November 2021

Google Colab

Filed under: Machine Learning — Vincent Rainardi @ 9:05 am

I started using Google Colab about 6 months ago, as my laptop is 1 core 4 GB. I’m now on Colab Pro (£9.72 a month) which is fantastic. So in this article I’d like to share my experience doing machine learning on Colab.

The Look and Feel

I love the table of contents on the left panel! Jupyter Notebook on local install don’t have it. Well they do (link) but it’s so complicated and so manual. On Google Colab it is available out of the box, without us doing anything (it detects the headings in the markdown, i.e. #, ##, ###, etc.) We use it to jump to different sections in our notebook:

Google Colab also supports dark theme as you can see above. On the left most column (marked A) we have search, code snippet, variables and files. They are very useful when doing development. Psst, if you click the Colab icon on the top left (B), you’ll see all your notebooks in Google Drive!).

On the bottom left (C) there are 2 very useful icons. The first one is Command Pallette where every single Jupyter command and shortcut is listed! The second one is the Terminal, where you can run shell commands (only on Colab Pro).

Google Drive

When starting on Colab almost everyone asked “How do I upload files” and spent time searching.

There you go! Saves you from searching 🙂

Tip: don’t use GPU when you connect to Google Drive for the first time. You could get “403 error daily limit exceeded” (link). Turn off GPU on the Notebook Settings (see below) and you’ll be ok.


The main reason we use Colab is because it’s free and it’s fast. Yes it’s free. And yes you get GPU and TPU! In my case it’s 10-20x faster than my laptop. So why pay £9.72/month for Pro? (including VAT) Because it disconnects after 1.5 hours, that’s why. To prevent that we need to keep typing! With Pro we also get more power and RAM.

When doing neural network (RNN, CNN) or RL, Colab is an absolute godsend. I can’t do those NN or RL models on my laptop (well it’s 6 years old, no wonder 🙂 Colab is very good with anything to do with Keras. Because of its TPU. Well, with scikit learn too.

Tip: unless you are doing network layers, using TPU doesn’t mean it’s faster than GPU or None. I was using classical models (Logistic Rregression, Random Forest, XG Boost, etc) and found that TPU or GPU doesn’t make it faster. And there were moments when I could not connect using TPU, but with  None I could connect.

Also, most often I don’t need to use the High RAM setting. Only once I ran out of memory, that was when processing a lot of images (augmentation).

Executing Code

After executing code, you can tell how long each cell runs by hovering over the play button (or the green check box). Below you can see “29 seconds” under the green checkbox so why bother hovering? Because once you edit the cell the 29s is gone! But if you hover over you can still see it 🙂

A lot of times when I’m running code it fails. After fixing that particular cell, and testing it, I used to run it from the top again (using “Run before”). Now once I fixed that cell, I use “Run after” to execute the rest of the code, because if you are like me, there will be other cells failing down there 🙂

Sometimes a model runs for a long time and when you click Stop (the rectangle on the left of the cell) it doesn’t stop. In that case I use “Restart runtime” (the last blue arrow above). It’s an absolute godsend! Whatever the code is doing, it will stop. Psst, you will also see this “Restart runtime” if you keep clicking the Stop button on the left of the cell 3 times.


On the top right of the screen, next to your initial, you can see a wheel icon. This is for Settings.

The dark theme is set here, along with the font, etc. Ever wonder why the tab is 2 spaces not 4 spaces? Well you can set it here. Do you want to see line numbers? Set it here. Mind you it’s the line number within a cell, not the cell number.


When editing, the first botton (A) makes it a header, i.e. it begins with a “#”. Keep clicking that icon and it changes to level 2 and level 3. Botton B formats whatever we are highlighting as a code and botton C inserts a link. You can even insert an image (D), a line (E) or an Emoji (G). Botton F is LaTeX (it’s a type setting thing, link) and H is to put the preview below rather than on the right.

You can get the link to a particular cell (button I), and you add a comment on individual cell (J). I use button K often, it is for closing the editor (well you can also press Escape). The last button (L) is open up the cell.

File Menu

On the file menu you can open or upload a notebook, rename or save a copy, and download the notebook. I press control-S (save) often, but it’s actually not necessary because the notebook is automatically saving itself very often.

When working on a project for a few days or weeks, I download the notebook to my PC, having different file name every time (file1, file2, file3, etc.) So I can see any previous code that I’ve written and copy it back to my current code if required.

If the automatic save fails and it says something like “auto sync fails because of another copy”, click Runtime menu, Manage sessions. You’ll see two sessions open for the same notebook file. Terminate one of them and the automatic saving is fixed.

There you go I hope it was useful. Any comments, corrections or advise you can contact me on
Happy coding in machine learning!

5 October 2021

Reinforcement Learning

Filed under: Machine Learning — Vincent Rainardi @ 7:28 am

Reinforcement Learning (RL) is a branch of machine learning where we have an agent and environment, and the agent is learning from the environment by trying out different actions from each state to find out which action gives the best reward.

Examples of RL is self driving car, games (e.g. chess, Go, DOTA2), portfolio management, chemical plant, traffic control system. In portfolio management, the agent is the Algo trading robot, the state is the holdings in the portfolio, actions are buying and selling stocks, and the reward is the profit made.

It is probably best to illustrate this using Grid World Game (link). Imagine a robot walking from the start box to the finish box, avoiding the X boxes:

Here’s the rule of the game:

  1. In this case the agent is the robot. Let’s call it R. The state is R’s position, i.e. A1, B3, etc. R’s initial state is A1. The goal is for R to get to D4. The episode finishes when R gets to D4, or when R hits the X1 or X2 box.
  2. The reward is: every time R moves it gets -1, if R reaches D4 it gets +20 and if R hits X1 or X2 it gets -10. The action is R’s movement. From A1 if R moves to the right the next state is A2, if R moves down the next state is B1.
  3. If R hits the perimeter wall the next state is the same as the current state. For example from A1 if R moves to the left or upwards the next state is A1.
  4. There are 4 possible actions from each state: move right, move left, move up, move down.
  5. There are 13 possible states (R possible locations), i.e. 4×4 = 16 minus X1, X2 and Finish.
  6. Example of an episode: A1, A2, A3, B2, C2, D2, Finish. The reward is 6x-1 + 20 = 14.
    Another example: A1, B1, X1. The reward is 2x-1 – 10 = -12.

These are the 2 basic equations in RL: (they are called Bellman equations, link)

  • The first one means: the reward for a particular state is the sum of the rewards from all possible actions in that state, weighted by the probability of those actions.
    v is the value of state s, π is the policy and q is the value of action a in state s.
  • The second one means: the reward for a particular action in a particular state is the sum of (the immediate reward and all the future rewards) from that state, over the model probabilities.
    p is the environment model, r is the immediate reward, γ is the discount factor (how much we value future reward), and v is the value of the next state.

The environment model, denoted by p(s’,r|s,a) in the second equation above, takes in the state and the action, and it gives back the reward and the next state. For example, we put in state = A1 and action = go right (that’s the s and the a), the output of the environment model is reward = -1 and next state = A2 (that’s the r and the s’).

A policy is the probability of choosing actions in a state. For example: in A1, policy1 says that the probability of going right = 40%, going left = 10%, up = 10%, down = 40%. Whereas policy2 says that in A1 the probability of taking those 4 actions have different percentages. A policy defines the actions for every single state, not just 1 state.

The goal is to choose the best policy which has the best action for every state. The best action is the one is the highest total reward, not just the immediate reward but all future reward as well (over an episode). If an action gives us a high reward now, but over the next view steps it gives low rewards, then the total reward would be low.


Training an RL model means:

Step 1. First we initialise every state with an action. Say we initialise all states with action = Up, like this:

Step 2. We start from A1. We calculate the reward for all 4 actions in A1, then choose the best action (say right). So the state is now A2. We calculate the reward for all 4 actions in A2, and choose the best action again. We do this until the episode finishes, i.e. we arrive at Finish or we hit x1 or x2.

For example, this is what we end up with in Episode 1, i.e. we took the yellow path. The reward is 5x-1 – 10 = -15.

Step 3. We do another episode. It is important that we must explore other possibilities, so not always choose the best action, but deliberately take a random action. For example, in this episode we go down from A1, then left on B1 (so the next state is still B1), then right on B1 and hit the x1 box. The reward is 3 x -1 – 10 = -13.

Step 4. We don’t want to keep exploring forever, so as time goes by we explore less and less, and exploit more and more. Explore means we choose an action randomly. Whereas exploit means we take the best action for that state.

We use a hyperparameter called epsilon to do this (ε). We start with ε = 1 and slowly decreasing it to 0. In every move we generate a random number. If this random number is lower than ε then we explore, but if it is higher than ε then we explore.

So in the initial episode our score is low but after a while our score will be high. Score is the total reward per episode. The maximum score is 14, i.e. when you directly to Finish in the shortest possible way. The worst score is a big negative number i.e. if you keep going around in circle endlessly. Remember that every time you move you get -1. This is to motivate the robot to go to the Finish box as soon as possible.

So if you play say 1000 episodes, the score would be something like this:

We can see that in the beginning until episode 200 the score is constantly increasing. This is because at the start we set the ε to 1 (fully exploring) and we tapering it slowly to 0.9, 0.8, 0.7, etc until it reaches 0 at episode 200. This is called epsilon decay.

From episode 200 the score is “in a band”, i.e. the value is in a certain range. We say that the score is “converged”. In this case it is between 8 and 14. From episode 300 the band is narrower to 9-14. From episode 500 it is 10-14 and from episode 700 to 1000 the score is 11-14. The band is smaller and smaller because after episode 300 the RL model doesn’t explore any more. It only exploits, i.e. taking the best possible actions. And it is still learning, resulting in the increase of score as time goes by.

Model Free

One of the most important thing to remember in RL is that in most cases we don’t have the model of the environment. So we need to use a neural network to estimate the value of q (the reward for an action in a state). The neural network is called the Q network. And because it consist of many layers, it is called Deep Q Network, or DQN. Like this:

The input of DQN is a state vector (one hot encoding) and the action vector (also one hot encoding).
For example, assuming we have 3 states (box A1, A2, A3) and 2 actions (left and right):

  • For state A1 the state vector is [1  0  0]
  • For state A2 the state vector is [0  1  0]
  • For state A3 the state vector is [0  0  1]
  • For action Go Left the action vector is [1  0]
  • For action Go Right the action vector is [0  1]

Because we don’t have the environment model, we generate the data (called “experience”) using DQN. We put in the state and the action, and get the Q value (the reward). We do this many many times and save those “experience” into memory (called Replay Buffer). After we have many experiences in the Replay Buffer (say 30,000) we pick a batch of say 100 experience randomly and use this data for training the network.

Three Architecture of DQN

There are 3 architecture of DQN. The first one is the one in the previous diagram (I draw it here again for clarity). It takes in the state and action vectors as input, and the output is the Q value (the reward) for that state and action. For example: from state A1 [1  0  0] take action = Go Right [0  1], and the reward is -1. So we feed in [1  0  0  0  1], i.e. concatenation of state and action vectors, and get -1 as the output.

The second architecture takes in the state vector as input, and the output is the Q value (the reward) for each action. For example, the input is A1 and the output is -1 for Go Left and -1 for Go Right. The advantage of using this architecture is we only need to feed each state once into the network, whereas the first architecture we need to feed each state and action combination into the network.

The problem with both of the above architectures is that we use the same network for calculating the expected Q values and for predicting the Q values. This makes the system unstable because as the network weights are updated in each step, both the expected Q and the predicted Q changes.

This problem is solved by the third DQN architecture which is called Double DQN. In this architecture we use 2 networks, one for predicting the Q value and one for calculating the expected Q value. The former is called the Main network and the latter is called the Target network. The weight of the Main network is updated at every step, whereas the weight of the Target network is updated at every episode. This makes the expected Q values (the target Q values) stable through out the episode.

We only train the Main network. The weight of the Target network is not updated using back propagation, but by copying the weight of the Main network. This way at the beginning of every episode the Target network is the same as the Main network, and we use it to calculate the target/expected Q values.

Portfolio Management

in portfolio management if we have 50 holdings in the portfolio, and investment universe is 500 stocks, then there are 501 possible actions i.e. sell any of the 50 holdings, or buy any of the 450 stocks that we are not currently holding, or do nothing. And what are the state? The 50 holdings, which could be any possible combination of the 500 stock in the investment universe – that’s a lot of states!

And we are not limiated to buy or sell just 1 stock. We can buy several stocks. We can buy 2 stocks or sell 2 stocks, or 3 or 4!

And we have not factored in the price. In real live the action is not “buy stock X” but “buy stock X and price P”. In this case the state is all the possible combination of 50 out of 500, at various different prices. That’s really a lot of states! So it is very resource intenstive.


Reinforcement Learning (RL) is probably the most complicated machine learning model. It uses deep neural networks, we have to generate the data (the experience) and it is resource intensive especially when we have many states and many actions.

But it is the one we use for for many things these days. We use it for robotics (check out Boston Dynamics), for self driving cars, for playing computer games, and for Algo trading (portfolio management). Checkout OpenAI playing DOTA 2: link. Guess how many CPUs they use? 128,000 CPUs plus 256 GPUs!

12 August 2021

RNN Applications

Filed under: Machine Learning — Vincent Rainardi @ 6:23 am

A Recurrent Neural Network is a machine learning architecture for processing sequential data, see my article here: link. The applications of this architecture are amazing, for example we can generate a song from a note, generate a poetry, a story or even a C code!

Here is a list of various amazing applications of RNN:

  1. Video classification: link
  2. Image classification: link
  3. Image captioning: link
  4. Sentiment analysis: link
  5. Language translation: link
  6. Making music: link
  7. Writing poem: link
  8. Writing code: link
  9. Generating text: link
  10. FX trading: link
  11. Stock market prediction: link
  12. Speech recognition: link
  13. Text to speech: link

RNN application is about a sequence of data. That sequence of data can be the input or it can be the output. That sequence of data can be a sequence of numbers, a sequence of musical notes, a sequence of words, or a sequence of images.

If the sequence of data is the output, then it becomes a creation. For example:

  • If the output is a sequence of notes, then the RNN is “writing music”.
  • If the output is a sequence of words, then the RNN is “writing a story”.
  • If the output is a sequence of share prices, then the RNN is “predicting share prices”.
  • If the output is a sequence of voices, then the RNN is “speaking”.
  • If the output is a sequence of colours, then the RNN is “painting”.

That is very powerful right? This is why AI in the last few years is really taking off. Because finally AI can create a song, a speech, a painting, a story, a poem, an article. Finally AI can predict a sequence of numbers. Not just one number but a series of numbers. That has very, very serious consequences. Imagine if that series of numbers is the temperature every hour in the next few days.

Imagine, if that series of numbers is stock prices in the next few weeks. Imagine if the prediction is accurate. It would turn the financial world upside down!

Three Categories of RNN

RNN applications can be categorised into 3:

  1. Classification
  2. Generation
  3. Encoder Decoder

Classification is about categorising a sequence of images or data into categories, for example:

  • Classifying films into action, drama or documentary
  • Classifying stock market movements into positive or negative trend
  • Classifying text into news, scientific or story

In classification the output is a single number.

Generation is about making a sequence of data based on another data, for example:

  • Making a sequence of musical notes e.g. a song.
  • Making a sequence of words e.g. a poem, Python code or a story.
  • Making a sequence of numbers e.g. predicting stock market.

For generation we need to have a “seed” i.e. data which the creation is based on. For example, when generating a sequence of words we need a word. When generating a sequence of musical notes we need a note.

Encoder Decoder consists of 2 parts. The first part (encoder) encodes the data into a vector. The second part (decoder) uses this vector to generate a sequence of data. For example: (image source: Greg Corrado)

The words in the incoming email is fed one by one as a sequence into an LSTM network, and encoded into a word vector representation. This vector (called thought vector) is then used to generate the reply, one word at a time. In the above example “Yes,” was generated first, and then that word is used to generate the second word “what’s”. Then these 2 words are used to generate the third word, and so on.

Using RNN to forecast stock prices

One of the popular techniques to forecast stock prices is RNN-based (the other is Reinforcement Learning). We can see the trend in Ref 1 below.

In a paper by Wenjie Lu, Jiazheng Li, Yifan Li, Aijun Sun, Jingyang Wang (see Ref 2 below), we can see that amongs the RNN based technique, the most accurate one is a combination of CNN and LSTM. The CNN is used to extract features in the stock price history, and LSTM is used to predict the future stock prices. The result is like this (Ref 2):

The stock price in this case is Shanghai Composite Index from 1/7/1991 to 31/8/2020. The last 500 days are used for test data, the rest as training data. They compared 6 methods: MLP, CNN, RNN, LSTM, CNN-RNN, CNN-LSTM and the result is as follows in terms of Mean Absolute Error (MAE):


  1. Hu, Z.; Zhao, Y.; Khushi, M. A Survey of Forex and Stock Price Prediction Using Deep Learning. Appl. Syst. Innov. 2021, 4, 9.
  2. Wenjie Lu, Jiazheng Li, Yifan Li, Aijun Sun, Jingyang Wang, “A CNN-LSTM-Based Model to Forecast Stock Prices”, Complexity, vol. 2020, Article ID 6622927, 10 pages, 2020.

7 August 2021

Recurrent Neural Network (RNN) and LSTM

Filed under: Machine Learning — Vincent Rainardi @ 4:16 pm

In machine learning, Recurrent Neural Network (RNN) is used to predict data that happens one after another. The data has a time element i.e. it has a sequence/order. For example: a video. A video is a series of images arranged in a particular order. Stock prices is also data of this kind, it happens day after day, in sequence (or second after second). Document is also a sequential data, the words are arranged in a particular sequence.

Let’s say we have a video of a dog running, and we try to classify whether the dog in the video jumps or not. So the output is say 100 frames of images, and the output is a binary number, 1 means jump and 0 means not jump, like below (image source: link).

The Recurrent Neural Network receives 100 images as input, one image at a time, in a particular order. And the output is binary number 1 or 0. So it’s a binary classification.

So that’s the input and output of RNN. The input is a sequence of images or numbers (or words), and the output is … well, there are a few different kinds actually:

  1. One output (like above) i.e. we just take the last output.
  2. Many output, i.e. we take the output on many different times.
  3. Generator, e.g. based on 1 note we generate a song.
  4. #1 above followed by #3 (called encoder-decoder), e.g. Gmail Smart Compose (link).


In the early days the RNN architecture was similar to the neural network architecture. See below: input x1 (the first image or number) was fed through a series of neural network layers until we get output y1. Input x2 (the second image or number) was fed through a series of neural network layers until we get output y2, like this:

The difference to a normal neural network was the red arrows above, i.e. the values on the hidden layers from time slot 1 was fed to time slot 2. So each layers on time slot 2 received two inputs: x2 and the values of the hidden layers in time slot 1 (multiplied by some weights).

If we take just 1 node in a layer, we can show what happens in this node across time (below left). The node (s) receives input (x) producing output (h). The state of the node (s) is multiplied by a weight ( w ) and sent to itself (s).

The left diagram is simplified into the right diagram above, i.e. we only draw 1 copy, with a circular w from s pointing to itself. The right diagram is called the “rolled” version, and the left one is the “unrolled” version.

Note that in the diagram above the output is h not y, because it is the output of a node in a layer, not the final output of the last layer.

I saw the rolled versions of the RNN diagram above for the first time about 2 years ago and I no idea what it was. I hope you can understand it now.

Long Short Term Memory (LSTM)

These days no one use this original RNN architecture any more. Today everyone uses some variant of LSTM, which looks like this:

This architecture is called Long Short Term Memory because it is using many short term memory cells to create a long term memory (link), meaning: able to remember a long sequence of input, e.g. 5 years of historical stock data. The old RNN inherently has a problem with long sequences because of “vanishing gradient” problem (link). “Exploding gradient” problem (link) is not as big an issue because we can cap it (called “gradient clipping”).

Cell Memory

On the LSTM diagram, the horizontal line at the top (from ct-1 to ct) is the cell state.
It is the memory of the cell i.e. the short term memory. Along this line, there are 3 things happening: the cell state is multiplied by the “forget gate”, increased/reduced by the “input gate” and finally the value is taken to the “output gate”.

So what are these 3 gates? Let’s go through them one by one.

Forget Gate

The forget gate removes unwanted information from the cell state.
The value of σ is from 0 to 1. By varying the value of σ we can adjust how much information is removed from the cell state. The current input (xt) and the previous output (ht-1) are multiplied by σ.

So the impact of this forget gate to the cell state is:

where bf is the bias and Wf and Uf are the weights (link). The blue circle with cross is element wise multiplication.

Bear in mind that t is the current time slot and t-1 is the previous time slot. Notice that h and x have their own weights.

Input Gate

The input gate adds new information into the cell state. As we can see below, the current input (xt) and the previous output (ht-1) pass through a sigma gate and a tanh gate, multiplied then added to the cell memory line.

Here i controls the how much a influences c. The value of tanh is between -1 and +1 so a can decrease or increase c. And the amount of a’s influence to c is controlled by i.

So the impact of this input gate to the cell state is: (link)

Notice that h and x have their own weights, both for i and a.

Output Gate

The output (h) is taken from the cell state (c) using tanh function. The value of tanh is from -1 to +1 so it can make the cell state positive or negative. The amount of influence this tanh(c) has on h is controlled by o. O is calculated from the previous output (ht-1) and the current input (xt), each having different weights, using a sigma function.

So the output (h) is calculated like this:

The complete equation is on Wikipedia: link, which is from Hochreiter and Schidhuber’s original LSTM paper (link) and Gers, Schmidhuber, Cummins’ paper: Learning to forget (link).

A variant of LSTM is Gated Recurrent Unit (GRU). GRU does not have an output gate like LSTM. It has a reset gate and an update gate: link.


  1. Wikipedia, RNN: link
  2. Wikipedia, LSTM: link
  3. Wikipedia, GRU: link
  4. Andrej Karpathy, The Unreasonable Effectiveness of RNN: link
  5. Michael Phi, Illustrated Guide to LSTM and GRU: link
  6. Christopher Olah, Understanding LSTM: link
  7. Gursewak Singh, Demystifying LSTM weights and bias dimensions, link
  8. Shipra Saxena, Introduction to LSTM: link
  9. Gu, Gulcehre, Paine, Hoffman, Pascanu, Improving Gating Mechanism in RNN: link
  10. Hochreiter and Schmidhuber, LSTM: link

13 July 2021

Learning Machine Learning with Upgrad

Filed under: Machine Learning — Vincent Rainardi @ 7:46 am

In the last 10 months I’ve been doing a master’s degree on machine learning with Upgrad (link). It has been a very good journey, very enjoyable. I really like it a lot. The opening webinar back in October 2020 was fantastic. They talked about various applications of AI such as image recognition for blind person, chest X-ray diagnosis, NFL video advert analysis, Makoto Koike cucumber, Alpha Go Zero and Volvo recruiting car. Everyone was assigned a student mentor who guides us through our journey and answer our non-academic questions. We have technical assistants who answer our academic questions (we have a discussion forum too). We learn primarily through videos (which suit me a lot as I’m in the UK with different working hours to India) and their learning platform is very good. Every week we have doubt resolution sessions (optional) where we can ask questions to real teachers (their teachers are very good in explaining difficult concepts so they are easy to understand). A lot of webinars where industry experts share their real world experiences on AI.

The thing I like best is the small group coaching where we learn in a group of eight, coached by an industry expert. My coach is from Paypal, the same industry as me (I work in asset management in London). The sesson is interactive where our coach explains things and we can ask questions, and it is always practical, often discussing the “notebook” (meaning the Python code for those who are not familiar with Jupyter). My mentor is an expert in ML and a very good teacher. We are really lucky to he’s willing to spend time coaching us. Sometimes we had a one-to-one discussion with our coach. At one time (just once) we students thaught each other, we learned from one another. But everyone was also assigned an industry mentor, with whom I discuss my job in the real world and my blog, and my aspirations/ideas in ML. Most students are looking for a job in ML and received a lot of guidance from their mentor. I’m not looking for a new job, but I’m very grateful to have a very experienced mentor. My mentor is from Cap Gemini, an industry leader in AI with 25 years of experience (13 of which were with Microsoft). Really lucky that he’s willing to spend time mentoring me.

In the first month I was learning Python and SQL, covering data structures, control structures, pandas, numpy, data loading, visualisation, etc. all on Jupyter notebook. I’m a SQL and BI veteran but I rarely do coding at work. I mean real coding, not SQL, ETL or BI tools. The last time I did real coding was 10 years ago (Java) and before that it was 20 years ago (C#). When I was young I really liked coding (Basic, C++, Pascal) and this Python coding with Upgrad really took me back to my childhood hobby. I really enjoy coding in Python as part of this course.

Then I learned about statistics and data exploration. I did Physics Engineering at uni so I did statistics and learning it again was enjoyable. The teacher was really good (from Gramener, link) and gave us real world examples like restaurant sales, securities correlation and electricity meter reading. Also learned about probability, central limit theorem and hypothesis testing. All these turned out to be come very useful when applying machine learning algorithms. The assignment was real world cases, such as investment analysis and loan, and the fact that they were in finance made me enjoyed them more.

Then for a few months I learned with various ML algorithms such as linear regression, logistic regression, Naive Bayes, SVM, Decision Tree, Random Forest, Gradient Boosting, clustering and PCA. Also various important technique such as regularisation (Ridge, Lasso), model selection, accuracy, precision. Again the assignments were real world cases such as predicting house prices, how weather affects sales, and telecommunication industry.

Then I learned about natural language processing (NLP) which was very different. All the other algorithms were based on mathematics, but this one is based on languages. It was such as an eye opener for me to learn how computer understand human languages (I wrote an article about it: link). And now I’m learning neural network, which is the topic I like most because it is the most powerful algorithm in machine learning. We started with computer vision (CNN, convolutional neural network, link) and now I’m studying RNN (Recurrent Neural Network, link) which is widely used for stock market analysis and any other sequential data.

I feel lucky I studied Physics Engineering in uni, because it helped me a lot in understanding the mathematics behind the algorithms, especially the calculus in neural network. I’ve done a few ML courses on Coursera (see my article on this: link, link) but this Upgrad one is way way better. It is a real eye opener. I can now read various machine learning papers. I mean real academic research papers, written by PhDs! A few years ago I was attending a machine learning “meetup” in London. Meetup is an app where people with similar interest gather together to meet. Usually the ML meetups were in the form of lecture, i.e. 1.5 hour session in the evening where two speakers explained about machine learning. But this time it was different. It was a discussion forum of 10 people and there was no speaker. Everyone must read a paper (it was Capsule Neural Network paper by Geoffrey Hinton) and in this meetup we discuss it. I didn’t understand a thing! I did understand neural network a bit, but I had no background in CNN so I could not understand the paper. But now I understand. I can read research papers! I didn’t know that I would be this happy to be able to read machine learning papers. It is really important to be able to read ML papers because ML progresses so fast, and the research papers provide superb sources on the latest invention is on ML.

Next Page »

Blog at