Data Warehousing and Data Science

20 October 2022

Forecasting stock prices using LSTM

Filed under: Data Warehousing — Vincent Rainardi @ 7:53 am


Long Short Term Memory (LSTM) is a machine learning algorithm that can recognise a series of data in a sequence. For example, series of words in an email, a series of spoken words (voice), a series of pictures (a video), a series of daily rain fall data, daily solar radiation, or monthly sales. LSTM has the ability to recognise the pattern in the data and make a prediction about the future values. This has been implemented widely in many areas, including in the financial markets.


LSTM belongs to a family of algorithms called Recurrent Neural Network (RNN). RNN is an machine learning algorithm that has the ability to learn the pattern in a sequential data. Apart from LSTM, there are other algorithms in RNN family, for example: Gated Recurrent Unit (GRU). RNN, as the name implies, consists of a neural network. A neural network is an architecture from a series of layers, and each layer consists of nodes. Each node in LSTM consists of 3 gates, i.e. the forget gate, the input gate and the output gate.

Stock Markets

There are many different forms of financial markets. Stock markets, bond markets, commodity markets (they trade the future price of oil, gold, aluminium, wheat, etc.), currency markets (trading the exchange rate of fiat currency pairs and crypto currencies), money markets (trading very liquid securities such as US treasury) and derivative markets (trading fixed income and equity derivatives such as options, futures and swaps). These markets have different characteristics. Some are volatile like stock markets and currency markets, some are quite stable like bond markets and money markets. For individual investors, stock markets is arguably the most popular one, much more popular than bond, commodity, derivative or money markets.

Forecasting vs prediction

Before we get into stock price forecasting, let’s understand the difference between prediction and forecasting. In machine learning the word “prediction” means trying to guess the value of a variable. If you have the number of windows in a house, and you try to guess the price of the house, that’s prediction. Or if you have the height of the people, and you try to guess the weight, that’s prediction. Of course you can use many variables for prediction, for example: using the house location, the floor area, the number of bedrooms and the age of the house to predict the price of the house.

In machine learning forecasting deals specifically with time. It means predicting the future values. For example how many washing machine will be sold in our store next month, or whether it is going to rain or not tomorrow. We can based our forecast based on the past values of same metrics. For example, we can forecast the next month inflation based on the inflation in last 2 years. Or we can based our forecast on something else entirely. For example, we can forecast the traffic jam based on the scheduled events in that city. And of course, forecasting is predicting. Because predicting means trying to guess the value of a variable, including the value of a variable in the future.

Stock Price Forecast

In the stock market there a lot of variables that we can forecast (= guessing the future). For example, try to predict which companies will be leaving S&P 500 next year. We can predict the number of transactions tomorrow. Or the volume. Or the direction of the price tomorrow, i.e. up or down. Or the market sentiment. Or we can try to correlate one variable with another variable. Out of all variables, the stock price is the most popular one. That is not surprising, because if you know what the stock price will be in the future, you can make a huge fortune.

There are many stock price forecasting articles. Most of them use only last few weeks data. And they forecast only the next day price. That is not what happens in the real world. Many uses the closing price, or the opening price. Again that’s not what’s happening in the real world. When an investor sell, the price they get is not the opening or closing price. Not the highest price or the lowest price either. They get somewhere between the highest and the lowest price. It is not called the mid price, because mid price is the mid point between the asking price and the offer/bid price, for a particular trade. Whereas high, low, open, close are for a day worth of trades, for a particular stock or bond. Or options, or futures, or any other derivatives, commodity or currency for that matter.

Principles of Forecasting

Many forecasting methods is based on the assumption that the future will behave like the past. That is the fundamental reason to believe that we can predict the future. Because the future will behave like the past. This is the fundamental of a traditional time series forecasting such as ARIMA or Exponential Smoothing.

The technical term for this is autoregression, meaning that the future values of a time series is related to the linear combination of past values of that time series (or autocorrelation if it is not linear). The other technical term for this is moving average model, meaning that the future values of a time series is related to the past forecast errors. Both autoregression and moving average model are the fundamental principles used in ARIMA.

Whereas Exponential Smoothing technique believes that the future values of a time series are related to the weighted average of the past values of that time series. The recent events get larger weights than the distant past events.

The Art of Stock Forecasting

Like any prediction, the art of stock forecasting is choosing the right variables to use. The stock time series itself might not enough for forecasting its future. In that case we need to include other variables, such as volume, other stocks, and the industry average. That is the most important ingredient: the input variables. The first step that we do, before choosing the algorithm or methods, we need to establish a set of variables which are highly correlated to the future price of the stock.

The second most important ingredient in stock forecasting is the choosing the right algorithm. It does not have to be a machine learning. Time series forecasting is a hundred years old since, so ignore the non ML methods at your peril. Combine the traditional methods and the ML methods. Learn about Facebook’s Prophet, GARCH, ES, ARIMA, Theta and other non-ML methods. Combine them with popular ML methods commonly used for forecasting time series such as LSTM and CNN. But don’t stop there, also use unconventional ML algorithms such as XGBoost and Random Forest. There are tons of literatures and papers for each methods, which provides us with a good starting point.

That is the art of stock forecasting, to find the right combination of variables and algorithms that can give us the best results. And then the third one: tuning those combinations. Trying out different regularisations, using different hyper-parameters, different optimiser, different evaluation criteria. All in an attempt to get the best results. That is the art of stock forecasting.

Different types of LSTM

When it comes to stock price forecasting using LSTM, there are 5 different architectures that we can use: vanilla LSTM, stacked LSTM, convolutional LSTM, bidirectional LSTM and CNN LSTM.

  • Vanilla LSTM is a single LSTM layer, possibly with regularisation measure such as a dropout layer. Optionally with one or two dense layers for converting the output into classes.
  • Stacked LSTM means multiple LSTM layers stacked one after another.
  • Bidirectional LSTM learns the time series both forward and backward.
  • Convolutional LSTM reads the time series using convolution method (see my article here for what convolution means)
  • CNN LSTM also reads the time series using convolution method.

The difference between CNN LSTM an Convolutional LSTM is: CNN LSTM uses a convolution layer to read the input, then pass the output to an LSTM layer. Whereas Convolutional LSTM does not use a separate convolution layer separately, but uses a Keras layer called ConvLSTM which already have convolution reading built in.


The implementation examples of these 5 architectures can be seen on Jason Brownlee’s website:

Here is an example for Vanilla LSTM: (my own implementation)

In the above example, the Y Gap is 66, i.e. the number of days between [the last trading day in the stock time series] and [the number of days in the forecasting period]. For example: the time series is from 1st Jan 2016 to 31st Dec 2021. The Y Gap is the period from Jan, Feb, and March 2022. And the forecasting period is the first week in April 2022.

That is why, in the example above the LSTM Output is set to 71 days. Because it is 66 days (the number of trading days in 3 months period of the gap, i.e. Jan, Feb, March 2022) plus 5 days (the number of trading days in the first week of April).

The number of trading days corresponds to the days when the stock market is open. This is usually Monday to Friday, minus the public holidays.

Running the model is quite simple:, Y_train, epochs = 50, batch_size = 32)

I do hope this article inspired you to take a stab in stock price forecasting. If you have any questions or would like to discuss any aspect of stock forecasting, I’m available at

14 October 2022

Data Warehouse on Data Lake

Filed under: Data Warehousing — Vincent Rainardi @ 7:04 am

Some people argue that the best way to build a data warehouse on data lake is to create a SQL Server database (or Oracle), populated from a Databricks data lake, like the diagram below. It is certainly a good way of doing it, but yesterday I saw a better way: build the Data Warehouse on the Databricks SQL Warehouse feature (used to be called SQL endpoint).

The SQL Warehouse in Databricks makes the tables in the Databricks queryable using SQL command as if they were tables in a SQL database. More importantly it enables BI and reporting tools to connect to it as if it was a SQL database, and we can permission which users are allowed to access which schemas, tables and views. Yes views! That is something that we could not do previously on the Databricks cluster. But now we can, on the SQL Warehouse. See the diagram below.

Similarly, for the data scientists and ML engineer, instead of creating a SQL database populated from the data lake, it is better to create the data science area on Databricks itself, as a SQL Warehouse.

Here’s how to do it. We build the data lake as normal on Databricks. One Databricks database per data source. As usual, in each database we have bronze area, gold area, etc. We bring the data in from the data sources (say from Charles River, or from SAP) into the bronze area using Azure Data Factory (or Glue + Transform if you use AWS). We create Python notebooks to process data from bronze to silver to gold, validating data, cleaning it and processing it along the way.

At this point the usual approach is to create an Azure SQL database (or Redshift if you use AWS, or Snowflake), and create a Kimball dimensional model on it (star schema, fact and dimension tables). To populate it we can use Azure Data Factory to read the lake. To get the data out of the lake we can either a) create APIs on top of Databricks (so the ADF pipelines consume these APIs) or b) directly reading the Databricks cluster using Spark SQL, Scala or Python.

Using SQL Datawarehouse we don’t create this Azure SQL database. Instead, we create a database in SQL Warehouse. A database in SQL Warehouse is formally called a schema. Then we create the Kimball dimensional model (fact and dimension tables) on this schema. Put the location path of those table in the data lake ADLS (or S3 if you use AWS). Preferably on Delta format (you can choose JSON, CSV, Parquet, Avro). For large fact tables we can partition the tables.

Then we use COPY INTO to loads data from a file location in the data lake, into staging tables on SQL Warehouse. So we don’t need to create the data lake tables as external tables. These staging tables are created as Delta tables. Then we populate those fact and dimension tables using INSERT INTO statement: INSERT INTO FactTable1 TABLE StagingTable1. So we select from the staging tables and insert into the fact & dimension tables.

And that’s how we create a data warehouse on Databricks SQL Warehouse. And populate it.

As a bonus, using Databricks SQL Warehouse we can capture the data lineage out of the box. See the Reference section below.


5 October 2022

MCC formula for multiclass classification

Filed under: Data Warehousing — Vincent Rainardi @ 7:51 am

Problem: you produced multiclass classification like below, and need to evaluate it using MCC:

Note: MCC (Matthew Correlation Coefficient), also known as Phi Coefficient (link) is a measure how closely related 2 variables are. For multiclass, MCC is a better evaluation measure than accuracy, precision, recall or F1 score. MCC and AUC measures different things: MCC measures the statistical accuracy, whereas AUC measures the robustness (link).


from sklearn.metrics import matthews_corrcoef
matthews_corrcoef(Y_Actual, Y_Predicted)
Output: 0.035662657

Where Y_Actual and Y_Predicted are lists containing the class ID like: [3, 8, 2, 5, …, 9]. In the above example both lists contain 1000 items.

How does it work?

But how does it actually work? What is the formula to calculate that 0.035662657?

Well, Scikit Learn specifies the formula here:

Where TP, TN, FP, FN are the number of true positives, true negatives, false positives and false negatives, respectively.

But that’s for binary classification. How about multiclass classification?

Well, Scikit Learn specifies the formula too, on the same page as above:


k = class from 1 to K (in the above example: 10 classes)
s = number of samples (in the above example: 1000 samples)
c = number of samples correctly predicted (in the above example: 157)
tk = number of times class k truly occurred (in the above example: 67, 42, 67, etc.)
pk = number of times class k was predicted (in the above example : 17, 8, 20, etc.)

When I looked at this formula, I got confused. So let’s pick it apart and do it in Excel.


The nominator is: c x s minus t x p, see below:

So it is 1000 x 1000 minus (column t) x (column p), i.e. 67 x 17 + 42 x 8 + … and so on.

In Excel we can use sumproduct like this:
(refer to the column and row numbers in Excel screenshot above)

= B12*D12-(SUMPRODUCT(B2:B11,C2:C11))
= 28918


The denominator is:

In Excel it is:
= SQRT((B12^2-SUMPRODUCT(C2:C11,C2:C11))*(B12^2-SUMPRODUCT(B2:B11,B2:B11)))
= 810876.198255

Dividing the nominator by the denominator gives us:
28918 / 810876.198255 = 0.035662657

There we go. That’s how you calculate MCC in Excel. I hope this is useful.

1 October 2022

Stock Price Forecasting using XGBoost

Filed under: Data Warehousing — Vincent Rainardi @ 7:37 am

This article is about forecasting stock prices using XGBoost algorithm. For those who are not familiar with XGBoost, Jason Brownlee wrote a good introduction: link. For those who are not familiar with stock price forecasting, Chai Chon Lung provides a good literature review: link (chapter 2), while Rouf et all provides a good recap on the latest development: link.

Predicted Variable

XGBoost algorithm is capable of predicting a categorical variable based on numerical independent variables. In other words, the input are numbers, and the output are classes. Therefore, to forecast stock prices using XGBoost algorithm, we need to categorise the predicted stock prices into classes.

This can be done by comparing the future stock prices with either the today’s prices or the average of the last few days. This comparison is something like: up 6% or down 3%. They are percentage increase or decrease, aka percentage change. This percentage change is still numeric, but they can be categorised into bands, such as 0 to 3%, 3% to 6%, 6% to 9%, and so on. Similarly with the negative changes. Each band can then be given a name, such as P1, P2, P3, and so on for the positive changes, and N1, N2, N3, and so on for the negative changes.

What about zero? 0% can be put either to P1 or N1, doesn’t make a difference. What about a large change? How many classes should we have? Do we stop at P10? P15? It depends on the distribution of data. I would recommend covering 95% of the sample with the normal 3% increment P and N classes, and put the remaining 5% into the last P and N classes. You could also label this last P and N classes differently, such as PR and NR (R stands for remaining).

So that is the output (the predicted variable). The output is P1 to P10 plus PR, and N1 to N10 plus NR. So 22 classes in total. Here 10 is just an example, depending on the data distribution.

Wait, this 3%, 6%, etc. are measured over what period? You can take a week from today, or a month from today, etc. whichever satisfy the business requirements. Of course you will need to translate these calendar days, weeks and months into “trading days”. This is because the stock prices data only exists when the stock markets are trading. The stock markets close on week ends and national holidays, which are different for each country. For example: instead of saying a week we say 5 trading days. And instead of saying a month we say 20 trading days.

Dependent Variables

What about the input, i.e. the dependent variables. That is of course the stock prices in the last few months or years. The more recent a date is, the bigger impact it has to the future price. Therefore, the more recent we should have higher frequency. For example, if you have 5 years data, do something like this: 1 to 5 years ago = only month end dates, 3 to 12 months ago = weekly (Mondays), and the last 3 months = daily prices. The number of dependent variables = (12 x 4 dates for the 4 years) + (39-40 dates for the 9 months) + (64-68 days for the 3 months, depending on country) = about 151 to 156 days.

That’s the input, i.e. prices from 151 to 156 days. Of course the input needs to be of the same size, so you’ll need to take 64 dates for the last 3 months period, and the 39 dates for the 9 months period. This way you’ll always have 151 inputs or dependent variables into your XGBoost algorithm.


The X dataset consists of 5 columns: Date, HL_Avg, Volume, HL_Avg_Norm, Volume_Norm like below. Below is X[0], i.e. just the first set.

For all intend and purposes, in this article, we can consider that HL Avg is the stock price, and HL Avg Norm is the normalised stock price. Only the HL_Avg_Norm column will be used.

And Y dataset contains Y_Avg, X_Last, Y_Diff, Y_Band, Y_Gap, Y_Days, Y_Avg_Norm, X_Last_Norm, Y_Band_Norm. It looks like this: (this is Y[0])

In this article we will only be using Y_Band, which is U1 in the above.

Y Band obtained by categorising Y Diff into 5% bands.

Tuning XGBoost Algorithm

The best parameters for on the XGBoost model are obtained by trying out different values for each parameter. It is best to change only 1 parameter, whilst keeping all the other parameters the same. This way we know the effect of that 1 parameter.

The following parameters need to be configured: number of estimators, number of days in the data, Gamma, maximum depth, minimum child weight, subsample, learning rate, lambda, alpha.

Note that the value for the best parameters depends on the number of dates you have on the dependent variables. Therefore, you should decide the number of dates first, factoring in a higher frequency for more recent period (see Independent Variables section above). We need to be mindful that increasing the number of days only increasing the model performance/accuracy up to a certain point. Afterwards it startes to make it worse. Therefore it is very important to find the right amount of data to feed into the model.

By far the most influential parameter is Gamma. Gamma is the regularisation parameter, ranging from zero to infinity. Increasing Gamma will make the model performance better (in terms of accuracy) only to a certain point, afterwards it starts to make it worse. You should try a very small numbers, gradually increasing to bigger numbers at a “stepped” distance (at different order). For example: 0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50.

The second most influencial parameter is the number of estimators. Like Gamma, increasing the number of estimators only increases the model performance/accuracy better up to a certain point. And afterwards it decreases.

So the order is: data first (number of dates, with recent dates get higher priority/frequency), then Gamma, then the number of estimators. Check out my article on tuning XGBoost model here. Also Laurae’s article specifically on Gamma here. All the XGBoost parameters are described here.

Forecasting the price band

Forecasting the price band is as easy as calling the predict method on the XGBoost model, giving it the test data. The output is the class ID (an integer indicating which class). So we will need to convert it to the price bands like P1 to P10 in “Predicted Variable” section above. In the code I use U1 to U5 and D1 to D5 (U stands for Up and D stands for Down, instead of using P and N). And I have 10 classes instead of 22.

Evaluation using MCC

For multiclass classification I would recommend using Matthew’s Correlation Coefficient. You can read the Wiki here, Boaz Shmueli’s article here and the Scikit reference here.

The formula is: (source: link)

s = number of samples
c = number of samples correctly predicted
p = number of times class k was predicted
t = number of times class k really/truly occurred
K = number of classes

To use it, nothing could be simpler. We just need to import it from Scikit Learn and use it:

Save the result for analysis

I recommend you save the output of the model (the actual), the expected result (from test data), along with the MCC for further analysis.

Saving it is as easy as this:

You can then open the CSV file in Excel and doing analysis like this: (or use Seaborn / Matplotlib)

Blog at