Data Warehousing and Machine Learning

12 August 2021

RNN Applications

Filed under: Data Warehousing — Vincent Rainardi @ 6:23 am

A Recurrent Neural Network is a machine learning architecture for processing sequential data, see my article here: link. The applications of this architecture are amazing, for example we can generate a song from a note, generate a poetry, a story or even a C code!

Here is a list of various amazing applications of RNN:

  1. Video classification: link
  2. Image classification: link
  3. Image captioning: link
  4. Sentiment analysis: link
  5. Language translation: link
  6. Making music: link
  7. Writing poem: link
  8. Writing code: link
  9. Generating text: link
  10. FX trading: link
  11. Stock market prediction: link
  12. Speech recognition: link
  13. Text to speech: link

RNN application is about a sequence of data. That sequence of data can be the input or it can be the output. That sequence of data can be a sequence of numbers, a sequence of musical notes, a sequence of words, or a sequence of images.

If the sequence of data is the output, then it becomes a creation. For example:

  • If the output is a sequence of notes, then the RNN is “writing music”.
  • If the output is a sequence of words, then the RNN is “writing a story”.
  • If the output is a sequence of share prices, then the RNN is “predicting share prices”.
  • If the output is a sequence of voices, then the RNN is “speaking”.
  • If the output is a sequence of colours, then the RNN is “painting”.

That is very powerful right? This is why AI in the last few years is really taking off. Because finally AI can create a song, a speech, a painting, a story, a poem, an article. Finally AI can predict a sequence of numbers. Not just one number but a series of numbers. That has very, very serious consequences. Imagine if that series of numbers is the temperature every hour in the next few days.

Imagine, if that series of numbers is stock prices in the next few weeks. Imagine if the prediction is accurate. It would turn the financial world upside down!

Three Categories of RNN

RNN applications can be categorised into 3:

  1. Classification
  2. Generation
  3. Encoder Decoder

Classification is about categorising a sequence of images or data into categories, for example:

  • Classifying films into action, drama or documentary
  • Classifying stock market movements into positive or negative trend
  • Classifying text into news, scientific or story

In classification the output is a single number.

Generation is about making a sequence of data based on another data, for example:

  • Making a sequence of musical notes e.g. a song.
  • Making a sequence of words e.g. a poem, Python code or a story.
  • Making a sequence of numbers e.g. predicting stock market.

For generation we need to have a “seed” i.e. data which the creation is based on. For example, when generating a sequence of words we need a word. When generating a sequence of musical notes we need a note.

Encoder Decoder consists of 2 parts. The first part (encoder) encodes the data into a vector. The second part (decoder) uses this vector to generate a sequence of data. For example: (image source: Greg Corrado)

The words in the incoming email is fed one by one as a sequence into an LSTM network, and encoded into a word vector representation. This vector (called thought vector) is then used to generate the reply, one word at a time. In the above example “Yes,” was generated first, and then that word is used to generate the second word “what’s”. Then these 2 words are used to generate the third word, and so on.

Using RNN to forecast stock prices

One of the popular techniques to forecast stock prices is RNN-based (the other is Reinforcement Learning). We can see the trend in Ref 1 below.

In a paper by Wenjie Lu, Jiazheng Li, Yifan Li, Aijun Sun, Jingyang Wang (see Ref 2 below), we can see that amongs the RNN based technique, the most accurate one is a combination of CNN and LSTM. The CNN is used to extract features in the stock price history, and LSTM is used to predict the future stock prices. The result is like this (Ref 2):

The stock price in this case is Shanghai Composite Index from 1/7/1991 to 31/8/2020. The last 500 days are used for test data, the rest as training data. They compared 6 methods: MLP, CNN, RNN, LSTM, CNN-RNN, CNN-LSTM and the result is as follows in terms of Mean Absolute Error (MAE):


  1. Hu, Z.; Zhao, Y.; Khushi, M. A Survey of Forex and Stock Price Prediction Using Deep Learning. Appl. Syst. Innov. 2021, 4, 9.
  2. Wenjie Lu, Jiazheng Li, Yifan Li, Aijun Sun, Jingyang Wang, “A CNN-LSTM-Based Model to Forecast Stock Prices”, Complexity, vol. 2020, Article ID 6622927, 10 pages, 2020.

7 August 2021

Recurrent Neural Network (RNN) and LSTM

Filed under: Data Warehousing — Vincent Rainardi @ 4:16 pm

In machine learning, Recurrent Neural Network (RNN) is used to predict data that happens one after another. The data has a time element i.e. it has a sequence/order. For example: a video. A video is a series of images arranged in a particular order. Stock prices is also data of this kind, it happens day after day, in sequence (or second after second). Document is also a sequential data, the words are arranged in a particular sequence.

Let’s say we have a video of a dog running, and we try to classify whether the dog in the video jumps or not. So the output is say 100 frames of images, and the output is a binary number, 1 means jump and 0 means not jump, like below (image source: link).

The Recurrent Neural Network receives 100 images as input, one image at a time, in a particular order. And the output is binary number 1 or 0. So it’s a binary classification.

So that’s the input and output of RNN. The input is a sequence of images or numbers (or words), and the output is … well, there are a few different kinds actually:

  1. One output (like above) i.e. we just take the last output.
  2. Many output, i.e. we take the output on many different times.
  3. Generator, e.g. based on 1 note we generate a song.
  4. #1 above followed by #3 (called encoder-decoder), e.g. Gmail Smart Compose (link).


In the early days the RNN architecture was similar to the neural network architecture. See below: input x1 (the first image or number) was fed through a series of neural network layers until we get output y1. Input x2 (the second image or number) was fed through a series of neural network layers until we get output y2, like this:

The difference to a normal neural network was the red arrows above, i.e. the values on the hidden layers from time slot 1 was fed to time slot 2. So each layers on time slot 2 received two inputs: x2 and the values of the hidden layers in time slot 1 (multiplied by some weights).

If we take just 1 node in a layer, we can show what happens in this node across time (below left). The node (s) receives input (x) producing output (h). The state of the node (s) is multiplied by a weight ( w ) and sent to itself (s).

The left diagram is simplified into the right diagram above, i.e. we only draw 1 copy, with a circular w from s pointing to itself. The right diagram is called the “rolled” version, and the left one is the “unrolled” version.

Note that in the diagram above the output is h not y, because it is the output of a node in a layer, not the final output of the last layer.

I saw the rolled versions of the RNN diagram above for the first time about 2 years ago and I no idea what it was. I hope you can understand it now.

Long Short Term Memory (LSTM)

These days no one use this original RNN architecture any more. Today everyone uses some variant of LSTM, which looks like this:

This architecture is called Long Short Term Memory because it is using many short term memory cells to create a long term memory (link), meaning: able to remember a long sequence of input, e.g. 5 years of historical stock data. The old RNN inherently has a problem with long sequences because of “vanishing gradient” problem (link). “Exploding gradient” problem (link) is not as big an issue because we can cap it (called “gradient clipping”).

Cell Memory

On the LSTM diagram, the horizontal line at the top (from ct-1 to ct) is the cell state.
It is the memory of the cell i.e. the short term memory. Along this line, there are 3 things happening: the cell state is multiplied by the “forget gate”, increased/reduced by the “input gate” and finally the value is taken to the “output gate”.

So what are these 3 gates? Let’s go through them one by one.

Forget Gate

The forget gate removes unwanted information from the cell state.
The value of σ is from 0 to 1. By varying the value of σ we can adjust how much information is removed from the cell state. The current input (xt) and the previous output (ht-1) are multiplied by σ.

So the impact of this forget gate to the cell state is:

where bf is the bias and Wf and Uf are the weights (link). The blue circle with cross is element wise multiplication.

Bear in mind that t is the current time slot and t-1 is the previous time slot. Notice that h and x have their own weights.

Input Gate

The input gate adds new information into the cell state. As we can see below, the current input (xt) and the previous output (ht-1) pass through a sigma gate and a tanh gate, multiplied then added to the cell memory line.

Here i controls the how much a influences c. The value of tanh is between -1 and +1 so a can decrease or increase c. And the amount of a’s influence to c is controlled by i.

So the impact of this input gate to the cell state is: (link)

Notice that h and x have their own weights, both for i and a.

Output Gate

The output (h) is taken from the cell state (c) using tanh function. The value of tanh is from -1 to +1 so it can make the cell state positive or negative. The amount of influence this tanh(c) has on h is controlled by o. O is calculated from the previous output (ht-1) and the current input (xt), each having different weights, using a sigma function.

So the output (h) is calculated like this:

The complete equation is on Wikipedia: link, which is from Hochreiter and Schidhuber’s original LSTM paper (link) and Gers, Schmidhuber, Cummins’ paper: Learning to forget (link).

A variant of LSTM is Gated Recurrent Unit (GRU). GRU does not have an output gate like LSTM. It has a reset gate and an update gate: link.


  1. Wikipedia, RNN: link
  2. Wikipedia, LSTM: link
  3. Wikipedia, GRU: link
  4. Andrej Karpathy, The Unreasonable Effectiveness of RNN: link
  5. Michael Phi, Illustrated Guide to LSTM and GRU: link
  6. Christopher Olah, Understanding LSTM: link
  7. Gursewak Singh, Demystifying LSTM weights and bias dimensions, link
  8. Shipra Saxena, Introduction to LSTM: link
  9. Gu, Gulcehre, Paine, Hoffman, Pascanu, Improving Gating Mechanism in RNN: link
  10. Hochreiter and Schmidhuber, LSTM: link

13 July 2021

Learning Machine Learning with Upgrad

Filed under: Data Warehousing — Vincent Rainardi @ 7:46 am

In the last 10 months I’ve been doing a master’s degree on machine learning with Upgrad (link). It has been a very good journey, very enjoyable. I really like it a lot. The opening webinar back in October 2020 was fantastic. They talked about various applications of AI such as image recognition for blind person, chest X-ray diagnosis, NFL video advert analysis, Makoto Koike cucumber, Alpha Go Zero and Volvo recruiting car. Everyone was assigned a student mentor who guides us through our journey and answer our non-academic questions. We have technical assistants who answer our academic questions (we have a discussion forum too). We learn primarily through videos (which suit me a lot as I’m in the UK with different working hours to India) and their learning platform is very good. Every week we have doubt resolution sessions (optional) where we can ask questions to real teachers (their teachers are very good in explaining difficult concepts so they are easy to understand). A lot of webinars where industry experts share their real world experiences on AI.

The thing I like best is the small group coaching where we learn in a group of eight, coached by an industry expert. My coach is from Paypal, the same industry as me (I work in asset management in London). The sesson is interactive where our coach explains things and we can ask questions, and it is always practical, often discussing the “notebook” (meaning the Python code for those who are not familiar with Jupyter). My mentor is an expert in ML and a very good teacher. We are really lucky to he’s willing to spend time coaching us. Sometimes we had a one-to-one discussion with our coach. At one time (just once) we students thaught each other, we learned from one another. But everyone was also assigned an industry mentor, with whom I discuss my job in the real world and my blog, and my aspirations/ideas in ML. Most students are looking for a job in ML and received a lot of guidance from their mentor. I’m not looking for a new job, but I’m very grateful to have a very experienced mentor. My mentor is from Cap Gemini, an industry leader in AI with 25 years of experience (13 of which were with Microsoft). Really lucky that he’s willing to spend time mentoring me.

In the first month I was learning Python and SQL, covering data structures, control structures, pandas, numpy, data loading, visualisation, etc. all on Jupyter notebook. I’m a SQL and BI veteran but I rarely do coding at work. I mean real coding, not SQL, ETL or BI tools. The last time I did real coding was 10 years ago (Java) and before that it was 20 years ago (C#). When I was young I really liked coding (Basic, C++, Pascal) and this Python coding with Upgrad really took me back to my childhood hobby. I really enjoy coding in Python as part of this course.

Then I learned about statistics and data exploration. I did Physics Engineering at uni so I did statistics and learning it again was enjoyable. The teacher was really good (from Gramener, link) and gave us real world examples like restaurant sales, securities correlation and electricity meter reading. Also learned about probability, central limit theorem and hypothesis testing. All these turned out to be come very useful when applying machine learning algorithms. The assignment was real world cases, such as investment analysis and loan, and the fact that they were in finance made me enjoyed them more.

Then for a few months I learned with various ML algorithms such as linear regression, logistic regression, Naive Bayes, SVM, Decision Tree, Random Forest, Gradient Boosting, clustering and PCA. Also various important technique such as regularisation (Ridge, Lasso), model selection, accuracy, precision. Again the assignments were real world cases such as predicting house prices, how weather affects sales, and telecommunication industry.

Then I learned about natural language processing (NLP) which was very different. All the other algorithms were based on mathematics, but this one is based on languages. It was such as an eye opener for me to learn how computer understand human languages (I wrote an article about it: link). And now I’m learning neural network, which is the topic I like most because it is the most powerful algorithm in machine learning. We started with computer vision (CNN, convolutional neural network, link) and now I’m studying RNN (Recurrent Neural Network, link) which is widely used for stock market analysis and any other sequential data.

I feel lucky I studied Physics Engineering in uni, because it helped me a lot in understanding the mathematics behind the algorithms, especially the calculus in neural network. I’ve done a few ML courses on Coursera (see my article on this: link, link) but this Upgrad one is way way better. It is a real eye opener. I can now read various machine learning papers. I mean real academic research papers, written by PhDs! A few years ago I was attending a machine learning “meetup” in London. Meetup is an app where people with similar interest gather together to meet. Usually the ML meetups were in the form of lecture, i.e. 1.5 hour session in the evening where two speakers explained about machine learning. But this time it was different. It was a discussion forum of 10 people and there was no speaker. Everyone must read a paper (it was Capsule Neural Network paper by Geoffrey Hinton) and in this meetup we discuss it. I didn’t understand a thing! I did understand neural network a bit, but I had no background in CNN so I could not understand the paper. But now I understand. I can read research papers! I didn’t know that I would be this happy to be able to read machine learning papers. It is really important to be able to read ML papers because ML progresses so fast, and the research papers provide superb sources on the latest invention is on ML.

7 July 2021

What is CNN Part 2

Filed under: Data Warehousing — Vincent Rainardi @ 4:54 am

In the first part (link), after trying 28 different models the conclusion was that the best models are model #26 and #28. Model #28 has more validation fluctuation, but it has half the number of parameters.

But as we can see above both model #26 and #28 suffer from overfitting. Meaning that the training accuracy is very high (about 90%) but very low validation accuracy (about 50%). This big gap of 40% is a clear indication of overfitting. To solve this we need to do image augmentation, i.e. we need to rotate, flip, zoom out, zoom in, and shift the image, like this:

The top left image is the original image. The other 11 images are generated using random rotation, random flip, random zoom and random contrast. I put 3 sets so we can understand the interplay between these 4 transformations on the augmented images: rotate, flip, zoom, contrast (combined). Jason Brownlee gave a good tutorial on this: link.

After doing image augmentation the result is as follows:

The gap between is closing but they are still low! The best one with the narrowest gap and the highest validation accuracy is A3. It has training accuracy = 55% and validation accuracy = 54%.

In this situation like this (i.e. after doing image augmentation) if the accuracy is still low, we need to check the number of training images in each class. If one class has only a few images, and other class has lots of images, then the model training will suffer from “class imbalance” problem. Shubrashankh Chatterjee’s explained this very well on his article: link.

Basically we auto generate additional images using image augmentation (rotate, flip, zoom, contrast, shift, etc) so that each class has the same number of images. After doing this, the result is like this: (note that it’s 30 epochs not 20)

So both models still suffer from validation fluctuation, even after 30 epochs. Even with batch normalisation and dropout. Even with dropout on the dense layer! I’m still finding out why, but I think it might be because of the type of augmentation, for example I didn’t change the colours of the images. To troubleshoot this we need to find out which class causing the low accuracy, is it just some particular classes or all classes. But that’s for another time and another article. Happy learning!

16 April 2021

Logistic Regression with PCA in Python

Filed under: Data Warehousing — Vincent Rainardi @ 8:31 pm

Logistic Regression means predicting a catagorical variable, without losing too much information. For example, whether a client will invest or not. JavaTPoint provies a good, short overview on Logistic Regression: link. Jurafsky & Martin from Stanford provide a more detailed view, along with the mathematics: link. Wikipedia provides a comprehensive view, as always: link.

In this article I will be writing how to do Linear Regression in Python. I won’t be explaining what it is, but only how to do it in Python.

PCA means Principal Component Analysis. When we have a lot of variables, we can reduce them using PCA. Matt Berns provide a good overview and resources: link. Lindsay Smith from Otago provides a good academic overview: link. And as always, Wikipedia provies a comprehensive explanation: link.

I think it would be good to kill two birds with on stone. So in this article I will build 2 Logistic Regression models, one with PCA and one without PCA. This way it will provide examples for both cases.

One of the weaknesses of PCA is that we won’t know which variables are the top predictors. To know the top predictors we will have to build the Linear Regression model without PCA. As we don’t use PCA, to reduce the number of variables I will use RFE + manual (see here for an example on reducing variables using RFE + manual on Linear Regression). One of the advantages of PCA is that we don’t need to worry about multicollinearity in the data (highly correlated features). So on the second model where I don’t use PCA, I have to handle the multicollinearity, i.e. remove the highly correlated features using VIF (Variance Inflation Factor).

There are 5 steps:

  1. Data preparation
    • Load and understand the data
    • Fix data quality issues
    • Data conversion
    • Create derived variables
    • Visualise the data
    • Check highly correlated variables
    • Check outliers
    • Handle class imbalance (see here)
    • Scaling the data
  2. Model 1: Logistic Regression Model with PCA
    • Split the data into X and y
    • Split the data into training and test data set
    • Decide the number of PCA components based on the explained variance
    • Train the PCA model
    • Check the correlations between components
    • Apply PCA model to the test data
    • Train the Logistic Regression model
  3. Model evaluation for Model 1
    • Calculate the Area Under the Curve (AUC)
    • Calculate accuracy, sensitivity & specificity for different cut off points
    • Choose a cut off point
  4. Model 2: Logistic Regression Model without PCA
    • Drop highly correlated columns
    • Split the data into X and y
    • Train the Logistic Regression model
    • Reduce the variables using RFE
    • Remove one variable manually based on the P-value and VIF
    • Rebuild the model
    • Repeat the last 2 steps until P value < 0.05 and VIF < 5
  5. Model evaluation for Model 2
    • Calculate the Area Under the Curve (AUC)
    • Calculate accuracy, sensitivity & specificity for different cut off points and choose a cut off point
    • Identify the most important predictors

Step 1 is long and is not the core of this article so I will be skipping Step 1 and go directly into Step 2. Step 1 is common to various ML scenario so I will be writing it in a separate article and put the link in here so you can refer to it. One part in step 1 is about handling class imbalance, which I’ve written here: link.

Let’s start.

Step 2. Model 1: Logistic Regression Model with PCA

# Split the data into X and y
y = high_value_balanced.pop("target_variable")
X = high_value_balanced

# Split the data into training and test data set
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7,test_size=0.3,random_state=42)

# Decide the number of PCA components based on the retained information
pca = PCA(random_state=88)
explained_variance = np.cumsum(pca.explained_variance_ratio_)
plt.vlines(x=80, ymax=1, ymin=0, colors="r", linestyles="--")
plt.hlines(y=0.95, xmax=120, xmin=0, colors="g", linestyles="--")

We can see above that to retain 95% explained variance (meaning we retain 95% of the information) we need to use 80 PCA components. So we build the PCA model with 80 components.

# Train the PCA model 
pca_final = IncrementalPCA(n_components=80)
df_train_pca = pca_final.fit_transform(X_train)

# Note that the above can be automated like this: (without using plot)
pca_final = PCA(0.95)
df_train_pca = pca_again.fit_transform(X_train)

# Check the correlations between components
corr_mat = np.corrcoef(df_train_pca.transpose())

As we can see in the heatmap above, all of the correlations are near zero (black). This one of the key features of PCA, i.e. the transformed features are not correlated to one another, i.e. their vectors are orthogonal to each other.

# Apply PCA model to the test data
df_test_pca = pca_final.transform(X_test)

# Train the Logistic Regression model
LR_PCA_Learner = LogisticRegression()
LR_PCA_Model =, y_train)

Step 3. Model evaluation for Model 1

# Calculate the Area Under the Curve (AUC)
pred_test = LR_PCA_Model.predict_proba(df_test_pca)
"{:2.2}".format(metrics.roc_auc_score(y_test, pred_test[:,1]))

# Calculate the predicted probabilities and convert to dataframe
y_pred = LR_PCA_Model.predict_proba(df_test_pca)
y_pred_df = pd.DataFrame(y_pred)
y_pred_1 = y_pred_df.iloc[:,[1]]
y_test_df = pd.DataFrame(y_test)

# Put the index as ID column, remove index from both dataframes and combine them
y_test_df["ID"] = y_test_df.index
y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)
y_pred_final = pd.concat([y_test_df,y_pred_1],axis=1)
y_pred_final = y_pred_final.rename(columns = { 1 : "Yes_Prob", "target_variable" : "Yes" } )
y_pred_final = y_pred_final.reindex(["ID", "Yes", "Yes_Prob"], axis=1)

# Create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_pred_final[i]= x: 1 if x > i else 0)

# Calculate accuracy, sensitivity & specificity for different cut off points
Probability = pd.DataFrame( columns = ['Probability', 'Accuracy', 'Sensitivity', 'Specificity'])
for i in numbers:
    CM = metrics.confusion_matrix(y_pred_final.Yes, y_pred_final[i] )
    Total = sum(sum(CM))
    Accuracy    = (CM[0,0]+CM[1,1])/Total
    Sensitivity = CM[1,1]/(CM[1,1]+CM[1,0])
    Specificity = CM[0,0]/(CM[0,0]+CM[0,1])
    Probability.loc[i] =[ i, Accuracy, Sensitivity, Specificity]
Probability.plot.line(x='Probability', y=['Accuracy','Sensitivity','Specificity'])

Choose a cut off point

Different applications have different priorities when choosing the cut off point. For some applications, the true positives is more important and the true negative doesn’t matter. In this case we should use sensitivity as the evaluation criteria, and choose the cut off point to make the sensitivity as high as possible e.g. probability = 0.1 which gives sensitivity almost 100%.

For some applications, the true negative is more important and the true positive doesn’t matter. In these case we should use specificity as the evaluation criteria, and choose the cut off point to make the sensitivity as high as possible, e.g. probability = 0.9 which gives specificity almost 100%.

For most applications the true positive and the true negative are equally important. In this case we should use accuracy as the evaluation criteria, and choose the cut off point to make the accuracy as high as possible, e.g. probability = 0.5 which gives accuracy about 82%.

In most cases it is not the above 3 extreme, but somewhere in the middle, i.e. the true positive is very important but the true negative also matters, even though not as important as the true positive. In this case we should choose the cut off point to make sensitivity high but the specificity not too low. For example, probability = 0.3 which gives sensitivity about 90%, specificity about 65% and accuracy about 80%.

So let’s do the the last paragraph, cut off point = 0.3

y_pred_final['predicted'] = lambda x: 1 if x > 0.3 else 0)
confusion_matrix = metrics.confusion_matrix( y_pred_final.Yes, y_pred_final.predicted )

We get sensitivity = 91.8%, specificifity = 65.9%, accuracy = 78.9%

Step 4. Model 2: Logistic Regression Model without PCA

# Drop highly correlated columns
data_corr = pd.DataFrame(data.corr()["target_variable"])
data_corr = data_corr[data_corr["target_variable"] != 1]
churn_corr.sort_values(by=["abs_corr"], ascending=False).head(5)

# Split the data into X and y, and normalise the data 
y = data.pop("target_variable")
normalised_data = (data - data.mean())/data.std()
X = normalised_data

# Train the Logistic Regression model
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7,test_size=0.3,random_state=88)
LR_model = LogisticRegression(max_iter = 200), y_train)

# Reduce the variables using RFE
RFE_model = RFE(LR_model, n_features_to_select = 15)
RFE_model =, y_train)
selected_columns = X_train.columns[RFE_model.support_]

# Rebuild the model
X_train_RFE = X_train[selected_columns], y_train)
LR2 = sm.GLM(y_train,(sm.add_constant(X_train_RFE)), family = sm.families.Binomial())
LR_model2 =

# Check the VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
Model_VIF = pd.DataFrame()
Model_VIF["Variable"] = X_train_RFE.columns
number_of_variables = X_train_RFE.shape[1]
Model_VIF["VIF"] = [variance_inflation_factor(X_train_RFE.values, i) for i in range(number_of_variables)]
Model_VIF.sort_values(by="VIF", ascending=False)

# Remove one variable manually based on the P-value and VIF
X_train_RFE.drop(columns=["column8"], axis=1, inplace=True)
LR2 = sm.GLM(y_train,(sm.add_constant(X_train_RFE)), family = sm.families.Binomial())
LR_Model2 =

Repeat the last 2 steps until P value < 0.05 and VIF < 5.

Step 5. Model evaluation for Model 2

# Calculate the Area Under the Curve (AUC)
df_test = LR_Model2.transform(X_test)
pred_test = LR_Model2.predict_proba(df_test)
"{:2.2}".format(metrics.roc_auc_score(y_test, pred_test[:,1]))

Calculate accuracy, sensitivity & specificifity for different cut off points and choose a cut off point:

See “choose cut off point” section above

Identify the most important predictors:

From the model output above, i.e. “LR_Model2.summary()” we can see the most important predictors.

15 April 2021

Linear Regression in Python

Filed under: Data Warehousing — Vincent Rainardi @ 6:53 am

Linear Regression is about predicting a numerical variable. There are 5 steps when we do it in Python:

  1. Prepare the data
    • Load and understand the data
    • Fix data quality issues
    • Remove non required columns
    • Visualise and analyse the data
    • Identify highly correlated columns and remove them
    • Create derived variables
    • Create dummy variables for categorical variables
  2. Build the model
    • Split the data into training data and test data
    • Scale the numerical variables in the training data
    • Split the data into y and X
    • Automatically choose top 15 features using RFE (Recursive Feature Elimination)
    • Manually drop features based on P-value and VIF (Variance Inflation Factor)
    • Rebuild the model using OLS (Ordinary Least Squares)
    • Repeat the last 2 steps until all variables have P-value < 0.05 and VIF < 5
  3. Check the distribution of the error terms
  4. Make predictions
    • Scale the numberical variables in the test data
    • Remove the dropped features in the test data
    • Make predictions based on the test data
  5. Model evaluation
    • Plot the predicted vs actual values
    • Calculate R2, Adjusted R2 and F statistics
    • Create the linear equation for the best fitted line
    • Identify top predictors

Below is the Python code for the above steps. I will skip step 1 (preparing the data) and directly go to step 2 because step 1 is common to all ML models (not just linear regression) so I will write it in a separate article.

2. Build the model

# Split the data into training data and test data 
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df_data, train_size = 0.7, test_size = 0.3, random_state = 100)

# Scale the numerical variables in the training data
from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler()
continuous_columns = ["column1", " column2", " column3", " column4"]
df_train[continuous_columns] = minmax_scaler.fit_transform(df_train[continuous_columns])

# Split the data into y and X
y_train = df_train.pop("count")
x_train = df_train

# Automatically choose top 15 features using RFE
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
data_LR = LinearRegression(), y_train)
data_RFE = RFE(data_LR, 15)             
data_RFE =, y_train)

# Check which columns are selected by RFE and which are not
selected_columns = x_train.columns[data_RFE.support_]
unselected_columns = x_train.columns[~data_RFE.support_]

# Train the model based on the columns selected by RFE
# and check the coefficients, R2, F statistics and P values
x_train = x_train[selected_columns] 
import statsmodels.api as data_stat_model  
x_train = data_stat_model.add_constant(x_train) 
data_OLS_result = data_stat_model.OLS(y_train, x_train).fit() 

# Calculate the VIF (Variance Importance Factor) 
from statsmodels.stats.outliers_influence import variance_inflation_factor
data_VIF = pd.DataFrame()
data_VIF['variable'] = x_train.columns
number_of_variables = x_train.shape[1]
data_VIF['VIF'] = [variance_inflation_factor(x_train.values, i) for i in range(number_of_variables)]
data_VIF.sort_values(by="VIF", ascending=False) 

# Drop one column and rebuild the model
# And check the coefficients, R-squared, F statistics and P values
x_train.drop(columns=["column5"], axis=1, inplace=True)
x_train = bike_stat_model.add_constant(x_train)
data_OLS_result = data_stat_model.OLS(y_train, x_train).fit()

# Check the VIF again
data_VIF = pd.DataFrame()
data_VIF['variable'] = x_train.columns
number_of_variables = x_train.shape[1]
data_VIF['VIF'] = [variance_inflation_factor(x_train.values, i) for i in range(number_of_variables)]
data_VIF.sort_values(by="VIF", ascending=False) 

Keep dropping one column at a time and rebuild the model until all variables have P value < 0.05 and VIF < 5.

The result from print(data_OLS_result.summary()) is something like this, where we can see the R2 and Adjusted R2 of the training data:

                            OLS Regression Results                            
Dep. Variable:                  count   R-squared:                       0.841
Model:                            OLS   Adj. R-squared:                  0.838
Method:                 Least Squares   F-statistic:                     219.8
Date:                Tue, 29 Dec 2020   Prob (F-statistic):          6.03e-190
Time:                        09:07:14   Log-Likelihood:                 508.17
No. Observations:                 510   AIC:                            -990.3
Df Residuals:                     497   BIC:                            -935.3
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
const          0.2444      0.028      8.658      0.000       0.189       0.300
column1        0.2289      0.008     28.108      0.000       0.213       0.245
column2        0.1258      0.011     10.986      0.000       0.103       0.148
column3        0.5717      0.025     22.422      0.000       0.522       0.622
column4       -0.1764      0.038     -4.672      0.000      -0.251      -0.102
column5       -0.1945      0.026     -7.541      0.000      -0.245      -0.144
column6       -0.2362      0.026     -8.946      0.000      -0.288      -0.184

3. Check the distribution of the error terms

In linear regression we assume that the error term follows normal distribution. So we have to check this assumption before we can use the model for making predictions. We check this by looking at the histogram of the error term visually, making sure that the error terms are normally distributed around zero and that the left and right side are broadly similar.

fig = plt.figure()
y_predicted = data_OLS_result.predict(x_train)
sns.distplot((y_train - y_predicted), bins = 20)
fig.suptitle('Error Terms', fontsize = 16)

4. Making predictions

# Scale the numberical variables in the test data (just transform, no need to fit)
df_test[continuous_columns] = minmax_scaler.transform(df_test[continuous_columns])

# Split the test data into X and y
y_test = df_test.pop('count')
x_test = df_test

# Remove the features dropped by RFE and manual process
x_test = x_test[selected_columns]
x_test = x_test.drop(["column5", "column6", "column7"], axis = 1)

# Add the constant variable to test data (because by default stats model line goes through the origin)
x_test = data_stat_model.add_constant(x_test)

# Make predictions based on the test data
y_predicted = data_OLS_result.predict(x_test)

5. Model Evaluation

Now that we have built the model, and use the model to make prediction, we need to evaluate the performance of the model, i.e. how close the predictions are to the actual values.

# Compare the actual and predicted values
fig = plt.figure()
plt.scatter(y_test, y_predicted)
fig.suptitle('Compare actual (Y Test) vs Y predicted', fontsize = 16)
plt.xlabel('Y Test', fontsize = 14)
plt.ylabel('Y Predicted', fontsize = 14)
  • We can see here that the Y Predicted and the Y Test have linear relation, which is what we expect.
  • There are a few data points which deviates from the line, for example the one on the lower left corner.

We can now calculate the R2 score on the test data like this:

from sklearn.metrics import r2_score
r2_score(y_test, y_predicted)

We can also calculate the Adjusted R2 like this:

Based on the coefficient values from the OLS regression result we construct the linear equation for the best fitted line, starting from the top predictors like this:

                 coef    std err          t      P>|t|      [0.025      0.975]
const          0.2444      0.028      8.658      0.000       0.189       0.300
column1        0.2289      0.008     28.108      0.000       0.213       0.245
column2        0.1258      0.011     10.986      0.000       0.103       0.148
column3        0.5717      0.025     22.422      0.000       0.522       0.622
column4       -0.1764      0.038     -4.672      0.000      -0.251      -0.102
column5       -0.1945      0.026     -7.541      0.000      -0.245      -0.144
column6       -0.2362      0.026     -8.946      0.000      -0.288      -0.184

y = 0.2444 + 0.5717 column3 – 0.2362 column6  + 0.2289 column1 – 0.1945 column5 – 0.1764 column4 + …

Based on the absolute value of the coefficients we can see that the top 3 predictors are column3, column6 and column1. It is very useful in any machine learning project to know the top predictors, i.e. the most influencing features because then we can take business action to ensure that those features are maximised (or minimised if the coefficient is negative).

Now we have the linear regression equation we can use this equation for predict the target variable for any given input values.

7 April 2021

Handling Class Imbalance

Filed under: Data Warehousing — Vincent Rainardi @ 7:17 am

In this article I will explain a few ways to treat class imbalance in machine learning. I will also give some examples in Python.

What is class imbalance?

Imagine if you have a data set containing 2 classes: 100 class A and 100 class B. This is called a balanced data set. But if those 2 classes are 5000 class A and 100 class B that is an imbalanced data set. This is not limited to 2 classes, but can happen on more than 2 classes. For example: class A and B both have 5000 members, whereas class C and D both have 100 members.

In an imbalance data set, the class with fewer members is called the minority class. The class with much more members is called the majority class. So if class A has 5000 members and class B 100 members, class A is the majority class and class B is the minority class.

Note that the “class” here is the target variable, not the independent variable. So the target variable is a categorical variable, not a continuous variable. A case where the target data set has 2 classes like above is called “binary classification” and it is quite common in machine learning.

At what ratio it is called class imbalance?

There is no exact definition on the ratio. If class A is 20% of class B I would call it imbalance. Whereas if class A is 70% of class B I would call it balance. 50% I would say is a good bet. It is wrong to dwell on finding the precise ratio range because each data set and each ML algorithm is diffferent. Some cases have bad results at 40%, some cases are ok with 40%.

Why class imbalance occurs

Some data is naturally imbalance, because one class happens rarely in nature, whereas the other happens frequently. For example: cancer, fraud, spam, accidents. The number of people with cancer are naturally much less than those without. The number of fraudulant credit card payments are naturally much less than good payments. The number of spam emails are much less than good emails. The number of flight having accidents are naturally much less than good flights.

Why class imbalance needs to be treated

Some machine learning algorithms don’t work well if the target variable is imbalanced, because during training the majority class would be favoured. As a result the model would be skewed toward the majority class. This situation is an issue because in most cases what we are interested in is predicting the minority class. For example: predicting that a transaction is a fraud, or that an email is a spam, is more important than predicting the majority class.

That is the reason why class imbalance needs to be treated. Because the model would be skewed towards the majority class, and we need to predict the minority class.

How to treat class imbalance

We resolve this situation by oversampling the minority class or by undersampling the majority class.

Oversampling the minority class means we randomly choose sample data from the minority class many times, whereas on the majority class we don’t do anything.

For example if class A has 5000 members and class B has 100 members, we resample class B 4950 times. Meaning that we pick data randomly from class B 4950 times. Effectively it is like duplicating class B data 50 times.

Undersampling the minority class means that we randomly selecting data from the majority class as many times as the minority class. In the above example we randomly pick 100 samples from class A, so that both class A and class B have 100 members.

Apart from randomly selecting data there are many other techniques, including:

  • Creating a new samples (called synthetic data)
  • Selecting samples not randomly but favouring samples which are misclassified
  • Selecting samples not randomly but favouring samples which resembles the other class

Jason Brownlee explained several other techniques such as SMOTE, Borderline Oversampling, CNN, ENN, OSS in this article: link.

Python examples

1. Random Oversampling

# Import resample from the Scikit Learn library
from sklearn.utils import resample

# Put the majority class and minority class on separate dataframes
majority_df = df[df["fraud"]==0]
minority_df = df[df["fraud"]==1] 

# Oversampling the minority class randomly
new_minority_df = resample( minority_df, replace = True, 
                            n_samples = len(majority_df), 
                            random_state = 0 )

# Combine the new minority class with the majority class
balanced_df = pd.concat([majority_df, new_minority_df])

2. Synthetic Minority Oversampling Technique (SMOTE)

# Import SMOTE from the Imbalance Learn library
from imblearn.over_sampling import SMOTE

# Oversampling the minority class using SMOTE
s = SMOTE()
X_new, y_new = s.fit_resample(X, y)

Jason Brownlee illustrates very well which part of the minority class got oversampled by SMOTE in this article: link. Please notice how the minority class differs on the first 3 plots in his article. We can see clearly how SMOTE with random undersampling is better than SMOTE alone or random undersampling alone.

6 April 2021

Natural Language Processing (NLP)

Filed under: Data Warehousing — Vincent Rainardi @ 8:15 am

NLP is different to all other machine learning areas. Machine learning usually deals with mathematics, with numbers. It is about finding a pattern in the numbers, and make a prediction. The root of analysis is mathematical such as matrix, vectors, statistics, probability and calculus. But NLP is about words and sentences which is is very different.

We are now used to Alexa, Siri and Google able to understand us and answer us back in a conversation (5 years ago it wasn’t like that). When we type a reply to an email in Gmail or a message in Linked In we are now used to receiving suggestions about what we are going to type. And when we login to British Gas, or online banking or online retail shop we now find chat bots with whom we will be able have a useful conversation. Much better than 5 years ago. There is no doubt there has been a significant advancements in this area.

The processing of language, be it voice or text, are done in 3 levels. The bottom level is lexical analysis, where ML deals with each word in isolation. The middle level is syntax analysis, where ML analyses the words within the context of the sentence and the grammar. The top level is semantic analysis where ML tries to understand the meaning of the sentence.

To do lexical analysis we start with regular expression. We use regular expression to find words within a text, and to replace them with another words. Then we learn how to identify and remove stop words such as and, the, a which occur frequently but don’t provide useful information during lexical analysis. The third step is learning how to break the text into sentences and into words. And finally for each word we try to find the base word either using stemming, lemmatisation or soundex.

Stemming is a process of removing prefixes and suffixes like “ing” and “er” from “learning” and “learner” to get the base word which is learn. Lemmatisation is a processes of changing a word to its root, e.g. from “went” to “go”, and from “better”, “well”, “best” to “good”. Soundex is a 4-character code that represents the pronounciation of a word, rather than its spelling.

The syntax analysis is done by tagging each word as noun, verb, adjective, etc. (called “part of speech”). The tagging is done by parsing (breaking up) the sentences into groups of words (phrases), analysing the grammatical patterns, and considering the dependencies between words.

Semantic analysis is about understanding the meaning of the words and sentences by looking at the structure of the sentence and the word tagging. Words such as “Orange” can mean colour, a fruit or a area, and “Apple” can mean a fruit or a company, depending on the sentence. In semantic analysis we either assign predefined categories to a text (for example for sentiment analysis, for classifying messages or for chat bots) or pull out a specific information from a text (for example for extracting certain terms from IRS contracts, or other documents).

21 February 2021

One or Two Fact Tables

Filed under: Data Warehousing — Vincent Rainardi @ 7:34 pm

One of the data I receive from my data sources is like this:

In this case, should we make it into one fact table like this: (one row per company per year, with M1, M2, M3 repeated) – let’s call this option 1

Or should we make it into two fact tables like this: (the first one is 1 row per company and the second one is 1 row per company per year) – let’s call it option 2

Kimball theory says we should do option 2, because the grains are different. A fact table is determined by its grain. If the data is at different grain, then it does not belong in this fact table, it belongs to another fact table. In option 1, M1 to M3 do not belong there, because their grains are different. They belong to another fact table, hence option 2 is the correct way.

So that was what I did when I first came into this this situation. Stick to the book, I said to myself, and you’ll be safe.

But then the same situation came up again, on another data source. I should model it the same way I thought. But this one is a bit “rugged”, notice M4 and M5 below which are only applicable to some years:

And M1A is closely related to M1 (they are used together), M2A is closely related to M2, M3A is closely related to M3, so the two fact table almost always have to be joined together.

Also notice that there are 15 measures which are annual, and only 3 which are not.

So I thought rather than the users having to join those two fact tables every time because the first 3 measures are used together, I think the right thing to do is to put them in one fact table like this:

So I did. And there are two more data sources like that afterwards. I still feel it is the right thing to do despite it breaks the normalisation rules (who’s doing normalisation in data warehousing anyway?) and breaks a Kimball principle on fact tables.

It’s not because there are only 3 measures which are repeated (even one is too many) but because otherwise to use it users will have to join the two fact tables.

What’s wrong with joining two fact tables? Nothing. Joining fact tables was done since the dawn of data warehousing, with shipments and orders as the classic example in 1990s.

Oh well, both options have the good side and bad side, the positives and negatives, and I think we could present good arguments on both options. What do you think?

And that is the title of this article: One or Two Fact Tables.

Vincent Rainardi, 21st Feb 2021 (gosh it’s been a looong time since I last wrote about data warehousing. That’s because I’m currently doing a master degree on machine learning, which is a lot of fun! Can’t have enough of it! Doing Python and math every day 🙂 At first I thought I would be converting to ML. But now that I’ve done quite a lot of it, I’m sure I don’t want to let DWBI and Investment Management go. I will bring ML into my DWBI work in financial sector. And keep BA as my main work, not ML engineer 🙂 It’s a very good feeling to be able to know what we want to do in life!

8 November 2020

Day Measures

Filed under: Data Warehousing — Vincent Rainardi @ 3:04 am

Sometimes the measures we want to store is the number of days. For example:

  • the number of days from when we received an order until we shipped it
  • the number of days from when a booking was made until the engineer visited
  • the number of days from today until a bond matures
  • the number of days since the policy was incepted (began) until today

I call these measures Day Measures.
When a day measure is “from today” or “until today”, then the measure changes everyday.
For example: the maturity of a bond is the number of years from today until a bond matures. If the bond matures in 31st December 2030, and today is 9th Nov 2020, then the maturity is 10.1451 years (10 + 52/365.25).
Tomorrow it is 10.1396. The maturity of a bond is important because the bigger the maturity, the higher the risk that the bond value can be impacted by the interest rate.

Days to shipment (the number of days from when the order is received until shipment) does not change if it was a last year order. But if the order was placed yesterday and it usually takes about 3 weeks for us to fulfill an order, then in the next 3 weeks the “days to shipment” changes every day.

If we have a daily periodic snapshot fact table, and we have a day measure in this fact table, then we need to calculate this measures every day.

The question is: is that right?
Couldn’t we just store the maturity date of the bond, rather than calculating the Maturity daily?
Couldn’t we just store the shipment date and the order date, rather than calculating “days to shipment” daily?

We can and we should. But “in addition to” not “instead of”.
It would be silly to store just the date because everytime we want to use it we need to calculate it. The point of having a data warehouse is not to save storage, but to make it easy to use. Easy to query.
When we calculate the Maturity or Days To Shipment, we don’t need to calculate anything. It is there ready for us to use.

So yes it is worth calculating Days Measures every day, and store them in the daily periodic snapshot fact table.

Next Page »

Blog at