13 July 2021

Learning Machine Learning with Upgrad

Filed under: Machine Learning — Vincent Rainardi @ 7:46 am

In the last 10 months I’ve been doing a master’s degree on machine learning with Upgrad (link). It has been a very good journey, very enjoyable. I really like it a lot. The opening webinar back in October 2020 was fantastic. They talked about various applications of AI such as image recognition for blind person, chest X-ray diagnosis, NFL video advert analysis, Makoto Koike cucumber, Alpha Go Zero and Volvo recruiting car. Everyone was assigned a student mentor who guides us through our journey and answer our non-academic questions. We have technical assistants who answer our academic questions (we have a discussion forum too). We learn primarily through videos (which suit me a lot as I’m in the UK with different working hours to India) and their learning platform is very good. Every week we have doubt resolution sessions (optional) where we can ask questions to real teachers (their teachers are very good in explaining difficult concepts so they are easy to understand). A lot of webinars where industry experts share their real world experiences on AI.

The thing I like best is the small group coaching where we learn in a group of eight, coached by an industry expert. My coach is from Paypal, the same industry as me (I work in asset management in London). The sesson is interactive where our coach explains things and we can ask questions, and it is always practical, often discussing the “notebook” (meaning the Python code for those who are not familiar with Jupyter). My mentor is an expert in ML and a very good teacher. We are really lucky to he’s willing to spend time coaching us. Sometimes we had a one-to-one discussion with our coach. At one time (just once) we students thaught each other, we learned from one another. But everyone was also assigned an industry mentor, with whom I discuss my job in the real world and my blog, and my aspirations/ideas in ML. Most students are looking for a job in ML and received a lot of guidance from their mentor. I’m not looking for a new job, but I’m very grateful to have a very experienced mentor. My mentor is from Cap Gemini, an industry leader in AI with 25 years of experience (13 of which were with Microsoft). Really lucky that he’s willing to spend time mentoring me.

In the first month I was learning Python and SQL, covering data structures, control structures, pandas, numpy, data loading, visualisation, etc. all on Jupyter notebook. I’m a SQL and BI veteran but I rarely do coding at work. I mean real coding, not SQL, ETL or BI tools. The last time I did real coding was 10 years ago (Java) and before that it was 20 years ago (C#). When I was young I really liked coding (Basic, C++, Pascal) and this Python coding with Upgrad really took me back to my childhood hobby. I really enjoy coding in Python as part of this course.

Then I learned about statistics and data exploration. I did Physics Engineering at uni so I did statistics and learning it again was enjoyable. The teacher was really good (from Gramener, link) and gave us real world examples like restaurant sales, securities correlation and electricity meter reading. Also learned about probability, central limit theorem and hypothesis testing. All these turned out to be come very useful when applying machine learning algorithms. The assignment was real world cases, such as investment analysis and loan, and the fact that they were in finance made me enjoyed them more.

Then for a few months I learned with various ML algorithms such as linear regression, logistic regression, Naive Bayes, SVM, Decision Tree, Random Forest, Gradient Boosting, clustering and PCA. Also various important technique such as regularisation (Ridge, Lasso), model selection, accuracy, precision. Again the assignments were real world cases such as predicting house prices, how weather affects sales, and telecommunication industry.

Then I learned about natural language processing (NLP) which was very different. All the other algorithms were based on mathematics, but this one is based on languages. It was such as an eye opener for me to learn how computer understand human languages (I wrote an article about it: link). And now I’m learning neural network, which is the topic I like most because it is the most powerful algorithm in machine learning. We started with computer vision (CNN, convolutional neural network, link) and now I’m studying RNN (Recurrent Neural Network, link) which is widely used for stock market analysis and any other sequential data.

I feel lucky I studied Physics Engineering in uni, because it helped me a lot in understanding the mathematics behind the algorithms, especially the calculus in neural network. I’ve done a few ML courses on Coursera (see my article on this: link, link) but this Upgrad one is way way better. It is a real eye opener. I can now read various machine learning papers. I mean real academic research papers, written by PhDs! A few years ago I was attending a machine learning “meetup” in London. Meetup is an app where people with similar interest gather together to meet. Usually the ML meetups were in the form of lecture, i.e. 1.5 hour session in the evening where two speakers explained about machine learning. But this time it was different. It was a discussion forum of 10 people and there was no speaker. Everyone must read a paper (it was Capsule Neural Network paper by Geoffrey Hinton) and in this meetup we discuss it. I didn’t understand a thing! I did understand neural network a bit, but I had no background in CNN so I could not understand the paper. But now I understand. I can read research papers! I didn’t know that I would be this happy to be able to read machine learning papers. It is really important to be able to read ML papers because ML progresses so fast, and the research papers provide superb sources on the latest invention is on ML.

7 July 2021

What is CNN Part 2

Filed under: Machine Learning — Vincent Rainardi @ 4:54 am

In the first part (link), after trying 28 different models the conclusion was that the best models are model #26 and #28. Model #28 has more validation fluctuation, but it has half the number of parameters.

But as we can see above both model #26 and #28 suffer from overfitting. Meaning that the training accuracy is very high (about 90%) but very low validation accuracy (about 50%). This big gap of 40% is a clear indication of overfitting. To solve this we need to do image augmentation, i.e. we need to rotate, flip, zoom out, zoom in, and shift the image, like this:

The top left image is the original image. The other 11 images are generated using random rotation, random flip, random zoom and random contrast. I put 3 sets so we can understand the interplay between these 4 transformations on the augmented images: rotate, flip, zoom, contrast (combined). Jason Brownlee gave a good tutorial on this: link.

After doing image augmentation the result is as follows:

The gap between is closing but they are still low! The best one with the narrowest gap and the highest validation accuracy is A3. It has training accuracy = 55% and validation accuracy = 54%.

In this situation like this (i.e. after doing image augmentation) if the accuracy is still low, we need to check the number of training images in each class. If one class has only a few images, and other class has lots of images, then the model training will suffer from “class imbalance” problem. Shubrashankh Chatterjee’s explained this very well on his article: link.

Basically we auto generate additional images using image augmentation (rotate, flip, zoom, contrast, shift, etc) so that each class has the same number of images. After doing this, the result is like this: (note that it’s 30 epochs not 20)

So both models still suffer from validation fluctuation, even after 30 epochs. Even with batch normalisation and dropout. Even with dropout on the dense layer! I’m still finding out why, but I think it might be because of the type of augmentation, for example I didn’t change the colours of the images. To troubleshoot this we need to find out which class causing the low accuracy, is it just some particular classes or all classes. But that’s for another time and another article. Happy learning!

Comments (1)

4 July 2021

What is Convolutional Neural Network (CNN)?

Filed under: Machine Learning — Vincent Rainardi @ 7:52 am
Tags: Machine Learning

In the previous article I explained what convolution was (link). We use convolution in image classification/recognition, to power a special type of neural network called CNN (Convolutional Neural Network). In this article I’ll explain what CNN is and how we use it for image classification.

What is CNN?

CNN is a neural network consisting of convolutional layers and pooling layers like this:

In the above architecture, the CNN classifies images into 10 different classes.

The dimension of the images is 200 x 200 pixel, in colour i.e. 3 layers (RGB), and the output is 10 classes.
The convolutional layers extract the features such as detecting edges (see my last article here). We set the dimension of the first convoltional layer to be the same as the image, i.e. 200 x 200. The 32 is the number of features we are extracting, usually we start with 32 or 64 then doubling on the next group. When an image passes a convolutional layer the dimension we try not to change the dimension by using “padding”.
The pooling layers summarise the features, either by taking the average or the maximum. When it passes a pooling layer, the dimension is typically reduced by half.
The flatten layers change the shape into 1 dimension so we can do normal neural network operations (configuring the weights). In the above example the last pooling layer is 25 x 25 x 128. When this is flatten the 1 dimensional shape is 25 x 25 x 128 = 1 x 160,000.
The fully connected layers (also known as dense layers) are fully connected layers of neurons (multilayer perceptron = MLP). We tend to set the number of neuron of the first fully connected layer to 4x or 8x of the last pooling layer. The second layer can be reduced to half of the first layer, e.g. in the above example the first layer is 1024 (8x of 128) and the second layer is 512 (half of 1024).

Python Code

So it is about composing many layers of neural network one by one. The tool that most people use for CNN nowadays is Keras (a library of Tensorflow). CNN requires a lot of computing resources, i.e. GPU, disk and memory, which is why usually we use Google Colab or Kaggle. Both of them provides GPU environment, which can speed up our code 10x (link) or even 100x. A CNN epoch which took 15-20 minutes in my CPU laptop, only took 3-4 seconds in Colab. (an epoch is a run)

As for disk space, CNN can take a lot of disk space. For CNN we need to do image augmentation, meaning we randomly rotate the source images so that the model is not overfitting. Not only rotating, but also flipping, zooming, shifting, changing the brightness, etc. all can be done very simply in Keras (link). The rotated/shifted/flipped images can take a lot of disk space, and for that we can use Google Drive. Luckily in Keras we can generate the augmented images when we fit the model, without having to generate them and store them on the disk! (link)

First we use image_data_set_from_directory to load the images from a directory into a TF dataset:

training_dataset = tf.keras.preprocessing.image_dataset_from_directory(data_directory, validation_split=0.25, subset="training", seed=100, image_size=(200,200), batch_size=30)

Note: the output classes are automatically generated from the sub directory names and they are stored in the TF dataset as “class_names”.

Then we build the CNN model like this:

# Data augmentation: rotation, contrast, flip, zoom and divide by 255 to standardize the input
augment = keras.Sequential( [layers.experimental.preprocessing.RandomFlip(mode="horizontal_and_vertical",input_shape=(200,200,3),
                             layers.experimental.preprocessing.RandomRotation((0.1, 1.3), fill_mode='reflect'), 
                             layers.experimental.preprocessing.RandomZoom(height_factor=(-0.15, 0.15), fill_mode='reflect')])
                             layers.experimental.preprocessing.RandomContrast(0.1)
                          ),                           

model = Sequential([augment, layers.experimental.preprocessing.Rescaling(1./255, input_shape = (200, 200, 3))])

# Two 32 convolution layers with batch normalisation, then max pooling with dropout
model.add(Conv2D(32, (3,3), padding="same", activation="relu"))
model.add(BatchNormalization())
model.add(Conv2D(32, (3,3), padding="same", activation="relu"))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.2))

# Two 64 convolution layers with BN, then max pooling with dropout
model.add(Conv2D(64, (5,5), padding="same", activation="relu"))
model.add(BatchNormalization())
model.add(Conv2D(64, (5,5), padding="same", activation="relu"))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.2))

# Flatten then 3 fully connected layers with dropout
model.add(Flatten())
model.add(Dense(128, activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(64, activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(10, activation="softmax"))

For a complete notebook on CIFAR 10 dataset (images with 10 classes) please refer to Jason Brownlee’s article (link) and Abhijeet Kumar’s article (link). The arguments for Random Flip/Rotation/Zoom, etc. are at Keras documentation here: link.

As we notice above there are a few things that we need to set when doing CNN:

Number of filters for Conv2D: for simple images such as MNIST start with 8 or 16, doubling on the next group. For complex images such as CIFAR start with 32 or 64, doubling on the next group. Sometimes doubling doesn’t increase the accuracy, in this case keep it the same on the next group, or even decrease. For instance: 32, 32, 16, 16, 128, 64, 10 rather than 32, 32, 64, 64, 128, 64, 10.
Number of Conv2D layers in each group: in the above example I use 2 layers. Sometimes it is enough to use 1 layer as using 2 layers doesn’t increase the accuracy. For instance: 32, 64, 128, 64, 10 rather than 32, 32, 64, 64, 128, 64, 10. Or two layers on the first group but one layer on the second group, like this: 32, 32, 64, 128, 64, 10.
Number of nodes on the dense layer: in the above example I use 128 and 64 (the last layer is dictated by the output classes, for example: for CIFAR 10 dataset has 10 classes so the last layer is 10). For simple images like MNIST or even complex images like CIFAR 10 sometimes we don’t need to use >100 for the first layer, but 16 or even 8 is enough. So we can use 8, 8, 10 instead of 128, 64, 10. Try with 8 first see if the accuracy is better, then increase it to 16, 32, etc. Of course it doesn’t have to be a power of 2. It can be 10, 25, 50, etc.
Number of dense layers: in the above example I used two layers, i.e. 128 and 64. Depending on the data, sometimes we need more than 2 layers or we could only need 1 layer. So try with 1 and 3 layers, see if the accuracy increases. For classification we rarely need more than 3 layers (with the right number of neurons).
Pooling layer: in the above example I use max pooling with 2 stride and 2 padding (2,2). This is commonly used (in VGGNet for example), effectively reducing the data shape by half on each dimension. But we should try (3,3) and (4,4) as well, if the accuracy is not dropping than the bigger one is better because the training performance would be better, the model would be simpler and smaller chance of the model overfitting the data.
With all the above we need to be careful with overfitting. For example, the accuracy on the training data can be 90% but the accuracy of the validation data is 50%. To avoid this we need to use dropout, i.e. removing some connections between layers. In the above example I use dropout of 20%. We need to try 30% and 50% as well. If the accuracy (of the validation data) is not dropping then the higher the dropout the better it is because the training performance would be better, the model would be simpler and smaller chance of the model overfitting the data. We add the dropout after the pooling layer, not after the convolution layer.
And finally after every convolution layer we need to add batch normalisation (BN), to make the back propagation faster to converge (this is true for all deep neural network, called internal covariate shift, link). Sometimes if we use BN we don’t need to use dropout (using both can make it worse: link, or make it better: link), so check the accuracy if the dropout is removed. If the accuracy stays the same, remove the dropout.

Below are the results of trying different number of filters and layers (assume filter size is constant at 3×3), both on the conv layers and dense layers, as well as trying different batch normalisation and dropout.

Legend:

C: means convolutional layer. The numbers represent the filters in that layer, for example “C: 16 16 32 32” means four convolutional layers, 16 filters on the first and second layers, 32 filters on the third and fourth layers.
D: means dense layer (fully connected layer), which is after the flatten layer. The numbers represent the nodes in the layer. For example “D: 16” means after the flatten layer there is one dense layer with 16 nodes (neurons).
Lower case n following a convolutional layer means batch normalisation. For example: “C: 16 16n 32 32n” means there are 4 convolutional layers, and after the second and fourth layers there is a batch normalisation layer.
Lower case d means dropout layer. The number after d is the dropout rate, i.e. ½ mean 50%, ¼ means 25%. For example: “C: 16n 16n d¼ 32n 32n d¼” means four convolutional layers all with batch normalisation, with 25% dropout after the second and forth layers.
Upper case R on the dense layer means ReLU activation layer.
L2 on the dense layer means the kernel regularizer is using L2 regularization penalty, with L2 factor kept default at 0.01.
The yellow numbers in circle are the model numbers.

In the above case we should choose model #26 because the training accuracy reached 90% at epoch 10 (93% at epoch 20), and the validation accuracy reached 51% at epoch 13 (50% at epoch 20). The goal here is for both the training accuracy and validation accuracy to be as high as possible, using the minimum number of filters and layers.

We also look for stability, for example: model #27 had a big validation drop on epoch 18, we want to avoid things like that. We want to validation accuracy to be stable, because if it fluctuates a lot it could unexpectedly drop when we run it for 50 epoch.

In terms of resources, what we are looking for is actually not the number of filters and layers, but the number of parameters. The “model summary” for number #26 looks like below. It displays the number of parameters for each layer.

As we can see above, the dense 16 layer (second line from bottom) has 946,704 parameters. This is because all 16 nodes in the dense layer are connected to the previous layer, which the flattened layer with 59,168 nodes. So the number of weights = 59,168 x 16 = 946,688. Plus 16 biases for each node in the dense layer = 946,704.

Whereas for number #28 the “model summary” looks like this:

We can see that the first dense 8 layer (third line from bottom) has 473,352 parameters. This is because it has 8 nodes and those nodes are connected to the previous layer which is a flattened layer with 59,168 nodes. So the number of weights = 59,168 x 8 = 473,344. Plus 8 biases for each node in the dense layer = 473,352.

Comparing the total number of parameters, model #26 has 963k parameters whereas model #29 has 490k parameters, only half of model #26. Because of this model #28 is a strong contender to model #26. Yes model #28 has more validation fluctuation, but it has half the number of parameters.

We can see that the validation accuracy is still low. In part 2 of this article (link) I’m going to address that issue.

As we can see above, configuring CNN is more of an art than science. But after a few projects we should get some understanding about how each hyperparameter influences the result. Happy learning!

Comments (1)

1 July 2021

What is Convolution?

Filed under: Machine Learning — Vincent Rainardi @ 7:32 am
Tags: Machine Learning

For me image classification/recognition is one of the most exciting topics in machine learning (ML). Today all good image classifications are using neural network. For image classification we use a specific type of neural network called Convolutional Neural Network (CNN).

I have heard this term so many times in the last 3 years but I never understood what convolution mean. So in this article I would like to explain what convolution means.

Image Classification

Since 2015 ML is better than human when classifying images. A lot faster no doubt, but also more accurate. Here are a few ML algorithms which made historical landmark in the ImageNet image classification competition (source: Gordon Cooper, Semiconductor Engineering, link)

The competition is about classifying 1.2m training images into 1000 categories (link). All the dark purple deep learning architectures above are convolutional neural networks (CNN). Over the years, the number of layers gradually increases as the available computing power increases.

AlexNet started the deep learning revolution in 2012 by using CNN and graphics processing unit (GPU), achieving massive improvement to the previous year result (link).
ResNet (stands for Residual Neural Network) can bypass 2-3 layers if those layers are not useful (link). This concept was inspired by the pyramid cells in the celebral cortex.
SENet (stands for Squeeze and Excitation Network) can adaptively recalibrate channel-wise feature responses (link).

The application is massive and live changing, from detecting cancer in medical images to self driving cars, from product search to face recognition (link).

What is Convolution?

Convolution is a mathematical operation between two functions, as follows: reverse and shift one function, then take the product of both functions, then take the integral (link).

But in image processing, convolution is a process of applying a filter on an image. This is because::

Mathematically speaking, a “convolution” in the time domain becomes a “multiplication” in the frequency domain (link).
Applying a filter on an image is multiplication process.

Let’s go through some examples so point 2 above becomes clear.

If we have this image and this filter, this is the convolution:

We get the yellow 5 on the convolution by multiplying the yellow area on the image by the filter:

So we multiply green cell on the image with the green cell on the filter (1×1), the blue cell on the image with blue cell on the image (0x1), etc. and then add them up to get 5 on the convolution:

Similarly to get the yellow 4 on the convolution, we multiply the yellow area on the image by the filter:

Why do convolution?

The purpose of doing convolution is to detect a pattern on the image.

If we want to detect if there is a horisontal line on the image then we apply this filter:

If we want to detect if there is vertical line on the image then we apply this filter:

And if we want to detect if there is a diagonal line on the image then we apply this filter:

This is called “feature extraction”. We use convolution to detect “lines” on the image.

The same 3 filters above not only detect “lines” on the image but they also detect “area”. It is probably easier to see if we don’t have the numbers on the cell, see below right:

Note: we call this line and area as “edge”, meaning the border of an area. The 3 filters above detect “edges”.

So next time people say Convolutional Neural Network, you know what Convolution means 🙂

In this article (link) I explain what CNN is.

Comments (1)