In the previous article I explained what convolution was (link). We use convolution in image classification/recognition, to power a special type of neural network called CNN (Convolutional Neural Network). In this article I’ll explain what CNN is and how we use it for image classification.
What is CNN?
CNN is a neural network consisting of convolutional layers and pooling layers like this:

In the above architecture, the CNN classifies images into 10 different classes.
- The dimension of the images is 200 x 200 pixel, in colour i.e. 3 layers (RGB), and the output is 10 classes.
- The convolutional layers extract the features such as detecting edges (see my last article here). We set the dimension of the first convoltional layer to be the same as the image, i.e. 200 x 200. The 32 is the number of features we are extracting, usually we start with 32 or 64 then doubling on the next group. When an image passes a convolutional layer the dimension we try not to change the dimension by using “padding”.
- The pooling layers summarise the features, either by taking the average or the maximum. When it passes a pooling layer, the dimension is typically reduced by half.
- The flatten layers change the shape into 1 dimension so we can do normal neural network operations (configuring the weights). In the above example the last pooling layer is 25 x 25 x 128. When this is flatten the 1 dimensional shape is 25 x 25 x 128 = 1 x 160,000.
- The fully connected layers (also known as dense layers) are fully connected layers of neurons (multilayer perceptron = MLP). We tend to set the number of neuron of the first fully connected layer to 4x or 8x of the last pooling layer. The second layer can be reduced to half of the first layer, e.g. in the above example the first layer is 1024 (8x of 128) and the second layer is 512 (half of 1024).
Python Code
So it is about composing many layers of neural network one by one. The tool that most people use for CNN nowadays is Keras (a library of Tensorflow). CNN requires a lot of computing resources, i.e. GPU, disk and memory, which is why usually we use Google Colab or Kaggle. Both of them provides GPU environment, which can speed up our code 10x (link) or even 100x. A CNN epoch which took 15-20 minutes in my CPU laptop, only took 3-4 seconds in Colab. (an epoch is a run)
As for disk space, CNN can take a lot of disk space. For CNN we need to do image augmentation, meaning we randomly rotate the source images so that the model is not overfitting. Not only rotating, but also flipping, zooming, shifting, changing the brightness, etc. all can be done very simply in Keras (link). The rotated/shifted/flipped images can take a lot of disk space, and for that we can use Google Drive. Luckily in Keras we can generate the augmented images when we fit the model, without having to generate them and store them on the disk! (link)
First we use image_data_set_from_directory to load the images from a directory into a TF dataset:
training_dataset = tf.keras.preprocessing.image_dataset_from_directory(data_directory, validation_split=0.25, subset="training", seed=100, image_size=(200,200), batch_size=30)
Note: the output classes are automatically generated from the sub directory names and they are stored in the TF dataset as “class_names”.
Then we build the CNN model like this:
# Data augmentation: rotation, contrast, flip, zoom and divide by 255 to standardize the input
augment = keras.Sequential( [layers.experimental.preprocessing.RandomFlip(mode="horizontal_and_vertical",input_shape=(200,200,3),
layers.experimental.preprocessing.RandomRotation((0.1, 1.3), fill_mode='reflect'),
layers.experimental.preprocessing.RandomZoom(height_factor=(-0.15, 0.15), fill_mode='reflect')])
layers.experimental.preprocessing.RandomContrast(0.1)
),
model = Sequential([augment, layers.experimental.preprocessing.Rescaling(1./255, input_shape = (200, 200, 3))])
# Two 32 convolution layers with batch normalisation, then max pooling with dropout
model.add(Conv2D(32, (3,3), padding="same", activation="relu"))
model.add(BatchNormalization())
model.add(Conv2D(32, (3,3), padding="same", activation="relu"))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.2))
# Two 64 convolution layers with BN, then max pooling with dropout
model.add(Conv2D(64, (5,5), padding="same", activation="relu"))
model.add(BatchNormalization())
model.add(Conv2D(64, (5,5), padding="same", activation="relu"))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.2))
# Flatten then 3 fully connected layers with dropout
model.add(Flatten())
model.add(Dense(128, activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(64, activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(10, activation="softmax"))
For a complete notebook on CIFAR 10 dataset (images with 10 classes) please refer to Jason Brownlee’s article (link) and Abhijeet Kumar’s article (link). The arguments for Random Flip/Rotation/Zoom, etc. are at Keras documentation here: link.
As we notice above there are a few things that we need to set when doing CNN:
- Number of filters for Conv2D: for simple images such as MNIST start with 8 or 16, doubling on the next group. For complex images such as CIFAR start with 32 or 64, doubling on the next group. Sometimes doubling doesn’t increase the accuracy, in this case keep it the same on the next group, or even decrease. For instance: 32, 32, 16, 16, 128, 64, 10 rather than 32, 32, 64, 64, 128, 64, 10.
- Number of Conv2D layers in each group: in the above example I use 2 layers. Sometimes it is enough to use 1 layer as using 2 layers doesn’t increase the accuracy. For instance: 32, 64, 128, 64, 10 rather than 32, 32, 64, 64, 128, 64, 10. Or two layers on the first group but one layer on the second group, like this: 32, 32, 64, 128, 64, 10.
- Number of nodes on the dense layer: in the above example I use 128 and 64 (the last layer is dictated by the output classes, for example: for CIFAR 10 dataset has 10 classes so the last layer is 10). For simple images like MNIST or even complex images like CIFAR 10 sometimes we don’t need to use >100 for the first layer, but 16 or even 8 is enough. So we can use 8, 8, 10 instead of 128, 64, 10. Try with 8 first see if the accuracy is better, then increase it to 16, 32, etc. Of course it doesn’t have to be a power of 2. It can be 10, 25, 50, etc.
- Number of dense layers: in the above example I used two layers, i.e. 128 and 64. Depending on the data, sometimes we need more than 2 layers or we could only need 1 layer. So try with 1 and 3 layers, see if the accuracy increases. For classification we rarely need more than 3 layers (with the right number of neurons).
- Pooling layer: in the above example I use max pooling with 2 stride and 2 padding (2,2). This is commonly used (in VGGNet for example), effectively reducing the data shape by half on each dimension. But we should try (3,3) and (4,4) as well, if the accuracy is not dropping than the bigger one is better because the training performance would be better, the model would be simpler and smaller chance of the model overfitting the data.
- With all the above we need to be careful with overfitting. For example, the accuracy on the training data can be 90% but the accuracy of the validation data is 50%. To avoid this we need to use dropout, i.e. removing some connections between layers. In the above example I use dropout of 20%. We need to try 30% and 50% as well. If the accuracy (of the validation data) is not dropping then the higher the dropout the better it is because the training performance would be better, the model would be simpler and smaller chance of the model overfitting the data. We add the dropout after the pooling layer, not after the convolution layer.
- And finally after every convolution layer we need to add batch normalisation (BN), to make the back propagation faster to converge (this is true for all deep neural network, called internal covariate shift, link). Sometimes if we use BN we don’t need to use dropout (using both can make it worse: link, or make it better: link), so check the accuracy if the dropout is removed. If the accuracy stays the same, remove the dropout.
Below are the results of trying different number of filters and layers (assume filter size is constant at 3×3), both on the conv layers and dense layers, as well as trying different batch normalisation and dropout.
Legend:
- C: means convolutional layer. The numbers represent the filters in that layer, for example “C: 16 16 32 32” means four convolutional layers, 16 filters on the first and second layers, 32 filters on the third and fourth layers.
- D: means dense layer (fully connected layer), which is after the flatten layer. The numbers represent the nodes in the layer. For example “D: 16” means after the flatten layer there is one dense layer with 16 nodes (neurons).
- Lower case n following a convolutional layer means batch normalisation. For example: “C: 16 16n 32 32n” means there are 4 convolutional layers, and after the second and fourth layers there is a batch normalisation layer.
- Lower case d means dropout layer. The number after d is the dropout rate, i.e. ½ mean 50%, ¼ means 25%. For example: “C: 16n 16n d¼ 32n 32n d¼” means four convolutional layers all with batch normalisation, with 25% dropout after the second and forth layers.
- Upper case R on the dense layer means ReLU activation layer.
- L2 on the dense layer means the kernel regularizer is using L2 regularization penalty, with L2 factor kept default at 0.01.
- The yellow numbers in circle are the model numbers.




In the above case we should choose model #26 because the training accuracy reached 90% at epoch 10 (93% at epoch 20), and the validation accuracy reached 51% at epoch 13 (50% at epoch 20). The goal here is for both the training accuracy and validation accuracy to be as high as possible, using the minimum number of filters and layers.
We also look for stability, for example: model #27 had a big validation drop on epoch 18, we want to avoid things like that. We want to validation accuracy to be stable, because if it fluctuates a lot it could unexpectedly drop when we run it for 50 epoch.
In terms of resources, what we are looking for is actually not the number of filters and layers, but the number of parameters. The “model summary” for number #26 looks like below. It displays the number of parameters for each layer.

As we can see above, the dense 16 layer (second line from bottom) has 946,704 parameters. This is because all 16 nodes in the dense layer are connected to the previous layer, which the flattened layer with 59,168 nodes. So the number of weights = 59,168 x 16 = 946,688. Plus 16 biases for each node in the dense layer = 946,704.
Whereas for number #28 the “model summary” looks like this:

We can see that the first dense 8 layer (third line from bottom) has 473,352 parameters. This is because it has 8 nodes and those nodes are connected to the previous layer which is a flattened layer with 59,168 nodes. So the number of weights = 59,168 x 8 = 473,344. Plus 8 biases for each node in the dense layer = 473,352.
Comparing the total number of parameters, model #26 has 963k parameters whereas model #29 has 490k parameters, only half of model #26. Because of this model #28 is a strong contender to model #26. Yes model #28 has more validation fluctuation, but it has half the number of parameters.

We can see that the validation accuracy is still low. In part 2 of this article (link) I’m going to address that issue.
As we can see above, configuring CNN is more of an art than science. But after a few projects we should get some understanding about how each hyperparameter influences the result. Happy learning!