This is a diagram of a neural network:

Each node in the above diagram receive input from all the nodes on the previous layer. For example, node 1 on hidden layer 1 receives input from 4 nodes on input layer: x1,x2, x3 and x4, like this:

Each of the lines has a weight. So not all three input have the same magnitude. In the above example, input from x1 has a weight of 0.1, whereas input from x2 has a weight of 0.2.

In addition to the weighted inputs, there is one special input going into the node. It is called the “bias”, notated as b. It is a constant (a number). So node n1 on hidden layer 1 receive 4 inputs (they are weighted) plus a bias. This total input is called z. In the above example it is 13.3.

The output of each node is the a function of the input. This function is called “activation function”. One of the common function used is “rectifier linear unit”, abbreviated as “relu”. It is simply the maximum between the input and zero. In the above example, the output of node n1 is 13.3. This output is called a, stands for “activation”.

__Forward Propagation__

So, if we have the values of input layer (x1, x2, x3, x4), the values of the weight on each line between the input layer and hidden layer 1, and the values of the biases for all 3 nodes in hidden layer 1, we can calculate the output of all 3 nodes in hidden layer 1 (a1, a2, a3), like step 1 in this diagram:

Once we calculated the output of hidden layer 1, we can use them to calculate the output of the next layer, i.e. hidden layer 2. This is marked as Step 2 in the above diagram. We can then calculate the output of hidden layer 3 (step 3) and finally calculate the output of the entire network (step 4).

__Sigmoid Function__

If the neural network is used to predict a binary condition (e.g. whether an image is a car or not), the activation function used on the output layer is usually not a relu function, but a sigmoid function, which looks like this: (source: wikipedia, link)

We can see that for the majority of x value (input), the y value (output) is 0 or 1, which fits the binary condition.

So in Step 4, we put 51.877 x 0.7 + 41.099 x 0.3 + 6 = 54.6436 as the x (input) of the sigmoid function, and get 1 as the output of the sigmoid (a).

We then predict the output of the neural network (ŷ) as follows: if a is more than half, set ŷ to 1, otherwise set ŷ to 0.

__Calculate The Cost__

We do this forward propagation process for every data set that we have. In Machine Learning a data set is called “an example”. So for every example (input) we will have the predicted output (ŷ). We then compare these predicted outputs with the actual outputs and the difference is called “loss”. The average of loss from all examples is called “cost”.

There are many loss functions (link). For a binary classification where the output is a probability between 0 and 1 like above, the appropriate cost function is “cross entropy” which is like below: (source: ”ML Cheatsheet” from “Read the Docs”, link)

So if the output should be 1 and the predicted output is 1, there is no loss (loss = 0). If the predicted output is very wrong e.g. a small number like 0.1 then it is penalised heavily. This “heavy penalty” is done by taking the log of the loss. So it is not linear. The formula is like this: (p = predicted output, y = actual output)

Cross Entropy Loss = -(y log(p) + (1-y) log(1-p))

Which is derived from: if y = 1 then p, else 1-p (the probability is p if y is 1, and the probability is 1-p if y is 0). Which becomes: p = p^{y}.(1-p)^{1-y}

Taking the log it becomes: y log(p) + (1-y) log(1-p).

Taking a minus of it, it becomes the above. We take the minus because log(x) is like the left graph below, whereas minus log(x) is like the right graph below:

Note. “cross entropy” is the average number bits needed to identify an event, see: Wikipedia: link

That is the loss for 1 example (data set). The cost is the average of loss for every example, which is sum of the above divided by m, where m is the number of examples, like this:

That average of loss is the cost of the entire neural network for this weighting (the weight on every line, including the bias). Which is not the best weighting. There are better weightings which result in lower costs. If we can find the best weighting, we will get the lowest cost, which means the smallest gap between the prediction and the actual output (across all data sets). Which means it’s the most accurate prediction. To find the best weighting we need to go backward from the output layer going to the left towards the input layer. This is called “back propagation”.

__Back Propagation__

…

__Update The Parameters__

…

References:

- Michael A. Nielson, “Neural Network and Deep Learning” book: link
- Denny Britz, “Implementing a Neural Network from scratch in Python”: link
- Sunil Ray, “Understanding and Coding Neural Network from scratch in Python and R”: link
- Matt Mazur, “Step by Step Back Propagation Example”: link.