πŸš€ πŸš€ Launch Offer β€” Courses starting at β‚Ή1499 (Limited Time)
CortexCookie

Backpropagation

Modern neural networks often consist of multiple layers and a large number of parameters, sometimes reaching millions. The central challenge in training such networks is determining how to efficiently update these parameters so that the model’s predictions improve over time. This learning process is achieved through an algorithm known as backpropagation, which systematically computes gradients of the loss function with respect to each parameter in the network.

Feature Extraction

Build Intuition of Chain Rule of Calculus

Consider we have a function οΏΌΒ g(f(x))g(f(x))

Chain Rule of Calculus (scaler case):

Feature Extraction

Consider,

z=f(x)=x2z=f(x)=x^2

y=g(z)=ez=ex2y=g(z)=e^z=e^{x^2}

We can calculate dydz\frac{dy}{dz} and dzdx\frac{dz}{dx} easily.

dydz=ez,β€…β€Šβ€…β€Šβ€…β€Šβ€…β€Šβ€…β€Šdzdx=2x\frac{dy}{dz}=e^z, \;\;\;\;\; \frac{dz}{dx}=2x

But how would you calculate dydx\frac{dy}{dx} here? We have to apply chain rule here.

dydx=dydz.dzdx\frac{dy}{dx}=\frac{dy}{dz}.\frac{dz}{dx} =ez.2x=ex2.2x=e^z.2x=e^{x^2}.2x

Chain Rule of Calculus (vector case):

Feature Extraction

Here each ziz_i is a function of x=[x1x2..xm]x = \begin{bmatrix} x_1 \\ x_2 \\ .\\ .\\ x_m \end{bmatrix} and yy is a function of z=[z1z2..zn]z = \begin{bmatrix} z_1 \\ z_2 \\ .\\ .\\ z_n \end{bmatrix} . So how do we calculate βˆ‚yβˆ‚xi\frac{\partial y}{\partial x_i} for any xix_i?

Feature Extraction

Now to understand it simply, earlier for scaler case we just had only one path from xx to zz to yy (x→z→yx \to z \to y). Now for vector case here we have three paths from xix_i to yy (considering n=3n=3). So we have to apply chain rule in all the three paths and take the summation.

βˆ‚yβˆ‚xi=βˆ‚yβˆ‚z1.βˆ‚z1βˆ‚xi+βˆ‚yβˆ‚z2.βˆ‚z2βˆ‚xi+βˆ‚yβˆ‚z3.βˆ‚z3βˆ‚xi\frac{\partial y}{\partial x_i}=\frac{\partial y}{\partial z_1}.\frac{\partial z_1}{\partial x_i}+ \frac{\partial y}{\partial z_2}.\frac{\partial z_2}{\partial x_i}+ \frac{\partial y}{\partial z_3}.\frac{\partial z_3}{\partial x_i}

Basically,

βˆ‚yβˆ‚xi=βˆ‘j=1nβˆ‚yβˆ‚zj.βˆ‚zjβˆ‚xi\frac{\partial y}{\partial x_i}=\sum_{j=1}^n \frac{\partial y}{\partial z_j}.\frac{\partial z_j}{\partial x_i}

What Is Our Target Here?

We want to train the model, that means we came up with the neural network model structure but we want to find the optimized weight parameters here. Weight parameters are named as wij(l)w_{ij}^{(l)} and it means the weight from node ii from previous layer to node jj in the current layer and current layer is ll (starting from the first hidden layer).

Feature Extraction

How do we find the optimizer proper weights? We run Gradient Descent algorithm and weight parameters are updated in each iteration.

wt+1=wtβˆ’Ξ±βˆ‚Eβˆ‚wtw^{t+1} = w^t - \alpha \frac{\partial E}{\partial w^t}

wtw^t contains all the weight parameters wij(l)w_{ij}^{(l)}.

So at each iteration, our target is to find βˆ‚Eβˆ‚wij(l)\frac{\partial E}{\partial w_{ij}^{(l)}} for all the weight parameters.

What is EE here? EE is the Error or Loss we found comparing the actual output (y)(y) and predicted output (y^)(\hat{y}).

Loss/Error Functions

There are various types of Loss/Error functions depending on the task we want to perform. Here are few examples..

Mean Square Error(MSE)

Used mostly in linear regression problems.

E=1nβˆ‘i=1n(yiβˆ’yi^)2E= \frac{1}{n} \sum_{i=1}^n {(y_i-\hat{y_i})}^2

Root Mean Square Error(RMSE)

Just taking square root of Mean Square Error

E=1nβˆ‘i=1n(yiβˆ’yi^)2E= \sqrt{\frac{1}{n} \sum_{i=1}^n {(y_i-\hat{y_i})}^2}

Mean Absolute Error (MAE)

Sum of absolute differences between actual and predicted outputs

E=1nβˆ‘i=1n∣yiβˆ’yi^∣E= \frac{1}{n} \sum_{i=1}^n {|y_i-\hat{y_i}|}

Binary Cross-Entropy

Used in binary classification

E=βˆ’yβ€…β€Šlog(y^)βˆ’(1βˆ’y)log(1βˆ’y^)E= -y \; log(\hat{y}) - (1-y)log (1-\hat{y})

Categorical Cross-Entropy

Used in multi-class classification when outputs are represented as one-hot encoding format and there are KK classes

E=βˆ’βˆ‘k=1Kyklog(yk^)E= - \sum_{k=1}^K y_k log(\hat{y_k})

Sparse Categorical Cross-Entropy

Used in multi-class classification when outputs are represented with class indices (1,2,..K)(1,2,.. K), yy is the correct class index and y^y\hat{y}_y is the predicted probability of correct class yy

E=βˆ’log(y^y)E = - log(\hat{y}_y)

Backward Propagation of Errors

Now we will see how we calculate the gradient of the error (or loss) with respect to all the weights. We start from the output layer and move back towards the input layer.

Feature Extraction

We are considering binary cross entropy loss for our discussion here.

E=βˆ’yβ€…β€Šlog(y^)βˆ’(1βˆ’y)log(1βˆ’y^)E= -y \; log(\hat{y}) - (1-y)log (1-\hat{y})

Output Layer Gradients

Ξ΄EΞ΄y^=βˆ’yy^βˆ’1βˆ’y1βˆ’y^(βˆ’1)=βˆ’yy^βˆ’yβˆ’11βˆ’y^\frac{\delta E}{\delta \hat{y}}= -\frac{y}{\hat{y}} - \frac{1-y}{1-\hat{y}}(-1)=-\frac{y}{\hat{y}} - \frac{y-1}{1-\hat{y}}

As a1(2)=y^a_1^{(2)}=\hat{y},

Ξ΄EΞ΄a1(2)=Ξ΄EΞ΄y^\frac{\delta E}{\delta a_1^{(2)}}= \frac{\delta E}{\delta \hat{y}}

Now assume a1(2)a_1^{(2)} uses sigmoid activation function or a1(2)=Οƒ(z1(2))a_1^{(2)}=\sigma(z_1^{(2)}).

We already know how to calculate the derivative of sigmoid [ Οƒβ€²(a)=Οƒ(a)(1βˆ’Οƒ(a))\sigma ^{'} (a) = \sigma (a)(1- \sigma(a))]

Ξ΄a1(2)Ξ΄z1(2)=Οƒ(z1(2))(1βˆ’Οƒ(z1(2)))\frac{\delta a_1^{(2)}}{\delta z_1^{(2)}}=\sigma(z_1^{(2)})(1-\sigma(z_1^{(2)}))

Now how do we calculate Ξ΄EΞ΄z1(2)\frac{\delta E}{\delta z_1^{(2)}}? We have to apply chain rule.

Ξ΄EΞ΄z1(2)=Ξ΄EΞ΄a1(2).Ξ΄a1(2)Ξ΄z1(2)=Ξ΄EΞ΄y^.Ξ΄a1(2)Ξ΄z1(2)\frac{\delta E}{\delta z_1^{(2)}}= \frac{\delta E}{\delta a_1^{(2)}}.\frac{\delta a_1^{(2)}}{\delta z_1^{(2)}}=\frac{\delta E}{\delta \hat{y}}.\frac{\delta a_1^{(2)}}{\delta z_1^{(2)}}

Second Layer Gradients

Feature Extraction

Here z1(2)z_1^{(2)} is the linear weighted combination of previous layer neurons. Also, biases ( like w01(2)w_{01}^{(2)} ) are not shown in the diagram so you just assume they are there :)

z1(2)=w01(2)+w11(2)a1(1)+w21(2)a2(1)+w31(2)a3(1)z_1^{(2)}=w_{01}^{(2)}+w_{11}^{(2)}a_1^{(1)}+w_{21}^{(2)}a_2^{(1)}+w_{31}^{(2)}a_3^{(1)}

Hence,

Ξ΄z1(2)Ξ΄w11(2)=a1(1),β€…β€Šβ€…β€ŠΞ΄z1(2)Ξ΄w21(2)=a2(1)β€…β€Šβ€…β€ŠΞ΄z1(2)Ξ΄w31(2)=a3(1)\frac{\delta z_1^{(2)}}{\delta w_{11}^{(2)}}=a_1^{(1)}, \; \; \frac{\delta z_1^{(2)}}{\delta w_{21}^{(2)}}=a_2^{(1)} \;\; \frac{\delta z_1^{(2)}}{\delta w_{31}^{(2)}}=a_3^{(1)}

So Gradients with respect to second layer weights can be calculated using chain rule.

Ξ΄EΞ΄w11(2)=Ξ΄EΞ΄z1(2).Ξ΄z1(2)Ξ΄w11(2)\frac{\delta E}{\delta w_{11}^{(2)}}= \frac{\delta E}{\delta z_1^{(2)}}.\frac{\delta z_1^{(2)}}{\delta w_{11}^{(2)}} Ξ΄EΞ΄w21(2)=Ξ΄EΞ΄z1(2).Ξ΄z1(2)Ξ΄w21(2)\frac{\delta E}{\delta w_{21}^{(2)}}= \frac{\delta E}{\delta z_1^{(2)}}.\frac{\delta z_1^{(2)}}{\delta w_{21}^{(2)}} Ξ΄EΞ΄w31(2)=Ξ΄EΞ΄z1(2).Ξ΄z1(2)Ξ΄w31(2)\frac{\delta E}{\delta w_{31}^{(2)}}= \frac{\delta E}{\delta z_1^{(2)}}.\frac{\delta z_1^{(2)}}{\delta w_{31}^{(2)}}

First Layer Gradients

Feature Extraction z1(2)=w01(2)+w11(2)a1(1)+w21(2)a2(1)+w31(2)a3(1)z_1^{(2)}=w_{01}^{(2)}+w_{11}^{(2)}a_1^{(1)}+w_{21}^{(2)}a_2^{(1)}+w_{31}^{(2)}a_3^{(1)}

What is βˆ‚z1(2)βˆ‚a1(1)\frac{\partial z_1^{(2)}}{\partial a_1^{(1)}} ?

βˆ‚z1(2)βˆ‚a1(1)=w11(2)\frac{\partial z_1^{(2)}}{\partial a_1^{(1)}}=w_{11}^{(2)}

What is βˆ‚a1(1)βˆ‚z1(1)\frac{\partial a_1^{(1)}}{\partial z_1^{(1)}} ? It depends on the activation function used in a1(1)a_1^{(1)}.

Mostly for hidden layers we use Relu as activation function (f(x)=max(0,x))(f(x)=max(0, x)). Derivative of relu is 1 if x>0x > 0 , 0 if x<0x < 0 and it is undefined at 0.

z1(1)z_1^{(1)} is a weighted linear combination of first layer weights.

z1(1)=w01(1)+w11(1)x1+w21(1)x2z_1^{(1)}=w_{01}^{(1)}+w_{11}^{(1)}x_1+w_{21}^{(1)}x_2

By using chain rule we will calculate βˆ‚EΞ΄w11(1)\frac{\partial E}{\delta w_{11}^{(1)}} and other gradients with respect to first layer weights in the same way.

βˆ‚EΞ΄w11(1)=βˆ‚Eβˆ‚z1(2).βˆ‚z1(2)βˆ‚a1(1).βˆ‚a1(1)βˆ‚z1(1).βˆ‚z1(1)βˆ‚w11(1)\frac{\partial E}{\delta w_{11}^{(1)}}= \frac{\partial E}{\partial z_1^{(2)}}. \frac{\partial z_1^{(2)}}{\partial a_1^{(1)}}.\frac{\partial a_1^{(1)}}{\partial z_1^{(1)}}.\frac{\partial z_1^{(1)}}{\partial w_{11}^{(1)}}

Visualize Error Backpropagation

Feature Extraction

Here we are considering a different network with two hidden layers and we will see how error gradients are flowing backward from output layer to input layers.

Feature Extraction

Here z1(3)z_1^{(3)} is a combination of a1(2)a_1^{(2)} and a2(2)a_2^{(2)}. So βˆ‚Eβˆ‚z1(3)\frac{\partial E}{\partial z_1^{(3)}} will flow in both the path as shown in the image.

βˆ‚Eβˆ‚z1(2)\frac{\partial E}{\partial z_1^{(2)}} will flow backward to three paths corresponding to a1(1)a_1^{(1)}, a2(1)a_2^{(1)} and a3(1)a_3^{(1)}. Similarly, βˆ‚Eβˆ‚z2(2)\frac{\partial E}{\partial z_2^{(2)}} will also flow backward to three paths corresponding to a1(1)a_1^{(1)}, a2(1)a_2^{(1)} and a3(1)a_3^{(1)}.

Feature Extraction

How do you calculate βˆ‚Eβˆ‚a1(1)\frac{\partial E}{\partial a_1^{(1)}}? a1(1)a_1^{(1)} is influencing both z1(2)z_1^{(2)} and z2(2)z_2^{(2)}. The error βˆ‚Eβˆ‚z1(2)\frac{\partial E}{\partial z_1^{(2)}} and βˆ‚Eβˆ‚z2(2)\frac{\partial E}{\partial z_2^{(2)}} both backpropagating towards a1(1)a_1^{(1)}.

Feature Extraction

So we have to apply vector case for chain rule of calculas..

βˆ‚Eβˆ‚a1(1)=βˆ‚Eβˆ‚z1(2).βˆ‚z1(2)βˆ‚a1(1)+βˆ‚Eβˆ‚z2(2).βˆ‚z2(2)βˆ‚a1(1)\frac{\partial E}{\partial a_1^{(1)}}=\frac{\partial E}{\partial z_1^{(2)}}.\frac{\partial z_1^{(2)}}{\partial a_1^{(1)}} +\frac{\partial E}{\partial z_2^{(2)}}.\frac{\partial z_2^{(2)}}{\partial a_1^{(1)}}

βˆ‚Eβˆ‚z1(2)\frac{\partial E}{\partial z_1^{(2)}} and βˆ‚Eβˆ‚z2(2)\frac{\partial E}{\partial z_2^{(2)}} is already backpropagating towards a1(1)a_1^{(1)}. We just need to calculate βˆ‚z1(2)βˆ‚a1(1)\frac{\partial z_1^{(2)}}{\partial a_1^{(1)}} and βˆ‚z2(2)βˆ‚a1(1)\frac{\partial z_2^{(2)}}{\partial a_1^{(1)}}.

z1(2)=w11(2)a1(1)+w21(2)a2(1)+w31(2)a3(1)z_1^{(2)}=w_{11}^{(2)}a_1^{(1)}+w_{21}^{(2)}a_2^{(1)}+w_{31}^{(2)}a_3^{(1)} z2(2)=w12(2)a1(1)+w22(2)a2(1)+w32(2)a3(1)z_2^{(2)}=w_{12}^{(2)}a_1^{(1)}+w_{22}^{(2)}a_2^{(1)}+w_{32}^{(2)}a_3^{(1)}

Hence,

βˆ‚z1(2)βˆ‚a1(1)=w11(2),β€…β€Šβ€…β€Šβ€…β€Šβˆ‚z1(2)βˆ‚a1(1)=w11(2)\frac{\partial z_1^{(2)}}{\partial a_1^{(1)}}=w_{11}^{(2)}, \;\; \;\frac{\partial z_1^{(2)}}{\partial a_1^{(1)}}=w_{11}^{(2)}

How Do We Update The Biases?

We will see one small example of how we calculate the gradient w.r.t the biases. Showing just one biad w01(3)w_{01}^{(3)} while computing z1(3)z_1^{(3)}.

We will try to calculate βˆ‚Eβˆ‚w01(3)\frac{\partial E}{\partial w_{01}^{(3)}}.

Feature Extraction z1(3)=w01(3)+w113a1(2)+w21(3)a2(2)z_1^{(3)}=w_{01}^{(3)}+w_{11}^3a_1^{(2)}+w_{21}^{(3)}a_2^{(2)}

Here,

βˆ‚z1(3)βˆ‚w01(3)=1\frac{\partial z_{1}^{(3)}}{\partial w_{01}^{(3)}}=1

Using chain rule,

βˆ‚Eβˆ‚w01(3)=βˆ‚Eβˆ‚z1(3).βˆ‚z1(3)βˆ‚w01(3)\frac{\partial E}{\partial w_{01}^{(3)}}=\frac{\partial E}{\partial z_{1}^{(3)}}.\frac{\partial z_{1}^{(3)}}{\partial w_{01}^{(3)}}

Hence,

βˆ‚Eβˆ‚w01(3)=βˆ‚Eβˆ‚z1(3)\frac{\partial E}{\partial w_{01}^{(3)}}=\frac{\partial E}{\partial z_{1}^{(3)}}

That was a free preview lesson.