πŸš€ πŸš€ Launch Offer β€” Courses starting at β‚Ή1499 (Limited Time)
CortexCookie

Forward Propagation

We already learned about neural network with weights and activation functions. But, how does it actually make a prediction? Forward propagation is the answerβ€”it's the journey of data from input to output, layer by layer.

Just as we decomposed XOR into hidden neurons in the motivation chapter, forward propagation is the mechanism that computes those intermediate values (h1h_1, h2h_2) and finally the output. It's the first step of every training iteration.

Think of it like water flowing through pipes:

  • Input enters at one end
  • Pipes (weights) control how much flows where
  • Junctions (activation functions) decide what gets passed forward
  • Output emerges at the other end

The Single Neuron Computation

Before understanding an entire network, let's start with one neuron. A single artificial neuron performs two simple operations:

Step 1: Linear Combination

First, the neuron computes a weighted sum of its inputs and adds a bias:

z=w1x1+w2x2+…+wnxn+b=βˆ‘i=1nwixi+bz = w_1x_1 + w_2x_2 + \ldots + w_nx_n + b = \sum_{i=1}^{n} w_ix_i + b

Where:

  • xix_i are the input values
  • wiw_i are the weights (learned parameters)
  • bb is the bias (also learned)

Step 2: Activation Function

Then, it applies an activation function to introduce non-linearity:

a=Οƒ(z)ora=ReLU(z)ora=tanh⁑(z)a = \sigma(z) \quad \text{or} \quad a = \text{ReLU}(z) \quad \text{or} \quad a = \tanh(z)

The activation function we chooseβ€”from the previous chapterβ€”determines how the neuron "fires."

For a single neuron, forward propagation is simply:

predictedΒ outputβ€…β€Šy^=a=activation(βˆ‘iwixi+b)\text{predicted output} \; \hat{y} = a = \text{activation}\left(\sum_{i} w_ix_i + b\right) Feature Extraction

Layer-by-Layer Propagation

Now let's scale this to multiple neurons organized in layers. A neural network has three types of layers:

Feature Extraction

Input Layer

The input layer receives raw features and passes them forward without computation:

a(0)=x\mathbf{a}^{(0)} = \mathbf{x}

Where x\mathbf{x} is the vector of input features (e.g., [x1,x2][x_1, x_2] for our XOR example).

Hidden Layers

Each hidden layer transforms the representation from the previous layer. For layer ll:

Linear step:

z(l)=W(l)a(lβˆ’1)+b(l)\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}

Activation step:

a(l)=g(l)(z(l))\mathbf{a}^{(l)} = g^{(l)}(\mathbf{z}^{(l)})

Where:

  • W(l)\mathbf{W}^{(l)} is the weight matrix connecting layer lβˆ’1l-1 to layer ll
  • b(l)\mathbf{b}^{(l)} is the bias vector for layer ll
  • g(l)g^{(l)} is the activation function (ReLU, tanh, sigmoid, etc.)
  • a(l)\mathbf{a}^{(l)} is the activation output of layer ll

Output Layer

The final layer produces the network's prediction:

y^=a(L)\hat{y} = \mathbf{a}^{(L)}

Where LL is the total number of layers. The activation function here depends on the task:

  • Binary classification: Sigmoid (outputs probability between 0 and 1)
  • Multi-class classification: Softmax (outputs probability distribution)
  • Regression: Linear (no activation, or ReLU for non-negative outputs)

Worked Example: Solving XOR with Forward Propagation

Let's walk through the XOR network we built in the motivation chapter. This network uses a hidden layer with two neurons (one approximating OR, one approximating AND) and an output layer that combines them.

Network Architecture

  • Input layer: 2 neurons (x1x_1, x2x_2)
  • Hidden layer: 2 neurons with sigmoid activation
  • Output layer: 1 neuron with sigmoid activation

The Learned Weights

From our earlier construction:

Feature Extraction

Hidden layer:

  • Neuron 1 (OR behavior): w1=[20,20]w_1 = [20, 20], b1=βˆ’10b_1 = -10
  • Neuron 2 (AND behavior): w2=[20,20]w_2 = [20, 20], b2=βˆ’30b_2 = -30

Output layer:

  • w=[20,βˆ’20]w = [20, -20], b=βˆ’10b = -10

Forward Pass for Input (1,0)(1, 0)

Let's compute step by step:

Hidden layer - Neuron 1 (OR):

z1[1]=20Γ—1+20Γ—0βˆ’10=10z_1^{[1]} = 20 \times 1 + 20 \times 0 - 10 = 10 h1=Οƒ(10)=11+eβˆ’10β‰ˆ0.99995β‰ˆ1h_1 = \sigma(10) = \frac{1}{1 + e^{-10}} \approx 0.99995 \approx 1

Hidden layer - Neuron 2 (AND):

z2[1]=20Γ—+20Γ—0βˆ’30=βˆ’10z_2^{[1]} = 20 \times + 20 \times 0 - 30 = -10 h2=Οƒ(βˆ’10)=11+e10β‰ˆ0.000045β‰ˆ0h_2 = \sigma(-10) = \frac{1}{1 + e^{10}} \approx 0.000045 \approx 0

Output layer:

z[2]=20Γ—1βˆ’20Γ—1βˆ’10=10z^{[2]} = 20 \times 1 - 20 \times 1 - 10 = 10 y^=Οƒ(10)β‰ˆ1\hat{y} = \sigma(10) \approx 1

Result: For input (1,0)(1, 0), the network predicts 11. Since 1βŠ•0=11 \oplus 0 = 1, this is correct.

Complete Truth Table

Let's verify all four input combinations:

Feature Extraction
x1x_1x2x_2h1h_1 (OR)h2h_2 (AND)y^\hat{y} (XOR)Expected
00000βœ“
01101βœ“
10101βœ“
11110βœ“

The network correctly computes XOR by:

  1. First layer: Learning intermediate concepts (OR and AND)
  2. Second layer: Combining them to produce the final output

This is the power of depthβ€”each layer builds on the previous one.

Vectorization: Processing Multiple Samples

Processing one sample at a time is inefficient. Modern hardware (GPUs) excels at performing the same operation on many samples simultaneously. Vectorization allows us to process an entire mini-batch in one go.

Single Sample (Slow)

For one sample, layer ll computes:

z(l)=W(l)a(lβˆ’1)+b(l)\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}

Here a(lβˆ’1)\mathbf{a}^{(l-1)} is a column vector of shape (n(lβˆ’1),1)(n^{(l-1)}, 1).

Mini-Batch of mm Samples (Fast)

For mm samples, we stack them as columns:

Z(l)=W(l)A(lβˆ’1)+b(l)\mathbf{Z}^{(l)} = \mathbf{W}^{(l)} \mathbf{A}^{(l-1)} + \mathbf{b}^{(l)}

Where:

  • A(lβˆ’1)\mathbf{A}^{(l-1)} has shape (n(lβˆ’1),m)(n^{(l-1)}, m) β€” each column is one sample
  • Z(l)\mathbf{Z}^{(l)} has shape (n(l),m)(n^{(l)}, m) β€” each column is the linear output for one sample
  • b(l)\mathbf{b}^{(l)} is broadcasted across all columns

Why this matters:

  • A batch of 128 samples takes roughly the same time as 1 sample on a GPU
  • Matrix multiplication is highly optimized
  • Training becomes dramatically faster

Loss/Error Computation: How Wrong Are We?

After forward propagation, we have predictions y^\hat{y} for all samples. Now we need to measure how wrong these predictions are. This is the role of the loss function.

The choice of loss function depends on the task:

Binary Classification (Sigmoid Output)

When the output is a probability between 0 and 1 (like our XOR network):

E(y^,y)=βˆ’[ylog⁑y^+(1βˆ’y)log⁑(1βˆ’y^)]E (\hat{y}, y) = -[y \log \hat{y} + (1-y) \log(1-\hat{y})]

This is called Binary Cross-Entropy. It penalizes confident wrong predictions heavily:

  • If y=1y=1 and y^=0.1\hat{y}=0.1, loss is large (βˆ’log⁑(0.1)β‰ˆ2.3-\log(0.1) \approx 2.3)
  • If y=1y=1 and y^=0.9\hat{y}=0.9, loss is small (βˆ’log⁑(0.9)β‰ˆ0.105-\log(0.9) \approx 0.105)

Multi-Class Classification (Softmax Output)

For CC classes with softmax output:

E=βˆ’βˆ‘i=1Cyilog⁑y^iE = -\sum_{i=1}^{C} y_i \log \hat{y}_i

This is Categorical Cross-Entropy. Only the true class contributes to the loss.

Regression (Linear Output)

When predicting continuous values:

E=1nβˆ‘i=1n(yiβˆ’y^i)2E = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

This is Mean Squared Error (MSE) .

That was a free preview lesson.