Forward Propagation

We already learned about neural network with weights and activation functions. But, how does it actually make a prediction? Forward propagation is the answer—it's the journey of data from input to output, layer by layer.

Just as we decomposed XOR into hidden neurons in the motivation chapter, forward propagation is the mechanism that computes those intermediate values ( $h_1$ , $h_2$ ) and finally the output. It's the first step of every training iteration.

Think of it like water flowing through pipes:

Input enters at one end
Pipes (weights) control how much flows where
Junctions (activation functions) decide what gets passed forward
Output emerges at the other end

The Single Neuron Computation

Before understanding an entire network, let's start with one neuron. A single artificial neuron performs two simple operations:

Step 1: Linear Combination

First, the neuron computes a weighted sum of its inputs and adds a bias:

z = w_1x_1 + w_2x_2 + \ldots + w_nx_n + b = \sum_{i=1}^{n} w_ix_i + b

Where:

$x_i$ are the input values
$w_i$ are the weights (learned parameters)
$b$ is the bias (also learned)

Step 2: Activation Function

Then, it applies an activation function to introduce non-linearity:

a = \sigma(z) \quad \text{or} \quad a = \text{ReLU}(z) \quad \text{or} \quad a = \tanh(z)

The activation function we choose—from the previous chapter—determines how the neuron "fires."

For a single neuron, forward propagation is simply:

\text{predicted output} \; \hat{y} = a = \text{activation}\left(\sum_{i} w_ix_i + b\right)

Layer-by-Layer Propagation

Now let's scale this to multiple neurons organized in layers. A neural network has three types of layers:

Input Layer

The input layer receives raw features and passes them forward without computation:

\mathbf{a}^{(0)} = \mathbf{x}

Where $\mathbf{x}$ is the vector of input features (e.g., $[x_1, x_2]$ for our XOR example).

Hidden Layers

Each hidden layer transforms the representation from the previous layer. For layer $l$ :

Linear step:

\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}

Activation step:

\mathbf{a}^{(l)} = g^{(l)}(\mathbf{z}^{(l)})

Where:

$\mathbf{W}^{(l)}$ is the weight matrix connecting layer $l-1$ to layer $l$
$\mathbf{b}^{(l)}$ is the bias vector for layer $l$
$g^{(l)}$ is the activation function (ReLU, tanh, sigmoid, etc.)
$\mathbf{a}^{(l)}$ is the activation output of layer $l$

Output Layer

The final layer produces the network's prediction:

\hat{y} = \mathbf{a}^{(L)}

Where $L$ is the total number of layers. The activation function here depends on the task:

Binary classification: Sigmoid (outputs probability between 0 and 1)
Multi-class classification: Softmax (outputs probability distribution)
Regression: Linear (no activation, or ReLU for non-negative outputs)

Worked Example: Solving XOR with Forward Propagation

Let's walk through the XOR network we built in the motivation chapter. This network uses a hidden layer with two neurons (one approximating OR, one approximating AND) and an output layer that combines them.

Network Architecture

Input layer: 2 neurons ( $x_1$ , $x_2$ )
Hidden layer: 2 neurons with sigmoid activation
Output layer: 1 neuron with sigmoid activation

The Learned Weights

From our earlier construction:

Hidden layer:

Neuron 1 (OR behavior): $w_1 = [20, 20]$ , $b_1 = -10$
Neuron 2 (AND behavior): $w_2 = [20, 20]$ , $b_2 = -30$

Output layer:

$w = [20, -20]$ , $b = -10$

Forward Pass for Input $(1, 0)$

Let's compute step by step:

Hidden layer - Neuron 1 (OR):

z_1^{[1]} = 20 \times 1 + 20 \times 0 - 10 = 10

h_1 = \sigma(10) = \frac{1}{1 + e^{-10}} \approx 0.99995 \approx 1

Hidden layer - Neuron 2 (AND):

z_2^{[1]} = 20 \times + 20 \times 0 - 30 = -10

h_2 = \sigma(-10) = \frac{1}{1 + e^{10}} \approx 0.000045 \approx 0

Output layer:

z^{[2]} = 20 \times 1 - 20 \times 1 - 10 = 10

\hat{y} = \sigma(10) \approx 1

Result: For input $(1, 0)$ , the network predicts $1$ . Since $1 \oplus 0 = 1$ , this is correct.

Complete Truth Table

Let's verify all four input combinations:

$x_1$	$x_2$	$h_1$ (OR)	$h_2$ (AND)	$\hat{y}$ (XOR)	Expected
0	0	0	0	0	✓
0	1	1	0	1	✓
1	0	1	0	1	✓
1	1	1	1	0	✓

The network correctly computes XOR by:

First layer: Learning intermediate concepts (OR and AND)
Second layer: Combining them to produce the final output

This is the power of depth—each layer builds on the previous one.

Vectorization: Processing Multiple Samples

Processing one sample at a time is inefficient. Modern hardware (GPUs) excels at performing the same operation on many samples simultaneously. Vectorization allows us to process an entire mini-batch in one go.

Single Sample (Slow)

For one sample, layer $l$ computes:

\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}

Here $\mathbf{a}^{(l-1)}$ is a column vector of shape $(n^{(l-1)}, 1)$ .

Mini-Batch of $m$ Samples (Fast)

For $m$ samples, we stack them as columns:

\mathbf{Z}^{(l)} = \mathbf{W}^{(l)} \mathbf{A}^{(l-1)} + \mathbf{b}^{(l)}

Where:

$\mathbf{A}^{(l-1)}$ has shape $(n^{(l-1)}, m)$ — each column is one sample
$\mathbf{Z}^{(l)}$ has shape $(n^{(l)}, m)$ — each column is the linear output for one sample
$\mathbf{b}^{(l)}$ is broadcasted across all columns

Why this matters:

A batch of 128 samples takes roughly the same time as 1 sample on a GPU
Matrix multiplication is highly optimized
Training becomes dramatically faster

Loss/Error Computation: How Wrong Are We?

After forward propagation, we have predictions $\hat{y}$ for all samples. Now we need to measure how wrong these predictions are. This is the role of the loss function.

The choice of loss function depends on the task:

Binary Classification (Sigmoid Output)

When the output is a probability between 0 and 1 (like our XOR network):

E (\hat{y}, y) = -[y \log \hat{y} + (1-y) \log(1-\hat{y})]

This is called Binary Cross-Entropy. It penalizes confident wrong predictions heavily:

If $y=1$ and $\hat{y}=0.1$ , loss is large ( $-\log(0.1) \approx 2.3$ )
If $y=1$ and $\hat{y}=0.9$ , loss is small ( $-\log(0.9) \approx 0.105$ )

Multi-Class Classification (Softmax Output)

For $C$ classes with softmax output:

E = -\sum_{i=1}^{C} y_i \log \hat{y}_i

This is Categorical Cross-Entropy. Only the true class contributes to the loss.

Regression (Linear Output)

When predicting continuous values:

E = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

This is Mean Squared Error (MSE) .

Forward Propagation

The Single Neuron Computation

Step 1: Linear Combination

Step 2: Activation Function

Layer-by-Layer Propagation

Input Layer

Hidden Layers

Output Layer

Worked Example: Solving XOR with Forward Propagation

Network Architecture

The Learned Weights

Forward Pass for Input (1,0)(1, 0)(1,0)

Complete Truth Table

Vectorization: Processing Multiple Samples

Single Sample (Slow)

Mini-Batch of mmm Samples (Fast)

Loss/Error Computation: How Wrong Are We?

Binary Classification (Sigmoid Output)

Multi-Class Classification (Softmax Output)

Regression (Linear Output)

Forward Pass for Input $(1, 0)$

Mini-Batch of $m$ Samples (Fast)