Forward Propagation
We already learned about neural network with weights and activation functions. But, how does it actually make a prediction? Forward propagation is the answerβit's the journey of data from input to output, layer by layer.
Just as we decomposed XOR into hidden neurons in the motivation chapter, forward propagation is the mechanism that computes those intermediate values (, ) and finally the output. It's the first step of every training iteration.
Think of it like water flowing through pipes:
- Input enters at one end
- Pipes (weights) control how much flows where
- Junctions (activation functions) decide what gets passed forward
- Output emerges at the other end
The Single Neuron Computation
Before understanding an entire network, let's start with one neuron. A single artificial neuron performs two simple operations:
Step 1: Linear Combination
First, the neuron computes a weighted sum of its inputs and adds a bias:
Where:
- are the input values
- are the weights (learned parameters)
- is the bias (also learned)
Step 2: Activation Function
Then, it applies an activation function to introduce non-linearity:
The activation function we chooseβfrom the previous chapterβdetermines how the neuron "fires."
For a single neuron, forward propagation is simply:
Layer-by-Layer Propagation
Now let's scale this to multiple neurons organized in layers. A neural network has three types of layers:
Input Layer
The input layer receives raw features and passes them forward without computation:
Where is the vector of input features (e.g., for our XOR example).
Hidden Layers
Each hidden layer transforms the representation from the previous layer. For layer :
Linear step:
Activation step:
Where:
- is the weight matrix connecting layer to layer
- is the bias vector for layer
- is the activation function (ReLU, tanh, sigmoid, etc.)
- is the activation output of layer
Output Layer
The final layer produces the network's prediction:
Where is the total number of layers. The activation function here depends on the task:
- Binary classification: Sigmoid (outputs probability between 0 and 1)
- Multi-class classification: Softmax (outputs probability distribution)
- Regression: Linear (no activation, or ReLU for non-negative outputs)
Worked Example: Solving XOR with Forward Propagation
Let's walk through the XOR network we built in the motivation chapter. This network uses a hidden layer with two neurons (one approximating OR, one approximating AND) and an output layer that combines them.
Network Architecture
- Input layer: 2 neurons (, )
- Hidden layer: 2 neurons with sigmoid activation
- Output layer: 1 neuron with sigmoid activation
The Learned Weights
From our earlier construction:
Hidden layer:
- Neuron 1 (OR behavior): ,
- Neuron 2 (AND behavior): ,
Output layer:
- ,
Forward Pass for Input
Let's compute step by step:
Hidden layer - Neuron 1 (OR):
Hidden layer - Neuron 2 (AND):
Output layer:
Result: For input , the network predicts . Since , this is correct.
Complete Truth Table
Let's verify all four input combinations:
| (OR) | (AND) | (XOR) | Expected | ||
|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | β |
| 0 | 1 | 1 | 0 | 1 | β |
| 1 | 0 | 1 | 0 | 1 | β |
| 1 | 1 | 1 | 1 | 0 | β |
The network correctly computes XOR by:
- First layer: Learning intermediate concepts (OR and AND)
- Second layer: Combining them to produce the final output
This is the power of depthβeach layer builds on the previous one.
Vectorization: Processing Multiple Samples
Processing one sample at a time is inefficient. Modern hardware (GPUs) excels at performing the same operation on many samples simultaneously. Vectorization allows us to process an entire mini-batch in one go.
Single Sample (Slow)
For one sample, layer computes:
Here is a column vector of shape .
Mini-Batch of Samples (Fast)
For samples, we stack them as columns:
Where:
- has shape β each column is one sample
- has shape β each column is the linear output for one sample
- is broadcasted across all columns
Why this matters:
- A batch of 128 samples takes roughly the same time as 1 sample on a GPU
- Matrix multiplication is highly optimized
- Training becomes dramatically faster
Loss/Error Computation: How Wrong Are We?
After forward propagation, we have predictions for all samples. Now we need to measure how wrong these predictions are. This is the role of the loss function.
The choice of loss function depends on the task:
Binary Classification (Sigmoid Output)
When the output is a probability between 0 and 1 (like our XOR network):
This is called Binary Cross-Entropy. It penalizes confident wrong predictions heavily:
- If and , loss is large ()
- If and , loss is small ()
Multi-Class Classification (Softmax Output)
For classes with softmax output:
This is Categorical Cross-Entropy. Only the true class contributes to the loss.
Regression (Linear Output)
When predicting continuous values:
This is Mean Squared Error (MSE) .