Dropout

Deep neural networks are powerful, but with great power comes great responsibility—specifically, the responsibility to not overfit. When your network has millions of parameters, it can easily memorize the training data instead of learning generalizable patterns. Dropout is one of the most elegant and effective solutions to this problem.

The Overfitting Problem

Complex networks with many nodes and weights tend to learn spurious correlations. Each weight $w_{ij}$ defines the importance of connections between neurons, and when you have too many of them, the model starts picking up noise.

The Ensemble Connection

Dropout didn't appear out of nowhere. It takes inspiration from ensemble learning—a technique where you train multiple models and combine their predictions.

How Ensemble Learning Works

1. Train multiple models with different data, architectures, or initializations
1. Combine their predictions (averaging, voting, etc.)
1. Achieve better generalization than any single model

What we do at Test Time?

At test time, with $K$ models, we average different model predictions.

If we have K models => we will have K predictions $\hat{y}_1, \hat{y}_2, \hat{y}_3,...,\hat{y}_K$

Generated output:

\hat{y} = \frac{1}{K} \sum_{k=1}^{K} \hat{y}_k

The Problem with Ensembles

Train multiple models → computationally expensive
Run all models at test time → slow inference
Storage and maintenance → operational overhead

We want the benefits of ensembling without the costs. Enter dropout.

What Is Dropout?

Dropout is a regularization technique that simplifies the model by randomly "droping out" (removing) neurons during training. In each mini-batch, we temporarily remove a subset of neurons along with all their incoming and outgoing connections.

In every mini-batch we update weights only for those nodes/weights that appear in the network. Nodes/weights that does not exist in mini-batch will simply retain previous weights.

But We don’t run multiple models at test time. We run only one model at test time, but still it mimics many many models.

Dropout Rate

The dropout rate controls how many neurons are turned off:

0.2 → 20% of neurons dropped
0.5 → 50% of neurons dropped

You typically set separate probabilities:

Input nodes: retain with probability $p$
Hidden nodes: retain with probability $q$

Common practice: 0.8 for inputs, 0.5 for hidden layers.

How Dropout Creates an Ensemble

With $m$ nodes in the original network, there are $2^m$ possible thinned network configurations. In each training iteration, you're effectively training a different architecture on a different subset of the data.

The Magic

Dropout trains a massive ensemble of thinned networks without:

Explicitly storing multiple models
Running multiple forward passes at test time
Paying the full ensemble cost

Each update touches a different random sub-network, and they all share weights. By the end of training, you've implicitly trained an ensemble of exponentially many models.

Training vs. Testing

This is where it gets interesting—and where many get confused.

Training Phase

Dropout ON
Randomly drop neurons with probability ( p )
Scale activations to maintain expected magnitude

Testing Phase

Dropout OFF
All neurons are used
No random dropping

But here's the catch: if you simply turn off dropout at test time, the activations will be larger because you're using more neurons. To fix this, you scale weights during training (inverted dropout) or during testing.

Imagine practicing free throws where your coach covers your eyes half the time. You learn to compensate by shooting harder. Then comes game day—your eyes are wide open, but you're still shooting with that extra force, and now you're overshooting every basket. That's standard dropout. During training, we randomly drop neurons, and the survivors learn to fire more intensely to compensate. At test time, all neurons are active, and suddenly the network is "overshooting" because every neuron is firing at that elevated intensity.

To fix this, you scale weights during training (inverted dropout) or during testing.

The Problem: Scale Mismatch

Let's understand why we need inverted dropout in the first place.

Training Phase (Dropout ON) Consider during training, you randomly drop neurons with probability $p$ . If you have a neuron that normally outputs a value of 10, after dropout it will output:

10 with probability 1−p (when kept)
0 with probability p (when dropped)

The expected output becomes:

E[output] = 10 \times (1-p) + 0 \times p

So dropout reduces the expected activation by a factor of 1−p.

Testing Phase (Dropout OFF)

At test time, you use all neurons (no dropout). That same neuron now outputs 10 consistently.

The problem: Your test-time activations are $\frac times larger than what the network experienced during training!

The network was trained on smaller activations but now sees larger ones. This breaks the weight distributions learned during training.

We can scale at test time — that’s actually the original dropout formulation. But inverted dropout is preferred because it simplifies things and avoids subtle issues.

Inverted Dropout: The Elegant Solution

Inverted dropout scales during training instead of testing.

How It Works During training, after applying the dropout mask, you divide by the keep probability 1−p:

What Happens Mathematically Consider neuron output during training: 10

With probability 1−p: $\frac{10} {(1−p)}$

With probability p: 0

Expected output:

E[output]= \frac{10} {(1−p)} \times (1-p) + 0 \times p = 10

Boom! The expected output during training is now exactly 10—the same as at test time.