๐Ÿš€ ๐Ÿš€ Launch Offer โ€” Courses starting at โ‚น1499 (Limited Time)
CortexCookie

Dropout

Deep neural networks are powerful, but with great power comes great responsibilityโ€”specifically, the responsibility to not overfit. When your network has millions of parameters, it can easily memorize the training data instead of learning generalizable patterns. Dropout is one of the most elegant and effective solutions to this problem.

The Overfitting Problem

Complex networks with many nodes and weights tend to learn spurious correlations. Each weight wijw_{ij} defines the importance of connections between neurons, and when you have too many of them, the model starts picking up noise.

Feature Extraction

The Ensemble Connection

Dropout didn't appear out of nowhere. It takes inspiration from ensemble learningโ€”a technique where you train multiple models and combine their predictions.

How Ensemble Learning Works

    1. Train multiple models with different data, architectures, or initializations
    1. Combine their predictions (averaging, voting, etc.)
    1. Achieve better generalization than any single model

What we do at Test Time?

At test time, with KK models, we average different model predictions.

If we have K models => we will have K predictions y^1,y^2,y^3,...,y^K \hat{y}_1, \hat{y}_2, \hat{y}_3,...,\hat{y}_K

Generated output:

y^=1Kโˆ‘k=1Ky^k\hat{y} = \frac{1}{K} \sum_{k=1}^{K} \hat{y}_k

The Problem with Ensembles

  • Train multiple models โ†’ computationally expensive
  • Run all models at test time โ†’ slow inference
  • Storage and maintenance โ†’ operational overhead

We want the benefits of ensembling without the costs. Enter dropout.

What Is Dropout?

Dropout is a regularization technique that simplifies the model by randomly "droping out" (removing) neurons during training. In each mini-batch, we temporarily remove a subset of neurons along with all their incoming and outgoing connections.

Feature Extraction Feature Extraction

In every mini-batch we update weights only for those nodes/weights that appear in the network. Nodes/weights that does not exist in mini-batch will simply retain previous weights.

But We donโ€™t run multiple models at test time. We run only one model at test time, but still it mimics many many models.

Dropout Rate

The dropout rate controls how many neurons are turned off:

  • 0.2 โ†’ 20% of neurons dropped
  • 0.5 โ†’ 50% of neurons dropped

You typically set separate probabilities:

  • Input nodes: retain with probability pp
  • Hidden nodes: retain with probability qq

Common practice: 0.8 for inputs, 0.5 for hidden layers.

How Dropout Creates an Ensemble

With mm nodes in the original network, there are 2m2^m possible thinned network configurations. In each training iteration, you're effectively training a different architecture on a different subset of the data.

The Magic

Dropout trains a massive ensemble of thinned networks without:

  • Explicitly storing multiple models
  • Running multiple forward passes at test time
  • Paying the full ensemble cost

Each update touches a different random sub-network, and they all share weights. By the end of training, you've implicitly trained an ensemble of exponentially many models.

Training vs. Testing

This is where it gets interestingโ€”and where many get confused.

Training Phase

  • Dropout ON
  • Randomly drop neurons with probability ( p )
  • Scale activations to maintain expected magnitude

Testing Phase

  • Dropout OFF
  • All neurons are used
  • No random dropping

But here's the catch: if you simply turn off dropout at test time, the activations will be larger because you're using more neurons. To fix this, you scale weights during training (inverted dropout) or during testing.

Imagine practicing free throws where your coach covers your eyes half the time. You learn to compensate by shooting harder. Then comes game dayโ€”your eyes are wide open, but you're still shooting with that extra force, and now you're overshooting every basket. That's standard dropout. During training, we randomly drop neurons, and the survivors learn to fire more intensely to compensate. At test time, all neurons are active, and suddenly the network is "overshooting" because every neuron is firing at that elevated intensity.

To fix this, you scale weights during training (inverted dropout) or during testing.

The Problem: Scale Mismatch

Let's understand why we need inverted dropout in the first place.

Training Phase (Dropout ON) Consider during training, you randomly drop neurons with probability pp. If you have a neuron that normally outputs a value of 10, after dropout it will output:

  • 10 with probability 1โˆ’p (when kept)
  • 0 with probability p (when dropped)

The expected output becomes:

E[output]=10ร—(1โˆ’p)+0ร—pE[output] = 10 \times (1-p) + 0 \times p

So dropout reduces the expected activation by a factor of 1โˆ’p.

Testing Phase (Dropout OFF)

At test time, you use all neurons (no dropout). That same neuron now outputs 10 consistently.

The problem: Your test-time activations are $\frac times larger than what the network experienced during training!

The network was trained on smaller activations but now sees larger ones. This breaks the weight distributions learned during training.

We can scale at test time โ€” thatโ€™s actually the original dropout formulation. But inverted dropout is preferred because it simplifies things and avoids subtle issues.

Inverted Dropout: The Elegant Solution

Inverted dropout scales during training instead of testing.

How It Works During training, after applying the dropout mask, you divide by the keep probability 1โˆ’p:

What Happens Mathematically Consider neuron output during training: 10

With probability 1โˆ’p: 10(1โˆ’p)\frac{10} {(1โˆ’p)}

With probability p: 0

Expected output:

E[output]=10(1โˆ’p)ร—(1โˆ’p)+0ร—p=10E[output]= \frac{10} {(1โˆ’p)} \times (1-p) + 0 \times p = 10

Boom! The expected output during training is now exactly 10โ€”the same as at test time.

That was a free preview lesson.