πŸš€ πŸš€ Launch Offer β€” Courses starting at β‚Ή1499 (Limited Time)
CortexCookie

Batch Normalization: Stabilizing and Accelerating Learning

Deep learning works amazingly well β€” but training deep networks is often unstable, slow, and sensitive. Batch Normalization was introduced to fix exactly that. Let’s build the intuition step by step.

Why Do We Need Scaling?

Real-world data is messy:

  • Different units (kg, cm, β‚Ή)
  • Different ranges
  • Some features dominate others

Consider this dataset:

AgeSalary (β‚Ή)Height (cm)Weight (kg)
253500017065
305000017572
284200016860
356500018078

πŸ‘‰ Notice: Salary is much larger in scale than other features.

This creates a problem:

  • The model becomes more sensitive to salary
  • Optimization becomes imbalanced

Two Ways to Scale Data

1. Normalization (Min-Max Scaling)

We squash values into a fixed range:

xβ€²=xβˆ’xmin⁑xmaxβ‘βˆ’xmin⁑x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}
  • Smallest value β†’ 0
  • Largest value β†’ 1
  • Preserves relative distances
Feature Extraction

2. Standardization (Z-score Scaling)

We center and scale the data:

xβ€²=xβˆ’ΞΌΟƒx' = \frac{x - \mu}{\sigma}
  • Centers data β†’ mean β‰ˆ 0
  • Standard deviation β‰ˆ 1
  • Standard deviation tells where your typical data lies
  • Keeps gradients balanced
Feature Extraction

πŸ‘‰ This is the most common approach in machine learning.

Normalization and Standardizations often used interchangeably, even Batch Normalization effectively use Standardization.

Input Scaling in Deep Learning

We already scale inputs before training and that makes features comparable. Deep Learning prefers data centered around zero with controlled spread.

Tabular / numerical data

For tabular data we mostly use standardization, because features can have very different distributions.

Image Data

For image data we can use both.

Scale pixels: 0–255 β†’ 0–1 (normalization)

Then often: Standardize using dataset mean & standard deviation

Why input Normalization helps

Without scaling or Normalizing:

  • Features with larger scale dominates the loss and make the loss much more sensitive in their direction. πŸ‘‰ Creates steep curvature in one direction, flat in another

πŸ‘‰ Imagine:

  • Steep in one direction
  • Flat in another

This leads to:

  • Slow convergence
  • Unstable updates
  • Need for very small learning rates
Feature Extraction

With Standardization we make the data zero centered with controlled spread.

Feature Extraction

Key Question

If scaling helps at the input…
why not apply it inside the network?


The Hidden Problem in Deep Networks

In deep networks:

  • Each layer’s output becomes the next layer’s input
  • But these outputs keep changing during training

πŸ‘‰ This means:

Every layer is constantly chasing a moving target.


Internal Covariate Shift

As parameters update:

  • Activations shift
  • Distributions change

πŸ‘‰ Each layer must continuously re-adapt

This slows down training significantly.


Core Idea of Batch Normalization

Normalize activations at every layer
so that distributions remain stable


How Batch Normalization Works

For a mini-batch, we compute:

x^=xβˆ’ΞΌΟƒ2+Ο΅\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}

This ensures:

  • Mean = 0
  • Variance = 1

πŸ‘‰ But importantly:

  • This is done per batch
  • It changes during training

Neuron-Level View

For each neuron jj:

Mean

ΞΌB(j)=1mβˆ‘i=1mzi(j)\mu_B^{(j)} = \frac{1}{m} \sum_{i=1}^{m} z_i^{(j)}

Variance

ΟƒB2(j)=1mβˆ‘i=1m(zi(j)βˆ’ΞΌB(j))2\sigma_B^{2(j)} = \frac{1}{m} \sum_{i=1}^{m} (z_i^{(j)} - \mu_B^{(j)})^2

Normalization

z^i(j)=zi(j)βˆ’ΞΌB(j)ΟƒB2(j)+Ο΅\hat{z}_i^{(j)} = \frac{z_i^{(j)} - \mu_B^{(j)}}{\sqrt{\sigma_B^{2(j)} + \epsilon}}

πŸ‘‰ Each neuron is normalized independently across the batch.


The Problem with Pure Normalization

After normalization:

  • Mean = 0
  • Variance = 1

πŸ‘‰ Sounds good, but it’s too restrictive.

What if the model needs:

  • Mean = 5
  • Variance = 20

We are forcing everything into the same shape.


The Fix: Learnable Flexibility

We introduce two parameters:

yi(j)=Ξ³(j)z^i(j)+Ξ²(j)y_i^{(j)} = \gamma^{(j)} \hat{z}_i^{(j)} + \beta^{(j)}
  • Ξ³\gamma β†’ controls scale
  • Ξ²\beta β†’ controls shift

πŸ‘‰ Now the model can learn the best distribution.


Training vs Inference

During Training

  • Use mini-batch statistics:
    • ΞΌB\mu_B
    • ΟƒB2\sigma_B^2

πŸ‘‰ This introduces slight randomness.


Running Averages

We maintain moving averages:

ΞΌrun(j)←(1βˆ’Ξ±)ΞΌrun(j)+Ξ±ΞΌB(j)\mu_{\text{run}}^{(j)} \leftarrow (1 - \alpha)\mu_{\text{run}}^{(j)} + \alpha \mu_B^{(j)} Οƒrun2(j)←(1βˆ’Ξ±)Οƒrun2(j)+Ξ±ΟƒB2(j)\sigma_{\text{run}}^{2(j)} \leftarrow (1 - \alpha)\sigma_{\text{run}}^{2(j)} + \alpha \sigma_B^{2(j)}

πŸ‘‰ These approximate the true data distribution.


During Inference

We use fixed statistics:

z^(j)=z(j)βˆ’ΞΌrun(j)Οƒrun2(j)+Ο΅\hat{z}^{(j)} = \frac{z^{(j)} - \mu_{\text{run}}^{(j)}}{\sqrt{\sigma_{\text{run}}^{2(j)} + \epsilon}}

πŸ‘‰ No batch dependency β†’ stable predictions


Why BatchNorm Works

  • Keeps distributions stable
  • Smooths the loss landscape
  • Improves gradient flow
  • Allows higher learning rates

πŸ‘‰ Training becomes faster and more reliable


Benefits

  • Faster convergence
  • Stable optimization
  • Less sensitive to initialization
  • Better gradient flow
  • Regularization effect

Limitations

  • Depends on batch size
  • Not ideal for small batches
  • Train-test mismatch
  • Not suitable for some architectures (e.g., RNNs)

Final Insight

Input normalization helps at the start.

BatchNorm extends that idea throughout the network.


Closing Thought

β€œWe didn’t just normalize data β€” we normalized learning itself.”

That was a free preview lesson.