Gradient Descent Variants
Gradient Descent is the workhorse of machine learningβthe algorithm that lets models learn from data by iteratively stepping toward the minimum of a loss function. But not all gradient descent is created equal. The way you compute gradients and update weights has profound implications for training speed, stability, and final model quality.
In this chapter, we'll explore the three main variants: Batch Gradient Descent, Stochastic Gradient Descent (SGD) , and Mini-Batch Gradient Descent. Each has its strengths, weaknesses, and ideal use cases.
The Core Algorithm
Before diving into variants, let's establish what a vanilla gradient descent algorithm looks like.
At its heart, gradient descent is a simple loop:
- Initialize weights and choose a learning rate
- Compute the gradient of the loss with respect to weights:
- Update weights by stepping opposite the gradient:
- Repeat until convergence
The difference between variants lies entirely in how we compute the gradient.
Visualizing the Loss Landscape
To understand the behavior of different optimizers, it helps to visualize the loss function as a landscape. A contour plot is a 2D representation where each line connects points of equal loss valueβlike a topographic map.
- The center of the concentric circles represents the minimum (lowest loss)
- Steepest descent direction is perpendicular to the contours
- Noisy updates bounce around instead of taking a straight path
Keep this mental image as we explore each variant.
Batch Gradient Descent
How It Works
Batch Gradient Descent computes the gradient using the entire dataset before making a single update.
For a dataset :
Where is the loss for sample .
The update rule:
Visual Behavior
In contour plots, Batch GD moves like a heavy shipβsmooth, deliberate, and smoothly pointing directly toward the minimum. Each step is calculated with perfect information, so the trajectory is the true gradient path.
Strengths
- Very stable β The gradient is the true gradient with no approximations
- Deterministic β Same initialization yields same results every run
- Easy to debug β Loss decreases monotonically with no surprises
Weaknesses
- Slow for large datasets β One update requires processing every single sample
- Memory heavy β Must load the entire dataset into memory
- Poor for online learning β Can't update model as new data arrives
- Gets stuck in shallow minima β No noise means no escape from local optima
Best Use Cases
- Small datasets where the entire dataset fits comfortably in memory
- Convex problems where the global minimum is guaranteed
- Situations where stability is more important than speed
- When you need deterministic, reproducible results
Stochastic Gradient Descent (SGD)
How It Works
Stochastic Gradient Descent flips the script. Instead of using all data, it computes the gradient from a single randomly selected sample and updates immediately.
For a randomly chosen sample :
The update happens after every sample:
Visual Behavior
If Batch GD is a cruise ship, SGD is a drunk sailor. Each step is noisy and erraticβone sample might push you one direction, the next pulls you another. But here's the magic: on average, these noisy steps point in the same direction as the true gradient.
The contour plot shows a jagged, zigzagging path that bounces around before eventually converging near the minimum.
Strengths
- Extremely fast per update β Process one sample, update immediately
- Handles massive datasets β No need to load everything at once
- Escapes shallow minima β Noise helps jump out of local optima
- Online learning ready β Update model as new data streams in
Weaknesses
- Noisy updates β Loss doesn't decrease smoothly; it bounces around
- Harder to tune β Learning rate selection is more critical
- Not vectorized β Can't leverage GPU parallelism efficiently
Best Use Cases
- Very large datasets with millions or billions of samples
- Online learning environments where data streams continuously
- Deep learning training where escaping local minima is valuable
- Situations where speed outweighs stability
Mini-Batch Gradient Descent
How It Works
Mini-Batch Gradient Descent strikes a balance. It splits the dataset into small batches of size (typically 32, 64, or 128) and computes the gradient over each batch.
For a mini-batch :
One update per mini-batch, many updates per epoch:
Visual Behavior
In contour plots, Mini-Batch GD walks the middle path. Updates are smoother than SGD but faster and more frequent than Batch GD. The path shows some wobbleβless than SGD, more than Batchβbut converges efficiently.
Strengths
- Best of both worlds β More stable than SGD, faster than Batch GD
- GPU-friendly β Batches fit in GPU memory for parallel computation
- Vectorized operations β Matrix math on batches is highly optimized
- Adjustable noise β Batch size lets you control the stability-speed tradeoff
Weaknesses
- Batch size is a hyperparameter β Needs tuning for optimal performance
- Still memory constrained β Batch must fit in GPU or CPU memory
Best Use Cases
- Most deep learning applications (this is the default choice)
- When you have GPU hardware available
- Situations requiring both stability and speed
The Batch Size Tradeoff
Choosing the right batch size is one of the most important decisions in training. Here's the intuition:
Batch size = 1 (pure SGD)
- High noise, erratic updates
- Slow per update (no vectorization)
- Poor GPU utilization
- Excellent generalization (escapes sharp minima)
Batch size = 32β128
- Moderate noise, reasonably stable
- Fast updates with vectorization
- Excellent GPU utilization
- Best generalization (the sweet spot)
Batch size = 256β512
- Low noise, smooth updates
- Very fast per epoch
- Excellent GPU utilization
- Moderate generalization (tends toward sharp minima)
Batch size = full dataset (Batch GD)
- No noise, perfectly stable
- Slow per epoch (one update only)
- Poor GPU utilization
- Poor generalization (stuck in sharp minima)
Why Small Batches Often Generalize Better
Small batches introduce noise into the optimization process. This noise acts like a regularizer, helping the model escape sharp minima and find flatter, more generalizable solutions. Large batches tend to converge to sharp minima that don't transfer well to test dataβthey perform well on training but fail on unseen examples.
Performance Comparison
Update frequency
- Batch GD: Once per epoch
- SGD: Once per sample
- Mini-Batch GD: Once per batch (dozens to hundreds per epoch)
Gradient accuracy
- Batch GD: Exact gradient (no approximation)
- SGD: Very noisy (single sample estimate)
- Mini-Batch GD: Approximate but stable (batch average)
Memory usage
- Batch GD: High (entire dataset)
- SGD: Low (single sample)
- Mini-Batch GD: Moderate (batch size)
Convergence speed
- Batch GD: Slow (one step per full pass)
- SGD: Fast (updates after every sample)
- Mini-Batch GD: Fast (updates after every batch)
GPU efficiency
- Batch GD: Poor (no parallelization across samples)
- SGD: Poor (can't vectorize single samples)
- Mini-Batch GD: Excellent (batches leverage matrix operations)
Online learning capability
- Batch GD: No (needs full dataset)
- SGD: Yes (update with each new sample)
- Mini-Batch GD: Yes (update with batch of new data)
Escaping local minima
- Batch GD: No (deterministic path)
- SGD: Yes (noise provides escape)
- Mini-Batch GD: Yes (tunable via batch size)
When to Use Which Method
Choose Batch Gradient Descent When
- Your dataset fits comfortably in memory
- You're solving a convex problem like linear regression
- You need deterministic, reproducible results
- You're debugging or testing model behavior
Choose Stochastic Gradient Descent When
- Your dataset is massive (streaming or over a million samples)
- You're training online with continuously arriving data
- You want to escape local minima aggressively
- You're working with deep neural networks (though mini-batch is more common now)
Choose Mini-Batch Gradient Descent When
- You have GPU hardware (almost always the case for deep learning)
- You want the best tradeoff between stability and speed
- Your dataset is moderately large (most real-world scenarios)
- You need vectorized operations for efficiency
The Sweet Spot: Mini-Batch with Tuned Size
In practice, Mini-Batch Gradient Descent is the default choice for most modern machine learning applications. The typical workflow:
- Start with batch size 32 or 64 (the classic defaults)
- Increase if you have GPU memory to spare
- Monitor validation lossβif it worsens, your batch might be too large
- Adjust learning rateβlarger batches often need larger learning rates
Common Batch Sizes by Domain
Image classification: 32β128
- Models like ResNet, EfficientNet, ViT typically use this range
NLP / Transformers: 16β64
- Memory constrained by sequence length and model size
Reinforcement learning: 32β256
- Balance between stability and sample efficiency
Tabular data: 256β1024
- Often larger batches work well with dense, structured data
Looking Ahead
Mini-Batch Gradient Descent is the foundation, but it's just the beginning. In the later optimization chapters, we'll explore advanced optimizers like Momentum, RMSprop, and Adamβalgorithms that adapt learning rates per parameter and accelerate convergence beyond what vanilla gradient descent can achieve.