Gradient Descent Variants

Gradient Descent is the workhorse of machine learning—the algorithm that lets models learn from data by iteratively stepping toward the minimum of a loss function. But not all gradient descent is created equal. The way you compute gradients and update weights has profound implications for training speed, stability, and final model quality.

In this chapter, we'll explore the three main variants: Batch Gradient Descent, Stochastic Gradient Descent (SGD) , and Mini-Batch Gradient Descent. Each has its strengths, weaknesses, and ideal use cases.

The Core Algorithm

Before diving into variants, let's establish what a vanilla gradient descent algorithm looks like.

At its heart, gradient descent is a simple loop:

Initialize weights $w^0$ and choose a learning rate $\alpha > 0$
Compute the gradient of the loss with respect to weights: $\nabla_w E(w^t)$
Update weights by stepping opposite the gradient: $w^{t+1} = w^t - \alpha \nabla_w E(w^t)$
Repeat until convergence

The difference between variants lies entirely in how we compute the gradient.

Visualizing the Loss Landscape

To understand the behavior of different optimizers, it helps to visualize the loss function as a landscape. A contour plot is a 2D representation where each line connects points of equal loss value—like a topographic map.

The center of the concentric circles represents the minimum (lowest loss)
Steepest descent direction is perpendicular to the contours
Noisy updates bounce around instead of taking a straight path

Keep this mental image as we explore each variant.

Batch Gradient Descent

How It Works

Batch Gradient Descent computes the gradient using the entire dataset before making a single update.

For a dataset $D = \{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}$ :

\nabla_w E(w) = \frac{1}{n} \sum_{i=1}^{n} \nabla_w E_i

Where $E_i$ is the loss for sample $i$ .

The update rule:

w^{t+1} = w^t - \alpha \nabla_w E(w^t)

Visual Behavior

In contour plots, Batch GD moves like a heavy ship—smooth, deliberate, and smoothly pointing directly toward the minimum. Each step is calculated with perfect information, so the trajectory is the true gradient path.

Strengths

Very stable — The gradient is the true gradient with no approximations
Deterministic — Same initialization yields same results every run
Easy to debug — Loss decreases monotonically with no surprises

Weaknesses

Slow for large datasets — One update requires processing every single sample
Memory heavy — Must load the entire dataset into memory
Poor for online learning — Can't update model as new data arrives
Gets stuck in shallow minima — No noise means no escape from local optima

Best Use Cases

Small datasets where the entire dataset fits comfortably in memory
Convex problems where the global minimum is guaranteed
Situations where stability is more important than speed
When you need deterministic, reproducible results

Stochastic Gradient Descent (SGD)

How It Works

Stochastic Gradient Descent flips the script. Instead of using all data, it computes the gradient from a single randomly selected sample and updates immediately.

For a randomly chosen sample $x_i$ :

\nabla_w E(w) = \nabla_w E_i

The update happens after every sample:

w^{t+1} = w^t - \alpha \nabla_w E_i

Visual Behavior

If Batch GD is a cruise ship, SGD is a drunk sailor. Each step is noisy and erratic—one sample might push you one direction, the next pulls you another. But here's the magic: on average, these noisy steps point in the same direction as the true gradient.

The contour plot shows a jagged, zigzagging path that bounces around before eventually converging near the minimum.

Strengths

Extremely fast per update — Process one sample, update immediately
Handles massive datasets — No need to load everything at once
Escapes shallow minima — Noise helps jump out of local optima
Online learning ready — Update model as new data streams in

Weaknesses

Noisy updates — Loss doesn't decrease smoothly; it bounces around
Harder to tune — Learning rate selection is more critical
Not vectorized — Can't leverage GPU parallelism efficiently

Best Use Cases

Very large datasets with millions or billions of samples
Online learning environments where data streams continuously
Deep learning training where escaping local minima is valuable
Situations where speed outweighs stability

Mini-Batch Gradient Descent

How It Works

Mini-Batch Gradient Descent strikes a balance. It splits the dataset into small batches of size $B$ (typically 32, 64, or 128) and computes the gradient over each batch.

For a mini-batch $\mathbb{B} = \{(x_{i1}, y_{i1}), (x_{i2}, y_{i2}), \ldots, (x_{iB}, y_{iB})\}$ :

\nabla_w E_{\mathbb{B}}(w) = \nabla_w \left( \frac{1}{B} \sum_{j=1}^{B} E_{ij} \right)

One update per mini-batch, many updates per epoch:

w^{t+1} = w^t - \alpha \nabla_w E_{\mathbb{B}}(w^t)

Visual Behavior

In contour plots, Mini-Batch GD walks the middle path. Updates are smoother than SGD but faster and more frequent than Batch GD. The path shows some wobble—less than SGD, more than Batch—but converges efficiently.

Strengths

Best of both worlds — More stable than SGD, faster than Batch GD
GPU-friendly — Batches fit in GPU memory for parallel computation
Vectorized operations — Matrix math on batches is highly optimized
Adjustable noise — Batch size lets you control the stability-speed tradeoff

Weaknesses

Batch size is a hyperparameter — Needs tuning for optimal performance
Still memory constrained — Batch must fit in GPU or CPU memory

Best Use Cases

Most deep learning applications (this is the default choice)
When you have GPU hardware available
Situations requiring both stability and speed

The Batch Size Tradeoff

Choosing the right batch size is one of the most important decisions in training. Here's the intuition:

Batch size = 1 (pure SGD)

High noise, erratic updates
Slow per update (no vectorization)
Poor GPU utilization
Excellent generalization (escapes sharp minima)

Batch size = 32–128

Moderate noise, reasonably stable
Fast updates with vectorization
Excellent GPU utilization
Best generalization (the sweet spot)

Batch size = 256–512

Low noise, smooth updates
Very fast per epoch
Excellent GPU utilization
Moderate generalization (tends toward sharp minima)

Batch size = full dataset (Batch GD)

No noise, perfectly stable
Slow per epoch (one update only)
Poor GPU utilization
Poor generalization (stuck in sharp minima)

Why Small Batches Often Generalize Better

Small batches introduce noise into the optimization process. This noise acts like a regularizer, helping the model escape sharp minima and find flatter, more generalizable solutions. Large batches tend to converge to sharp minima that don't transfer well to test data—they perform well on training but fail on unseen examples.

Performance Comparison

Update frequency

Batch GD: Once per epoch
SGD: Once per sample
Mini-Batch GD: Once per batch (dozens to hundreds per epoch)

Gradient accuracy

Batch GD: Exact gradient (no approximation)
SGD: Very noisy (single sample estimate)
Mini-Batch GD: Approximate but stable (batch average)

Memory usage

Batch GD: High (entire dataset)
SGD: Low (single sample)
Mini-Batch GD: Moderate (batch size)

Convergence speed

Batch GD: Slow (one step per full pass)
SGD: Fast (updates after every sample)
Mini-Batch GD: Fast (updates after every batch)

GPU efficiency

Batch GD: Poor (no parallelization across samples)
SGD: Poor (can't vectorize single samples)
Mini-Batch GD: Excellent (batches leverage matrix operations)

Online learning capability

Batch GD: No (needs full dataset)
SGD: Yes (update with each new sample)
Mini-Batch GD: Yes (update with batch of new data)

Escaping local minima

Batch GD: No (deterministic path)
SGD: Yes (noise provides escape)
Mini-Batch GD: Yes (tunable via batch size)

When to Use Which Method

Choose Batch Gradient Descent When

Your dataset fits comfortably in memory
You're solving a convex problem like linear regression
You need deterministic, reproducible results
You're debugging or testing model behavior

Choose Stochastic Gradient Descent When

Your dataset is massive (streaming or over a million samples)
You're training online with continuously arriving data
You want to escape local minima aggressively
You're working with deep neural networks (though mini-batch is more common now)

Choose Mini-Batch Gradient Descent When

You have GPU hardware (almost always the case for deep learning)
You want the best tradeoff between stability and speed
Your dataset is moderately large (most real-world scenarios)
You need vectorized operations for efficiency

The Sweet Spot: Mini-Batch with Tuned Size

In practice, Mini-Batch Gradient Descent is the default choice for most modern machine learning applications. The typical workflow:

Start with batch size 32 or 64 (the classic defaults)
Increase if you have GPU memory to spare
Monitor validation loss—if it worsens, your batch might be too large
Adjust learning rate—larger batches often need larger learning rates

Common Batch Sizes by Domain

Image classification: 32–128

Models like ResNet, EfficientNet, ViT typically use this range

NLP / Transformers: 16–64

Memory constrained by sequence length and model size

Reinforcement learning: 32–256

Balance between stability and sample efficiency

Tabular data: 256–1024

Often larger batches work well with dense, structured data

Looking Ahead

Mini-Batch Gradient Descent is the foundation, but it's just the beginning. In the later optimization chapters, we'll explore advanced optimizers like Momentum, RMSprop, and Adam—algorithms that adapt learning rates per parameter and accelerate convergence beyond what vanilla gradient descent can achieve.