🚀 🚀 Launch Offer — Courses starting at ₹1499 (Limited Time)
CortexCookie

L1 and L2 Regularization

Deep learning models are extremely powerful because they can approximate highly complex functions. However, this power often comes with a downside: overfitting. When a neural network has too many parameters, it may learn the noise in the training data instead of the true underlying patterns.

Feature Extraction

The goal of regularization is to simplify such networks by controlling the magnitude of their weights and improving generalization.

Why Regularization Is Necessary

A neural network consists of layers of neurons connected by weights. These weights determine how much influence one neuron has on another. A complex network can have:

  • Many neurons
  • Dense interconnections
  • A large number of weight parameters

When this happens, the model can fit the training data extremely well but fail to perform on unseen data.

Regularization addresses this problem by penalizing large weights, effectively encouraging the model to learn simpler and more robust representations.

Feature Extraction

Regularized Optimization: The Big Picture

In standard deep learning optimization, we aim to minimize a loss function. If our Loss/Error function is L(w)L(w) or E(w)E(w):

minwE(w)\min_w E(w)

Regularization modifies this objective by adding a penalty term:

minw(E(w)+Regularization Term)\min_w \left( E(w) + \text{Regularization Term} \right)

The two most common regularization techniques are L1 regularization and L2 regularization.

L1 Regularization (Lasso)

L1 regularization adds the L1 norm of the weight vector to the loss function:

Usual Optimization:

minwE(w)\min_w E(w)

L1 regularized Optimization:

minw(E(w)+λw1)\min_w \left( E(w) + \lambda \|w\|_1 \right)

where:

w1=j=1mwj\|w\|_1 = \sum_{j=1}^{m} |w_j|

Here:

  • ww represents the network weights
  • λ\lambda is the regularization parameter
  • mm is the number of weight parameters

Gradient Descent Update Rule (L1)

With L1 regularization, the gradient descent update rule becomes:

w:=wα(wE(w)+λsign(w))w := w - \alpha \left( \nabla_w E(w) + \lambda \, \text{sign}(w) \right)

This update introduces a constant force that pushes weights toward zero.

The most important property of L1 regularization is that it drives many weights exactly to zero. This effectively removes unnecessary connections in the network.

As highlighted in the comparison tables in the PDF (pages 6–7):

  • L1 regularization produces sparse models
  • It performs implicit feature selection
  • It is most effective when many features are irrelevant or redundant

L2 Regularization (Ridge)

L2 regularization adds the squared L2 norm of the weights to the loss function.

Usual Optimization:

minwE(w)\min_w E(w)

L2 Regularized Optimization:

minw(E(w)+λ2w22)\min_w \left( E(w) + \frac{\lambda}{2} \|w\|_2^2 \right)

where:

w22=w2=w12+w22+...+wm2=j=1mwj2=wTw=squared L2 - norm of the weight vector\|w\|_2^2 = ||w||_2= {w_1^2+w_2^2+...+ w_m^2} = \sum_{j=1}^{m} w_j^2 = w^T w = \text{squared L2 - norm of the weight vector} w2=w12+w22+...+wm2=L2 norm of the weight vector||w||_2=\sqrt{w_1^2+w_2^2+...+ w_m^2} = \text{L2 norm of the weight vector}

Gradient Descent Update Rule (L2):

With L2 regularization, the weight update rule is:

w:=wα(wE(w)+λw)w := w - \alpha \left( \nabla_w E(w) + \lambda w \right)

Unlike L1 regularization, L2 regularization shrinks weights smoothly toward zero but never makes them exactly zero.

This results in:

  • Stable learning
  • Distributed importance across features
  • Dense but well-controlled models

L2 regularization works best when most features contribute a little rather than a few features dominating the prediction.

The Role of the Regularization Parameter λ\lambda

The regularization parameter λ\lambda controls the strength of the penalty term.

When λ\lambda Is Too Small

  • The regularization term becomes negligible
  • The model behaves nearly like no regularization
  • Might result in overfitting - High training accuracy but poor generalization.

When λ\lambda Is Too Large

  • The regularization term dominates the loss
  • The model focuses too much on shrinking weights
  • Underfitting may occur

Thus, choosing λ\lambda involves balancing bias and variance.

Combining L1 and L2: Elastic Net

Limitations of L1 Regularization

  • Encourages sparsity by pushing weights exactly to zero
  • Can be unstable when features are highly correlated
  • May arbitrarily eliminate useful features

Limitations of L2 Regularization

  • Shrinks weights smoothly but never removes them
  • Does not perform feature selection
  • Keeps all features active, even weak ones

Elastic Net addresses these issues by blending both penalties into a single objective. It augments the loss function with both L1 and L2 regularization terms:

minw(L(w)+λ1w1+λ22w22)\min_w \left( L(w) + \lambda_1 \|w\|_1 + \frac{\lambda_2}{2} \|w\|_2^2 \right)

where:

  • L(w)L(w) is the original loss function
  • w1=j=1mwj\|w\|_1 = \sum_{j=1}^{m} |w_j| is the L1 norm
  • w22=j=1mwj2\|w\|_2^2 = \sum_{j=1}^{m} w_j^2 is the squared L2 norm
  • λ1\lambda_1 controls sparsity
  • λ2\lambda_2 controls weight shrinkage

Intuition Behind Elastic Net

Elastic Net applies two simultaneous forces during training:

  • The L1 term pushes small and unimportant weights exactly to zero
  • The L2 term prevents remaining weights from becoming too large

As a result, Elastic Net:

  • Produces sparse yet stable models
  • Handles correlated features better than L1 alone
  • Improves generalization performance

The combined effect ensures that irrelevant connections are removed while important ones remain controlled.

When to Use Elastic Net

Elastic Net is particularly effective when:

  • The dataset contains many features
  • Input features are highly correlated
  • Feature selection and stability are both desired
  • Pure L1 or pure L2 regularization performs poorly

Key Takeaways

  • Overfitting arises from overly complex neural networks
  • L1 regularization simplifies models by removing unnecessary weights
  • L2 regularization stabilizes learning by shrinking weights
  • The regularization parameter λ\lambda must be tuned carefully
  • Combining L1 and L2 often yields the best performance

Conclusion

Regularization is a fundamental concept in deep learning. It ensures that models not only perform well on training data but also generalize effectively to unseen examples. By understanding and applying L1 and L2 regularization correctly, we can build neural networks that are both powerful and reliable.

In practice, a well-regularized model is often more valuable than a highly complex one.

That was a free preview lesson.