L1 vs. L2 Regularization: when should you use each?
Regularization is one of those concepts that appears in almost every machine learning course, yet many practitioners struggle to explain when to use L1 versus L2 regularization in practice. Understanding the intuition behind each method is more important than memorizing the formulas.
Why do we need regularization?
Machine learning models can overfit when they become too tailored to the training data. Regularization introduces a penalty on model complexity, discouraging excessively large coefficients and improving generalization to new data.
The objective function becomes: Loss+Penalty. The type of penalty determines whether we are using L1 or L2 regularization.
L1 Regularization (LASSO): L1 regularization adds the absolute values of model coefficients to the loss function. L1 tends to drive some coefficients exactly to zero. As a results, unimportant features are removed from the model, and the model becomes more interpretable.
Use L1 when:
You have a very large number of features
Many features are likely irrelevant
Interpretability is important
A common example is text classification using document-term matrix containing thousands of words. L1 can identify the subset of terms truly contribute to prediction.
L2 Regularization (Ridge): L2 shrinks coefficients toward zero but rarely eliminates them entirely. As a result all features remain in the model , predictions are often more stable, and correlated variables are handled better.
Use L2 when:
Most features contain some predictive signal
Features are highly correlated
Prediction accuracy is more important
Many production machine learning systems default to L2 regularization because it often produces robust predictive performance.
A practical rule of thumb:
Choose L1 if your primary goal is identifying the most important predictors.
Choose L2 if your primary goal is maximizing predictive performance while reducing overfitting.
Use Elastic Net when you want a balance between the two, especially in high-dimensional datasets with correlated features.
Think Like a Data Scientist: Suppose you are building a text classification model with 20,000 word features. Which regularization method would you start with?
Understanding the Bias-Variance Tradeoff:
The bias-variance tradeoff is one of the most fundamental concepts in machine learning. It explains why some models perform poorly because they are too simple, while others perform poorly because they are too complex. Understanding this tradeoff helps data scientists build models that generalize well to unseen data.
A machine learning model should perform well not only on the training data but also on unseen data. Two common problems can prevent this: underfitting and overfitting. The bias-variance tradeoff helps explain why these problems occur.
What is Bias? Biasrefers to a model’s tendency to make consistent mistake because it is too simple to capture the true relationship in the data. Think of bias as a model that has the wrong assumptions about how the world works. Because of these assumptions, the model misses important patterns and repeatedly produces predictions that are systematically off target.
Characteristics of High Bias
Training error is high.
Test error is high.
Model is too simple.
Underfitting occurs.
Examples
Using a linear model to represent a highly nonlinear relationship.
Building a shallow decision tree when the problem is complex.
What is Variance? Variance refers to how much a model's predictions change when trained on different samples of data. High-variance models are highly sensitive to the training data and often learn noise instead of true signal.
Characteristics of High Variance:
Training error is very low.
Test error is high.
Model is overly complex.
Overfitting occurs.
Examples:
An extremely deep decision tree.
A model with thousands of features and little regularization.
The Tradeoff:
As model complexity increases:
Bias generally decreases.
Variance generally increases.
As model complexity decreases:
Bias generally increases.
Variance generally decreases.
The challenge is finding the balance between the two. A model that is too simple misses important patterns, while a model that is too complex memorizes the training data.
How Do We Reduce Bias?
To reduce bias:
Increase model complexity.
Add relevant features.
Use more flexible algorithms.
Reduce regularization if it is too strong.
How Do We Reduce Variance?
To reduce variance:
Collect more training data.
Apply regularization.
Reduce model complexity.
Use cross-validation.
Use ensemble methods such as Random Forests.
Why Regularization Helps? Regularization intentionally introduces a small amount of bias to reduce variance. This often improves performance on unseen data because the reduction in variance outweighs the increase in bias. This is one reason why techniques such as Ridge, LASSO, and Elastic Net are widely used in machine learning.
Food for Thought:
Suppose you build a decision tree that achieves 99% accuracy on the training set but only 70% accuracy on the test set. Would you describe this problem as high bias or high variance?
Feature Selection vs Feature Extraction:
While both feature selection and feature extraction aim to reduce the number of variables used by a model, they achieve this goal in fundamentally different ways.
Selection keeps a subset of the original features, while extraction creates entirely new ones.
1. Feature Selection (Choosing)
Feature Selection is the process of selecting a subset of the most relevant features from your original dataset without altering them. You simply drop the features that are noisy, redundant, or irrelevant.
How it works: If you start with features [A, B, C, D], a feature selection algorithm might decide that B and D are useless and leave you with just [A, C].
Analogy: Imagine filtering a closet full of clothes. You look at each item and decide to keep your 10 favorite jackets and donate the rest. The jackets you kept are exactly the same as they were before.
Common Methods: * Variance Thresholding (dropping features with constant values)
Correlation Matrices (dropping features that are too similar to each other)
Forward/Backward Selection (stepping through features to see which ones improve model score)
LASSO (L1 regularization)
Recursive Feature Elimination (RFE)
2. Feature Extraction (Transforming)
Feature Extraction reduces dimensionality by transforming the original features into a brand-new, smaller set of features. It combines the information from the old features to create something entirely new. These new features may be linear combinations (as in PCA) or complex nonlinear representations (as in neural networks).
How it works: If you start with features [A, B, C, D], an extraction algorithm might mash them together mathematically to create two brand-new features: [X, Y], where X = 2A + 3B and Y = C - D.
Analogy: Imagine taking those same clothes in your closet, shredding them down into raw fabric, and weaving them into 3 completely new, highly versatile outfits.
Common Methods:
Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)
Autoencoders (in deep learning)