Real-world problems are far from ideal. The outcome of an event can depend on hundreds of other things. In terms of a machine learning(ML) problem, a target variable(quantitative or qualitative) can rely on many different variables, known as predictors or features. ML model fitting and evaluation process become tricky when the number of predictors or features is very large. In a regression setting, a large number of predictors may
- Cause the overfitting problem
- Make the model less interpretable
- Make the model computationally expensive
The good news is that not all the predictors strongly affects the outcome. So, a feature selection algorithm can be used to reduce the number of predictors. However, there is an alternative to this process where we consider all the predictors into our model and try to regularize the coefficient estimates of a predictor such that a large number of coefficient estimates shrink towards zero(In case of ridge regression) or become precisely zero( in case of the lasso). This process is called regularization and is of two types.
- L2 regularization or Ridge Regression
- L1 regularization or LASSO
In this post, we discuss regularization and its need. We further discuss the two types of regularization techniques and then try to differentiate them.
Regularization And Its Need?
Regularization techniques are the extension of a simple linear regression technique. Recall that a simple linear regression model is given below.
Y ≈ β0 + β1X1 + β2X2 + …+ βpX
Where Y represents the learned relation, and β represents the coefficient estimates for different variables or predictors(X).
In a simple linear regression setting, the coefficients are estimated in a way to minimize the residual sum of squares of training data. The loss function for linear regression is given by the following relation.
However, such a model that tries to minimize the training RSS does not generalize well on the unseen data, and The overfitting problem creeps into the model. The training error reduces, but the test error does not decrease as we include more and more predictors into our model. This is shown in Figure 1.
Figure 1. The blue curve shows training error which keeps on decreasing as we include more and more predictors into our model. The red curve depicts the test error. Image Source- By Gringer – Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=2959742
An overfit model results in high variance and has low bias, as shown in Figure 2. Also, a simple linear regression model provides an unbiased estimate of the coefficient. This means that a simple linear regression model does not consider which predictor is important and which is not!
The goal of a model that tries to capture the underlying pattern of the data is to find the sweet spot of bias-variance trade-off, as shown in Figure 2.
Figure 2. A generalized model which works well on unseen data should hit the sweet spot of Bias-Variance Trade-off
Regularization makes the model generalizable and reduces the variance of the model by introducing a small amount of bias, known as the regularization penalty in the loss function. The additional term constrains/ regularizes or shrinks the coefficient estimates towards zero. In other words, this technique discourages learning a more complex or flexible model to avoid the risk of overfitting. Also, as regularization constrains some of the predictors towards zero, it makes the model more interpretable.
Figure 3 explains how a regularized model tries to solve The Overfitting Problem.
Figure 3. The green and blue functions both incur zero loss on the given data points. A learned model can be induced to prefer the green function, which may generalize better to more points drawn from the underlying unknown distribution, by adjusting, the weight of the regularization term. Image Source- By Nicoguaro – Own work, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=46259145
In the next section, we will see two different types of regularization techniques.
L2 Regularization or Ridge Regression
L2 regularization or Ridge regression is an extension to linear regression where we want to minimize the following loss function.
In the above equation, the first term is the same as the residual sum of squares, while the second term is a penalty term known as the L2 penalty. The minimization of the above loss function balances two conditions by varying the tuning parameter λ.
- The model should fit data well so that RSS should be minimum
- The model should fit so that it minimizes the sum of the square of coefficients
Let’s see the effects of different values of λ on the loss function.
- λ = 0:
- The loss function is the same as that of simple linear regression.
- There will be no penalty added, and the coefficients are the same as that of simple linear regression
- λ → ∞:
- The impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero.
- 0 < λ < ∞:
- The magnitude of λ will decide the weightage given to different parts of the loss function.
- The coefficients will shrink relative to coefficients in simple linear regression.
The selection of tuning parameter λ is extremely critical to the performance of ridge regression and is selected using Cross-Validation. Figure 4 represents ridge regression.
Figure 4. When λ is 0 ridge regression coefficients are the same as simple linear regression estimates. As we increase λ, the coefficients shrink. The dotted vertical line represents the best value of tuning parameter λ for which test error is minimum. Image source- James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning. New York: Springer; 2013 Feb 11. Chapter 6
L1 Regularization or LASSO
While the ridge regression performs shrinkage on the coefficients of predictors, it does not set any of those predictors precisely zero. This results in a model which have all the predictors in the final model. This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation in settings in which the number of predictors is quite large. Lasso tries to overcome this disadvantage.
In LASSO, some of the coefficients are set precisely to zero. Hence, Lasso also performs feature selection inherently.
L1 regularization or Lasso is an extension of linear regression where we want to minimize the following loss function.
Here, λ (lambda) works similarly to that of the ridge and provides a trade-off between balancing RSS and the magnitude of coefficients. Like that of the ridge, λ can take various values. In the above equation, the first term is the same as the residual sum of squares, while the second term is a penalty term known as the L1 penalty. The L1 penalty has the effect of forcing some of the coefficient estimates to be precisely equal to zero when the tuning parameter λ is sufficiently large.
- λ = 0: Same coefficients as simple linear regression
- λ = ∞: All coefficients zero (same logic as before)
- 0 < λ < ∞: coefficients between 0 and that of simple linear regression
Lasso yields sparse models—that is, sparse models that involve only a subset of the variables. As in ridge regression, selecting a good value of λ for the lasso is critical and is done using cross-validation. Figure 5. represents Lasso.
Figure 5. Standardized Lasso coefficients as a function of λ. When λ is large, some of the coefficients exactly becomes zero. Image source- James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning. New York: Springer; 2013 Feb 11. Chapter 6
Ridge Vs Lasso
Both Ridge and Lasso regression try to solve the overfitting problem by inducing a small amount of bias to minimize the variance in the predictor coefficients. They also deal with the issue of multicollinearity. How can one decide if they should be using Ridge or Lasso or just a simple linear regression? While there is no rule applicable universally, we should keep some points in mind to decide better.
- Lasso performs better when a small number of predictors is known to affect the output.
- Ridge performs better when all the predictors are known to affect the output.
- In practice, we do not know how the predictors are affecting the output. We can perform cross-validation to see if Lasso is outperforming or Ridge.
- Lasso Performs feature selection while ridge does not
- Both methods allow to use of correlated predictors, but they solve multicollinearity issue differently:
- In ridge regression, the coefficients of correlated predictors are similar;
- In lasso, one of the correlated predictors has a larger coefficient, while the rest are (nearly) zeroed.
In this post we discussed
- The limitations of simple linear regression when the number of predictors is large
- What is Regularization
- Need of Regularization
- L1(Lasso) and L2 (Ridge) regularization
- Comparison of Ridge and Lasso
Reference- James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning. New York: Springer; 2013 Feb 11.