A model that fits too well to the training data fails to fit on the unseen data reliably!. Such an overfit model predicts/classify future observations poorly.
In the below picture, the bed (overfit model) fits a sleeping man(training data) too closely, but this bed(model) will not be a correct fit for a new person( unseen data). Thus this model is an example of an overfit model.
Let’s depict this problem pictorially.
In the first picture, we fit a regression model on the training data points shown in black. The black line depicts a model that just fits the model right and also performs reasonably well on the unseen data. While the blue curve fits really well with all the points in the dataset, it will perform poorly on the unseen data. The blue curve is an overfit model.
The second picture depicts the overfitting in case of a classification problem. The black line is a regularized optimal fit for the classification boundary. Evidently, the black line misclassifies some of the blue and red points incorrectly but will perform satisfactorily on the unseen data. Whereas, the green curve correctly classifies all the points on the training data and result in a wiggly curved boundary which will not perform well on the unseen data points. Here green curve is an overfitted model.
Causes of Overfitting
Some of the significant causes of overfitting are listed below.
- The complexity of the model– When we increase the complexity of a model and include more and more features to track the training data more closely, it doesn’t fare well on unseen data. In the figure above, when the complexity of the model increases, the blue curve tracks all the training point well but will fail to fit the test data optimally.
- Noisy Data – If our model has too much random variation, noise, and outliers, then these data points can fool our model. The model learns these variations as genuine patterns and concepts.
- Quality and Quantity of training data – Your model is as good as the data it used to train itself. Garbage in will result in garbage out. If our data sample is not representative of the population, it may not give the correct picture of the population. Moreover, non-standardized data could also lead to the misfit of the model.
Consequences of Overfitting
An overfit model will result in large MSE or large misclassification errors. Thus while an overfit model good on the training data, the data the model has already seen, it’s not generalizable. In an overfit model, the noise or random fluctuations in the training data is picked up and learned as concepts by the model and thus giving too much predictive power to random fluctuations or noise in the input data.
In regression and classification examples shown in the figure earlier, if we evaluate our model on training data, the MSE(Mean Square Error) in case of regression and misclassification error in case of classification will be close to zero. But, it will result in a large test error. A typical training and testing error rate follow the following graph. In this plot, the X-axis represents the flexibility of the model( e.g., fitting degree), and the Y-axis represents error. As you can see, the training error shown in the blue line keeps on reducing as you increase the complexity of the model. While the test error shown in red reduces initially but begin to rise after a certain point. The point to the right of the global minima of test error, wherein test error increases and training error reduces, shows that the model is overfitting. While on the left of the global minima of test error, where both training and testing error is high, it shows that the model is underfitting.
An overfit model has high variance and low bias. We will look into this in the next section.
Bias-Variance Trade-off and The Optimal Model
Before talking about the bias-variance trade-off, let’s revisit these concepts briefly.
Bias is the simplifying assumptions made by a model to make the target function easier to learn.
- Low Bias: Predicting less assumption about Target Function
- High Bias: Predicting more assumption about Target Function
Variance is the amount that the estimate of the target function will change if different training data was used.
- Low Variance: Predicting small changes to the estimate of the target function with changes to the training dataset.
- High Variance: Predicting large changes to the estimate of the target function with changes to the training dataset.
When our model is very simple, it makes assumptions that the output is fairly simple, and therefore, such models have a high bias towards the output. On the contrary, when our model doesn’t make assumptions and includes everything (and thereby make our model complex) that the data features have to offer, it has low bias.
But as described in the preceding section, a complex model tends to overfit and thereby to have significant test errors associated with it.
While a simple model has high bias and low variance, and a complex model has low bias and high variance associated with it, and this problem is known as the bias-variance dilemma or bias-variance trade-off.
The following picture explains this situation.
The optimal model is one where test error has a global minimum. After this point, the model tends to overfit, and before this point, the model underfit.
In the picture above, the blue line depicts the training error with model complexity. The red line represents the test error with model complexity. As discussed earlier, a complex model has high variance and low bias and is called an overfit model. On the other hand, a simple model has low variance, but high bias and is called an underfit model.
The following picture shows the bias-variance tradeoff. The bull’s eye shown in red is what our model wants to achieve. To hit the bull’s eye, our model should have low bias and low variance. But in most of the cases, it is not feasible to minimize both of them together, and hence we have to trade-off between the two.
Addressing The Issue of Overfitting
There are several techniques we can use to address the issue of overfitting.
- Cross-Validation – The idea of cross-validation is to generate multiple train-test splits. Then train the model on each training set and evaluate the test set. This way, we can find the optimal model, which gives the least test error. Cross-validation can also be used to tune the hyperparameters of the model and select the parameter which performs best on the test set.
- Reducing features– Some algorithms do the feature selection internally, while in others, we have to do it manually. We need to check which features are strongly related to the target and which variables are redundant.
- Regularization – It is a technique that keeps all the features but reduces the magnitude of their effects by imposing a penalty on the features which do not impact the target much. Some regularization techniques (like ridge/L2 regularization) sets the coefficient of features to a small value while others (like LASSO/L1 regularization) technique sets the coefficient of features to exactly zero.
- Training with more data– Training with more data helps the model to add more signals to the data, but it doesn’t always work. If we are adding noisy data to the model, then the model performance will deteriorate.
In this post, we learned
- The problem of overfitting
- The causes of overfitting
- The consequences of overfitting
- The bias-variance trade-off and selection of the optimal model
- Ways to reduce the overfitting in a model