The Overfitting Problem
In one of my previous post, “The Overfitting Problem,” I discussed in detail the problem of overfitting, it’s causes, consequences, and the ways to address the issue. In the same article, I also discussed the bias-variance trade-off and the optimal model selection.
Recall that an overfit model fits too well to the training data but fails to fit on the unseen data reliably!. Such an overfit model predicts/classify future observations poorly.
Also, recall from “The Overfitting Problem” that a complex model tends to overfit. In this post, with the help of the Auto-MPG dataset, We will try to understand how a complex model overfits. We will see that when the model complexity is increased, Mean Square Error(MSE) on the test data doesn’t improve while the MSE on training data keeps on decreasing. As a thumb rule, the model complexity wherein the test error reaches its global minima is considered to be the optimum model. The source code for this exercise is also included at the end of this post for your reference.
About the Auto-MPG Dataset
Description – Data on mileage per gallon for a series of older automobiles, based on other information about the car, such as acceleration and horsepower
Summary – This dataset summary was taken from UCI Machine Learning Repository.
This dataset was taken from the StatLib library, which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.
This dataset is a slightly modified version of the dataset provided in the StatLib library. In line with the use by Ross Quinlan (1993) in predicting the attribute “mpg”, 8 of the original instances were removed because they had unknown values for the “mpg” attribute. The original dataset is available in the file “auto-mpg.data-original”.
“The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.” (Quinlan, 1993)
- mpg: continuous
- cylinders: multi-valued discrete
- displacement: continuous
- horsepower: continuous
- weight: continuous
- acceleration: continuous
- model year: multi-valued discrete
- origin: multi-valued discrete
- car name: string (unique for each instance)
Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
Dataset Source – http://archive.ics.uci.edu/ml/datasets/Auto+MPG
Before proceeding to frame our problem, let’s have a closer look at our dataset to understand better about the data. The concise summary of the dataset is shown below. The dataset has 398 rows and 9 columns.
From the above picture, it is clear that there are no null values present in the dataset, as all the columns have non-null values. But there is a caveat in this dataset. When we investigate further into the dataset, we find that there are 6 rows in the horsepower column, which has the value “?”. The missing value, in this case, is denoted by “?”. We will remove these entries from our data frame before going further into the problem.
Framing the Machine Learning problem
For this tutorial to understand overfitting, we will frame our problem as below.
- How does mpg is related to horsepower. Let mpg as Y and horsepower as X, then our problem becomes,Y=f(X,X^2,X^3,…)+C
- This is called the regression problem, and our goal is to find miles per gallon for a new power of a vehicle available in the units of horsepower.
- Here we have assumed that the mpg of a vehicle is dependent on various degrees of horsepower.
Before we start building our model, let’s make the following assumption.
- We will use a fixed training set (60% of the data) and a fixed test set (40% of the data)
- We will train different polynomial regression models on the same training set
In the code provided in the last section, note that the fixed value of random_state is chosen. If you vary the random_state, you may get different results than mine. The idea behind assuming a fixed random_state and a single training and testing set is to evaluate how models of different complexities behave on the same data. In the following figure, a polynomial regression plot for various degrees are shown. You may intuitively see that beyond 2nd degree, there seems to be no visible improvement in the data fit.
Model Evaluation On Training And Testing set
In the previous section, we have modeled mpg as a function of various degrees of horsepower( 1 to 10). Once the model is trained, we will predict the mpg for the test dataset using the horsepower value of the test dataset. Note that test data is unseen data for the model. We already have an actual value of mpg for the test data. Once we have predicted the value of mpg, we can evaluate the model performance based on the model evaluation metric (Mean Square Error in this example). In the following figure, we have plotted MSE for the training data and the test data obtained from our model.
The Problem Of Overfitting And The Optimal Model
As you can see in the above figure, when we increase the complexity of the model, training MSE keeps on decreasing. This means that the model behaves well on the data it has already seen. But on the other hand, there seems to be no improvement test ( the data model has not seen) MSE. This situation, where the training MSE keeps on decreasing but the test MSE doesn’t decrease(instead increase), is a typical characteristic of an overfit model. The model complexity where test error reaches a global minimum is considered to be an optimal point or the point of bias-variance trade-off( read “The Overfitting Problem” for details). From the above figure, it is clear that increasing model complexity beyond 2nd degree doesn’t improve test MSE. We will look at the model selection in other posts, but for now, we can say that the model with 2nd degree is the optimal model to predict mpg. Beyond 2nd degree, the model overfits.
Example Jupyter Notebook
This section includes jupyter notebook used for this post.
In this post we
- Reviewed “The Overfitting Problem“
- Analysed Auto-MPG dataset
- Defined regression problem to model mpg with horsepower
- Modelled polynomial regression of various degrees of polynomial
- Predicted mpg and plotted MSE on test and train data
- Explained test and training error MSE and the issue of overfitting
- Explained the code for the above points in the jupyter notebook