A machine learning algorithm learns from the data by fitting parametric or non-parametric models. But, there is no single model that is universally applicable to different kinds of data and problems. An improper choice of the model may lead to false prediction/classification and consequently lead to flawed conclusions. An essential step in a data science process is to consider a set of candidate models and then select the most appropriate model among them for a given task.
Recall that Model Selection is a process of selecting the best model out of various candidate models that will be used in the production setting.
In this post, we discuss the objectives of Model Selection, its various methods, and the best practices to follow when doing Model Selection.
Objectives of Model Selection
When solving a Machine Learning problem, we may zero down to several candidate models for the problem. We may further be interested in the selection of
- The best choice among various ML algorithms (e.g., Logistic regression, support vector machine, neural networks, etc.)
- Variables for linear regression
- Basis terms such as polynomials, splines, or wavelets in function estimation
- Most appropriate parametric family among several alternatives
When we are at it, what we should keep in our minds so that we select the best model? The two primary criteria for model selection are prediction accuracy and model interpretability, which are listed below.
- Prediction Accuracy – One of the main objectives of Model Selection in Machine Learning is to find a model with the highest prediction accuracy. It can be measured in terms of MSE/Misclassification Error depending upon whether the target variable is quantitative or qualitative, respectively.
- Model Interpretability – A highly complex model, with too many predictors, not only introduces The Overfitting Problem but also is difficult to interpret. An appropriate model tries to eliminate irrelevant variables from the model to make the model both simpler and accurate.
Methods of Model Selection
How may we answer this question?
If a set of competing models is given, which model is the most appropriate one?
A good model selection technique will balance between prediction accuracy and simplicity. Usually, we aim to find the model which works best on the test dataset. But, a designated test set is not available when we are building a predictive model. To address this problem, two conventional approaches are used to find the estimate of the test error.
- Analytic Methods-We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting. In these groups of methods, the training error is calculated first and then a penalty is added to the training error to estimate the testing error.
- Resampling Methods– We can directly estimate the test error, using Resampling Methods. In resampling methods, the model is fit on one dataset and is validated on the complementary dataset and the validation error is recorded for each iteration. This process is repeated multiple times and the mean validation error is taken as an estimate for test error.
We briefly discuss both the methods in the next section.
In a typical machine learning model, training error usually underestimates the testing error. This is because the model is trained in a way to minimize the training mean square error( MSE). As we include more and more variables into the model, training error continually decreases, but the testing error may not decrease. This is shown in Figure 1. and is known as “The Overfitting Problem.”
Intuitively, the training error metric does not give any information about the prediction performance of the model. However, several techniques for adjusting the training error for the model size are available.
These techniques add a penalty term in the training error to estimate the testing error.
The benefit of these techniques is that they do not require a designated test set. The important methods are listed below. Here we have presented the formulas for AIC, BIC, and Cp in the case of a linear model fit using least squares; however, these quantities can also be defined for more general types of models. In all the formulas given below, RSS is the residual sum of squares, n is the sample size, k is the number of variables in the model. You can intuitively observe that as the k increases, so is the penalty for the error in AIC and BIC. Since log n > 2 for any n > 7, the BIC statistic generally places a heavier penalty on models with many variables. The model which has the lowest AIC, BIC, Cp, or highest adjusted R squared is considered as the best model among all the candidate models.
We have discussed Resampling Methods in detail in an earlier post. There are mainly two categories of resampling methods, as listed below.
The widely used resampling technique in model selection is cross-validation. Using cross-validation, we can determine the model performance of each of the candidate models for a given problem. Then, we can select the model for which the resulting estimated test error is smallest. In The Applications of Cross-Validation, we have shown an example of model selection using 10-fold cross-validation.
This procedure has the following advantages relative to AIC, BIC, Cp, and adjusted R2.
It provides a direct estimate of the test error
It makes fewer assumptions about the true underlying model
It can also be used in a wider range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom
It can be used where it is hard to estimate the error variance σ2
The Best Practices for Model Selection
Some general recommendations and best practices that are trendy in the data science community are listed below for reference.
Keep in mind the objectives of model selection
Cross-Validation is the most attractive method for model selection.
5 or 10-fold cross-validation fares well for the majority of the cases.
In the simple linear models with a large number of predictors(p) and sample size(n), analytic methods perform as good as resampling methods and are computationally inexpensive.
In this post, we learned the concept of model selection and why it is crucial. We also discussed the objectives of a model selection process. Then, we briefly discussed two different categories of model selection techniques. Lastly, we put forward some of the best practices to follow when doing the model selection. You may follow the following resources to dive further into the topic of model selection