In Resampling Methods, we discussed the concept of Cross-Validation in detail. Let’s recall that Cross-validation is a resampling technique for model evaluation where multiple models are fit on a subset of data, and the model is evaluated on the complementary subset of the data. We also discussed the various types of cross-validation. We concluded that despite having some limitations, K-fold cross-validation is the most popular cross-validation technique in the data science community.
In this post, we will explore the applications of cross-validation. The applications of cross-validation fall majorly into three contexts.
- Performance Estimation
- Model Selection
- Hyperparameter Tuning
In this post, we will look into these contexts in detail and will look at their implementation in python as well.
Cross-Validation can be used to estimate the performance of a learning algorithm.
In case of a regression problem, estimates of Mean Square Error(MSE), Root Mean Square Error(RMSE), Mean Absolute Error(MAE), etc. is used as a performance indicator of an algorithm. On the other hand, in a classification setting, one may be interested in the estimates of accuracy, precision, recall, or an F-score to evaluate model performance.
One can use K-fold cross-validation to train a model on k-1 fold, and validate the model on the remaining one fold of data to evaluate the estimate of test error. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The estimates of test errors so evaluated can then be averaged to produce a single estimation of test error.
The consensus in the data science community is to take the value of k as 10. Using 10-fold cross-validation, one repeatedly uses 90% of the data to build a model and test its accuracy on the remaining 10%. The resulting average test error is likely somewhat of an underestimate for the true test error when the model is trained on all data and tested on unseen data. Still, in most cases, this estimate is reliable, mainly if the amount of training data is sufficiently large, and if the unseen data follows the same distribution as the training data.
Let’s take an example. In the classic Auto-MPG dataset, our goal is to find miles per gallon for a new vehicle based on the power of the vehicle, in horsepower. Let’s create two simple regression models. A linear model where the mpg depends on the horsepower linearly and a polynomial regression model where mpg depends on the “horsepower” as well as “horsepower^2”. We applied ten-fold cross-validation and evaluated the average MSE on both the model to be 27.43 and 21.23, respectively. Clearly, in this simplistic example, it is evident that the polynomial regression model with degree 2 performs better.
When trying to solve a machine learning problem, we explore different algorithms that can solve the given problem. The goal here is to find the best algorithm that solves our problem well.
Model selection is a process of selecting the best model out of various candidate models that will be used in the production setting.
There may be various criteria(accuracy, complexity, running time, etc.) for selecting a model, but one of the most important criteria is the model performance.
Using cross-validation, one can easily determine the model performance of each of the candidate models for a given problem. After the model performance for each model is available, we can make judgments keeping other constraints( model complexity, flexibility, interpretability, etc.) in mind.
In the classic Auto-MPG dataset, our goal is to find miles per gallon for a new vehicle based on the power of the vehicle, in horsepower. Let’s create three simple regression models.
1. A linear regression model, where y= f(x) +c
2. A polynomial regression model of degree 2, where y = f(x,x^2) + c
3. A polynomial regression model of degree 3, where y = f(x,x^2,x^3) + c
Where x is the horsepower and y is mpg.
We applied ten-fold cross-validation and evaluated the average MSE on all the three models to be 27.43, 21.23, and 21.33, respectively. Let’s assume that these three models are the candidates for our problem. Then by keeping model performance, model complexity, interpretability in mind, we can make model number 2( polynomial regression with degree 2) as a rational choice for our problem.
In machine learning models, there are some parameters, that are external to the model. The value of these parameters has to be provided before the learning process begins. These parameters are called hyperparameters. Typical examples include C, kernel, and gamma for Support Vector Classifier, alpha for Lasso, etc.
There are many possible candidates for hyperparameters, and we want to select hyperparameters that result in the best model performance. Using Cross-Validation, we can search the collection of hyperparameters and look for the best cross-validation score. Figure 1 illustrates the hyperparameter tuning process.
In Scikit-learn, two generic approaches to sampling search candidates are provided: for given values, GridSearchCV exhaustively considers all parameter combinations, while RandomizedSearchCV can sample a given number of candidates from a parameter space with a specified distribution.
In this example, which is taken from scikit-learn documentation, hyperparameter tuning is explained using GridSearchCV. It fits all the possible combinations of parameter values on the dataset and retains the best combination.
In this article, we have shown the various applications of Cross-Validation and then explained them through examples. You may explore the following resources to deep dive into Cross-Validation.