Statistics For Data Science Course
277

Resampling: Cross-Validation Techniques

What is Resampling?

Resampling is a technique of repeatedly drawing samples from available training data and refitting our model of interest to each of these samples to get additional information about our model.

For example, let’s say we want to build a regression model to evaluate a target variable. If we take different samples from the available data, each time we would get a different model as per the data used to train the model. We can then analyze the variability of the models and get additional information, which was not possible with only one sample of training data.

As Resampling involves redrawing and fitting the samples repeatedly, they are computationally very expensive. There are mainly two categories of resampling methods, as listed below.

  1. Cross Validation
  2. Bootstrapping

In this post, we will discuss briefly about Cross-Validation, it’s need, and the various types of Cross-Validation Techniques. The code used to describe the concepts are also included later in this post as jupyter notebook.

What is Cross-Validation?

Cross-validation is a resampling technique for model evaluation where multiple models are fit on a subset of data, and the model is evaluated on the complementary subset of the data.

An optimal model should generalize well and should have a low test error. In real scenarios, the designated test data set is not available. In cross-validation, a dataset is divided into two subsets, a training dataset, and a validation/test dataset. An ML model is trained on the training dataset and evaluated against the validation dataset. Different models are built using different samples of training data and are evaluated against the complementary subset of the data. These different models give an idea about the performance of the model on unseen data. The bias-variance tradeoff, overfitting, etc. can be identified using cross-validation.

Cross Validation

Figure 1. The idea behind Cross-Validation, Source -https://docs.aws.amazon.com/machine-learning/latest/dg/cross-validation.html

Cross- Validation are mainly used in the following scenarios.

  1. Model assessment – To estimate the test error associated with a given ML method to determine it’s performance
  2. Model selection -To select an appropriate level of flexibility or suitable hyperparameters.

Types of Cross-Validation

Cross-Validation is further categorized mainly into the following subtypes

  1. The Validation Set approach
  2. Leave One Out Cross Validation (LOOCV)
  3. K Fold Cross Validation

The Validation Set Approach

The validation set approach is very simple strategy in which the available data is randomly divided into two parts, namely the training set and validation set or holdout set. The model is trained on the training set, and the fitted model is used to predict/classify the target value on the validation set. The model’s performance is then measured in terms of MSE/Misclassification Error depending upon whether the target variable is quantitative or qualitative, respectively. The validation set approach is shown in Figure 2 below.

validation_set_approach

Figure 2. A schematic display of the validation set approach. A set of n observations are randomly split into a training set (shown in blue, containing observations 7, 22, and 13, among others) and a validation set (shown in beige, and containing observation 91, among others). The statistical learning method is fit on the training set, and its performance is evaluated on the validation set. Source- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 177). New York: springer

In the Jupyter notebook included later in this post, we have applied The Validation set approach to the Auto-MPG dataset to estimate the test error that results from predicting mpg using polynomial functions of horsepower. In Figure 3, we have shown the test error estimates for the single split into training and validation data sets.

Validation Set Approach

Figure 3. The validation set approach was used on the Auto data set in order to estimate the test error that results from predicting mpg using polynomial functions of horsepower. Validation error estimates for a single split into training and validation data sets.

In Figure 4, we have repeated the validation set approach ten times, each time using a different random split of observations into a training set and validation set.

The Validation Set Approach, 10 different splits

Figure 4. The validation set approach was used on the Auto data set to estimate the test error that results from predicting mpg using polynomial functions of horsepower. The validation method was repeated ten times, each time using a different random split of the observations into a training set and a validation set. This illustrates the variability in the estimated test MSE that results from this approach.

The validation set approach is easy to implement and conceptually simple but suffers from two major drawbacks.

  1. The validation estimate of the test error is highly variable in this method as can be seen in Figure 4. The test error depends upon the observations included in the training set and validation set, which is also variable.
  2. The validation estimate of the test error often overestimates the actual test error (had the model trained using entire data).

Leave One Out Cross Validation

Leave One Out Cross Validation is closely related to the Validation set approach but tries to address the drawback of the validation set approach. In LOOCV, the available data is divided into two parts so that a single observation is kept for validation purposes. If there are n observations available, then n-1 observations are used for training purposes, and one observation is used for validation purposes.

This process is repeated such that all the observations are used for validation once. Hence, for n observation, the model needs to be trained n times. Let’s say the first time the model is trained on n-1 observations and tested on the remaining one observation. The test error so obtained is denoted as MSE1. Similarly, we get other MSEs are MSE2, MSE3, MSE4 and so on..

The LOOCV estimate for the test MSE is the average of these n test error estimates.

LOOCV_formula

 

A schematic of the LOOCV is illustrated in Figure 5.

LOOCV

Figure 5. Illustration of leave-one-out cross-validation (LOOCV) when n = 8 observations. A total of 8 models will be trained and tested. A single observation is used for testing and rest 7 observations are used for training the model. Source-By MBanuelos22 – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=87684543

LOOCV has some advantages over the validation set approach.

  1. It has less bias.
  2. The LOOCV approach tends not to overestimate the test error rate as much as the validation set approach does.
  3. In contrast to the validation approach, which will yield different results when applied repeatedly due to randomness in the training/validation set splits, performing LOOCV multiple times will always yield the same results: there is no randomness in the training/validation set splits.

LOOCV is used on the Auto-MPG dataset to obtain an estimate of the test set MSE that results from fitting a linear regression model to predict mpg using polynomial functions of horsepower. The results are shown in Figure 6.

LOOCV_Auto_Dataset

Figure 6. Cross-validation was used on the Auto data set to estimate the test error that results from predicting mpg using polynomial functions of horsepower. The LOOCV error curve.

K Fold Cross Validation

In k-fold cross-validation, the original sample is randomly partitioned into k different equal-sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k −1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation.

The mean squared error, MSE, is computed on the observations in the held-out fold. This process results in k estimates of the test error, MSE1, MSE2,…, MSEk. The k-fold CV estimate is computed by averaging these values,

KFOLDCV_Formula

K-fold cross-validation is illustrated in Figure 7.

KfoldCV

Figure 7. Illustration of k-fold cross-validation when n = 12 observations and k = 3. After data is shuffled, a total of 3 models will be trained and tested. Source- By MBanuelos22 – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=87684542

Figure 8 displays nine different 10-fold CV estimates for the Auto data set, each resulting from a different random split of the observations into ten folds. As we can see from the figure, there is some variability in the CV estimates as a result of the variability in how the observations are divided into ten folds. But this variability is typically much lower than the variability in the test error estimates that results from the validation set approach, as in Figure 4.

 

10 Fold Cross Validation

Figure 8. 10-fold CV was run nine separate times, each with a different random split of the data into ten parts. The figure shows the nine slightly different CV error curves.

The K fold cross-validation has the same properties as that of LOOCV but is less computationally intensive. In K fold cross-validation, computation time is reduced as we repeated the process only ten times when the value of k is 10. The variance remains low, and as we increase the value of k variance is reduced. As the folds increase, most of the data is used by the model, and hence the bias is also reduced.

 

Example Jupyter Notebook

Summary

In this post we

  • Introduced the concept of resampling and its types
  • Discussed Cross-Validation and its need
  • Discussed the various types of Cross-Validation
  • Explained the code discussing cross-validation in the jupyter notebook

Show Comments

No Responses Yet

Leave a Reply