### Introduction

Since ancient times, humankind has always avidly sought a way to predict the future. One of the most widely known examples of this kind of activity in the past is the Oracle of Delphi, who dispensed previews of the future to her petitioners in the form of divine inspired prophecies1. In the modern days, the desire to know the future is still of interest to many of us, even if my feeling is that the increasing rapidity of technology innovations we observe everyday has somewhat lessened this instinct: things that few years ago seemed futuristic are now available to the great mass (e.g. the World Wide Web).

Among the many areas of the human being where predictions are highly needed there is business decision making. The tools for formulating predictions about quantities of interest are commonly known as predictive analytics, which is itself an essential part of data science. At the heart of any prediction there is always a model, which typically depends on some unknown structural parameters (e.g. the coefficients of a regression model) as well as on one or more tuning parameters (e.g. the number of basis functions in a smoothing spline or the degree of a polynomial). The former are commonly estimated using a sample of data, while the latter have to be chosen to guarantee that the model itself provides predictions which are accurate enough. Tuning parameters usually regulate the model complexity and hence are a key ingredient for any predictive task. In this blog entry we focus on the most common strategy for eliciting reasonable values for the tuning parameters, the cross-validation approach.

### The Bias-Variance Dilemma

The reason why one should care about the choice of the tuning parameter values is because these are intimately linked with the accuracy of the predictions returned by the model. What an analyst typically wants is a model that is able to predict well samples that have not been used for estimating the structural parameters (the so called training sample). In other words, a predictive model is considered good when it is capable of predicting previously unseen samples with high accuracy. The accuracy of a model’s predictions is usually gauged using a loss function. Popular choices for the loss functions are the mean-squared error for continuous outcomes, or the 0-1 loss for a categorical outcome2.

At this point, it is important to distinguish between different prediction error concepts:

• the training error, which is the average loss over the training sample,
• the test error, the prediction error over an independent test sample.

The training error gets smaller as long as the predicted responses are close to the observed responses, and will get larger if for some of the observations, the predicted and observed responses differ substantially. The training error is calculated using the training sample used to fit the model. Clearly, we shouldn’t care too much about the model’s predictive accuracy on the training data. On the contrary, we would like to assess the model’s ability to predict observations never seen during estimation. The test error provides a measure of this ability. In general, one should select the model corresponding to the lowest test error.

The R code below implements these idea via simulated data. In particular, I simulate 100 training sets each of size 50 from a polynomial regression model, and for each I fit a sequence of cubic spline models with degrees of freedom from 1 to 30.

The next plot shows the first simulated training sample together with three fitted models corresponding to cubic splines with 1 (green line), 4 (orange line) and 25 (blue line) degrees of freedom respectively. These numbers have been chosen to show the full set of possibilities one may encounter in practice, i.e., either a model with low variability but high bias (degrees of freedom = 1), or a model with high variability but low bias (degrees of freedom = 25), or a model which tries to find a compromise between bias and variance (degrees of freedom = 4).

Then, for each training sample and fitted model, I compute the corresponding test error using a large test sample generated from the same (known!) population. These are represented in the following plot together with their averages, which are shown using thicker lines3. The solid points represent the three models illustrated in the previous diagram.

One can see that the training errors decrease monotonically as the model gets more complicated (and less smooth). On the other side, even if the test error initially decreases, from a certain flexibility level on it starts increasing again. The change point occurs in correspondence of the orange model, that is, the model that provides a good compromise between bias and variance. The reason why the test error starts increasing for degrees of freedom larger than 3 or 4 is the so called overfitting problem. Overfitting is the tendency of a model to adapt too well to the training data, at the expense of generalization to previously unseen data points. In other words, an overfitted model fits the noise in the data rather than the actual underlying relationships among the variables. Overfitting usually occurs when a model is unnecessarily complex.

It is possible to show that the (expected) test error for a given observation in the test set can be decomposed into the sum of three components, namely

Expected Test Error Irreducible Noise (Model Bias)^Model Variance
which is known as the bias-variance decomposition. The first term is the data generating process variance. This term is unavoidable because we live in a noisy stochastic world, where even the best ideal model has non-zero error. The second term originates from the difficulty to catch the correct functional form of the relationship that links the dependent and independent variables (sometimes it is also called the approximation bias). The last term is due to the fact that we estimate our models using only a limited amount of data. Fortunately, this terms gets closer and closer to zero as long as we collect more and more training data. Typically, the more complex (i.e., flexible) we make the model, the lower the bias but the higher the variance. This general phenomenon is known as the bias-variance trade-off, and the challenge is to find a model which provides a good compromise between these two issues.

Clearly, the situation illustrated above is only ideal, because in practice:

• We do not know the true model that generates the data. Indeed, our models are typically more or less mis-specified.
• We do only have a limited amount of data.

One way to overcome these hurdles and approximate the search for the optimal model is to use the cross-validation approach.

### A Solution: Cross-Validation

In essence, all these ideas bring us to the conclusion that it is not advisable to compare the predictive accuracy of a set of models using the same observations used for estimating the models. Therefore, for assessing the models’ predictive performance we should use an independent set of data (the test sample). Then, the model showing the lowest error on the test sample (i.e., the lowest test error) is identified as the best.

Unfortunately, in many cases it is not possible to draw a (possibly large) independent set of observations for testing the models’ performance, because collecting data is typically an expensive activity. The immediate reaction to this statement is that we can solve this issue by splitting the available data in two sets, one of which will be used for training while the other is used for testing. The split is usually performed randomly to guarantee that the two parts have the same distribution4.

Even if data splitting provides an unbiased estimate of the test error, it is often quite noisy. A possible solution5 is to use cross-validation (CV). In its basic version, the so called k-fold cross-validation, the samples are randomly partitioned into k sets (called folds) of roughly equal size. A model is fit using all the samples except the first subset. Then, the prediction error of the fitted model is calculated using the first held-out samples. The same operation is repeated for each fold and the model’s performance is calculated by averaging the errors across the different test sets. kis usually fixed at 5 or 10 . Cross-validation provides an estimate of the test error for each model6. Cross-validation is one of the most widely-used method for model selection, and for choosing tuning parameter values.

The code below illustrates k-fold cross-validation using the same simulated data as above but not pretending to know the data generating process. In particular, I generate 100 observations and choose k=10. Together with the training error curve, in the plot I report both the CV and test error curves. Additionally, I provide also the standard error bars, which are the standard errors of the individual prediction error for each of the k=10 parts.

Often a “one-standard error” rule is used with cross-validation, according to which one should choose the most parsimonious model whose error is no more than one standard error above the error of the best model. In the example above, the best model (that for which the CV error is minimized) uses 3 degrees of freedom, which also satisfies the requirement of the one-standard error rule.

The case where k=n corresponds to the so called leave-one-out cross-validation (LOOCV) method. In this case the test set contains a single observation. The advantages of LOOCV are: 1) it doesn’t require random numbers to select the observations to test, meaning that it doesn’t produce different results when applied repeatedly, and 2) it has far less bias than k-fold CV because it employs larger training sets containing n1 observations each. On the other side, LOOCV presents also some drawbacks: 1) it is potentially quite intense computationally, and 2) due to the fact that any two training sets share n−2 points, the models fit to those training sets tend to be strongly correlated with each other.

The code below implements LOOCV using the same example I discussed so far. The next plot shows that most of the times LOOCV does not provide dramatically different results with respect to CV.

### Doing Cross-Validation With R: the caret Package

There are many R packages that provide functions for performing different flavors of CV. In my opinion, one of the best implementation of these ideas is available in the caret package by Max Kuhn (see Kuhn and Johnson 2013)7. The aim of the caret package (acronym of classification and regression training) is to provide a very general and efficient suite of commands for building and assessing predictive models. It allows to compare the predictive accuracy of a multitude of models (currently more than 200), including the most recent ones from machine learning. The comparison of different models can be done using cross-validation as well as with other approaches. The package also provides many options for data pre-processing. It is not my aim to provide here a thorough presentation of all the package features. Rather, I will focus only on a handful of its functions, those that allow to perform CV. For more details on the other package functions, you can inspect the package documentation and its website. To illustrate these feature I will use some data for a credit scoring application whose data can be found here.

Since credit scoring is a classification problem, I will use the number of misclassified observations as the loss measure. The data set contains information about 4,455 individuals for the following variables:

Variable Description
Status credit status
Seniority job seniority (years)
Home type of home ownership
Time time of requested loan
Age client’s age
Marital marital status
Records existence of records
Job type of job
Expenses amount of expenses
Income amount of income
Assets amount of assets
Debt amount of debt
Amount amount requested of loan
Price price of good

Here I use the “cleaned” version of the data set, where some pre-processing has already been performed (i.e., removal of few observations, imputation of missing values and categorization of continuous predictors). The tidy data are contained in the file CleanCreditScoring.csv.

The caret package provides functions for splitting the data as well as functions that automatically do all the job for us, namely functions that create the resampled data sets, fit the models, and evaluate performance.

Among the functions for data splitting I just mention createDataPartition() and createFolds(). The former allows to create one or more test/training random partitions of the data, while the latter randomly splits the data into k subsets. In both functions the random sampling is done within the levels of y (when y is categorical) to balance the class distributions within the splits. These functions return vectors of indexes that can then be used to subset the original sample into training and test sets.

To automatically split the data, fit the models and assess the performance, one can use the train() function in the caret package. The code below shows an example of the train() function on the credit scoring data by modeling the outcome using all the predictors available with a penalized logistic regression. More specifically, I use the glmnet package (Friedman, Hastie, and Tibshirani 2008), that fits a generalized linear model via penalized maximum likelihood. The algorithm implemented in the package computes the regularization path for the elastic-net penalty over a grid of values for the regularization parameter λ. The tuning parameter λ controls the overall strength of the penalty. A second tuning parameter, called the mixing percentage and denoted with α, represents the elastic-net penalty (Zou and Hastie 2005). This parameter takes value in [0,1] and bridges the gap between the lasso (α=1) and the ridge (α=0) approaches.

The train() function requires the model formula together with the indication of the model to fit and the grid of tuning parameter values to use. In the code below this grid is specified through the tuneGrid argument, while trControl provides the method to use for choosing the optimal values of the tuning parameters (in our case, 10-fold cross-validation). Finally, the preProcess argument allows to apply a series of pre-processing operations on the predictors (in our case, centering and scaling the predictor values).

The previous plot shows the “accuracy”, that is the percentage of correctly classified observations, for the penalized logistic regression model with each combination of the two tuning parameters α and λ. The optimal tuning parameter values are α=0 and λ= 0.01.

Then, it is possible to predict new samples with the identified optimal model using the predict method:

If you need to deepen your knowledge of predictive analytics, you may find something interesting in the R Course Data Mining with R.

Stay tuned for the next article on the MilanoR blog!

### References

Efron, B., and R. Tibshirani. 1993. An Introduction to the Bootstrap. CRC Press.

Friedman, J., T. Hastie, and R. Tibshirani. 2008. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22.

Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning. 2nd ed. Springer.

James, G., D. Witten, T. Hastie, and R. Tibshirani. 2013. An Introduction to Statistical Learning. Springer.

Kuhn, M., and K. Johnson. 2013. Applied Predictive Modeling. Springer.

Zou, H., and T. Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Association B 67 (2): 301–20.

1. By the way, it seems that the oracular powers appeared to be associated with hallucinogenic gases that puffed out from the temple floor.
2. You can find a thorough formal illustration of all these concepts in Hastie, Tibshirani, and Friedman (2009), Chapter 7. A somewhat simpler presentation can be found in James et al. (2013).
3. More precisely, the light red curves correspond to what is called conditional test error, which means that each curve is conditional on the corresponding training sample. The heavier red curve correspond to the expected test error. In general, we would like to focus on the conditional test error for the particular training sample we have. However, this curve is very difficult to be estimated and in practice the expected test error is typically targeted. As we will see, cross-validation is a method for estimating the expected test error. For more details see Hastie, Tibshirani, and Friedman (2009).
4. A variant of the purely random split is to use stratified random sampling in order to create subsets that are balanced with respect to the outcome. This is useful in particular in classification problems when one class has a disproportionately small frequency compared to the others.
5. An alternative approach for the same objective is the bootstrap, that won’t be illustrated here (see Efron and Tibshirani (1993)).
6. More precisely, cross-validation provides an estimate of the expected test error.
7. The boot package contains also a nice function called cv.glm, which implements k-fold cross-validation for generalized linear models.

Share: