Cross Validation is an approach in which the data is split into train and validation sets; this is done so that the number of samples that are required to be tested are smaller.
In cross validation the following methods are used:
k-Fold -K-Folds to divide the whole dataset into groups.
cross-sample (test a sample from one group only)
cross validation(test a combination of training and validation examples)
cross validation (train, validate, test)
cross validation(train, validate, test)
k-fold
In cross validation k random subgroups are taken from all the original samples. This cross validation method works well when there aren’t any other variables that might affect the outcome of the sample.
Let’s consider a situation where you have 5 variables and a variable X representing time, this means each variable takes approximately 0.5 seconds to complete a task.
Why use K-fold cross validation?
For simplicity purposes, let’s assume we have 30 samples per day and the samples are in chronological order or from previous days. This means samples 3,4 and 5 will be represented as “m”. This means we can represent all the samples by splitting them in three folds. We will have 2 samples m3 (n=30), m4 (n=30), and m5 (n=30). Now a simple way to interpret what’s going on here would be you have a variable called “x” and “m1” and “m2” are “k” replicates (with “k” being 10 or 20), and “m” is the time.
Now m1 will contain samples with the value 1 and m2 will contain samples with the value 2. In this case, we have 1+1=2 and we can conclude that our model is overfitting. So instead of using “k”, maybe “x” should be considered as a new dependent variable which can be fed to the model without changing anything in terms of the independent variables. That’s how cross-validation “fold”, but let’s just take a look with more than one variable and how to go about it.
First let’s say the first two groups are m0 and m1 and the last group is m5. One way to think about what happens here is we have two groupings, so m0 will be a reference group for the past samples, and m1 will be the group labeled as m0. Another way to think about this is m1 will contain all those samples of classifier m0, and m2 will contain all those samples of classifier m1, plus other samples, so we’re adding another set to the same group, however this time we have more samples to fit, i.e. m1 and m2.
If we take all the samples from any group as samples, we end up having samples from 2 different models which ultimately make it harder to see the patterns we were hoping to see. On the other hand, our goal of cross-validation was to find the optimal model that maximizes the validation error in such cases where we’re not sure which variable is going to be included at the moment, but is usually to reduce bias, which we can achieve by getting samples from either m1 or m2
The next thing is once m1 and m2 are both considered, then we can perform a proper fitting exercise, so we define the error function for both groups. Our error function f(x, m1, m2) computes the error for the samples labeled m1 and m2 at least once per iteration of the experiment. Since the variable “x” will be a reference to the previous samples, we can use this as a label variable.
However, if we change “x” to the variable “y” then m1 will no longer be the same as m2, therefore, the label will be changed. If we then perform k-fold cross validation, this process repeats itself, so our process can become tedious to repeat on and on. So it’s important to understand when and why cross validation is performed. For example, you may have noticed that the code below has been copied from a blog post published in 2017.
Step 4.
Now that we’ve defined our variables, let’s move further and define the validation set! For a given dataset we’ll want to choose a validation subset size of m=5. For example, if we have 5 datasets, then the validation size will be m=5. We’ve also defined the testing errors, and if we have a testing dataset of size n-m=5, then the total number of tests will be 5k-m=20. So this is when we’ll perform cross-validation.
But before we actually start the process, let’s see what the validation set consists of. The general idea of cross-validation is that we’ll split the entire dataset into a training and validation set. We’ll use the test set as a validation target, and then run k-fold test with one model or the other. So each fold is labelled m1, m2, m3, m4. Then we can use these labels to evaluate the accuracy of the result on m1 vs m2, m3 vs m4, etc…
Let’s say we have “x” and we’ve labeled “y”; we’ll do k-fold cross validation using “x” as testing set target and we’ll “y” as the validation target. We’ll then train on the training set, and then check its validation on the training set. After training we’ll keep the validation results, and check whether they match on the validation set. Finally we predict on the validation set. Here’s how it looks like an experiment in R!
Step 5
So far we’ve seen enough of how cross-validation works, now let’s jump into specifics. What is the benefit?
There are many reasons why a given cross-validation technique is used. Some of which are summarized as follows:
It improves the internal validity in your model and makes it faster to select hyper parameters and tune them.
It eliminates over-fitting in your model.
It reduces bias, which is an issue with relying too much on a single parameter and makes the model converge faster when the error rate is high.
It helps us visualize relationships between the variables during training.
It provides insight into which factors are contributing to the model.
it allows us to estimate the statistical significance of the coefficients and thus make inferences about the effect of various parameters on the training set.
In the end, we want our model to be accurate. And, that includes not only predicting the value of y, but understanding the behavior of the features as well. So in short, cross validation is important for that.
What’s important is that it’s just part of an of learning and tweaking a model that eventually becomes better. Also, it’s just one aspect of a larger modeling workflow. There are many kinds of modeling that can go together, for example, time series forecasting, machine learning, deep learning, and neural networks.
With time series modeling, for example one expects a predictable future based on historical values, but we also have to learn to deal with unpredictable future values that are not necessarily predictable. To do that, we need to adjust the model to account for that.
Here are some steps to follow when working with cross validation:
Get data
Create validation set
Run k-fold
Run all models
Check data
Run validation set
Do training on data
Output the results
Visualize
Feedback
Let me know if you like my article. On the off chance that indeed, I’d very much want to hear your criticism. How can we improve it? Thank you!