Before getting into the details of Cross Validation techniques and its application, we will see what the steps in a Machine Learning Pipeline are. This will help us to better visualize the purpose of doing Cross Validation.
To understand Cross Validation, we need to know couple of things that are involved in model creation. Let’s suppose we have a dataset with 10k records. Based on the above flow we would perform a Train Test Split. The training and validation set would be around 70% or more/less based on the requirement. The rest of the data would be testing set.
This means that we split the data into training set to train our model. Later test the accuracy of our model using the testing set.
Now the question comes is how we are selecting these training and testing sets. So, we are basically randomly selecting the training and testing records. Random selection will make sure the type of records present in the Test set is also present in the Training set and vice versa. In order to do this, we generally use a parameter variable called random state. We can pass any integer value to this variable and it will randomly select the values from the dataset.
But wait!! While this sounds fine, there is a bigger drawback to this approach of selecting the records randomly for the Train and Test Split.
Suppose we are passing the value for the random state =50 and get an accuracy of 90%. Now if I run the code again by passing the random state = 100,we won’t be able to get the same result as before. This fluctuation of accuracy will make things tough while we are trying to convey our results.
So, every time we change the value for the random state, we will have fluctuating values for accuracy. In order to mitigate this problem, we will be using Cross Validation.
Let’s check what are the different types of Cross Validation that are there:
1. Leave one out Cross Validation
To understand the concept of Leave one out Cross Validation, we will consider a dataset which has 10k records. In this cross validation technique, we will basically take one record as the testing set and the remaining as the training set.
The blue shaded area is the testing set and the brown shaded area is the training set.
This approach is more like a brute force approach where we are looping through each and every element of the dataset by considering one element at a time for the testing set and the rest as the training set.
This approach seems to be pretty straight forward, however there are some major drawbacks to this:
- We are looping through each and every element and choosing one element as the testing set and rest as the training set. This is a time taking and computationally intensive approach.
- This process will lead to Low Bias condition. The reason why it will have low Bias is we are looping through each and every data point here. This might lead to an overfitting condition. Now if we run the same model on a new set of data then it might lead to high variance. This will result in poor accuracy with high error rate.
2. K Fold Cross Validation
This is the most commonly used cross validation technique that we use while training our models. To understand it, we need to reiterate to the disadvantage of using the random state while doing the train test split.
Each time when we change the random state, our training and test data got randomly selected, which also changed the accuracy of the model everytime.
To prevent this fluctuation of accuracy, we will use the K Fold Cross Validation approach.
In this approach we need to select the value for K which is basically the no of experiments we want to perform on our dataset.
Now, let’s understand this using an example.
Suppose, we have 10K records in our dataset, and the K value we are selecting here is 10. Below figure shows how that data is splitted as Train Test Split for 10 iterations as our K value is 10.
So, K = 10000/10 => 1000
This means that each Test split would be having 1000 records.
Let’s try to understand how the K Fold Cross Validation works.
In the first iteration, the first set of 1000 records would be considered as the Test Split. The rest would be the Train Split. Similarly for the 2nd experiment/iteration, the next block of 1000 records will be considered as the Test Split.
The 1st block and the remaining blocks after the 2nd block of Test records would be considered as the Train Split. This process will continue till the end of the 10th experiment/iteration. For every Experiment/Iteration we will get the accuracy of the model until we reach the end of the experiment/iteration.
Based on the results we get for the accuracy, we could take the mean of all the 10 accuracy results and provide the overall result for model performance or we could also provide the upper bound and lower bound percentage for the model specifying the model performance.
Now let’s consider few disadvantages of K Fold Cross Validation that are there:
- Let’s say for the experiment-1 in the figure above, we splitted the data into train test split and the test set has all same set of instances. To understand this, let’s say we are implementing a binary classification problem. The data has only 2 set of instances – 0/1. Let’s say the 1st set of data for Experiment – 1 i.e, the test set has only one instance ex., 0. This means the dataset is very imbalanced and will yield us inaccurate results.
To prevent this situation, we use another cross validation technique called the Stratified Cross Validation.
3. Stratified Cross Validation
Stratified Cross Validation is similar to the K Fold cross validation. In this also we would be using the K value and then splitting the testing and training set based on it as we did in the K Folds.
The only difference here is, in every Experiment/iteration, the random sample that is populated in the train and test is distributed in such a manner that the values of each instances will be present in both.
Let’s take the example of Binary classification, where the data has only 2 instances – 0/1. On applying the Stratified Cross Validation, the train and test set would be splitted in such a manner that the values of each instance (0/1) will be present in both the train and test set. This is how we deal with the disadvantage of K Fold Cross Validation.
Now the last technique that we need to know is Time series cross validation.
4. Time Series Cross Validation
So, as the name suggests, this technique of cross validation will be used for Time Series data. In this case we can’t simply do a train test split to a time series data to find the accuracy.
To understand this let’s take the the example of daily sales of a store. Let’s say we have the sales data for 5 days and we want to predict the sales for the 6th to 10th day.
In order to predict the sales from 6th day till the 10th day, we have to leverage the sales figures for the first 5 days.
As you could see in the below image, we are using the 5 days of sales to predict the 6th day sale.
Then again we are using the already given sales and the predicted 6th day sales to predict the sale for the 7th day.
This process continues till we predict the sale up to the 10th day.
Now let’s implement whatever we discussed so far in code and see how it works !!