Introduction to Bias and Variance
Bias and Variance plays a very important role while building a model. To frame it in simple terms Bias is interpreted as the model error encountered for the training data and Variance is interpreted as the model error encountered for the test data.
To understand the concept of Bias and Variance we need to know what Overfitting and Underfitting is and how do we work towards rectifying it.
But wait!! What exactly does Overfitting and Underfitting mean and how does it matter so much?
Let’s break these down into steps and understand them by taking different scenarios for both Regression and Classification problems.
Case -I (Degree of Polynomial =1)
Let’s take the above example where we have created a best fit line using a Polynomial Linear Regression algorithm. In the above example we have taken the degree of polynomial =1.
Example of an equation with degree of polynomial =1 is mx + b. This kind of models are similar to simple Linear regression model and they will create a straight best fit line as shown above.
Now if you notice above, most of the points are non-linearly spread across the region, and clearly the linear best fit line won’t be able to fit those points properly. Based on the above model, if we try to calculate the R-Squared error (check out this post for R-Squared matrix), we will get a higher value. If we take the summation of all distances between the actual and predicted points, it will be a high value.
In this case, we created a model on the training dataset, but we are getting very high error on both the training and test/actual data. That means our model doesn’t fit the data. This scenario is called Underfitting.
In case of Underfitting accuracy is very low for both training and test data.
Bias and Variance
Let’s understand what Bias and Variance means.
Bias is interpreted as the model error encountered for the training data and Variance is interpreted as the model error encountered for the test data.
In this case, the error for both training and test data is high which clearly means that our model has High Bias and High Variance.
Case -II (Degree of Polynomial = 3)
Now let’s increase the degree of polynomial to 3, you will notice that we will get a better best fit line which fits the points better than the previous case.
If you notice above, the best fit line is a curve which tries to fit the points in a better way thereby satisfying most of the training data points. Therefore, the R-squared error is comparatively lesser in this case.
Bias and Variance
This will be a perfect model as this has low training and testing error. This means that it has Low Bias and Low Variance.
Case -III (Degree of Polynomial = n)
Now let’s increase the degree of polynomial to n, you will notice that we will get a best fit line which fits almost all the points.
In this case each point is being fitted perfectly by the best fit line. This is a perfect example of Overfitting.
In case of Overfitting, the model’s accuracy is very high for training data but very low for test data. A good model should have high accuracy for both training and test data.
Bias and Variance
In this case, the error is low for training data but high for test data which clearly means that our model has Low Bias and High Variance.
Let’s take a classification use case where the training and testing error for the model is high.
Training error = ~35%
Testing error = ~37%
This has high bias and high variance which clearly shows that it is a case of Underfitting.
Let’s take a classification use case where the training and testing error for the model is low.
Training error = <5%
Testing error = <5%
This has low bias and low variance which clearly says that this is our most generalized model.
Let’s take a classification use case where we have a low training error, but high testing error.
Training error = <5%
Testing error = ~35%
This has low bias and high variance which clearly shows that it is a case of Overfitting.
Now that we have understood different scenarios of Classification and Regression cases with respect to Bias and Variance, let’s see a more generalized representation of Bias and Variance.
Generalized representation of Bias and Variance
Let’s consider model error as the y-axis and the degree of polynomial as the x-axis.
Now let’s say that we have an Underfitting condition. Generally, in this case, the degree of polynomial will be less, and the error will be pretty high for both the Training and Testing data signifying the high bias and high variance. This can be seen in the graph below.
In case of Overfitting Condition the degree of polynomial will be high, and the error will be low for the Training data and pretty high for the Testing data signifying the low bias and high variance.
If we notice the graph, as the degree of polynomial increases the error for both the training and testing data gradually descends. But after a certain point the testing error just skyrockets. That’s called the high variance for certain degree of polynomial where the training data was a perfect fit (Overfitting).
The basic aim is to find a model which will have a generalized fit for both the training and test data. The Oval circle encompassing both the training and test data represents the point which would be suitable for a generalized model. This represents low bias and low variance.
Now let’s go ahead and see these scenarios of Overfitting and Underfitting with respect to Decision Tree and Random Forest.
Now let’s say, we have the above decision tree which has been split to its depth based on its features. This splitting of the tree to its complete depth is like a scenario of Overfitting condition. This will give us a very good training result, but for test data this will not be as accurate. So, we can say that decision tree has a condition of low bias and high variance.
To mitigate such cases, we use methods like decision pruning where we create the decision tree up to some certain depth to avoid Overfitting. This will help us convert the high variance to low variance.
There are also many hyperparameter techniques used in it to tune it better. We will learn those in detail in further posts related to decision trees.
In case of Random Forest, we use multiple decision trees in parallel. The low bias and high variance property will be there of the decision tree will be there, but as we are using multiple decision trees in parallel, the high variance would be converted to low variance.
In the figure below, you could see we have 60K records and they have been split into individual decision trees. Random forest algorithm aggregates the output of all the various decision trees to display us the aggregated outcome.