R-Squared and Adjusted R-Squared

Introduction

R-Squared and Adjusted R-Squared are the key techniques to check the accuracy for a Regression problem. We will understand each in detail in the subsequent sections.

There are various techniques to check the accuracy of different problems. In case of classification problems, we use the confusion matrix, F1-Score, Precision, Recall etc. You can check the detailed post for Classification performance matrix here.

R-Squared

The formula for R-Squared is:

Where,

{{SS}}_{{res}} = Sum of residual

{{SS}}_{{tot}} = Sum of average total

Best Fit Line

To understand what {{SS}}_{{res}} is, let’s take a graph to understand that.

The blue dots in the graph are the actual points. The double ended arrow between each blue dot and the diagonal line (best fit line)shows the difference between the predicted and actual point. This is the error/residual. The summation of all these differences between the actual and the predicted points is what we call as {{SS}}_{{res}}

{{SS}}_{{res}} =\sum\left({y}_{i}-{\hat{{y}}}_{i}\right)^\mathbf{2}

Here  are the actual points and  are the predicted points.

Now let’s understand using the graph shown below.

Average Fit Line

In the above figure, you can see that instead of finding the best fit line, the average output line is taken. The blue dots in the graph are the actual points.

The double ended arrow between each blue dot and the average output line gives the difference between the predicted and actual point. The summation of all these differences between the actual and the predicted points is what we call as {{SS}}_{tot}

{{SS}}_{tot}\ = \sum\left({y}_{i}-{\hat{{y}}}_{average}\right)^\mathbf{2}

So, substituting the value of {{SS}}_{{res}} and {{SS}}_{{tot}} in the R2 equation, we will get a value somewhere between 0 and 1.

\mathbf{R}^\mathbf{2}\ =\mathbf{1}\ -\ \frac{{\mathbf{SS}}_{\mathbf{res}}}{{\mathbf{SS}}_{\mathbf{tot}}}\ =1\ -\ \frac{value\ of\ {\mathbf{SS}}_{\mathbf{res}}<{\mathbf{SS}}_{\mathbf{tot}}}{{\mathbf{SS}}_{\mathbf{tot}}}\ = 0<value\ of\ \mathbf{R}^\mathbf{2}<1

The logic behind this is, the error in {{SS}}_{{tot}} will always be higher as we are taking an average fit.

Whereas the error for {{SS}}_{{res}} will be comparatively lower than {{SS}}_{{tot}} making it a smaller value.

Therefore, \frac{{\mathbf{SS}}_{\mathbf{res}}}{{\mathbf{SS}}_{\mathbf{tot}}} will be a smaller value. Subtracting this from 1 will give us a value somewhere between 0 and 1.

If the R2 value is nearer to 1, then our best fit line has fitted to the model quite well.

But wait!! Can we encounter a scenario where the R2 value is less than 0?

Yes, the value for R2 can be less than 0 in cases where the output of the best fit line is worse than the average output line. That means {{SS}}_{res}\ >{{SS}}_{tot}

Substituting the values to the R2 equation below:

\mathbf{R}^\mathbf{2}\ =\ \mathbf{1}\ -\ \frac{{\mathbf{SS}}_{\mathbf{res}}}{{\mathbf{SS}}_{\mathbf{tot}}}\ =\ 1\ -\ \frac{large\ value}{smaller\ value}\ =-ve\ value

This means that the model that we have created is not at all a good model. Therefore R2 is used to check the goodness of fit.

Drawback of R2

There is a drawback to R2 which often makes it difficult to predict the accuracy of the model.

Let’s say we have a simple linear regression model which has one independent feature and has the equation y = ax + b. Now we add few more independent features to the model. Our new equation would be a multiple linear regression model with an equation somewhat like y = ax1 + bx2 + cx3 + d.

So, as the number of independent feature increases, our R2 also increases.

How does the value of R2 increase?

Every time when we add an independent feature, the linear regression algorithm adds a coefficient value to the feature. Ex., The coefficients for the above equation are a, b, c which got added when new features x1, x2, x3 were introduced to the model.

The Linear regression algorithm assigns the coefficients in such a way that the value of {{SS}}_{res} will always be decreasing, whenever we add a new independent feature.

If we substitute this logic to R2 equation:

\mathbf{R}^\mathbf{2}\ =\mathbf{1}\ -\ \frac{{\mathbf{SS}}_{\mathbf{res}}}{{\mathbf{SS}}_{\mathbf{tot}}}\ =1\ -\ \frac{Decreasing\ with\ increasing\ independent\ features}{{\mathbf{SS}}_{\mathbf{tot}}(constant\ greater\ value)}\ =close\ to\ 1

This sounds perfect right, Not really!!

As we increase the number of independent features in the model, the R2 value will also keep on increasing even though the independent features are not co-related with the dependent variable.

Chances are the feature that we include can be a complete one-off. It might not have any relation with the target dependent variable, but still has some coefficient value contributing to the output. Linear regression algorithm works in such a way that it adds a coefficient value to every feature that is present in the model.

Ex. suppose we are predicting the age of students in which our model might have one of the features as the contact number of the students. This feature seems to have no correlation with the age, but still might have some coefficient value contributing to the output thereby increasing the overall R2 of the model.

This clearly means that R2 doesn’t have anything to do with the correlation between the independent features and the dependent variable. It simply increases whenever we add a new feature to the model.

To prevent such scenarios, we use Adjusted R2

Adjusted R – Squared

The formula for Adjusted R-Squared is as follows

{Adjusted\ -\ R}^{2\ }=\ 1\ -\ (1-R^2)\ \frac{(N-1)}{N-P-1}

Where,

R^2\ =\ R\ -\ squared\ value

P\ =\ independent\ features
N\ =\ Sample\ size\ of\ the\ dataset.

The Adjusted – R2 has a penalizing factor. It penalizes for adding independent variable that don’t contribute to the model in any way or are not correlated to the dependent variable.

To understand this penalizing factor let’s divide it into 2 cases:

Case – I

Let’s say we increase the number of independent features (P) for the model. These features are not really contributing to the model much or are not correlated to the dependent variable.

Let’s substitute this logic to the Adjusted R-Squared equation. The value for N-P-1 will decrease as the value for P has increased. Thus, the value for \frac{(\mathbf{N}-\mathbf{1})}{\mathbf{N}-\mathbf{P}-\mathbf{1}} will increase.

Now there is one thing that we need to understand here. As we add new features, it is obvious that the R-Squared value will increase. But this increase will be insignificant in comparison to the \ \frac{(\mathbf{N}-\mathbf{1})}{\mathbf{N}-\mathbf{P}-\mathbf{1}} value because the newly added features are not correlated to the dependent variable. So, (1-R^2) will not decrease much.

Now the value for \ \frac{(\mathbf{N}-\mathbf{1})}{\mathbf{N}-\mathbf{P}-\mathbf{1}} multiplied with (1-R^2) will also not be a decreased value.

Finally, on subtracting it from 1 will give us a smaller value

{\mathbf{Adjusted}\ -\ \mathbf{R}}^{\mathbf{2}\ }=\ \mathbf{1}\ -\ (\mathbf{1}-\mathbf{R}^\mathbf{2})\ \frac{(\mathbf{N}-\mathbf{1})}{\mathbf{N}-\mathbf{P}-\mathbf{1}}

=\mathbf{1}-(\mathbf{increasing}\ \mathbf{value}\ \mathbf{less}\ \mathbf{than}\ \mathbf{1})

=\mathbf{smaller}\ \mathbf{value}

This is how Adjusted R-Squared penalizes when the features are not correlated to the dependent variable.

Case – II

Now let’s say we are adding features which are very much correlated to the dependent variable. In this case the R2 will be higher and will overwhelm the \ \frac{(\mathbf{N}-\mathbf{1})}{\mathbf{N}-\mathbf{P}-\mathbf{1}} value.

So, (1 – R2) will be a smaller value which multiplied with an overwhelming \ \frac{(\mathbf{N}-\mathbf{1})}{\mathbf{N}-\mathbf{P}-\mathbf{1}} value will give us a smaller value. Now subtracting this from 1 would give us Adjusted R-Squared which is an increased value compared to the previous case.

Substituting this logic to the Adjusted R-Squared equation

{\mathbf{Adjusted}\ -\ \mathbf{R}}^{\mathbf{2}\ }=\ \mathbf{1}\ -\ (\mathbf{1}-\mathbf{R}^\mathbf{2})\ \frac{(\mathbf{N}-\mathbf{1})}{\mathbf{N}-\mathbf{P}-\mathbf{1}}

=1\ -\ (smaller\ value)(Overwhelmed\ value)

=\ \mathbf{1}\ -\ \mathbf{smaller}\ \mathbf{value}\ =\ \mathbf{Increased}\ \mathbf{Adjusted}\ \mathbf{R}^\mathbf{2}\ \mathbf{value}

So, this signifies that, when the independent features are correlated to the dependent variable, the Adjusted R- Squared value goes up.

Conclusion

  1. Whenever we add an independent feature to the model, the R-squared value will always increase, even if the independent feature is not correlated to the dependent variable. It will never decrease.On the other hand, Adjusted R- Squared increases only when the independent feature is correlated to the dependent variable.
  2. The value for Adjusted R-Squared will always be less than or equal to R-squared value.

Reference :

  1. https://discuss.analyticsvidhya.com/t/difference-between-r-square-and-adjusted-r-square/264
  2. https://www.statisticshowto.com/adjusted-r2/
  3. https://blog.minitab.com/blog/adventures-in-statistics-2/multiple-regession-analysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the-correct-number-of-variables
  4. https://www.youtube.com/watch?v=2AQKmw14mHM