Multicollinearity for numerical features.

In [73]:
import pandas as pd
import statsmodels.api as sm

1. Import the dataset

In [91]:
df = pd.read_csv("Advertising.csv",index_col=0)
In [92]:
df.head()
Out[92]:
TV radio newspaper sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9

2. Selecting the dependent and the independent features

In [93]:
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
In [94]:
X.head()
Out[94]:
TV radio newspaper
1 230.1 37.8 69.2
2 44.5 39.3 45.1
3 17.2 45.9 69.3
4 151.5 41.3 58.5
5 180.8 10.8 58.4
In [95]:
y.head()
Out[95]:
1    22.1
2    10.4
3     9.3
4    18.5
5    12.9
Name: sales, dtype: float64

3. Fitting the model to the Linear Regression to see the coefficients and intercepts

In [96]:
from sklearn.linear_model import LinearRegression
Linear = LinearRegression()
Linear.fit(X,y)
print("Coefficients: ",Linear.coef_)
print("Intercepts: ",Linear.intercept_)
Coefficients:  [ 0.04576465  0.18853002 -0.00103749]
Intercepts:  2.9388893694594085

4. Now for the OLS model we have to explicitly addthe intercept column to the dataset

In [97]:
X = sm.add_constant(X)
C:\Users\tkhan050\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)
In [98]:
X.head()
Out[98]:
const TV radio newspaper
1 1.0 230.1 37.8 69.2
2 1.0 44.5 39.3 45.1
3 1.0 17.2 45.9 69.3
4 1.0 151.5 41.3 58.5
5 1.0 180.8 10.8 58.4

4. Fitting OLS model

In [99]:
model = sm.OLS(y,X).fit()

5. Let's try to understand in detail, what the model summary tells about the model

Here if you notice, the coefficients and the intercept for the Linear Regression model and the OLS model are the same. So there is no difference between the models there. But the main advantage of the OLS model is it gives us the summary report of the model.

It gives us a comprehensive report on how the model is split, what are the main parameters to look at, what are the different tests that is performed to validate if a feature is necessary or not for model creation. All these important things are covered in this OLS model summary report.

So first, let's start off with the R-Squared metric and see what does it say about the model.

R-Squared and Adjusted R-Squared

This is the proportion of variance in the dependent variable that is predictable from the independent variable. The main idea here that we need to understand here is, if the Linear Regression model is fit really well, then we'll have a R-Squared value close to 1. There is also a conception that the R-Squared value cannot be negative, but that's not true. If we fail to follow the properties of linear regression then chances are we won't even get a positive R-squared value ever.

Also one more thing that we need to check is the Adjusted R-Squared value. If we keep on adding features to the overall model for a better prediction accuracy then chances are there that the R-Squared value will always be a higher value closer to 1. However to counter this problem Adjusted R-Squared is used which penalizes the R-Squared values that include non-useful predictors to the model. So, if we add features which are very little relevance to our model then Adjusted R-Squared value will go down. If the Adjusted R-Squared is much less than R-Squared value, then its a sign that a variable might not be relevant to the overall model. So we just need to find that variable and remove it.

In our case the R-Squared value is 0.897 and the Adjusted R-Squared is o.896. So, its a very well fitted model where we dont have any such feature which is affecting our model performance out of proportion.

So, couple of things to keep in mind is:

  1. First we need to check the R-Squared to see how close it is to value 1. If its close then our model is fitted well.
  2. Understand if the Adjusted R-Squared and R-Squared are very different. If they are close then we have selected relevant features. If they are really apart then there are chances that we have included a feature which is less relevant.

F-Statistic

This is used to access the significance of the overall regression model. To understand this metric we need to consider 2 cases:

  1. Model 1: The model has no features and only intercept, thus called as a Intercept only model
  2. Model 2: The model has all the features (in our case: TV, radio, newspaper)

Now let's take the Null hypothesis(H0) and the Alternate hypothesis(H1):

H0: The above 2 specified models are equal. H1: Intercept only model(Model 1) is worst than Model 2.

Based on this hypothesis, we will get back a p-value which will help us to accept/reject the null hypothesis.

From the summary below, we can see that the p-value is close to 0 and the F-statistic is really large, we can therefore reject the null hypothesis H0. Therefore there is a clear evidence that There is a linear relationship between the features TV, radio, newspaper and the target variable--sales.

So, F-statistic greater than 1 or a really large number along with the p-value of less than 0.05 signifies that there is a good amount of linear relationship between the target variables and the feature variables.

T-test

If we want to see if a particular variable is significant or relevant to the target variable, we perform the t-test to check it. In t-test, it checks the relationship between the target variable and every predictor variable independently.

Without considering all the features at once, it takes one feature at a time and checks the relationship between the target variable and predictor variable.

it's basically performes like:

T-test 1 --> Feature 1 (TV) and Target T-test 2 --> Feature 2 and Target T-test 3 --> Feature 3 and Target

Provided the below hypothesis, we will perform the t-test

Null Hypothesis(H0) : The coefficient of the Feature value is going to be 0 Alternate Hypothesis(H1): The Feature coefficient value will not be equal to 0.

Higher value t/lower p-value signifies that we can reject the null hypothesis and accept the alternate hypothesis.

In our case if we look at the p-value columns, we can see that for the constant variable and the TV and radio features we have a p-value 0.000. Essentially, all these features have a p-value less than 0.05 because we are testing it a a 95% confidence interval. Therefore we reject the null hypothesis and accept the alternate hypothesis stating that the coefficients of the features are non-zero.

Now if we take a look at the newspaper feature, the t-value is very small and the p-value is very high which is 0.860. As we know, if the p-value is less than 0.05 we reject the null hypothesis. But in our case its greater than 0.05, so we will not be able to reject the null hypothesis in this case. As per the null hypothesis, the coefficient of the feature is going to have a value 0. This means that the feature newspaper seems to be irrelevant for our Linear Regression model. Therefore we can ignore or drop this feature.

In [100]:
model.summary()
Out[100]:
OLS Regression Results
Dep. Variable: sales R-squared: 0.897
Model: OLS Adj. R-squared: 0.896
Method: Least Squares F-statistic: 570.3
Date: Sat, 25 Jul 2020 Prob (F-statistic): 1.58e-96
Time: 15:43:21 Log-Likelihood: -386.18
No. Observations: 200 AIC: 780.4
Df Residuals: 196 BIC: 793.6
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 2.9389 0.312 9.422 0.000 2.324 3.554
TV 0.0458 0.001 32.809 0.000 0.043 0.049
radio 0.1885 0.009 21.893 0.000 0.172 0.206
newspaper -0.0010 0.006 -0.177 0.860 -0.013 0.011
Omnibus: 60.414 Durbin-Watson: 2.084
Prob(Omnibus): 0.000 Jarque-Bera (JB): 151.241
Skew: -1.327 Prob(JB): 1.44e-33
Kurtosis: 6.332 Cond. No. 454.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

6. Checking the correlation between features. From the below table we can see that none of the features are correlated, which means there are no problems of collinearity present.

In [101]:
X.iloc[:,1:].corr()
Out[101]:
TV radio newspaper
TV 1.000000 0.054809 0.056648
radio 0.054809 1.000000 0.354104
newspaper 0.056648 0.354104 1.000000

7. Dropping the feature newspaper and checking the model summary report

In [109]:
cols = ['const','TV','radio']
Xvar = X[cols]
In [110]:
Xvar.head()
Out[110]:
const TV radio
1 1.0 230.1 37.8
2 1.0 44.5 39.3
3 1.0 17.2 45.9
4 1.0 151.5 41.3
5 1.0 180.8 10.8
In [111]:
model = sm.OLS(y,Xvar).fit()

8. If you check the report, there has not been much changes in the overall model by dropping the feature newspaper

In [112]:
model.summary()
Out[112]:
OLS Regression Results
Dep. Variable: sales R-squared: 0.897
Model: OLS Adj. R-squared: 0.896
Method: Least Squares F-statistic: 859.6
Date: Sat, 25 Jul 2020 Prob (F-statistic): 4.83e-98
Time: 17:47:40 Log-Likelihood: -386.20
No. Observations: 200 AIC: 778.4
Df Residuals: 197 BIC: 788.3
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 2.9211 0.294 9.919 0.000 2.340 3.502
TV 0.0458 0.001 32.909 0.000 0.043 0.048
radio 0.1880 0.008 23.382 0.000 0.172 0.204
Omnibus: 60.022 Durbin-Watson: 2.081
Prob(Omnibus): 0.000 Jarque-Bera (JB): 148.679
Skew: -1.323 Prob(JB): 5.19e-33
Kurtosis: 6.292 Cond. No. 425.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [113]:
Xvar.iloc[:,1:].corr()
Out[113]:
TV radio
TV 1.000000 0.054809
radio 0.054809 1.000000