In this post, we will discuss about Covariance and Correlation. This plays an important role while doing feature selection.
Covariance, as the name suggests is the measure of variance of 2 variables when they are taken together. When we have one variable then we call it as variance, but in case of 2 variables we specify it as Covariance to measure how the 2 variables vary together.Covariance is considered a very important concept when it comes to data analysis. We will discuss this in greater detail in the subsequent sections.We will also discuss some limitations of Covariance and how we could be mitigate it using Correlation. It defines the strength of relation between two sets of data.
i. Correlation is positive when both the values increase together.
ii.Correlation is negative when one value decreases when the other increases.
Also we will get into the detailed concept of below Correlation
- Pearson’s Correlation Coefficient.
- Spearman’s Rank Correlation Coefficient
Let’s try to understand Covariance using an example. Let’s consider the age and salary of Individuals in a town.
So, if you notice the above table, you could see a positive relation between Age and the Salary. As the Age increases the Salary also increases.
So, the equation for covariance is specified below:
Where xi…. n is the individual ages and yi…. n is the individual salaries. µx, µy are the mean of Age and Price respectively.
If you try to see the equation for covariance closely you can understand that it’s quite similar to what we do in case of Variance, the only difference was in case of Variance, we use one variable and in Covariance we are using 2 variables.
Let’s derive the Variance equation to understand this concept in a clearer way:
If you take the above equation and instead of taking 2 variables, let’s take one variable and derive the equation:
So, the above equation looks similar to the the Covariance equation. We could rewrite the equation as follows:
Positive and Negative Covariance
Using the equation of covariance stated above we could find whether the covariance is positive or negative.
Suppose with the increase in Age(X) the Salary(Y) also increases, the we have a positive covariance.
In another case suppose with the increase in Age(X) the Salary(Y) decreases, then we are having a negative covariance.
Let’s plot a graph to better understand this relation.
1.From the above graph if you look at blue dot, both the x and y coordinates are greater than their respective means. So if you put the values in the covariance equation
So, this basically means that if the random value X increases and along with that Y also increases then we have a positive Covariance.
2. Now suppose we have a scenario where the random variable X increases but the value of Y decreases. In that case we will have a negative Covariance.
You can see that in the graph below were the value of X is above the mean, however the value of Y is below the mean, which mean that with the increase in X, the value of Y has decreased.
Putting that in the equation, we will get a negative value for Covariance.
From the above cases we found about the negative and positive covariance. But there is something that we should notice here, which is that the Covariance doesn’t specify the value of positivity or negativity that is there.
To mitigate this limitation, we use another matrix which is the Pearson’s Correlation Coefficient.
The key takeaway for Covariance, is that it helps us to understand and quantify the relationship between 2 variables in a dataset. So, in our dataset if a certain column value is increasing and at the same time another column value is also increasing then we have a positive covariance. And in another case if one column value in the dataset is increasing and another column is decreasing then we have a negative covariance.
Now that we have understood the concept of Covariance, let’s understand how we could mitigate it’s limitation using Pearson’s Correlation Coefficient.
i.Pearson Correlation Coefficient
Pearson Correlation Coefficient basically shows the Linear Relationship between 2 sets of data and whether we could represent 2 sets of data using a line graph.
The formula for Pearson Correlation Coefficient is:
Where σx, σy are the standard deviations for x and y.
As discussed above in the Covariance section, if we are trying to find the covariance of 2 variables and suppose one is increasing w.r.t the other then we have a positive covariance. So, what we understand from this is Covariance provides us the Direction of the relationship, which is whether it moves towards positive or negative direction.
Now in case of Pearson Correlation Coefficient we have an added advantage. With the variance of X and Y (σx, σy) it will be able to tell us the Strength of the correlation between X and Y.
It will also tell us the Direction of relationship between X and Y.
So the basic difference between Covariance and Pearson correlation coefficient is, in case of Covariance, we cant know how much is the Strength of Positivity or negativity of the correlation between X and Y or the Direction of relationship.
But in case of Pearson Correlation Coefficient we are able to do that as we are dividing the Covariance with the Variance of X and Y.
Range of Pearson Correlation Coefficient
The range of the value in for correlation coefficient will always be between -1 to 1.
Let’s try to understand this concept by taking few examples where the correlation coefficient will vary between -1 to 1.
1.Let’s say we have a scenario where X increases and along with that Y also increases and the value lie on a straight line as shown below. In that case the value of the Pearson Correlation Coefficient will always be 1.
2. Let’s say we have a scenario where X increases the Y decreases and the value lie on a straight line as shown below. In that case the value of the Pearson Correlation Coefficient will always be negative 1.
3.Let’s consider a case where we don’t have any kind of relation within X and Y. The points are scattered everywhere. In that case the value of the Pearson Correlation Coefficient will always be 0.
4.In the below scenario, the X and Y values are negatively correlated as the X value increases, the Y value decreases, however all the points don’t fall in a straight line. This means that the value of the Pearson Correlation Coefficient is greater than equal to -1 and less than equal to 0.
5.In the below scenario, the X and Y values are positively correlated as the X value increases, the Y value also increases, however all the points don’t fall in a straight line. This means that the value of the Pearson Correlation Coefficient is greater than equal to 0 and less than equal to 1.
Now that we have discussed about the different scenarios that we could have and the values that are possible, let’s actually understand why is Pearson’s Correlation Coefficient used and where can it be used.
Importance of Correlation Coefficient
i. Feature Selection using Pearson correlation coefficient
Correlation Coefficient is basically used in case of Feature Selection. Let’s take an example to see how it could be used in Feature Selection.
Let’s consider 2 variables, X, Y.
X is the independent feature and Y is outcome variable/ label for the dataset. Let’s say we find that the correlation between X and Y is 1 which means that when X is increasing, Y is also increasing.
Also as we know that the correlation value is 1, that means both X and Y are the same. So we cand drop one feature and apply a machine learning algorithm on it.
Now that we have discussed the concepts and different scenarios involving Pearson Correlation Coefficient and how critical it is for Feature Selection, let’s go ahead and discuss about Spearman’s rank Correlation Coefficient and the limitations of Pearson correlation coefficient which it addresses.
ii.Spearman’s Rank Correlation Coefficient
Suppose we have X and Y values that are positively correlated and as the X value increases, the Y value also increases, but the relation between them is non-linear.
Below figure shows the graph to better illustrate this:
In the above graph, X and Y are positively correlated, however if we apply the Pearson Correlation to it, the value is 0.88 and Spearman correlation gives a result of 1.
So, Spearman Correlation has advantage over Pearson Correlation when it comes to non-linear relation between 2 attributes.
Concept behind Spearman Rank Correlation Coefficient
We can also write the equation as follows:
- ρ denotes the Pearson Correlation Coefficient, which is applied to the rank of X and Y here.
- cov(rgx , rgy) denotes the covariance of the rank of X and Y
- σrgx , σrgy is the standard deviations of the rank of X and Y.
Scenarios while finding the correlation using Spearman Rank Correlation Coefficient
We can encounter 2 types of scenarios while finding the correlation between 2 attributes using Spearman Rank correlation coefficient method.
Case 1: When all the ranks are distinct integers i.e, there are no tie in the ranks, we can compute the correlation using the below formula:
Where, di = rg(Xi) – rg(Yi) is the difference between the rank of X and Y and n, is the number of observations.
This sounds a bit abstract and complicated, but trust me it’s not. We will take an example to understand this concept better.
Example –When there are no tie in the Ranks
The above formula is a type of Pearson Correlation, the only difference is it is applied to the rank of X and Y.
Let’s try to understand about rank using an example below. We have taken this example from Wikipedia. Feel free to go and check it out there as well.
|IQ (Xi)||Hours of TV Per week(Yi)|
To apply Spearman Rank formula, we need to follow the below steps:
- Sort the data for the 1st column.
- Create a separate column xi to assign ranks to the sorted values of the 1st column–Rank (xi).
- Similarly, Create a separate column yi and assign ranks to the sorted values of the 2nd column-Rank(yi)
- Now, create a column for the difference between the 2 rank columns —di
- Lastly, create a column for the squared of value of difference between the 2 rank columns.
|IQ (Xi)||Hours of TV Per week, (Yi)||Rank (xi)||Rank(yi)||di||di2|
As you could see above, there are no tie in the ranks. So we could use the below formula to find the correlation.
= – 29/165 = -0.175757575
Conclusion from the Outcome
From the above result we could say that both the attributes IQ and Hours of TV per week are negatively correlated. And as the value is close to zero, we could say that the correlation between IQ and Hours of TV per week is very low. The negative value of the outcome suggests that the IQ is lower for those who have higher Hours of TV per week.
Case 2: When there are tie in the ranks. Let’s take an example to better understand this:
Suppose we have the following records as shown in the table below:
Now to find the Spearman Correlation we need to follow the below steps:
1.Sort the data for the 1st column.
2.Create a separate column xi to assign ranks to the sorted values of the 1st column–Rank (xi). We can start the ranking either in ascending or descending order of the values of X and Y.
3.Now if the value in the 1st column has same values then take the positions/index value for the 2 same values and divide by the count of record having the same value.
Ex., there are 2 positions in column X which has same value of 30. Suppose the position of those two 30’s are 1st and 2nd respectively.
The rank would be (1+2)/2 which is equal to 1.5. So we would assign 1.5 to both the rank values for 30.This approach will be followed for the values in Y as well.
4.Then for the next element the rank would be its actual position/index value. If it has tied rank, then we will start from step iii again.
|X||Rx||Y||Ry||D = (Rx – Ry)||D2 = (Rx – Ry)2|
We will use the below formula to calculate the correlation coefficient.
In the above equation,
1.m denotes the number of times a particular number is repeated.
Example., 30 is repeated twice in column X, so (m3 – m) will be (23 – 2) = 6
Similarly, 23 is repeated twice in column X so (m3 – m) will be (23 – 2) = 6
We will repeat this for as many times as the same numbers are repeated.
2.N denotes the number of records.
Let’s solve the above problem by substituting the values to the above formula:
Conclusion from the Outcome
From the above result we could say that both the attributes X and Y are negatively correlated. And as the value is close to zero, we could say that the correlation between X and Y is very low. The negative value of the outcome suggests that the X is lower for those who have higher Y values.
The concept of Covariance and Correlation will be used while doing data preprocessing and Exploratory data analysis.