**Introduction**

In this post, we will discuss about **Covariance **and **Correlation. **This plays an important role while doing feature selection.

**Covariance**, as the name suggests is the measure of variance of 2 variables when they are taken together. When we have one variable then we call it as variance, but in case of 2 variables we specify it as **Covariance** to measure how the 2 variables vary together.**Covariance** is considered a very important concept when it comes to data analysis. We will discuss this in greater detail in the subsequent sections.We will also discuss some limitations of **Covariance **and how we could be mitigate it using **Correlation. **It defines the strength of relation between two sets of data.

i. **Correlation** is positive when both the values increase together.

ii.**Correlation** is negative when one value decreases when the other increases.

Also we will get into the detailed concept of below **Correlation**

**Pearson’s Correlation Coefficient.****Spearman’s Rank Correlation Coefficient**

**Covariance**

Let’s try to understand **Covariance** using an example. Let’s consider the age and salary of Individuals in a town.

Age | Salary ($) |

22 | 40,000 |

25 | 45,000 |

39 | 150,000 |

45 | 200,000 |

33 | 120,000 |

So, if you notice the above table, you could see a positive relation between **Age** and the **Salary**. As the **Age **increases the **Salary** also increases.

So, the equation for covariance is specified below:

Where **x _{i…. n} **is the individual ages and

**y**is the individual salaries. µ

_{i…. n }_{x, }µ

_{y are the mean of }Age and Price respectively.

If you try to see the equation for covariance closely you can understand that it’s quite similar to what we do in case of **Variance, **the only difference was in case of **Variance, **we use one variable and in **Covariance **we are using 2 variables.

**Variance**

Let’s derive the **Variance **equation to understand this concept in a clearer way:

If you take the above equation and instead of taking 2 variables, let’s take one variable and derive the equation:

So, the above equation looks similar to the the **Covariance equation. **We could rewrite the equation as follows:

**Positive and Negative Covariance**

Using the equation of **covariance **stated above we could find whether the covariance is positive or negative.

Suppose with the increase in Age(X) the Salary(Y) also increases, the we have a positive covariance.

In another case suppose with the increase in Age(X) the Salary(Y) decreases, then we are having a negative covariance.

Let’s plot a graph to better understand this relation.

**Positive Covariance**

1.From the above graph if you look at blue dot, both the **x **and** y **coordinates are greater than their respective means. So if you put the values in the covariance equation

So, this basically means that if the random value **X **increases and along with that **Y **also increases then we have a positive **Covariance.**

**Negative Covariance**

2. Now suppose we have a scenario where the random variable **X **increases but the value of **Y **decreases. In that case we will have a negative **Covariance.**

You can see that in the graph below were the value of **X **is above the mean, however the value of **Y **is below the mean, which mean that with the increase in **X, **the value of **Y **has decreased.

Putting that in the equation, we will get a negative value for **Covariance.**

From the above cases we found about the negative and positive covariance. But there is something that we should notice here, which is that the **Covariance **doesn’t specify the value of positivity or negativity that is there.

To mitigate this limitation, we use another matrix which is the **Pearson’s Correlation Coefficient.**

The key takeaway for **Covariance**, is that it helps us to understand and quantify the relationship between 2 variables in a dataset. So, in our dataset if a certain column value is increasing and at the same time another column value is also increasing then we have **a positive covariance**. And in another case if one column value in the dataset is increasing and another column is decreasing then we have **a negative covariance.**

**Correlation**

Now that we have understood the concept of **Covariance, **let’s understand how we could mitigate it’s limitation using **Pearson’s Correlation Coefficient.**

__i.Pearson Correlation Coefficient__

__i.Pearson Correlation Coefficient__

Pearson Correlation Coefficient basically shows the **Linear Relationship **between 2 sets of data and whether we could represent 2 sets of data using a line graph.

The formula for Pearson Correlation Coefficient is:

Where **σ _{x}**

**,**

**σ**are the standard deviations for

_{y}**x and y.**

As discussed above in the **Covariance **section, if we are trying to find the **covariance **of 2 variables and suppose one is increasing w.r.t the other then we have a positive covariance. So, what we understand from this is **Covariance **provides us the **Direction of the relationship**, which is whether it moves towards positive or negative direction.

Now in case of **Pearson Correlation Coefficient **we have an added advantage. With the variance of **X **and **Y (****σ _{x}**

**,**

**σ**

_{y) }_{it will be able to tell us the Strength }of the correlation between X and Y.

It will also tell us the **Direction of relationship **between **X ** and **Y.**

So the basic difference between **Covariance **and **Pearson correlation coefficient **is, in case of **Covariance, **we cant know how much is the **Strength **of Positivity or negativity of the correlation between **X and Y **or the **Direction of relationship.**

But in case of **Pearson Correlation Coefficient **we are able to do that as we are dividing the **Covariance **with the **Variance **of **X and Y.**

__Range of Pearson Correlation Coefficient__

__Range of Pearson Correlation Coefficient__

The range of the value in for correlation coefficient will always be between **-1 to 1.**

Let’s try to understand this concept by taking few examples where the correlation coefficient will vary between -1 to 1.

1.Let’s say we have a scenario where **X **increases and along with that **Y **also increases and the value lie on a straight line as shown below. In that case the value of the **Pearson Correlation Coefficient **will always be **1.**

2. Let’s say we have a scenario where **X **increases the **Y **decreases and the value lie on a straight line as shown below. In that case the value of the **Pearson Correlation Coefficient **will always be negative **1.**

3.Let’s consider a case where we don’t have any kind of relation within **X and Y. **The points are scattered everywhere. In that case the value of the **Pearson Correlation Coefficient **will always be **0.**

4.In the below scenario, the **X and Y **values are negatively correlated as the **X **value increases, the **Y **value decreases, however all the points don’t fall in a straight line. This means that the value of the **Pearson Correlation Coefficient **is greater than equal to **-1 and **less than equal to** 0.**

5.In the below scenario, the **X and Y **values are positively correlated as the **X **value increases, the **Y **value also increases, however all the points don’t fall in a straight line. This means that the value of the **Pearson Correlation Coefficient **is greater than equal to **0 and **less than equal to** 1.**

Now that we have discussed about the different scenarios that we could have and the values that are possible, let’s actually understand why is **Pearson’s Correlation Coefficient **used and where can it be used.

__Importance of Correlation Coefficient__

__Importance of Correlation Coefficient__

**i. Feature Selection using Pearson correlation coefficient**

**Correlation Coefficient **is basically used in case of **Feature Selection. **Let’s take an example to see how it could be used in **Feature Selection.**

Let’s consider 2 variables, **X, Y.**

**X **is the independent feature and **Y is **outcome variable/ label for the dataset. Let’s say we find that the correlation between **X and Y **is 1 which means that when **X **is increasing, **Y **is also increasing.

Also as we know that the correlation value is 1, that means both **X **and **Y **are the same. So we cand drop one feature and apply a machine learning algorithm on it.

Now that we have discussed the concepts and different scenarios involving **Pearson Correlation Coefficient **and how critical it is for **Feature Selection**, let’s go ahead and discuss about **Spearman’s rank Correlation Coefficient **and the limitations of Pearson correlation coefficient which it addresses.

__ii.Spearman’s Rank Correlation Coefficient__

__ii.Spearman’s Rank Correlation Coefficient__

Suppose we have **X and Y **values that are positively correlated and as the **X **value increases, the **Y **value also increases, but the relation between them is **non-linear.**

Below figure shows the graph to better illustrate this:

In the above graph, **X and Y **are positively correlated, however if we apply the **Pearson Correlation ** to it, the value is **0.88** and **Spearman correlation **gives a result of **1.**

So, **Spearman Correlation **has advantage over **Pearson Correlation **when it comes to **non-linear **relation between 2 attributes.

__Concept behind Spearman Rank Correlation Coefficient__

__Concept behind Spearman Rank Correlation Coefficient__

We can also write the equation as follows:

Here,

**ρ**denotes the**Pearson Correlation Coefficient,**which is applied to the**rank of X and Y**here.**cov(rg**denotes the_{x }, rg_{y})**covariance of**the rank of**X and Y****σ**is the standard deviations of the rank of_{rgx }, σ_{rgy }**X and Y.**

__Scenarios while finding the correlation using Spearman Rank Correlation Coefficient__

__Scenarios while finding the correlation using Spearman Rank Correlation Coefficient__

We can encounter 2 types of scenarios while finding the correlation between 2 attributes using **Spearman Rank correlation coefficient** method.

__Case 1: __When all the ranks are distinct integers i.e, there are no tie in the ranks, we can compute the correlation using the below formula:

__Case 1:__

Where, ** d_{i }= rg(X_{i}) – rg(Y_{i})** is the difference between the rank of

**X and Y**and

**n,**is the number of observations.

This sounds a bit abstract and complicated, but trust me it’s not. We will take an example to understand this concept better.

__Example –When there are no tie in the Ranks__

The above formula is a type of **Pearson Correlation, **the only difference is it is applied to the **rank of X and Y.**

Let’s try to understand about rank using an example below. We have taken this example from Wikipedia. Feel free to go and check it out there as well.

IQ (X_{i}) | Hours of TV Per week(Y_{i}) |

106 | 7 |

100 | 27 |

86 | 2 |

101 | 50 |

99 | 28 |

103 | 29 |

97 | 20 |

113 | 12 |

112 | 6 |

110 | 17 |

To apply **Spearman Rank **formula, we need to follow the below steps:

- Sort the data for the 1
^{st}column. - Create a separate column
**x**to assign ranks to the sorted values of the 1_{i}^{st}column–**Rank (x**._{i}) - Similarly, Create a separate column
**y**and assign ranks to the sorted values of the 2_{i }^{nd}column-**Rank(y**_{i}) - Now, create a column for the difference between the 2 rank columns —
**d**_{i} - Lastly, create a column for the squared of value of difference between the 2 rank columns.

IQ (X_{i}) | Hours of TV Per week, (Y_{i}) | Rank (x_{i}) | Rank(y_{i}) | d_{i} | d_{i}^{2} |

86 | 2 | 1 | 1 | 0 | 0 |

97 | 20 | 2 | 6 | -4 | 16 |

99 | 28 | 3 | 8 | -5 | 25 |

100 | 27 | 4 | 7 | -3 | 9 |

101 | 50 | 5 | 10 | -5 | 25 |

103 | 29 | 6 | 9 | -3 | 9 |

106 | 7 | 7 | 3 | 4 | 16 |

110 | 17 | 8 | 5 | 3 | 9 |

112 | 6 | 9 | 2 | 7 | 49 |

113 | 12 | 10 | 4 | 6 | 36 |

As you could see above, there are no tie in the ranks. So we could use the below formula to find the correlation.

**= – 29/165 = -0.175757575**

__Conclusion from the Outcome__

From the above result we could say that both the attributes **IQ **and **Hours of TV per week **are negatively correlated. And as the value is close to **zero, **we could say that the correlation between **IQ **and **Hours of TV per week **is very low. The negative value of the outcome suggests that the **IQ **is lower for those who have **higher Hours of TV per week.**

__Case 2:__ When there are tie in the ranks. Let’s take an example to better understand this:

__Case 2:__

Suppose we have the following records as shown in the table below:

X | Y |

20 | 28 |

22 | 24 |

28 | 24 |

23 | 25 |

30 | 26 |

30 | 27 |

23 | 32 |

24 | 30 |

Now to find the **Spearman Correlation **we need to follow the below steps:

**1**.Sort the data for the 1^{st} column.

**2**.Create a separate column **x _{i}** to assign ranks to the sorted values of the 1

^{st}column–

**Rank (x**. We can start the ranking either in ascending or descending order of the values of

_{i})**X and Y.**

**3**.Now if the value in the 1^{st} column has same values then take the positions/index value for the 2 same values and divide by the count of record having the same value.

Ex., there are 2 positions in column **X **which has same value of 30. Suppose the position of those two 30’s are 1^{st} and 2^{nd} respectively.

The rank would be (1+2)/2 which is equal to 1.5. So we would assign 1.5 to both the rank values for 30.This approach will be followed for the values in **Y **as well.

**4**.Then for the next element the rank would be its actual position/index value. If it has tied rank, then we will start from step iii again.

X | R_{x} | Y | R_{y} | D = (R_{x} – R_{y}) | D^{2 }= (R_{x} – R_{y})^{2} |

20 | 1 | 28 | 6 | -5 | 25 |

22 | 2 | 24 | 1.5 | 0.5 | 0.25 |

23 | 3.5 | 25 | 3 | 0.5 | 0.25 |

23 | 3.5 | 32 | 8 | -4.5 | 20.25 |

24 | 5 | 30 | 7 | -2 | 4 |

28 | 6 | 24 | 1.5 | -4.5 | 20.25 |

30 | 7.5 | 26 | 4 | 3.5 | 12.25 |

30 | 7.5 | 27 | 5 | 2.5 | 6.25 |

We will use the below formula to calculate the correlation coefficient.

In the above equation,

1.**m **denotes the number of times a particular number is repeated.

Example., 30 is repeated twice in column **X**, so (m^{3} – m) will be (2^{3} – 2) = 6

Similarly, 23 is repeated twice in column **X **so (m^{3} – m) will be (2^{3} – 2) = 6

We will repeat this for as many times as the same numbers are repeated.

2.**N **denotes the number of records.

Let’s solve the above problem by substituting the values to the above formula:

__Conclusion from the Outcome__

From the above result we could say that both the attributes **X **and **Y **are negatively correlated. And as the value is close to **zero, **we could say that the correlation between **X **and **Y **is very low. The negative value of the outcome suggests that the **X **is lower for those who have **higher Y **values**.**

The concept of **Covariance **and **Correlation **will be used while doing data preprocessing and Exploratory data analysis.