**Introduction**

Choosing the right **Classification Metrics** is very crucial for model evaluation. Metrics like **Confusion Matrix is **a simple yet a very powerful **Classification Metrics** when it comes to evaluating the performance of a classification problem. **Confusion Matrix **is a performance measurement for machine learning problem where output can be two or more classes. Similarly we have **Precision**[3] which is defined as the fraction of relevant instances among the retrieved instances, **Recall[3]** which is the fraction of the total amount of relevant instances that were actually retrieved, **F-Beta**[5] is the weighed harmonic mean of **Precision **and **Recall.**

We will discuss these in detail in the upcoming sections.Below are the various Classification metrics that we should use in Machine Learning.

- Confusion Matrix
- Accuracy
- Recall (True Positive Rate, Sensitivity)
- Precision (Positive Prediction Value)
- F – Beta
- Cohen Kappa
- ROC Curve, AUC Score
- PR Curve

It is very important to use the correct kind of metrics to find out how good the model is. If we are not using the correct metrics, then it would be really difficult tell the efficiency of our model.

So, let’s understand every metrics and see which one will best fit in what kind of scenarios.

Now let’s consider a classification problem statement. As shown in the below figure, there are two ways in which we can solve a classification problem statement:

**1.**__Predicting the Class Labels__.

__Predicting the Class Labels__.

Suppose we have a binary classification with classes A and B. The threshold boundary in this case will by default be 0.5 as we have 2 classes.

So, let’s say our prediction value is greater than 0.5 then it will belong to class B and if it’s less than 0.5 then it will belong to class A.

**2. **__Probability__

__Probability__

In case of probability also we have to find out the class labels by selecting the right threshold value.

The threshold value which we will choose will depend on a case by case basis. Let’s say we want to predict whether a person is having cancer or not. In this case the choosing threshold value is very critical and should be chosen in a proper way.

The probability approach involves following classification metrics which we can use for predicting the correct threshold.

- ROC Curve
- AUC Score
- PR Curve

Now that we know how we can solve a classification problem, let’s understand what metrics will be used for a dataset.

- If we have a dataset which has 1000 records and is split into equal halves, then that means it is a balanced dataset. In such cases we use
**Accuracy**to be the classification metric. - If we have an imbalanced dataset where the distribution of data is not equal for the binary classification, then we consider
**Recall, Precision**and**F-beta**to be the classification metric.

Now that we have briefly discussed about balanced and imbalanced and what type of metrics should be used for each, let’s understand each of them in detail.

**1. **__Confusion Matrix__

__Confusion Matrix__

**Confusion Matrix**, in case of binary classification is a 2X2 matrix as shown below. The top values are the actual values and the left part are the predicted values. It is an error matrix, which allows visualization of the performance of an algorithm.

- The first field corresponding to 1 for predicted value and 1 for actual value is the
**True Positive (TP)**field. - Similarly, the field corresponding to 1 for predicted value and 0 for actual value is the
**False Positive (FP)**field which is also called the**type I error**or**the false positive rate (FPR)** - The field corresponding to 0 for predicted value and 1 for actual value is the
**False Negative (FN)**which is also called the**type II error**or**the false negative rate (FNR).** - The field corresponding to 0 for the predicted value and 0 for the actual value is the
**True Negative (TN).**

One way to remember the formula for FPR is, we consider all the false value (FP, TN) with respect to the actual predicted value (FP).

Our most accurate results are TP and TN. Our aim should always be to reduce the type I error and the type II error.

**2. Accuracy**

As we discussed before, if our dataset is a balanced one then we use **Accuracy **as the classification metric.

The formula for **Accuracy **is:

Here TP and TN are the most accurate results out of all the other results.

Now what would happen if our dataset is not balanced. What if we still use the **Accuracy Metric **as the classification metric. To understand this let’s take an example:

Suppose we have 10K records with label A being 9k and label B being 1K. Now suppose we are calculating the **Accuracy, **then its obvious that we will get a 90% accuracy were the model predicts most of the records being tagged to label A.

Clearly this is not a good way of calculating the efficiency of the model if our dataset is not balanced.

So, in such such situations we use **Recall, Precision, F-beta **as the classification metric.

__3.Recall (True Positive Rate, Sensitivity)__

__3.Recall (True Positive Rate, Sensitivity)__

For a classification matrix **Recall** says that out of the total actual positive values, how many positive were we able to predict correctly. This can be seen in the figure below.

One thing to remember here is, in case of **Recall** we deal with **False Negative**.

### 4. __Precision (Positive Prediction Value)__

__Precision (Positive Prediction Value)__

Out of the total predicted positive result, how many results were actually positive. One thing to remember here is, in case of **Precision** we deal with **False Positive**.

Now let’s take few examples to better understand the scenarios where we could use **Precision **and **Recall.**

__Precision Example__

__Precision Example__

- Let’s take a use case of
**Spam Detection.**In this case, we mostly have to consider the**Precision.**Let’s say we got an email which is originally not a spam, but the model detected it as a spam, which means it is a**False Positive.**

In such cases such cases, where the **False Positive **value is high, our main focus should always be to reduce it to minimum so that if we get an important email, it should not be wrongly classified as a spam email.

__Recall Example__

__Recall Example__

Now let’s say our model is tasked to predict whether a person is covid positive or not. Suppose the model predicted it as not having covid whereas he was actually covid positive which is a **False Negative**. This might turn out to be a blunder by the model.

In such cases a **False Positive **won’t be a very big issue because even if the person is not covid positive but is predicted as positive then he/she could go for another test to verify the result.

But if the person has covid and is predicted as negative (False Negative) then chances are he might not go for another test which might turn out to be a disaster.

Therefore, it’s important to use **Recall **in such situations.

NOTE: Our goal should always be to reduce **Precision **and **Recall**, however:

- Whenever the
**False Positive**is of more importance with respect to the problem statement, then use**precision** - If the
**False Negative**has greater importance with respect to the problem statement, then use**Recall.**

Now that we have understood what **Precision **and **Recall **is, let’s go ahead and understand **F-Beta** and where can we possibly use it.

### 5. __F-Beta__

__F-Beta__

We will encounter some of the scenarios in which both the **False Positive **and **False Negative **play an important role in an **imbalanced** **dataset**. In such cases we have to consider both **Recall **and **Precision.**

So, if we are considering both these metrics, the we have to use the **F-Beta **score.

If the **Beta **value is 1, then the **F-Beta** becomes a **F1-Score. **Similarly **Beta **value can also be 0.5 or 2.

If, **β = 1 **then,

This formula is a representation of Harmonic mean between **Precision **and **Recall.** Now, let’s understand when to choose what values of **Beta.**

**Case I:**

If both **False Positive **and **False Negative **are equally important, then we will select **Beta = 1.**

**Case II:**

Suppose **False Positive **is having more impact than the **False Negative, **then we need to reduce the **Beta **value by selecting something between 0 to 1.

**Case III:**

Suppose the **False Negative **impact is high which is basically the **Recall, **then in such cases we increase the **Beta **value more than 1.

In the next part of this blog, we will discuss the rest of the metrics which is Cohen Kappa, ROC Curve, AUC Score and PR Curve.

Also, if you want to understand the Regression Matrix in detail, refer this link.