Classification Metrics in Machine Learning


Choosing the right Classification Metrics is very crucial for model evaluation. Metrics like Confusion Matrix is a simple yet a very powerful Classification Metrics when it comes to evaluating the performance of a classification problem. Confusion Matrix is a performance measurement for machine learning problem where output can be two or more classes. Similarly we have Precision[3] which is defined as the fraction of relevant instances among the retrieved instances, Recall[3] which is the fraction of the total amount of relevant instances that were actually retrieved, F-Beta[5] is the weighed harmonic mean of Precision and Recall.

We will discuss these in detail in the upcoming sections.Below are the various Classification metrics that we should use in Machine Learning.

  • Confusion Matrix
  • Accuracy
  • Recall (True Positive Rate, Sensitivity)
  • Precision (Positive Prediction Value)
  • F – Beta
  • Cohen Kappa
  • ROC Curve, AUC Score
  • PR Curve

It is very important to use the correct kind of metrics to find out how good the model is. If we are not using the correct metrics, then it would be really difficult tell the efficiency of our model.

So, let’s understand every metrics and see which one will best fit in what kind of scenarios.

Now let’s consider a classification problem statement. As shown in the below figure, there are two ways in which we can solve a classification problem statement:

1.Predicting the Class Labels.

Suppose we have a binary classification with classes A and B. The threshold boundary in this case will by default be 0.5 as we have 2 classes.

So, let’s say our prediction value is greater than 0.5 then it will belong to class B and if it’s less than 0.5 then it will belong to class A.

2. Probability

In case of probability also we have to find out the class labels by selecting the right threshold value.

The threshold value which we will choose will depend on a case by case basis. Let’s say we want to predict whether a person is having cancer or not. In this case the choosing threshold value is very critical and should be chosen in a proper way.

The probability approach involves following classification metrics which we can use for predicting the correct threshold.

  1. ROC Curve
  2. AUC Score
  3. PR Curve

Now that we know how we can solve a classification problem, let’s understand what metrics will be used for a dataset.

  1. If we have a dataset which has 1000 records and is split into equal halves, then that means it is a balanced dataset. In such cases we use Accuracy to be the classification metric. 
  2. If we have an imbalanced dataset where the distribution of data is not equal for the binary classification, then we consider Recall, Precision and F-beta to be the classification metric.

Now that we have briefly discussed about balanced and imbalanced and what type of metrics should be used for each, let’s understand each of them in detail.

1. Confusion Matrix

Confusion Matrix, in case of binary classification is a 2X2 matrix as shown below. The top values are the actual values and the left part are the predicted values. It is an error matrix, which allows visualization of the performance of an algorithm.

  1. The first field corresponding to 1 for predicted value and 1 for actual value is the True Positive (TP) field.
  2. Similarly, the field corresponding to 1 for predicted value and 0 for actual value is the False Positive (FP) field which is also called the type I error or the false positive rate (FPR)
  3. The field corresponding to 0 for predicted value and 1 for actual value is the False Negative (FN) which is also called the type II error or the false negative rate (FNR).
  4. The field corresponding to 0 for the predicted value and 0 for the actual value is the True Negative (TN).

One way to remember the formula for FPR is, we consider all the false value (FP, TN) with respect to the actual predicted value (FP).

Our most accurate results are TP and TN. Our aim should always be to reduce the type I error and the type II error.

2. Accuracy

As we discussed before, if our dataset is a balanced one then we use Accuracy as the classification metric.

The formula for Accuracy is:

Here TP and TN are the most accurate results out of all the other results.

Now what would happen if our dataset is not balanced. What if we still use the Accuracy Metric as the classification metric. To understand this let’s take an example:

Suppose we have 10K records with label A being 9k and label B being 1K. Now suppose we are calculating the Accuracy, then its obvious that we will get a 90% accuracy were the model predicts most of the records being tagged to label A.

Clearly this is not a good way of calculating the efficiency of the model if our dataset is not balanced.

So, in such such situations we use Recall, Precision, F-beta as the classification metric.

3.Recall (True Positive Rate, Sensitivity)

For a classification matrix Recall says that out of the total actual positive values, how many positive were we able to predict correctly. This can be seen in the figure below.

One thing to remember here is, in case of Recall we deal with False Negative.

4. Precision (Positive Prediction Value)

Out of the total predicted positive result, how many results were actually positive. One thing to remember here is, in case of Precision we deal with False Positive.

Now let’s take few examples to better understand the scenarios where we could use Precision and Recall.

Precision Example

  1. Let’s take a use case of Spam Detection. In this case, we mostly have to consider the Precision. Let’s say we got an email which is originally not a spam, but the model detected it as a spam, which means it is a False Positive.

In such cases such cases, where the False Positive value is high, our main focus should always be to reduce it to minimum so that if we get an important email, it should not be wrongly classified as a spam email.

Recall Example

Now let’s say our model is tasked to predict whether a person is covid positive or not. Suppose the model predicted it as not having covid whereas he was actually covid positive which is a False Negative. This might turn out to be a blunder by the model.

In such cases a False Positive won’t be a very big issue because even if the person is not covid positive but is predicted as positive then he/she could go for another test to verify the result.

But if the person has covid and is predicted as negative (False Negative) then chances are he might not go for another test which might turn out to be a disaster.

Therefore, it’s important to use Recall in such situations.

NOTE: Our goal should always be to reduce Precision and Recall, however:

  1. Whenever the False Positive is of more importance with respect to the problem statement, then use precision
  2. If the False Negative has greater importance with respect to the problem statement, then use Recall.

Now that we have understood what Precision and Recall is, let’s go ahead and understand F-Beta and where can we possibly use it.

5. F-Beta

We will encounter some of the scenarios in which both the False Positive and False Negative play an important role in an imbalanced dataset. In such cases we have to consider both Recall and Precision.

So, if we are considering both these metrics, the we have to use the F-Beta score.

If the Beta value is 1, then the F-Beta becomes a F1-Score. Similarly Beta value can also be 0.5 or 2.

If, β = 1 then,

This formula is a representation of Harmonic mean between Precision and Recall. Now, let’s understand when to choose what values of Beta.

Case I:

If both False Positive and False Negative are equally important, then we will select Beta = 1.

Case II:

Suppose False Positive is having more impact than the False Negative, then we need to reduce the Beta value by selecting something between 0 to 1.

Case III:

Suppose the False Negative impact is high which is basically the Recall, then in such cases we increase the Beta value more than 1.

In the next part of this blog, we will discuss the rest of the metrics which is Cohen Kappa, ROC Curve, AUC Score and PR Curve.

Also, if you want to understand the Regression Matrix in detail, refer this link.


  1. A Gentle Introduction to the Fbeta-Measure for Machine Learning
  2. Performance Metrics For Classification Problem In Machine Learning- Part1
  3. Precision and recall
  4. Classification: Precision and Recall
  5. sklearn.metrics.fbeta_score