Another problem that we often face in data are the outliers. They are the one-off values which always stand out from the population. They may be very large or very small with respect to the entire population of the data. outlier detection is a very important and crucial step in Exploratory data analysis. outlier detection is basically the identification of events or observations which are not normal and differ from the population of data. We only find outliers in numerical data. It is also known as anomaly detection. Box plots are very efficient in visualisation of outliers.
Categories of anomaly detection
There are 3 divisions of anomaly detection.
i. Unsupervised anomaly detection is a technique, which detects outliers in unlabeled test data. The assumption in this case is majority of the instances in dataset are normal.
ii. Supervised anomaly detection is a technique which requires dataset to be labeled as “normal” or “abnormal”. This involves training a classifier.
iii. Semi-supervised anomaly detection technique involves constructing a model which represents normal behavior from a normal training dataset and then test the likelihood of a test instance to be generated by the learnt model.
Applications of outlier detection
There are various applications of outlier detection such as fraud detection, intrusion detection, fault detection, data-preprocessing to remove anomalies in the dataset.
Example of outlier in a data
It is not necessary that an outlier will always be incorrect data or wrong data. If we have proper subject matter expertise and business knowledge about the data then we could take proper decision on whether the data is an actual outlier or not and do we need to remove it or retain it.
So now before getting into the actual code, let’s just take an example to see how to find the possible outliers in a data.
1. Let’s first sort the data:
2. Now let’s calculate the median which is the middle number after softing : 65
3. Now let’s calculate the lower quartile (Q1) which is the median of the 1st half of the dataset:
1st half = 12, 32, 45, 51, 54, 59
The median = (45 + 51)/2 = 48
4. Now we calculate the upper quartile of the 3rd quartile (Q3) which is the median value of the 2nd half of the dataset.
2nd half = 74, 77, 87, 121, 139, 321
The median = (87 + 121)/2 = 104
5. Now we find the interquartile range (IQR)
IQR = Q3 – Q1 => 104 – 48 = 56
6. Now let’s find the upper extreme and lower extreme:
Lower extreme = Q1 – 1.5(IQR) =>48 – 1.5(56) = 48 – 84 =-36
Upper extreme = Q3 + 1.5(IQR) =>104 + 1.5(56) = 104 + 84 = 188
Boundaries of our fences = -36 and 188
Any data points lower than the Lower extreme and greater than Upper extreme are outliers. The outliers from our example would be 321.
Let’s visualize the same in the box plot below.
This post is one of the steps involved in data pre-processing. Refer this post to understand and implement data cleaning which is a crucial step in data pre-processing.