Data Preprocessing is the most important step when we are building our model. In Data Preprocessing step, the data is transformed into a form where it becomes suitable for model ingestion. There are various steps involved in Data Preprocessing are shown below in the flowchart. In this post we will cover only the first step of Data Preprocessing which is Data Cleaning.
The subsequent steps that are followed after Data Cleaning are linked at the end of the post.
The first step of Data Preprocessing is Data Cleaning. Most of the data that we work today are not clean and requires substantial amount of Data Cleaning. Some have missing values and some have junk data in it. If these missing values and inconsistencies are not handled properly then our model wouldn’t give accurate results.
So, before getting into the nitty gritty details of Data Cleaning, let’s have a high level understanding of what are the possible problems we face in real world data scenarios.
Missing Values :
Missing values is very crucial when it comes to building a model. It can break your complete model by predicting inaccurately if not handled properly. Let’s check the below example to understand more about it.
Below dataset can be used to predict the graduate admission of students. However it has some missing values which are critical to predict their admissions.
As you could see above, some of the records have the GRE Score, TOEFL Score, University Rating, SOP, LOR, CGPA and Research missing, which are important features for predicting the admit for the student.
Handling Missing Data:
In order to build a robust model which handles complex tasks we need to handle missing data more efficiently. There are many ways of handling missing data. Some of them are as follows:
Method 1. Removing the Data
The first step that we should do is to check if a dataset has any missing values. A model cannot accept missing values. So one common and easy method to handle missing values is to delete the entire row if there is any missing value in that row or we delete an entire column if it has 70 to 75% of missing data, however this percent limit is not fixed and mostly depends on what kind of data we are dealing with, and what kind of features are there in the dataset.
Advantage of this method is, it’s a pretty quick and dirty method of fixing the missing values issue. But this is not always the goto method as you might sometime end up losing critical information by deleting the features.
Let’s understand this method using the below example:
Method 2: Mean/Median/Mode Imputation
In this method we will use the Mean/Median/Mode to replace missing values.
1. In the case of Numerical data, we can compute its mean or median and use the result to replace missing values.
2. While if there is Categorical (non-numerical) data, we can compute its mode to replace the missing value.
This process is known as Mean/Median/Mode imputation.
Advantage of this method is that we don’t remove the data which prevents data loss.
The drawback is that you don’t know how accurate using the mean, median, or mode is going to be in a given situation.