Categorical Encoding using Label Encoding

Handling Categorical Data — Label Encoding

Usually in Machine learning we encounter data which have multiple labels in one or multiple columns. These labels can be characters or numeric form. These kind of data cannot be fed in the raw format to a Machine Learning model. To make the data understandable for the model, it is often labeled using Label encoding. Label Encoding is a technique of converting the labels into numeric form so that it could be ingested to a machine learning model. It is an important step in data preprocessing for supervised learning techniques. In this method, we generally replace each value in a categorical column with numbers from 0 to N-1. LabelEncoder is a utility class to help normalize labels such that they contain only values between 0 and n_classes-1.

For example, say we’ve got a list of employee heights in a column. After performing Label Encoding, each employee name will be assigned a numeric label. Below image shows how the heights will be encoded to respective numerical labels:

But this might not be suitable for all cases because the model might consider numeric values to be weights assigned to the data. It is the best method to use for ordinal data.

The scikit-learn library provides LabelEncoder(), method which is used to encode the categorical data. We will see how this is implemented in the subsequent section.



Next: One-Hot Encoding