Handling Categorical Data – One-Hot Encoding
In label encoding, categorical data is converted to numerical data, and the values are assigned labels (such as 1, 2, and 3). But there is a flaw here, Predictive models that use this numerical data for analysis might sometimes mistake these labels for some kind of order (for example, a model might think that a label of 3 is “better” than a label of 1, which is incorrect). In order to avoid this confusion, we can use One-hot Encoding. In this technique, the integer encoded variable is removed and a new binary variable is added for each unique integer value. The binary variables are often called “dummy variables” in other fields, such as statistics.
In One-hot Encoding, the label-encoded data is further divided into n number of columns. Here, n denotes the total number of unique labels generated while performing label encoding.
For example, let’s say that three new labels are generated through label encoding. Then, while performing this form of encoding, the columns will be divided into three parts. So, the value of n is 3. Below image shows an example on how this technique works.
Let’s look at an exercise to get further clarification.