Data Transformation

Data Transformation

Data Transformation is the technique of converting data from one format to another. Data Transformation can be divided into following steps. Each of these steps will be applied based on the complexity of the transformation.

  1. Data Discovery: This is more of an exploratory step which involves profiling the data using data profiling tools or sometimes using manual scripts. The goal of this step is to understand the structure and characteristics of data.
  2. Data Mapping: This is a process which defines how individual fields are mapped, modified, joined, filtered, aggregated etc. to produce the final desired output.
  3. Data Transformation code: This is a process of generating code(e.g, SQL, python, R etc) which will transform data based on the data mapping rules.
  4. Code implementation: This is a process in which the generated code is executed against the data to create the desired output.
  5. Review of data: This process is to ensure that the output data meets the transformation requirements. This step is mostly carried out by the business or end user.

Data Transformation in Machine Learning

Now that we have seen different steps involved in Data Transformation, let’s get into some more details and see how to transform the data into a machine-learning-digestible format. All machine learning algorithms are based on mathematics. So, we need to convert all the columns into numerical format. Before that, let’s see all the different types of data we have.

Taking a broader perspective, data is classified into numerical and categorical data:

1. Numerical: As the name suggests, this is numeric data that is quantifiable.

2. Categorical: The data is a string or non-numeric data that is qualitative in nature.

Numerical data is further divided into the following:

i. Discrete: To explain in simple terms, any numerical data that is countable is called discrete, for example, the number of people in a family or the number of students in a class. Discrete data can only take certain values (such as 1, 2, 3, 4, etc).

ii. Continuous: Any numerical data that is measurable is called continuous.
for example, the height of a person or the time taken to reach a destination. Continuous data can take virtually any value (for example, 1.25, 3.8888, and 77.1276).

Categorical data is further divided into the following:

i. Ordered: Any categorical data that has some order associated with it is called ordered categorical data, for example, movie ratings (excellent, good, bad, worst) and feedback (happy, not bad, bad). You can think of ordered data as being something you could mark on a scale.

ii. Nominal: Any categorical data that has no order is called nominal categorical data. Examples include gender and country.

From these different types of data, we will focus on categorical data. In the next section, we’ll discuss how to handle categorical data.

Handling Categorical Data

There are some algorithms that can work well with categorical data, such as decision trees. But most machine learning algorithms cannot operate directly with categorical data. These algorithms require the input and output both to be in numerical form. If the output to be predicted is categorical, then after prediction we convert them back to categorical data from numerical data. Let’s discuss some key challenges that we face while dealing with categorical data:

i. High cardinality: Cardinality means uniqueness in data. The data column, in this case, will have a lot of different values. A good example is User ID – in a table of 500 different users, the User ID column would have 500 unique values.

ii. Rare occurrences: These data columns might have variables that occur very rarely and therefore would not be significant enough to have an impact on the model.

iii. Frequent occurrences: There might be a category in the data columns that occurs many times with very low variance, which would fail to make an impact on the model.

iv. Won’t fit: This categorical data, left unprocessed, won’t fit our model. Encoding To address the problems associated with categorical we can use encoding. This is the process by which we convert a categorical variable into a numerical form. Here, we will look at three simple methods of encoding categorical data. Replacing This is a technique in which we replace the categorical data with a number. This is a simple replacement and does not involve much logical processing. Let’s look at an exercise to get a better idea of this.

1.Encoding :To address the problems associated with categorical data, we can use encoding. This is the process by which we convert a categorical variable into a numerical form. Here, we will look at three simple methods of encoding categorical data.

2.Replacing:This is a technique in which we replace the categorical data with a number. This is a simple replacement and does not involve much logical processing. Let’s look at an exercise to get a better idea of this.

Handling Categorical Data — Method 1 : Replacing

Reference:

  1. https://en.wikipedia.org/wiki/Data_transformation
  2. https://www.amazon.com/Master-Data-Science-Python-real-world-ebook/dp/B07L8HKSP6/ref=sr_1_1?dchild=1&keywords=rohan+chopra&qid=1596391176&sr=8-1
  3. https://www.analyticsvidhya.com/blog/2020/03/understanding-transform-function-python/

Next: Categorical Encoding using Label Encoding