##### What is Feature Scaling

Feature Scaling is a preprocessing step in Machine Learning that involves transforming the features of your dataset onto a similar scale. The primary purpose of Feature Scaling is to ensure that all features contribute equally to the computation of distances and gradients during the training process, preventing certain features from dominating due to their larger scales.

Let’s try to understand this concept with the help of an example:

We have a dataset with 2 features: “Age”, and “Income”. The “Age” feature ranges from 0 to 100, while the “Income” feature ranges from 20000 to 100000. If we simply fit these data into the model, it might lead to biased results and the model wouldn’t perform optimally. Without Feature Scaling, the algorithm might give more weight to “Income” because its values are much larger than those of “Age”.

So, how do we deal with this scaling issue ? Well, there are few methods for Feature Scaling

##### Different Methods for Feature Scaling

**Min-Max Scaling (Normalization):**

This technique scales the values to a fixed range, usually between 0 and 1. It’s calculated using the formula:

[math]X_{\text{norm}} = \frac{X_{\text{max}} – X_{\text{min}}}{X_{\text{max}} – X_{\text{min}}}[/math]

where:

- [math]X_{\text{norm}}[/math] is the scaled value
- [math]X[/math] is the original value of the feature
- [math]X_{\text{min}}[/math] is the minimum value of the feature in the dataset
- [math]X_{\text{max}}[/math] is the maximum value of the feature in the dataset

Now let’s dive into the mathematical derivation of Min-Max Scaling by breaking it down into few steps:

**Normalization to the range [0,1]:**

The goal of Min-Max Scaling is to map the original values to a range between 0 and 1. We do this by subtracting the minimum value [math](X_{\text{min}})[/math] from each value [math](X)[/math] to shift the range so that the minimum value becomes 0. Then, we divide by the difference between the maximum value [math](X_{\text{max}})[/math] and the minimum value [math](X_{\text{min}})[/math] to scale the values within the range 0 to 1.

**Maintaining the relative distance between values:**

Min-Max Scaling maintains the relative distances between values in the original range. By subtracting the minimum value and dividing by the range, we ensure that the relative distances between values are preserved. This is important in many machine learning algorithms, particularly those that rely on distance calculations (such as k-nearest neighbors or support vector machines).

**Ensuring the transformed data has a specific range:**

By using the minimum and maximum values of the original data, Min-Max Scaling guarantees that the transformed data will fall within the specified range [0, 1]. This can be useful for algorithms that require input data to be within a certain range, or for interpretability purposes.

**Limitations of Min-Max Scaling**

Overall, Min-Max Scaling is a simple and effective technique for scaling features to a specific range, making it easier to compare and combine different features in machine learning models. However, Min-Max Scaling id not suitable for all datasets, especially if the data contains outliers or if the distribution of the data is highly skewed. In such cases, other scaling techniques like Standardization (Z-score normalization) may be more appropriate.

Let’s see this with a very basic example, once that is done, we will implement that in code as well

Suppose we have a dataset with following feature:

Feature

——-

10

20

30

40

To apply Min-Max Scaling to this feature, we need to calculate the scaled values using the formula:

[math]X_{\text{norm}} = \frac{X_{\text{max}} – X_{\text{min}}}{X_{\text{max}} – X_{\text{min}}}[/math]

For the given dataset:

- [math]X_{\text{min}}[/math] = 10
- [math]X_{\text{max}}[/math] = 40

Now, we can calculate the scaled values:

- for 10 : [math]\frac{{10 – 10}}{{40 – 10}} = \frac{{0}}{{30}} = 0[/math]
- for 20: [math]\frac{{20 – 10}}{{40 – 10}} = \frac{{10}}{{30}} = 1 \div 3[/math] or approximately 0.33
- for 30 : [math]\frac{{30 – 10}}{{40 – 10}} = \frac{{20}}{{30}} = 2 \div 3[/math] or approximately 0.67
- for 40 : [math]\frac{{40 – 10}}{{40 – 10}}[/math] = [math]\frac{{30}}{{30}} = 1[/math]

Now that we have understood how to mathematically calculate min-max scalar for any given feature, lets write a simple linear regression model and implement Min-Max scaling to the features

from sklearn.linear_model import LinearRegression from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error import numpy as np # Sample dataset X = np.array([[1], [2], [3], [4]]) y = np.array([2, 4, 6, 8]) # Split the dataset into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Initialize MinMaxScaler scaler = MinMaxScaler() # Fit the scaler to the training data and transform both training and test data X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) print("Scaled Training Data:") print(X_train_scaled) print("\nScaled Test Data:") print(X_test_scaled) # Initialize and fit the linear regression model model = LinearRegression() model.fit(X_train_scaled, y_train) # Make predictions on the test set y_pred = model.predict(X_test_scaled) # Calculate Mean Squared Error mse = mean_squared_error(y_test, y_pred) print("\nMean Squared Error:", mse) Output: Scaled Training Data: [[1. ] [0. ] [0.66666667]] Scaled Test Data: [[0.33333333]] Mean Squared Error: 1.9721522630525295e-31

Looking at the above code, you might have a question, which I also asked myself, why didn’t we scale the output feature “y”. Let me tell you, this is a very important and valid question. Let’s address this before we move forward!

To understand this, let’s first breakdown the idea behind scaling:

**Why do we scale features?**

- When we’re training a machine learning model, we want all the features to contribute equally to the model’s predictions.
- If features have different scales (i.e., they’re on different numerical scales), features with larger scales can dominate the learning process. This can lead to biased or inefficient models.
- By scaling features, we ensure that all features are on the same scale, preventing any single feature from having too much influence on the model.

**Why don’t we scale the target variable (output feature)?**

- The target variable (or output feature) in regression problems is what we’re trying to predict. We’re interested in understanding its relationship with the features.
- Unlike features, which are input to the model, the target variable is the output we’re trying to estimate. Its scale doesn’t affect the model’s ability to learn the relationship between features and the target.
- Scaling the target variable isn’t necessary for most regression tasks because we’re focused on predicting its value based on the features, not comparing its magnitude across different observations.

**When might we scale the target variable?**

- In some cases, algorithms or techniques used for regression might require the target variable to be on a specific scale.

For example, if you’re using a distance-based algorithm like k-nearest neighbors or an algorithm with regularization terms that involve the scale of the target variable, scaling the target might be necessary to ensure optimal performance.

Now that we have understood Min-Max scaling, let move onto the next type of Feature Scaling technique called Standardization (Z-score Normalization)

**Standardization (Z-score Normalization): **

This technique scales the values to have a mean of 0 and a standard deviation of 1. It’s calculated using the formula

[math]X_{\text{std}} = \frac{X – \mu}{\sigma}[/math]

Where:

[math]X_{\text{std}}[/math] is the standardized value

[math]X[/math] is the original value of the feature

[math]\mu[/math] is mean of the feature in the dataset

[math]\sigma[/math] is the standard deviation of the feature in the dataset

**Centering around the mean ([math]\mu[/math]):**

The first step in Standardization is to shift the distribution of the data so that it is centered around the mean [math]\mu[/math]. This is done by subtracting the mean from each value [math](X)[/math]. By doing this, the mean of the standardized data becomes 0. Let’s understand with the help of an example:

Below is the original values for the “Age” and Income feature respectively:

Age Income

——-|————

30 | 50,000

40 | 70,000

25 | 60,000

35 | 80,000

We calculate the mean and standard deviation for each feature:

Mean of Age: [math]\mu_{\text{Age}} = \frac{{30+40+25+35}}{4}[/math] = [math]32.5[/math]

Mean of Income: [math]\mu_{\text{Income}} = \frac{{50,000 + 70,000 + 60,000 + 80,000}}{4}[/math] = [math]65,000[/math]

**Scaling the Standard deviation([math]\sigma[/math]):**

After centering the data around the mean, the next step is to scale the data by dividing each value by the standard deviation [math](\sigma)[/math]. This ensures that the spread of the data is consistent across different features. Scaling by the standard deviation also ensures that the variance of the standardized data is 1.

Standard Deviation of Age: [math]\sigma_{\text{Age}} = \sqrt{\frac{{(30 – 32.5)^2 + (40 – 32.5)^2 + (25 – 32.5)^2 + (35 – 32.5)^2}}{4}}[/math] ≈[math]5.59[/math]

Standard Deviation of Income: [math]\sigma_{\text{Income}}[/math] = [math]\sqrt{\frac{{(50,000 – 65,000)^2 + (70,000 – 65,000)^2 + (60,000 – 65,000)^2 + (80,000 – 65,000)^2}}{4}}[/math]≈[math]10,609.08[/math]

**Preserving the relative distances between values:**

Smilar to Min-Max Scaling, Standardization preserves the relative distances between values in the original range. By subtracting the mean and dividing by the standard deviation, the relative distances between values are maintained. This is important for algorithms that rely on distance calculations or assume a standard normal distribution.

**Interpretability:**Standardization makes the interpretation of coefficients in linear models easier. Since the standardized values have a mean of 0 and a standard deviation of 1, the coefficients represent the change in the target variable (dependent variable) associated with a one standard deviation change in the predictor variable (independent variable).Similar to Min-Max Scaling, Standardization preserves the relative distances between values in the original range. By subtracting the mean and dividing by the standard deviation, the relative distances between values are maintained. This is important for algorithms that rely on distance calculations or assume a standard normal distribution.

Let’s understand show standardization makes interpretability of coefficients in linear model easier.

Suppose we have a dataset containing information about the weight (in kilograms) and height (in meters) of individuals, and we want to predict their body mass index (BMI), which is calculated as weight (kg) divided by the square of height (m).

Here’s a simplified version of the dataset:

Sure, here’s the table in tabular format:

\[

\begin{array}{|c|c|c|}

\hline

\text{Height (m)} & \text{Weight (kg)} & \text{BMI} \\

\hline

1.60 & 50 & 19.5 \\

1.75 & 65 & 21.2 \\

1.80 & 70 & 21.6 \\

1.68 & 60 & 21.3 \\

1.85 & 75 & 22.0 \\

\hline

\end{array}

\]

Let’s say we want to build a linear regression model to predict BMI based on height and weight.

If we don’t standardize the features (height and weight), the coefficients obtained from the linear regression model will be in terms of the original units (meters for height and kilograms for weight). For example, if the coefficient for height is 2.5, it means that for each additional meter in height, the BMI is expected to increase by 2.5 units.

However, if we standardize the features using standardization (Z-score normalization), the coefficients will be in terms of standard deviations.

Here’s how it works:

- After standardization, the mean of each feature becomes 0 and the standard deviation becomes 1.
- Let’s say the coefficient for height after standardization is 0.7. This means that for each additional standard deviation in height, the BMI is expected to increase by 0.7 units.
- Similarly, if the coefficient for weight after standardization is 0.9, it means that for each additional standard deviation in weight, the BMI is expected to increase by 0.9 units.

This makes the interpretation of coefficients easier because they represent the change in the target variable (BMI) associated with a one standard deviation change in the predictor variable (height or weight).

So, in summary, standardization enhances interpretability by providing coefficients that represent the change in the target variable per standard deviation change in the predictor variable, making comparisons between coefficients more straightforward and meaningful.

Coming back to out original example, let’s calculate the standardized values for Age and Income:

For “Age”:

[math]

\text{For 30:} \quad \frac{30 – 32.5}{5.59} \approx \frac{-2.5}{5.59} \approx -0.45

[/math]

[math]

\text{For 40:} \quad \frac{40 – 32.5}{5.59} \approx \frac{7.5}{5.59} \approx 1.34

[/math]

[math]

\text{For 25:} \quad \frac{25 – 32.5}{5.59} \approx \frac{-7.5}{5.59} \approx -1.34

[/math]

[math]

\text{For 35:} \quad \frac{35 – 32.5}{5.59} \approx \frac{2.5}{5.59} \approx 0.45

[/math]

For “Income”:

[math]

\text{For 50,000:} \quad \frac{50,000 – 65,000}{10,609.08} \approx \frac{-15,000}{10,609.08} \approx -1.41

[/math]

[math]

\text{For 70,000:} \quad \frac{70,000 – 65,000}{10,609.08} \approx \frac{5,000}{10,609.08} \approx 0.47

[/math]

[math]

\text{For 60,000:} \quad \frac{60,000 – 65,000}{10,609.08} \approx \frac{-5,000}{10,609.08} \approx -0.47

[/math]

[math]

\text{For 80,000:} \quad \frac{80,000 – 65,000}{10,609.08} \approx \frac{15,000}{10,609.08} \approx 1.41

[/math]

One thing we need to keep in mind here in the context of Z-score normalization (standardization) is, the standard deviation of the standardized values depends on the spread of the original data. When we standardize a dataset, we transform the data such that the new distribution has a mean of 0 and a standard deviation of 1. However, this doesn’t necessarily mean that the standard deviation of the standardized values will always be exactly 1.

Let’s check if the above hypothesis about standard deviation holds true, by calculating the standard deviation of the standardized values for “Age” and “Income”

For “Income”:

The calculated values for the standardized “Income” feature were:

- For 50,000: approximately -1.41
- For 70,000: approximately 0.47
- For 60,000: approximately -0.47
- For 80,000: approximately 1.41

Now, let’s calculate the standard deviation for these standardized values. We’ll first calculate the mean of the standardized values and then use the formula for standard deviation to compute it.

[math]

\text{Mean of standardized “Income”:} \quad \text{Mean} = \frac{-1.41 + 0.47 – 0.47 + 1.41}{4} = \frac{0}{4} = 0

[/math]

Next, we calculate the standard deviation using the formula:

[math]

\text{Standard deviation} = \sqrt{\frac{(-1.41 – 0)^2 + (0.47 – 0)^2 + (-0.47 – 0)^2 + (1.41 – 0)^2}{4}}

[/math]

[math]

\text{Standard deviation} = \sqrt{\frac{1.9881 + 0.2209 + 0.2209 + 1.9881}{4}}

[/math]

[math]

\text{Standard deviation} = \sqrt{\frac{4.418}{4}}

[/math]

[math]

\text{Standard deviation} = \sqrt{1.1045}

[/math]

[math]

\text{Standard deviation} \approx 1.05

[/math]

The calculated values for the standardized “Age” feature were:

- For 30: approximately -0.45
- For 40: approximately 1.34
- For 25: approximately -1.34
- For 35: approximately 0.45

Now, let’s calculate the standard deviation for these standardized values. We’ll first calculate the mean of the standardized values and then use the formula for standard deviation to compute it as done above.

Here’s the LaTeX code for the provided text:

[math]

\text{Mean of standardized “Age”:} \quad \text{Mean} = \frac{-0.45 + 1.34 – 1.34 + 0.45}{4} = \frac{0}{4} = 0

[/math]

Next, we calculate the standard deviation using the formula:

[math]

\text{Standard deviation} = \sqrt{\frac{(-0.45 – 0)^2 + (1.34 – 0)^2 + (-1.34 – 0)^2 + (0.45 – 0)^2}{4}}

[/math]

[math]

\text{Standard deviation} = \sqrt{\frac{0.2025 + 1.7956 + 1.7956 + 0.2025}{4}}

[/math]

[math]

\text{Standard deviation} = \sqrt{\frac{3.9962}{4}}

[/math]

[math]

\text{Standard deviation} = \sqrt{0.99905}

[/math]

[math]

\text{Standard deviation} \approx 0.999

[/math]

Hence this proves our above hypothesis that the standard deviation of the standardized values will not always be exactly 1 due to the spread of the original data.

Let’s implement the code below using the above dataset for 3 different scenarios:

- Without Scaling
- With Min-Max Scaling
- With Standardization (Z-Score Normalization)

import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.preprocessing import MinMaxScaler, StandardScaler # Sample dataset X = np.array([[1, 20_000], [2, 30_000], [3, 40_000], [4, 50_000]]) y = np.array([3000, 3500, 4000, 4500]) # Target variable (e.g., salary) # Splitting the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Applying feature scaling # Min-Max Scaling (Normalization) min_max_scaler = MinMaxScaler() X_train_min_max_scaled = min_max_scaler.fit_transform(X_train) X_test_min_max_scaled = min_max_scaler.transform(X_test) # Standardization (Z-score normalization) standard_scaler = StandardScaler() X_train_standard_scaled = standard_scaler.fit_transform(X_train) X_test_standard_scaled = standard_scaler.transform(X_test) # Linear Regression model model = LinearRegression() # Without feature scaling model.fit(X_train, y_train) train_score_no_scaling = model.score(X_train, y_train) test_score_no_scaling = model.score(X_test, y_test) # With Min-Max Scaling (Normalization) model.fit(X_train_min_max_scaled, y_train) train_score_min_max = model.score(X_train_min_max_scaled, y_train) test_score_min_max = model.score(X_test_min_max_scaled, y_test) # With Standardization (Z-score normalization) model.fit(X_train_standard_scaled, y_train) train_score_standard = model.score(X_train_standard_scaled, y_train) test_score_standard = model.score(X_test_standard_scaled, y_test) # Plotting plt.figure(figsize=(10, 6)) # Plot without scaling plt.plot(['Train (No Scaling)', 'Test (No Scaling)'], [train_score_no_scaling, test_score_no_scaling], label='No Scaling', marker='o') # Plot with Min-Max Scaling plt.plot(['Train (Min-Max Scaling)', 'Test (Min-Max Scaling)'], [train_score_min_max, test_score_min_max], label='Min-Max Scaling', marker='o') # Plot with Standardization plt.plot(['Train (Standardization)', 'Test (Standardization)'], [train_score_standard, test_score_standard], label='Standardization', marker='o') plt.xlabel('Data Type') plt.ylabel('R^2 Score') plt.title('Model Scores with and without Feature Scaling') plt.legend() plt.grid(True) plt.show()

Output: