Linear regression is a statistical method that can be used to model the relationship between a dependent and one or more independent variables.

Now you might be wondering, this is the definition and I am not here looking for a definition of Linear Regression. So, let’s try to understand the concept using an example. Let’s say you want to buy your dream house for which you have been working so hard for years, but at the same time, you are skeptical if you end up paying more for that dream house of yours. To avoid such scenarios, you ask your friend who is a statistician to help you out with a strategy for deciding, whether to buy a particular house or not.

Now your friend asks you, what’s the primary factor which you would look in the house, to which your reply is the size of the house.

Based on this one input from you, he plans to implement Linear regression with the following strategy to better guide you in the decision making:

**Data Collection: **Collect data on various houses, noting their sizes (in square feet) and corresponding prices

**Data Visualization: **Plot the data on a scatter plot, where the x-axis represents the size of the house, and the y-axis represents the price.

**Linear Regression Model: **Apply linear regression to create a model that fits the trend in the data. The model will have a slope (coefficient) and an intercept.

**Model Interpretation: **The linear regression model will give you an equation like $price=m×size+b$, where $m$ is the slope (indicating the price per square foot) and $b$ is the y-intercept (representing the base price).

**Prediction: **Now, if you have the size of a new house that’s not in the input dataset, you can use the model to predict its price. Plug in the size into the equation, and the model will give you an estimated price.

**Evaluation: **Evaluate the model’s performance using metrics like Mean Squared Error (MSE) to see how well it predicts prices based on size. MSE is measure of the average squared difference between the predicted values and the actual values. In the context of linear regression, it helps assess how well the model is performing. Lower MSE values indicate better predictive performance

In simpler terms, the Linear regression model is essentially learning the relationship between the size of a house and its price from the provided data. Once trained, it can be used to make predictions for new houses.This example illustrates how linear regression, with its straightforward equation, can be applied to real-world situations to make predictions based on a single input variable. In this case, it’s predicting house prices based on the size of the house.

Now that you know what Linear regression can do, let’s try to understand it in a mathematical way, before we implement it in code

The formula for linear regression is : $y=mx+b$

Where:

- $y$ is the dependent variable,
- $x$ is the independent variable,
- $m$ is the slope of the line (coefficient),
- $b$ is the y-intercept.

In the case of multiple independent variables, the equation becomes:

$y=b_{0}+b_{1}⋅x_{1}+b_{2}⋅x_{2}+…+b_{n}⋅x_{n}$

Where:

- $y$ is the dependent variable,
- $b_{0}$ is the y-intercept,
- $b_{1},b_{2},…,b_{n}$ are the coefficients for the independent variables $x_{1},x_{2},…,x_{n}$.

The goal of Linear regression is to find the values of $m$, $b$, $b_{0}$, $b_{1}$, $b_{2}$, etc., that minimize the sum of squared differences between the actual and predicted values of the dependent variable.

Let’s implement the Linear regression to model the above example of house proces, where our independent variable is ‘house_sizes’ and dependent variable which we will be predicting is ‘house_prices’

import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # Generate synthetic data for house sizes and prices np.random.seed(42) house_sizes = 150 + 20 * np.random.rand(20, 1) # Random house sizes between 150 and 170 sq. ft house_prices = 50000 + 300 * house_sizes + 10000 * np.random.randn(20, 1) # Linear relation with some random noise # Visualize the data plt.scatter(house_sizes, house_prices, color='blue') plt.title('House Prices vs. Size') plt.xlabel('Size (sq. ft)') plt.ylabel('Price ($)') plt.show() # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(house_sizes, house_prices, test_size=0.2, random_state=42) # Create a Linear Regression model model = LinearRegression() # Train the model on the training data model.fit(X_train, y_train) # Make predictions on the test set y_pred = model.predict(X_test) # Calculate Mean Squared Error (MSE) on the test set mse = mean_squared_error(y_test, y_pred) print(f"Mean Squared Error on the test set: {mse:.2f}") # Make predictions for new house sizes new_sizes = np.array([[160], [165], [170]]) # Sizes of new houses to predict prices predicted_prices = model.predict(new_sizes) # Display the predictions for size, price in zip(new_sizes, predicted_prices): print(f"Predicted price for a house of size {size[0]} sq. ft: ${price[0]:.2f}")

Mean Squared Error on the test set: 151621882.14

Predicted price for a house of size 160 sq. ft: $94282.63

Predicted price for a house of size 165 sq. ft: $94658.20

Predicted price for a house of size 170 sq. ft: $95033.76

If you look at the output closely, the Mean Squared Error(MSE) is very high which indicates that the linear regression model is not fitting the data well, or it may be overfitting. Let’s investigate more on this. In order to do that, lets visualize the actual vs predicted prices of house.

import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # Generate synthetic data for house sizes and prices np.random.seed(42) house_sizes = 150 + 20 * np.random.rand(100, 1) house_prices = 50000 + 300 * house_sizes + 10000 * np.random.randn(100, 1) # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(house_sizes, house_prices, test_size=0.2, random_state=42) # Visualize the data plt.scatter(X_test, y_test, color='blue', label='Actual Prices') plt.title('Actual vs. Predicted Prices (Test Set)') plt.xlabel('Size (sq. ft)') plt.ylabel('Price ($)') # Create a Linear Regression model model = LinearRegression() # Train the model on the training data model.fit(X_train, y_train) # Make predictions on the test set y_pred = model.predict(X_test) # Visualize the predicted values plt.scatter(X_test, y_pred, color='red', label='Predicted Prices') plt.legend() # Show the plot plt.show() # Calculate Mean Squared Error (MSE) on the test set mse = mean_squared_error(y_test, y_pred) print(f"Mean Squared Error on the test set: {mse:.2f}")

Looking at the above visualization of the actual vs. predicted prices on the test set, a high Mean Squared Error (MSE) could be attributed to following factors:

**Non-linearity:** The scatter plot suggests that there may be a non-linear relationship between house size and prices. Linear regression assumes a linear relationship, and if the true relationship is more complex, the model may not capture it well. In such cases we would need a more complex model which captures and models the non-linear relationships efficiently.

**Outliers:** If there are outliers in the data, they can have a significant impact on the performance of linear regression. Outliers can influence the model’s coefficients and predictions, leading to higher MSE.

**Underfitting:** Linear regression may be too simple to capture the underlying patterns in the data. If the relationship between house size and prices is more complex, a linear model may underfit the data, resulting in higher MSE.

**Randomness in Data Generation:** The synthetic data generation process includes random noise (`np.random.randn(100, 1)`

). If this noise is large, it can introduce variability in the data, leading to higher MSE. I have considered the random data generation to illustrate an example, but in real world scenarios, we should always model data using a real data-set.

To address the above issues, you might consider the following:

**Feature Engineering:** Explore non-linear features or transformations of features to capture more complex relationships.

**Outlier Detection and Removal:** Identify and handle outliers in the data that might be affecting the model’s performance.

**Polynomial Regression:** If a linear model is insufficient, consider polynomial regression, which can capture non-linear relationships.

**Increase Data Size:** If the dataset is small, increasing its size might help the model learn better patterns.

**Explore More Realistic Data:** Consider using a more realistic dataset that better represents the actual problem you’re trying to solve.