##### Introduction

In this post, we will go over all the nitty gritty details around Multiple Linear Regression, but first let’s start with understanding what is Multiple Linear Regression.

Multiple linear regression is a statistical technique used to model the relationship between multiple independent variables (also called predictors or features) and a dependent variable (also called the outcome or response variable). It assumes a linear relationship between the predictors and the outcome variable.

##### Model Representation

The multiple linear regression model can be represented as:

[math]Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_n X_n + \varepsilon[/math]

Where:

- [math]Y[/math] is the dependent variable (outcome).
- [math]X_1, X_2, \ldots, X_n[/math]are the independent variables (predictors).
- [math]\beta_0[/math] is the intercept (constant term).
- [math]\beta_1, \beta_2, \ldots, \beta_n[/math] are the coefficients (parameters) representing the change in [math]Y[/math] for a unit change in each corresponding [math]X[/math].
- [math]\varepsilon[/math] is the error term, representing the difference between the observed and predicted values.

##### Mathematical Derivation

The objective of Multiple Linear Regression is to estimate the coefficients [math]\beta_1, \beta_2, \ldots, \beta_n[/math] that minimize the sum of squared differences between the observed and predicted values. This is typically done using the method of ordinary least squares (OLS).

The OLS estimates for the coefficients can be obtained by minimizing the sum of squared residuals (errors):

[math]\sum_{i=1}^{n} (Y_i – (\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_n X_{in}))^2[/math]

To minimize this expression, the partial derivatives of the sum of squared residuals with respect to each coefficient are equated to zero:

[math]\frac{\partial}{\partial \beta_0} \sum_{i=1}^{n} (Y_i – (\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_n X_{in}))^2 = 0[/math]

[math]\frac{\partial}{\partial \beta_1} \sum_{i=1}^{n} (Y_i – (\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_n X_{in}))^2 = 0[/math]

[math]\vdots[/math]

[math]\frac{\partial}{\partial \beta_n} \sum_{i=1}^{n} (Y_i – (\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_n X_{in}))^2 = 0[/math]

Before going any further, we should understand why did we do partial derivative on the [math]\sum_{i=1}^{n} (Y_i – (\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_n X_{in}))^2[/math] expression and equate it to zero

As we already saw in the above section [math]\sum_{i=1}^{n} (Y_i – (\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_n X_{in}))^2[/math] expression represents the residual sum of squares (RSS) in the context of linear regression, which quantifies the discrepancy between the observed values [math]Y_i[/math] and the predicted values [math]\hat{Y}_i[/math] generated by the regression model

RSS = [math]\sum_{i=1}^{n} (Y_i – (\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_n X_{in}))^2[/math]

Or it can also be represented as

RSS = [math]\sum_{i=1}^{n} (Y_i – X_i \beta)^2[/math]

The goal in linear regression is to find the values of the coefficients [math]\beta_1, \beta_2, \ldots, \beta_n[/math] that minimize the RSS. This is typically done by finding the values of the coefficients that make the partial derivatives of the RSS with respect to each coefficient equal to zero as show in the equations above. These values are then the critical points where the RSS might be at a minimum.

Now, let’s understand how the partial derivative works:

**Partial Derivatives**: Taking the partial derivatives of the RSS with respect to each coefficient [math]\beta_j \quad \text{where} \quad j = 0, 1, \ldots, n[/math] provides information about how the RSS changes with respect to changes in each coefficient individually. Setting these partial derivatives to zero helps identify candidate solutions where the RSS might be minimized. That’s the reason why we are taking the partial derivative of RSS with respect to each coefficient and equating it to 0

[math]\frac{\partial}{\partial \beta_0} \sum_{i=1}^{n} (Y_i – (\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_n X_{in}))^2 = 0[/math]

[math]\frac{\partial}{\partial \beta_1} \sum_{i=1}^{n} (Y_i – (\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_n X_{in}))^2 = 0[/math]

[math]\vdots[/math]

[math]\frac{\partial}{\partial \beta_n} \sum_{i=1}^{n} (Y_i – (\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_n X_{in}))^2 = 0[/math]

##### Vector Representation

Now that we have understood the mathematical model representation, lets try to interpret it in the form of vector representation, so that it will be easier for us to translate it into code while modeling the data

Consider the Multiple regression model : [math]Y = X\beta + \epsilon[/math]

RSS = [math]\sum_{i=1}^{n} (Y_i – X_i \beta)^2[/math]

To represent RSS in matrix form, let’s denote the following:

- [math]Y[/math] as the vector of observed values of the dependent variable (of length n).
- [math]X[/math] as the matrix of predictor variables (of size n \times (p+1), where p is the number of predictors, and the 1 represents the first column of X whichis a vector of ones for the intercept term).
- [math]\hat{Y}[/math] as the vector of predicted values of the dependent variable.

Then, the predicted value [math]\hat{Y}[/math] can be expressed as [math]\hat{Y}[/math] = [math]X \beta[/math], where [math]X \beta[/math] represents the matrix multiplication of [math]X[/math] and [math]\beta[/math]

Now we could define a residual vector [math]\varepsilon[/math] which denotes the vector difference between observed values of [math]Y[/math] and the predicted values of [math]\hat{Y}[/math], i.e., [math]\varepsilon[/math] = [math]Y[/math] – [math]\hat{Y}[/math]

Using the above equation we can rewrite the equation of RSS as the sum of square of the residual vector [math]\varepsilon[/math]

[math]RSS = \sum_{i=1}^{n} \epsilon_i^2[/math]

In matrix form, this can be written as the dot product of the transpose math]\varepsilon[/math] and [math]\varepsilon[/math]:

[math]RSS = \epsilon^T \epsilon[/math]

Expanding this expression using the definitions of [math]\varepsilon[/math] and [math]\hat{Y}[/math], we get:

[math]RSS = (Y – X\beta)^T (Y – X\beta)[/math]

Expanding and simplifying the above expression:

[math]RSS = Y^T Y – 2\beta^T X^T Y + \beta^T X^T X \beta[/math]

To find the coefficients [math]\beta[/math] that minimize the RSS, we take the derivative of RSS with respect to [math]/beta[/math],set it equal to zero, and solve for [math]/beta[/math]. This is the same method we used above in our partial derivative section. The only difference was we are solving this in a vectorized format and the earlier version was in scaler format.

[math]\frac{\partial RSS}{\partial \beta} = -2X^T Y + 2X^T X\beta = 0[/math]

Now, solving the above equation for [math]\beta[/math] gives us:

[math](X^T X) \beta = X^T Y[/math]

[math]\hat{\beta} = (X^T X)^{-1} X^T y[/math]

where [math]\hat{\beta}[/math] is the vector of estimated coefficients.

Now that we have understood the Model representation and the mathematical derivation of Multiple Linear Regression, lets implement a model in code. I will first write each block of code and provide explaination and later consolidate the whole code

import numpy as np class MultipleLinearRegression: def __init__(self): self.intercept_ = None self.coef_ = None

This above block of code defines a Python class named MultipleLinearRegression. It initializes two instance variables intercept_ and coef_ to None.

def fit(self, X, y): ones = np.ones((X.shape[0], 1)) # Adding a column of ones for the intercept term X = np.concatenate((ones, X), axis=1) # Compute coefficients using closed-form solution (OLS) self.coefficients = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y) self.intercept_ = self.coefficients[0] self.coef_ = self.coefficients[1:]

The above method fit is used to train the Multiple Linear Regression model. It takes two arguments X (the input features) and y (the target variable). It first creates a column vector of ones using np.ones((X.shape[0], 1)). This vector is added as the first column to the input features X using np.concatenate. This is done to account for the intercept term in the linear regression model.

Next, it computes the coefficients of the linear regression model using the closed-form solution, also known as Ordinary Least Squares (OLS). It calculates the coefficients by first multiplying the transpose of X with X representing (X.T.dot(X)), then taking the inverse of the resulting matrix (np.linalg.inv) , and finally multiplying the result with the transpose of X and y to get the coefficients. This is something which we already did during the vector representation section.

The intercept term is stored in self.intercept_, and the remaining coefficients are stored in self.coef_.

def predict(self, X): ones = np.ones((X.shape[0], 1)) X = np.concatenate((ones, X), axis=1) # Adding a column of ones for the intercept term return X.dot(self.coefficients)

This method predict is used to make predictions using the trained linear regression model.

Similar to the fit method, it adds a column of ones to the input features to account for the intercept term.

It then calculates the predicted values (y_pred) using the dot product of the input features X and the coefficients self.coefficients

# Example usage: # Generate some sample data np.random.seed(0) X = np.random.rand(100, 2) # 100 samples, 2 features # True relationship: y = 2*x1 + 3*x2 + noise y = 2*X[:,0] + 3*X[:,1] + np.random.randn(100) # Split the data into train and test sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Fit the multiple linear regression model model = MultipleLinearRegression() model.fit(X_train, y_train) # Make predictions on the test set y_pred = model.predict(X_test) # Evaluate the model from sklearn.metrics import r2_score accuracy = r2_score(y_test, y_pred) print('Accuracy (R^2 score):', accuracy) print('Intercept:', model.intercept_) print('Coefficients:', model.coef_)

This block of code demonstrates the usage of the MultipleLinearRegression class.It generates some sample data X and y where the relationship between features and the target variable y is known.

Then, it splits the data into training and testing sets using train_test_split from scikit-learn.It fits the MultipleLinearRegression model to the training data using the fit method.After that, it makes predictions on the test set using the predict method.Finally, it evaluates the model’s performance using the R^2 score and prints the result.

Finally running the whole code

import numpy as np class MultipleLinearRegression: def __init__(self): self.intercept_ = None self.coef_ = None def fit(self, X, y): ones = np.ones((X.shape[0], 1)) X = np.concatenate((ones, X), axis=1) # Adding a column of ones for the intercept term # Compute coefficients using closed-form solution (OLS) self.coefficients = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y) self.intercept_ = self.coefficients[0] self.coef_ = self.coefficients[1:] def predict(self, X): ones = np.ones((X.shape[0], 1)) X = np.concatenate((ones, X), axis=1) # Adding a column of ones for the intercept term return X.dot(self.coefficients) # Example usage: # Generate some sample data np.random.seed(0) X = np.random.rand(100, 2) # 100 samples, 2 features y = 2*X[:,0] + 3*X[:,1] + np.random.randn(100) # True relationship: y = 2*x1 + 3*x2 + noise # Split the data into train and test sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Fit the multiple linear regression model model = MultipleLinearRegression() model.fit(X_train, y_train) # Make predictions on the test set y_pred = model.predict(X_test) # Evaluate the model from sklearn.metrics import r2_score accuracy = r2_score(y_test, y_pred) print('Accuracy (R^2 score):', accuracy) print('Intercept:', model.intercept_) print('Coefficients:', model.coef_) # Plotting the model's predictions vs. actual values plt.scatter(y_test, y_pred, color='blue') plt.plot(y_test, y_test, color='red', linestyle='--') # Plotting the y = x line for reference plt.title('Model Predictions vs. Actual Values') plt.xlabel('Actual Values') plt.ylabel('Predicted Values') plt.show() Accuracy (R^2 score): 0.43228286040209773 Intercept: 0.04493496059558341 Coefficients: [1.56277657 3.09631373]