Math behind Simple Linear Regression

Simple Linear Regression

In this post we will try to understand the Math behind Simple Linear Regression. But before getting into the details let’s understand what Simple Linear Regression means.

Simple Linear Regression basically defines relationship between one feature and a continuous outcome variable/ dependent variable y = α +βx.

This equation is similar to the slope-intercept form, where y is the predicted value of the dependent variable, α is the intercept and β is the slope and x is the feature/independent variable.

Providing the value of x, the Linear regression model would evaluate the α and β values that would minimize the absolute difference between the predicted value of y which is ŷ and actual value y.

Also refer this post if you want to see the end to end implementation of Simple Linear Regression.

Let’s understand this using an example below:

Suppose we are trying to predict the weight of individuals based on their heights. Below is how the distribution looks like when we plot it out:

Our goal here as part of the Simple Linear Regression is to plot a best fit line which basically encompasses or approximates the points corresponding to the heights and weights that we have for the distribution. And if we have a future height then we could predict the corresponding weight using the best fit line as shown below in red.

Now if you see there are still some points which are not being covered by the best fit line. How do we take those points also into consideration?

Approach -1 (Simple Linear Regression — Cost Function/Loss Function)

One way is to take multiple best fit lines and calculate the average of distance of the points (let’s call it a J) from each best fit line and then consider the best fit line which has the least value for J. This is called the Cost Function.

There is one more concept that we need to understand called the Loss Function, but before getting into that let’s just see the equation for Cost Function. Later we will see how the Cost Function is basically derived from the Loss Function.

Cost Function (J) =  \frac{\mathbf{1}}{{m}}\sum_{i=1}^{m}L\left({\hat{y}}^{\left(i\right)}-y^{\left(i\right)}\right)


L = Loss Function

m = number of points

{\hat{{y}}}^{\left(i\right)} =Prediction output

{y}^{\left(i\right)} = true label/actual outcome variable.

Below figure you can see the green best fit line seems to have the least value for the Cost Function and hence would be considered as our best fit line.

Loss Function

The L that we saw in the above equation is the Loss Function. It is basically measures the discrepancy between the prediction ({\hat{{y}}}^{\left(i\right)}) and the desired output ({y}^{\left(i\right)})

However, Loss Function computes the error for a single training example. The equation for Loss Function is {L}({\hat{{y}}}^{\left(i\right)}-{y}^{\left(i\right)}) = \frac{\mathbf{1}}{\mathbf{2}}{({\hat{{y}}}^{\left(i\right)}-{y}^{\left(i\right)})}^\mathbf{2}


{\hat{{y}}}^{\left(i\right)}  = Prediction output

{y}^{\left(i\right)} =true label/actual outcome variable.

So, Loss Function is applied with respect to a single training example, however Cost Function is applied for the entire training set.

Based on this understanding we could derive the Cost Function and write it as follows

Cost Function (J) = \frac{\mathbf{1}}{{m}}\sum_{i=1}^{m}{L({\hat{y}}^{\left(i\right)}-y^{\left(i\right)})}^\mathbf{2} => \frac{\mathbf{1}}{\mathbf{2}{m}}\sum_{i=1}^{m}{({\hat{{y}}}^{\left(i\right)}-{y}^{\left(i\right)})}^\mathbf{2}

The above approach seems perfect and we could use it to select the best fit line which has the minimum Cost Function and our life is easy.

But wait we haven’t considered a very important factor here. While considering lot of best fit lines and computing the cost function for each of them and then selecting the minimum value for the cost function would consume more amount of time as well and compute power.

This doesn’t seem to me as a very time/cost efficient approach. So, let’s see some other approach which would be better than this.

Approach -2 –Gradient Descent (2 – Dimensional)

Now, before getting into the details of Gradient Descent let’s first see how we approach the problem first.

Suppose we have a set of points and the equation for that is given by y = x. Below plot shows the distribution of the data for the equation.

Now based on the distribution, we need to find out the best fit line for it. So, the equation for the line would be {\hat{{y}}}^{\left(i\right)} =α +βx. The α value here will be 0 as we are considering a 2-dimensional space for the independent feature (x) and the dependent outcome variable (y) which means the equation for the best fit line would become {\hat{{y}}}^{\left(i\right)} = βx.

Now let’s take sample values and find the values of {\hat{{y}}}^{\left(i\right)}

We will also calculate the value of cost function using the equation:

Cost Function (J) = \frac{\mathbf{1}}{{m}}\sum_{i=1}^{m}{L({\hat{y}}^{\left(i\right)}-y^{\left(i\right)})}^\mathbf{2} => \frac{\mathbf{1}}{\mathbf{2}{m}}\sum_{i=1}^{m}{({\hat{{y}}}^{\left(i\right)}-{y}^{\left(i\right)})}^\mathbf{2}

For β =1, the cost function would be as follows:

Cost Function (J) = \frac{\mathbf{1}}{\mathbf{2}{m}}\sum_{{i}=\mathbf{1}}^{{m}}\left(\hat{{y}}-{y}\right)^\mathbf{2}=\frac{\mathbf{1}}{\mathbf{8}}\left(\mathbf{1}-\mathbf{1}\right)^\mathbf{2}+\left(\mathbf{2}-\mathbf{2}\right)^\mathbf{2}+\left(\mathbf{3}-\mathbf{3}\right)^\mathbf{2}=\mathbf{0}

For β =0.5, the cost function would be as follows:

Cost Function (J) = \frac{\mathbf{1}}{\mathbf{2}{m}}\sum_{{i}=\mathbf{1}}^{{m}}\left(\hat{{y}}-{y}\right)^\mathbf{2}=\frac{\mathbf{1}}{\mathbf{8}}\left(\mathbf{0}\cdot\mathbf{5}-\mathbf{1}\right)^\mathbf{2}+\left(\mathbf{1}-\mathbf{2}\right)^\mathbf{2}+\left(\mathbf{1}\cdot\mathbf{5}-\mathbf{3}\right)^\mathbf{2}=\mathbf{0}.\mathbf{4375}

Similarly, we would calculate the cost function for all the values of the slope β.

Note here that the value of m used in the cost function is the number of data points. In our case the value is 4 as we are using 4 data points as specified in the above graph.

If we plot a graph with different values of the Cost Function (J) vs Slope (β) we will get something like below, where the graph starts from a high value for cost function and then gradually descends to minimum and then again starts increasing to a higher value.

This phenomenon is what we call as Gradient descent.

Global Minima

Now one more question which we need to answer is:

  1. From the above graph, where should we stop to select the value for cost function (J) such that it has the best fit line.In other words, the value where we need to stop to select the cost function(J) which yields us the best fit line is called the Global Minima. As per our graph, the Global Minima seems to be at slope (β) =1

Let’s try to understand how this works.

When we start to evaluate the cost function, let’s say it starts from a higher value and gradually starts descending to the global minima (red dot). For the cost function to descend towards the global minima we need to use the Convergence theorem.

Convergence Theorem

Until now we understood how we could reach the global minima using Gradient Descent, but we need to also understand how this whole mechanism works.

To understand this let’s step back a bit and reiterate:

In a 2- dimensional space where we are considering only the slope(β) and not the intercept (α)

our motive is to find a point where the cost function is minimum, and we use Gradient descent to calculate that. So Gradient descent is basically the rate at which the cost function descends to the minimum value i.e. the rate at which the difference between the predicted value ({\hat{{y}}}^{\left(i\right)}) and the true value(y) decreases to become minimum.

The formula for Gradient Descent is as follows:

 Loop: {

                                      β: = β – α\frac{\partial J\left({\beta}\right)}{\partial\beta}


Note that the α, that we have used in the formula above is called the Learning Rate which will be a very small constant value of around 0.001.

Smaller value of α which will allow us to descend the slope towards the global Minima in very smaller steps.

If we take the Learning Rate as a larger value, then chances are we won’t be able to reach the global minima even after many iterations.

\frac{\partial J\left({\beta}\right)}{\partial\beta}, is the partial derivative of the cost function with respect to the slope.

We run the above algorithm for Gradient Descent in loop until it converges to a Global Minima. Let’s understand this with the help of the diagram below:

Consider that the we just have the function J(β) and we want to find the gradient descent for it:

In order to find the gradient descent, we would run the Gradient descent algorithm in a loop until it descends to a local minimum of the above graph.

So, if you want to find the slope of the right part of the curve, that will get us a positive slope (see the figure above for formula of slope).

There will decrease in the value of β from the right part of the curve, because using the Gradient descent formula, the learning rate (α) multiplied with positive \frac{\partial J\left({\beta}\right)}{\partial\beta}

and then subtracted from β will decrease the value of β and making it descend down the curve.

Similarly, for the left part of the curve the slope will be negative (See the figure for the formula of slope).

So the above equation will show an increase in the value of W from the left part of the curve, because learning rate(α) multiplied with negative  \frac{\partial J\left({\beta}\right)}{\partial\beta} and then subtracted from β will increase the value of β and making it descend down the curve.

Now, after we reach the Global Minima, if we try to find the slope of the curve, it will basically be 0 and this should essentially be the slope of our best fit line.

Gradient Descent (Multi- Dimensional)

In the above approach we had just one feature which we try to converge it to the Global Minima in order to find the best fit line.

What will happen if we have a multiple feature. What would be our approach in that case?

The approach would remain the same, however in this case all the features will try to approach the Global Minima. You could see the below 3 – dimensional graph where we have all the features converging at a particular point specified in the diagram below.