Understanding the math behind any algorithm is very important. Often we dive straight into solving a problem using some machine learning algorithm and applying all sorts of techniques without understanding the underlying fundamentals of the algorithm, the assumptions that should be taken while implementing it on a particular scenario etc.In this post, we will try to understand the math behind regression and the various assumptions involved.So, before gettings our hands dirty by diving into the maths of regression, let’s define the dataset that we will be using throughout this post.
Input for Regression model
Let’s consider inputs which will be denoted by X as being drawn from some p-dimensional space which we call Rp .
So, if you take an example, let’s consider input that has age and income as the attributes. This means that the p is 2-dimensions. One of the dimensions is age and the other is income.
In this case we are trying to move to a more general setting where we are talking about any kind of p-dimensional space where p could be much larger than 2.
The output that we are going to look at in the Regression case will be drawn from the real numbers (assumption).
In case of Classification, the output will come from a discrete space.
The data comes from some kind of joint probability distribution. We don’t know this joint distribution apriori, but the assumption that we are going to make is, there is an underlying data distribution which is like a joint distribution over the inputs and the outputs = Pr(X,Y) and that it is fixed and we are going to be given samples (X1, Y1) ….(Xn, Yn) which is drawn from this joint probability distribution.
So, basically the joint probability distribution Pr(X,Y) will be the main dataset and (X1, Y1) ….(Xn, Yn) which will be out training data and possibly for validation data as well if required.
The goal is, given such a training data learn a function f(x) which goes from p-dimensional space Rp to the real line R.
The p-dimensional space Rp essentially corresponds to a point in the input space and the real line R corresponds to the output space.
So, basically the function f is going to take any input that is given to it and produce a number.
The function f could take any forms, may be a straight line or anything else.
Now that we have defined the dataset and set our expectations, let’s understand how we could formulate the mathematical expression for Linear Regression.
So, let’s take one example of f where we will be predicting Y-hat = f(x) which means that given an x we are going to predict a label Y-hat.
Here the x1 , x2, …xp are essentially the coordinates of X. So, X here comprises of (x1 , x2, …xp ) where x1 could be age, x2 could be income etc.
Basically each coordinate corresponds the different attribute which describes the data.
So, using the equation above we can rewrite f(x) as:
The alternate way of writing this is to declare that x0 = 1 and then rewrite the function as below:
This is essentially what we do in case of Linear Regression.
f(x) in vector notation can be written as below:
If you look at it from a training data point of view, we can write it in form of matrix notation:
In the matrix X, each row corresponds to an input x and our β will be a vector of coefficients from β0 to βp. So, we could consider the output to be essentially Xβ.
Therefore, the overall error is going to be
This is the estimate of the expected prediction error based on the training data that has been given to us.
In order to get the value of β, we can take the derivative of the above which will be as below:
Now that we have understood how the Linear Regression expression is formulated, let’s take the same dataset and formulate the mathematical expression for K – Nearest Neighbour .
K-NN (Nearest Neighbour )
The KNN classifier is given by the formula:
Let’s try to understand this formula using an example
So, in this case we basically pick the 3 nearest neighbors as k = 3 and find the corresponding y’s. In this case we will pick y1, y3 and y6 and will take the average if these 3 points and report the value of this function as X is the average of these 3 points. This is called the K-NN regressor.
So, basically we will just take the average of the outputs of y1, y3 and y6 and report that as a value at X.
There are different ways in which you could define the function f(x). But we should remember that unless we make an assumption about the form of f we really cannot do any generalization.
So, in the first example, our assumption for function f was lines as we needed to talk about lines, but in the example 2 we are talking about different assumption for the function f which need not necessarily be lines.
In the first example the function was about line and the 2nd example, it’s about an average function which is a local average.
Now let’s see some performance measure for this function:
1. Loss Function
The loss function is a performance measure with respect to Regression which will compare the true output Y with the predicted output f(x)
Our goal here is to find an f(x) such that the loss function is minimized.
2. Squared Error
One of the most common loss functions that is used generally is called the Squared error.
3. Expected Prediction Error
The performance measure that we will be using for Regression is the Expected Prediction Error of function f
So, the Expected Prediction Error is the Expected value of (Y – f(x))2
So now we need to understand the distribution w.r.t which we are taking this Expectation. Whenever we talk about the Expectation of a random variable we will have to talk about the underlying distribution.
Coming back to the question, what is the distribution w.r.t which we are taking the Expectation.
It’s basically the joint probability distribution Pr(X,Y) which we had discussed above.
We can write the Expectation using the joint distribution as follows:
4. Conditional Probability
Now, we can do some kind of modification to the above and discuss about some real time situations by considering conditional distribution.
The above conditional probability says that, there is some chance with which we could choose a data point X and having chosen the data point X what is the probability of seeing a particular output value Y.
So, now the question is why are we looking at probabilities again ?
Scenarios in Conditional Probability
Basically probabilities help us to model a variety of different scenarios like below.
If there is noise in the measurement. In this case what will be the probability of Y given X. So to understand this, let’s take an example where we are measuring the temperature at 3 in the morning. Basically there will be some natural variations in temperature due to weather everytime we measure the temperature at 3 in the morning. That variation is modeled using the conditional probability.May be there will be some set of temperature that will be probable and some set of temperature which will be not. For example, measuring the temperature at 3 AM, then 40C is not a probable temperature which will have lower probabilities and something with lower temperature will have higher probabilities.
The second scenario that this conditional probability helps us to consider is our ignorance about the whole system. In the above example we just considered the time of the day at 3AM, however there might be other factors that could also be taken to consideration while forming the data. So, these factors which we do not know or haven’t considered will or might appear as noise. So it might not be important whether we take the temperature at 3AM, rather it might be more important as to where in the building are we taking the temperature measurement. May be we are measuring the temperature on the outer part of the AC where the things will be comparatively more warmer or may be measuring inside an AC room where temperature will be lot lower. So even though we say that we are measuring at 3AM there could be many other factors for natural variations which we have not modeled. So, one way of arguing about this is, natural variations could be due to the factors that we don’t know anything about which is a valid argument. If we look at it like this that there is no real noise and all the uncertainty in the data arises due to our lack of knowledge, however this is more of a philosophical question.
So, to better frame it, things that are measurable and we don’t measure is called the lack of knowledge and things that are unmeasurable are called as noise.
5. Expected Prediction Error based on Conditional Probability
So, the above 2 scenarios introduces probability to our system. So based on the conditional probability equation for the Regression, let’s go back and write the Expected Prediction Error.
The above equation is the Expectation over X of the Expectation of Y given X. So this above equation has the same quantity (Y – f(x))2 as we saw in the earlier equation for EPE, the only difference now is we are conditioning on the value of X.
So what this equation basically says is : Given the value of X, what will the error (Y – f(x))2 be. The uncertainty here is over the value of Y.
This basically says that I will give you X, tell me what Y is. So we will look at the error only after conditioning on X and it will look only on the variation of Y.
This is what the inner part of the Expectation says which is EY|X([Y – f(X)]2 | X)
And the outer Expectation EX gives us the variation w.r.t X.
6. Minimum Prediction Error
Now that we know the Expected Prediction Error for the Regression, let’s try to find the minimum of the prediction Error by conditioning on a specific value for X
Note: We are not making any assumption about the function F. It can be any function.
So now we want to look at each and every value of X and then pick an F such that for every value of X it makes the best possible prediction.
So what does that mean for the below equation:
For a specific value of x, f(x) will give an output such that the inner expectation EY|X([Y – f(X)]2 | X) is minimized.
Let’s write down the above statement in the form of an equation:
Ⅎ(x) = argminc EY|X ([Y – C]2 | X = x)
Given an X, Ⅎ(x) has to be a specific number. Let’s say someone with an age of 25 and an income of 15,000 walks into the shop, we can give only one O/P which is whether he will buy a computer or not buy a computer which is just a single value. Since we have fixed the input description, we are going to get only one output corresponding to the input description.
So that output will be called C which is the output we are going to give for f(x).
what is the value we should choose for C ?
It should be such that the error which is [Y – C]2 is as small as possible. So, we are minimizing over different possible values of C that we could assign for f(x) such that the C that we are picking will give us the smallest error.
So, argmin means first minimize w.r.t C and take that value of C which achieves the minimal for EY|X ([Y – C]2 | X = x). If there are multiple values that give the minimal then you can pick any one.
The equation Ⅎ(x) = argminc EY|X ([Y – C]2 | X = x) is essentially called conditioning on a point. So, instead of conditioning on a random variable X, we are conditioning on a specific point where X = x as shown in the formula above.
What happens now is for every possible input x, we will basically find the corresponding C and say that f(x) is equal to that particular C.
Note: We should note here that we have not made any assumption for the function f. It might be anything. So this approach can potentially become a huge failure as we might as well end up overfitting the data.
However let’s just move ahead taking the current outcomes and establish some general principles.
Conditional Expectation or Regression Function
Let’s dig a bit more and see how we could simplify a bit more.
Now that we have decided to say that the minimizer argminc EY|X ([Y – C]2 | X = x) is the one whose value we assign to f(x), then what is the value of C that will minimize this minimizer. For that we have to look at [Y – C]2 and assign a single value for C.To understand this let’s take the below example.
Suppose, we give an input as X, and make a measurement Y1 and again we make a measurement by giving the same input X and make a measurement Y2 and again we give the same input X and make the third measurement Y3
So, we have 3 measurements Y1 , Y2 , Y3 for same input measurements X.
Now, we are asking to give us a prediction for what will be the output, given the input X
Prediction will be the average of the 3 Y’s.
The reason we are taking average here is, essentially the quantity that will minimize the squared error is the average.
So, based on this theory we will end up writing the following:
So, this is essentially the Conditional Expectation or Regression Function.
Now, there are a few problems/limitation to the above Regression Function expression.
- First is that we do not know what the distribution w.r.t which we are taking the Expectation. So, to estimate the distribution we have to look at the data (X1, Y1) …. (Xn, Yn) and estimate the data by taking averages.
In our case we will pick all the training data points which will have [X = x] and the find the corresponding Y and take the average. We can write this as following:
There is still a limitation to the above expression.
If we look at the expression, you could see how many samples will we get from the same input x that we have specified above in the expression.
What this means is, we are trying to make an estimation by taking the averages, but what happens when we don’t have enough measurements ?
Our average is going to be bad !!
We will not be able to make prediction for any data point which is not there in the input.
What this means is we are making an average at a particular point and if that point doesn’t exist in the training data then we are not going to return an estimate for it.
So, we need to address these limitations somehow. In order to do that we will basically relax the conditioning criteria.
So, instead of Conditioning on a point, we will condition on a region.
To understand this let’s go back to our previous expression where we were taking the average of all the data points for which xi was equal to x which has the limitations as we discussed in the limitations 1.
So, let’s change the expression so that it will take the average of all the data points which will belong to some region around xi which is essentially the neighbourhood as shown below with the inner marker.
So, inner circle above will correspond to the neighbourhood around xi on which we will be conditioning.
However there is one implicit assumption that we are making here which is the output of the function over the region is going to be a constant.
Let’s understand this using an example:
Let’s say there are multiple data points (x1, y1), (x2,y2)…(x8,y8) as shown below in the figure. These are out training data.
Now, given a query point denoted by +, we want to know what is the output value for it.
Let’s say, we pick 3 nearest neighbours of + which is denoted by the square brackets in the graph below and then will take the average of the points for the 3 neighbours which will be (x4,y4), (x5,y5) and (x6, y6) in our case. Like this we will make assumptions for other data points as well.
Now let’s take another query point beside the + denoted by dot and see the possible output. If we notice, the nearest neighbours remain the same. So, if the neighbours are the same then the average of their 3 corresponding points will also be same which means it remains constant.
As a matter of fact for a certain region around those 3 points the output will always remain constant.
As you can see above in the figure, for all those data points for which the above 3 are the nearest neighbours, the output is going to be a constant.
This was the assumption that we were making for output which is going to be consistent over a region so that we can write an expectation over the region.
So, if you notice, the below expressions are the same which is our nearest neighbour classifier.
So, to summarize we can say that, conditioning on a data point to relaxing that to conditioning on a region gives us Nearest Neighbour Classifier.
Below are the points that show this as a very powerful classifier:
- As k increases the estimate becomes more and more stable. For small changes of input data, the classifier doesn’t change tremendously.
As the number of data points is very large and the number of points that we could look at the neighbourhood also becomes larger and larger, provided the no of data points have to grow at a faster rate than the size of the neighbourhood, then we could show that the actual prediction we make using the below expression
and it becomes closer or approaches the true prediction which we are interested in as shown below:
So, there are a few exceptions/caveats here that we need to point out.
- We were assuming in the above statement that N goes to infinity. This is not a very correct statement to make as N rarely goes to infinity as coming with large datasets is pretty difficult or rare. So this means we cannot really have a classifier that gives us the right output.
- The 2nd problem is, as the dimensionality becomes larger, the data becomes sparse. So, in this case if we are looking at k-neighbours in a 100 dimensional space, the volume covered by these k-neighbours will be very large because they are very sparse space. Therefore it is not a good assumption to make that the output is a constant over this large area.
So if the dimension of the input is large then using a KNN is not a good idea.
For both the KNN and Linear regression one thing that is common is, in both the cases we start off with the same formulation where we want to minimize the Expected Prediction Error
And we name different assumptions. One assumption leads us to Linear regression where we assume that the data is going to be Globally Linear and the other assumption that we made was where the data is going to be Locally Constant which leads us to KNN.
Check this post for the code implementation of Linear Regression model. If you want to explore more about Linear Regression and how to work with Loss Function and Gradient descent then check this post out.
We might encounter overfitting scenarios while modeling data using Linear Regression models. Check this post on how to avoid the overfitting condition using Regularization with Ridge and Lasso Regression.