In Linear Regression, Cost Function and Gradient Descent are considered fundamental concepts that play a very crucial role in training a model. Let’s try to understand this in detail and also implement this in code with a simple example.
Cost Function
The cost function is a very important metric when it comes to quantifying the error or mismatch between the model’s prediction and the actual values. The choice of cost function depends on the type of problem you are trying to solve and the nature of the output. What does the nature of output mean in this context,
- Suppose you are trying to solve a regression problem where your goal is to predict continuous values, Mean Squared Error (MSE) is the most common cost function. The MSE measures the average squared difference between the predicted values and the actual values. Lower MSE indicates better performance.
[math]J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) – y^{(i)})^2 [/math]
2. In classification problems, where the goal is to predict discrete class labels, the cross-entropy loss or the hinge loss might be used. These loss functions penalize the model more when it makes incorrect predictions, with the magnitude of the penalty depending on the degree of error.
Let’s come back to the main discussion now!
We will try to break down the above equation of cost function into smaller components first and try to understand each component, and later use that understanding to derive cost function from scratch. This might sound counter-intuitive at the beginning but will start making sense, so stay with me as we go through it one step at a time!
Let’s re-write the equation for cost function again :
[math]J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) – y^{(i)})^2 [/math]
Where:
- $[math]J(\theta)[/math]$ is the cost function.
- $[math]m[/math]$ is the number of training examples.
- [math]h_\theta(x^{(i)})[/math] is the predicted value for the $i$-th example.
- $[math]y^{(i)}[/math]$ is the actual value for the $i$-th example.
- $θ$ represents the parameters of the linear regression model.
The goal during training is to find the values of $θ$ that minimize this cost function.
So far, we understood that the above cost function is a MSE, where were are calculating the squared of difference between the predicted and actual value and taking their mean, which is the number of training examples (m) in our case, but what is that [math]\frac{1}{2}[/math] in that formula, what’s it significance.
Well, the fraction [math]\frac{1}{2}[/math] in the cost function is included for mathematical convenience, and it does not change the optimization problem. It is commonly used to simplify the calculations during the gradient descent optimization process.
When you take the derivative of the cost function with respect to the parameters ($θ_{j}$), the [math]\frac{1}{2}[/math] term cancels out when you apply the chain rule, leading to simpler expressions. If you want to learn the basics of calculus, you can go through my following blog post as well.
Specifically, when you take the derivative with respect to $θ_{j}$, you get:
[math]\frac{\partial J}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) – y^{(i)}) \cdot x_j^{(i)}[/math]
The [math]\frac{1}{2}[/math] factor is introduced to cancel the 2 that arises when taking the derivative of the squared term in the Mean Squared Error (MSE) cost function. It does not affect the optimal values of $θ$ that minimize the cost function, but it simplifies the subsequent computations. This will make more sense, when we go over the derivation of Gradient Descent in the subsequent section.
So, the presence of [math]\frac{1}{2}[/math] in the cost function is a convention to make the optimization calculations more convenient and does not change the underlying objective of finding parameters that minimize the prediction error.
Now, let’s derive the cost function from the ground up,
- Hypothesis function : we consider a hypothesis function for the linear regression as below
[math]h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \ldots + \theta_n x_n[/math]
Here, [math]x[/math] represents the feature vector [math][x_0, x_1, x_2, \ldots, x_n][/math], where [math]x_0 = 1[/math] is the bias term.
2. Difference between predicted and actual value
[math](h_\theta(x^{(i)}) – y^{(i)})[/math]
3. Square the difference between predicted and actual value : We typically do this to account for both overestimations and underestimations of error
[math](h_\theta(x^{(i)}) – y^{(i)})^2[/math]
Pieceing all the above 3 steps together, we write the cost function, [math]J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) – y^{(i)})^2 [/math] which the average of the squared differences over all training examples. We sum up the squared differences for each example and divide by twice the number of training examples ($2m$) for mathematical convenience as explained above.
Now that we have understood Cost Function in detail, let’s understand what is gradient descent and its relation with cost function in detail
Gradient Descent
Gradient descent is an optimization algorithm used to minimize the cost function by adjusting the model parameters ($θ$). It iteratively updates $θ$ in the direction of the steepest decrease in the cost function. The update formula for gradient descent is given by:
[math]\frac{\partial J}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) – y^{(i)}) \cdot x_j^{(i)}[/math]
Where:
- $θ_{j}$ is the $j$-th parameter.
- $α$ is the learning rate, a hyperparameter that determines the size of the steps in the parameter space.
- $m$ is the number of training examples.
- [math]h_\theta(x^{(i)})[/math] is the predicted value for the $i$-th example.
- $[math]y^{(i)}[/math]$ is the actual value for the $i$-th example.
- [math]x_j^{(i)}[/math] is the $j$-th feature of the $i$-th example.
This process is repeated until convergence, where the algorithm finds parameter values that result in a minimal cost function. In short, the cost function quantifies the error in the model’s predictions, and gradient descent is the optimization algorithm that adjusts the model parameters to minimize this error.
Let’s understand how the Gradient descent can be derived from cost function:
- Cost function
[math]J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) – y^{(i)})^2 [/math]
2. Compute the partial derivative of the cost function with respect to $θj :$
[math]\frac{\partial J}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) – y^{(i)}) \cdot x_j^{(i)}[/math]
3. Update $θ_{j}$ using the gradient descent update rule:
[math]\theta_j := \theta_j – \alpha \frac{\partial \theta_j}{\partial J}[/math]
Substituting the partial derivative, we get:
[math]\theta_j := \theta_j – \alpha \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) – y^{(i)}) \cdot x_j^{(i)}[/math]
This update rule is applied simultaneously for all $θ_{j}$ until convergence is reached.The intuition behind this process is that we update each parameter in the direction that reduces the cost function, and the size of the update is determined by the learning rate ($α$). The sum term represents the average contribution of each training example to the update. The process is repeated iteratively until the algorithm converges to parameter values that minimize the cost function.
Below is the visual representation of the above iterative process, showing the eventual convergance to minima which minimizes the cost function.
[Source: wikipedia]
[Source: wikipedia]
Now that we have understood both the concepts and their derivation, we will implement the code by generating randon synthetic data.
1. Function to create random synthetic data.To keep things simple, we will only consider one independent variable with 100 sample size.The assumption here is that, we have already have established the relation between the dependent(y) and independent(X) variable using the below formula: y = 3X +4 Using this understanding, we create a function which generates 100 random values for X and y
def generate_synthetic_data(): np.random.seed(42) X = 2 * np.random.rand(100, 1) y = 4 + 3 * X + np.random.randn(100, 1) return X, y
2.We implement the below function to define the Cost function
def compute_cost(X, y, theta): m = len(y) predictions = X.dot(theta) cost = (1/(2*m)) * np.sum((predictions - y)**2) return cost
3.Function to implement the Gradient Descent. Notice that we have transposed the X matrix while calculating the gradient. That's because, X is a (100,1) column vector, When performing matrix multiplication $X._{T.}errors$, the transpose of $X$ is not necessary because $X$ is already a column vector. However, if the synthetic data had more than one feature (i.e., if $X$ had more than one column), then the transpose of $X$ would be needed. In that case, you would need to use $X.T$ to ensure that the matrix dimensions match for the dot product operation
def gradient_descent(X, y, theta, learning_rate, iterations): m = len(y) cost_history = [] for _ in range(iterations): predictions = X.dot(theta) errors = predictions - y gradient = (1/m) * X.T.dot(errors) theta = theta - learning_rate * gradient cost = compute_cost(X, y, theta) cost_history.append(cost) return theta, cost_history
4.Generate synthetic data
X, y = generate_synthetic_data()
5. Add a bias term to X (for theta_0)
X_b = np.c_[np.ones((100, 1)), X]
6.Initialize theta with zeros, the value of theta will change with every iteration of gradient descent
theta_initial = np.zeros((2, 1)) print(theta_initial)
7.Set hyperparameters
learning_rate = 0.01 iterations = 10000
8.Run gradient descent
theta_optimal, cost_history = gradient_descent(X_b, y, theta_initial, learning_rate, iterations)
9.Print the optimal parameters
print("Final Parameters (theta):") print(f"Theta 0 (Intercept): {theta_final[0][0]}") print(f"Theta 1 (Slope): {theta_final[1][0]}")
10.Plot the cost history
plt.plot(cost_history) plt.xlabel('Iterations') plt.ylabel('Cost') plt.title('Cost Function Convergence') plt.show()
Piecing all the above code blocks together:
import numpy as np import matplotlib.pyplot as plt def generate_synthetic_data(): np.random.seed(42) X = 2 * np.random.rand(100, 1) y = 4 + 3 * X + np.random.randn(100, 1) return X, y def compute_cost(X, y, theta): m = len(y) predictions = X.dot(theta) cost = (1/(2*m)) * np.sum((predictions - y)**2) return cost def gradient_descent(X, y, theta, learning_rate, iterations): m = len(y) cost_history = [] for _ in range(iterations): predictions = X.dot(theta) errors = predictions - y gradient = (1/m) * X.T.dot(errors) theta = theta - learning_rate * gradient cost = compute_cost(X, y, theta) cost_history.append(cost) return theta, cost_history X, y = generate_synthetic_data() X_b = np.c_[np.ones((100, 1)), X] theta_initial = np.zeros((2, 1)) learning_rate = 0.01 iterations = 10000 theta_optimal, cost_history = gradient_descent(X_b, y, theta_initial, learning_rate, iterations) print("Final Parameters (theta):") print(f"Theta 0 (Intercept): {theta_final[0][0]}") print(f"Theta 1 (Slope): {theta_final[1][0]}") plt.plot(cost_history) plt.xlabel('Iterations') plt.ylabel('Cost') plt.title('Cost Function Convergence') plt.show()
Output: Final Parameters (theta): Theta 0 (Intercept): 4.215096157546533 Theta 1 (Slope): 2.770113386438666