Simple Linear Regression


What is Simple Linear Regression

Simple Linear Regression basically defines the relation between a one feature and the outcome variable.

This can be specified using the formula y = α + βx which is similar to the slope-intercept form, where y is the value of the dependent variable, α is the intercept β denotes the slope and x is the value of the independent variable.

Suppose we are given the value of the independent variable x, regression model will compute the value of α and β such that the absolute difference between the predicted y value and actual y value is minimal.

As a result of the difference between the predicted y value and actual y value , we will be able to understand whether our model is performing well or needs fine tuning.

You can also refer this post to understand the math behind Linear Regression models in greater detail.

Supervised Learning

In Supervised learning the model is trained using the labeled data (data in which the target value or the dependent value is already known).

The model will learn the patterns of the feature matrix (pattern of the dependent variables) and how to map to the target value. Based on the learning, when we fed in a new dataset, it uses the learning to predict the target variable. This process is known as predictive modeling.

Supervised Learning has the following categories:


Well, on a lighter side of things, below curve will tell you what Classification models do.

As I am a big fan of both Marvel and DC, so I have done a Classification of my favourite superheroes in the below curve.

Now let’s understand this more scientifically.

Classification methods are mostly categorical. It helps in predicting which group or class a data point belongs to. When we are trying to predict between 2 classes then it is categorized as binary classification.

An Example is, predicting when whether it will rain or not and here the prediction will be either yes or no.

Similarly, if we are predicting more than 2 target classes then it is known as multi-classification.

What are Regression models

Regression deals with numerical target values.

Regression model helps predict the numerical value of a target value based on the training dataset.

Last but not the least is the :

Time series analysis models

On a lighter note, below video shows a Time-series analysis of all the things that happened around the COVID-19 outbreak. This video has been published by the popular youtube channel Know Your Meme

Now let’s understand this more scientifically.

Time series models deal with data which is distributed chronologically.

Example of time series data are stocks and share market. Based on the requirement of the use case, time series analysis can either be a regression or classification method.

Now that we have discussed at a high level about how the different types of Supervised learning methods, let’s dive into one of the methods called Linear regression.

If we talk about Linear regression, it is a statistical data analysis technique which can be used to determine the relationship between a dependent variable and one or more independent variables.

We could divide Linear Regression into two types:

  1. Simple Linear Regression
  2. Multiple Linear Regression.

When should we use the Simple Linear Regression Model:

Simple Linear regression model should generally be used in cases when we are using one feature (independent variable) to predict the outcome (dependent variable).


Let’s take an example to understand simple linear regression in much more detail and then we will also implement it in code to see how accurate our results are.

  • Predicting the Chances of getting admit to the graduate school based on the GRE Score.

About the Dataset

The dataset and the content described below has been taken from Kaggle for the better understanding of the readers. Please click the link to download the dataset.

The dataset version that has been used here is Admission_Predict_Ver1.1.csv


This dataset is created for prediction of Graduate Admissions from an Indian perspective.


The dataset contains several parameters which are considered important during the application for master’s Programs. The parameters included are:

  1. GRE Scores (out of 340)
  2. TOEFL Scores (out of 120)
  3. University Rating (out of 5)
  4. Statement of Purpose and Letter of Recommendation Strength (out of 5)
  5. Undergraduate GPA (out of 10)
  6. Research Experience (either 0 or 1)
  7. Chance of Admit (ranging from 0 to 1)


This dataset is inspired by the UCLA Graduate Dataset. The test scores and GPA are in the older format. The dataset is owned by Mohan S Acharya.


This dataset was built with the purpose of helping students in shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for a particular university.


Our goal here would be to predict the “Chance of Admit” based on the different parameters that are provided in the dataset.

We will achieve this goal by using the Simple Linear Regression model.

Based on the data that we have; we will split out data into training and testing sets. The Training set will have features and labels on which our model would be trained. The label here is the “Chance of Admit”. If you think from a non-technical standpoint then label is basically the output that we want, and features are the parameters that drive us towards the output. Once our model is trained, we will use the trained model and run it on the test set and predict the output. Then we will compare the predicted results with the actual results that we have to see how our model performed.

This whole process of training the model using features and known labels and later testing it to predict the output is called Supervised Learning as discussed above.

1. Import the data

import pandas as pd
df = pd.read_csv("Admission_Predict_Ver1.1.csv")

2. Explore the data

3. Let’s format the column names a bit by making all the column names to lowercase and removing the spaces between the column names by adding “_” and also replacing any “)” or “(” with no spaces

df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')

Note that in our data set we don’t have any categorical values. Categorical values are mostly multiclass in a dataset. Firstly, while building a model, categorical variables should be converted into dummy variable through dummy encoding process for better accuracy. The reason is most of the machine learning algorithms perform matrix operations for model evaluation and with categorical variables the results won’t be accurate. Please refer the Data Cleaning post as we have discussed about the Dummy encoding technique there in detail.

df_dummies = pd.get_dummies(df, drop_first=True)

4. Let’s remove any possible order effects in the data by shuffling the rows of the data before splitting the data into features(X) and dependent variables(y).

from sklearn.utils import shuffle
df_shuffled = shuffle(df_dummies,random_state = 44)

5. Now that we have shuffled the data, let’s go ahead and split the the data into features (X) and dependent variables (y).

DV = 'chance_of_admit'
X = df_shuffled.drop(DV,axis=1)
y = df_shuffled[DV]

6. Now that we have splitted the data into features and dependent variables, let’s go ahead and split the data into training and testing set. The testing size that we are taking here is 30% of the total size of the dataset.

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.33, random_state = 42)

7. Now the data is splitted into testing and training set. Let’s go ahead and fit the simple linear regression model and determine the Intercept and Coefficient.

from sklearn.linear_model import LinearRegression
model = LinearRegression()

8. Now let’s fit the model to the “gre_score” column of the training data which is the feature column and extract the Intercept and Coeff[['gre_score']],y_train)

9. Extract the value of the intercept using the below

intercept = model.intercept_
print("Intercept: ",intercept)
Intercept: -2.4379055229632742
coeff = model.coef_
print("Coefficient: ",coeff)
Coefficient: [0.00997539]

10. Now that we have the intercept and coefficient with us, let’s print out the whole formula which will be used for predicting the “chance_of_admit”

print('Chance of Admit = {0:0.2f} + ({1:0.2f} x gre_score)'.format(intercept,coeff[0]))
Chance of Admit = -2.44 + (0.01 x gre_score)

11. Now that we have extracted the intercept and coefficient and framed the equation, let’s generate the prediction and also evaluate the model performance on unseen data

Below code, we will generate the prediction on the test data

predictions = model.predict(X_test[['gre_score']])
print("Predictions: ",predictions)

12. Let’s compare the actual Chance of Admit vs the Predicted Chance of admit

df1 = pd.DataFrame({'Chance of Admit Actual value': y_test, 'Chance of Admit Predicted value': predictions})

13. Now that we have seen the comparison between the Actual and the predicted value, let’s go ahead and generate the model performance matrix

One of the ways to evaluate the model performance is to use scatterplot to examine the correlation between the Actual and predicted values which will show how both of them are related. In case of an ideal regression model, we will get a straiget, diagonal line between predicted and actual value.

We can signify the relationship between the predicted and actual values using the Pearson R correlation Coefficient. To learn more about this, please check the post <–link–>

Now let’s dive into the code and see how the correlation between the predicted and actual value looks.

import matplotlib.pyplot as plt
from scipy.stats import pearsonr
plt.xlabel('Y-test(Actual value)')
plt.plot(y_test, color='red', linewidth=2)
plt.title('Predicted vs Actual values(r = {0:0.2f})'.format(pearsonr(y_test,predictions)[0]))
(0.8180041678901111, 5.409068047135853e-41)

As a result of the above scatter plot, the Pearson r value is 0.82 which shows that there is a strong positive, linear correlation between the predicted and actual value which could be seen from the red trend line. In case of a perfect model all the values would fall in a straight line with pearson r value as 1.0

import pylab
import numpy
plt.xlabel('Y-test(Actual value)')
plt.plot(y_test, color='red', linewidth=2)
plt.title('Predicted vs Actual values(r = {0:0.2f})'.format(pearsonr(y_test,predictions)[0]))
##calc the trendline
##here the 1 denotes the degree which means a straight line
z = numpy.polyfit(y_test, predictions, 1) 
p = numpy.poly1d(z)
print("value of p: ",p)
print('y = {0:0.6f}x +({1:0.6f})'.format(z[0],z[1]))

14. However to have a better evaluation of the model we need to check if it has normally distributed residuals. Let’s create a density plot to see this. For a detailed understanding of what does distributed residuals mean, refer the post

import seaborn as sns
from scipy.stats import shapiro
sns.distplot((y_test-predictions),bins = 50)
plt.title('Residual distribution(Shapiro W p-value = {0:0.3f})'.format(shapiro(y_test-predictions)[1]))
(0.9585781693458557, 8.17107647890225e-05)

As a result of the above output, the residuals in histogram is more negatively skewed and the Shapiro W p-value is 0 which means that the distribution is also not normal. This is an evidence that the model requires improvement.

14. Finally lets compute the mean absolute error, mean squared error, root mean squared error, and R-squared

from sklearn import metrics
import numpy as np
metrics_df = pd.DataFrame({'Metrics':['MAE','MSE','RMSE','R-squared'],

1. Mean Absolute Error is the absolute difference between the predicted value (predictions) and the actual value(y_test).

2. Mean squared error (MSE) is the average of the squared differences between the predicted and actual values.

3. Root mean squared error (RMSE) is the square root of the MSE.

4. R-squared is a statistical measure which says how close the data are to the fitted regression line.R-squared is always between 0 and 100%:

  • For instance, R-Squared value of 0% indicates that the model explains none of the variability of the response data around its mean. Similarly, 100% indicates that the model explains all the variability of the response data around its mean.


In conclusion the R-squared value for the Simple Linear Regression was 66.8% which explains that the “gre_score” explained 66.8% of variance in the “chance_of_admit”.

As a result of this, it is evident that our model still needs improvement. And also the predictions were within the range of ±0.64 unit.


Mohan S Acharya, Asfia Armaan, Aneeta S Antony: A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019