Linear Regression

The figure that you are seeing above details the various steps in data preprocessing as well as the Linear regression. We have covered the data preprocessing steps in detail here. In this post we will be going over in detail on the lower portion of the figure which is the Linear Regression.


Before getting into detail of Linear regression and its types, let’s first understand what happens in Regression.
In regression, a single dependent or outcome variable is predicted with the help of one or more independent variables.

Below are few examples of Regression:

  1. The likelihood of thunderstorm, based on several climate measurements
  2. The risk of diabetes, based on the family the family health history and various other factors
  3. The chances of admit to a college based on the GRE, TOEFL score.

The most common choices for predicting such outcomes are linear and logistic regression. These are preferred mostly because of their transparency of interpretability. They could also be extrapolated easily to the value which are present in the training data but don’t seem significant to the user even if they are.

In case of Linear Regression, the main objective is to draw a straight line through the observations. This minimizes the absolute distance between the line and the observations. This is called the line of best fit.

So, in case of Linear regression, it is assumed that the relationship between the feature(s) and the Continuous dependent variable follows a straight line.

The formula for straight line is y =mx + c

The variables y and x in the formula is the one whose relationship will be determined.

Both the variables are named as below:

  1. y : Continuous Dependent variable
  2. x : Independent variable –Feature(s)

The above equation is more equivalent to the slope intercept form. The continuous dependent variable is denoted by y, c denotes the intercept, m denotes the slope, and x is the independent variable.

So, if we are given a particular Independent Variable x, the regression model would basically compute the results of c and m. This would minimize the absolute difference between the dependent variable and the predicted value.

There are two types of Linear regression that we will discuss in this post:

1. Simple Linear Regression.

2. Multiple Linear Regression

So, now let’s try to understand a bit more in detail about Simple Linear Regression

1. Simple Linear Regression

Simple Linear Regression models are basically used to define the relationship between one independent variable(Feature) and the Continuous Outcome variable or the independent variable.

The formula that we use is similar to the slope-intercept form y =mx+c where y denotes the predicted value of dependent variable, c denotes the intercept and m denotes the slope and x is the independent variable(feature).

So how do we use this equation to make prediction ?

Given the value of x, the simple linear regression model would compute the values of m and c which will be used to minimize the absolute difference between the predicted y value and the actual y value.

Let’s take an example to better understand this concept:

Suppose that height was the only determinant of the body weight. If we were to plot height (the independent or ‘predictor’ variable) as a function of body weight (the dependent or ‘outcome’ variable), we might see a very linear relationship.

We could also describe this relationship with the equation for a line, Y = c + m(x), where ‘c’ is the Y-intercept and ‘m’ is the slope of the line. We could use the equation to predict weight if we knew an individual’s height.

In this example, if an individual was 70 inches tall, we would predict his weight to be:

Weight = 80 + 2 x (70) = 220 lbs.

In this simple linear regression, we are examining the impact of one independent variable on the outcome.

If height were the only determinant of body weight, we would expect that the points for individual subjects would lie close to the line.

However, if there were other factors (independent variables) that influenced body weight besides height (e.g., age, calorie intake, and exercise level), we might expect that the points for individual subjects would be more loosely scattered around the line.

Exercise 1: Preparing data for a Linear Regression Model

Before preparing the data for the Linear Regression model, let us first see what are the different features in the data and understand it.

The data that we will be using is a weather related dataset which consists of hourly weather measurements. Below are the features of the data:

Temperature_c: The temperature in Celsius

Humidity: The proportion of humidity

Wind_Speed_kmh: The wind speed in kilometers per hour

Wind_Bearing_Degrees: The wind direction in degrees clockwise from due north

Visibility_km: The visibility in kilometers

Pressure_millibars: The atmospheric pressure as measured in millibars

Rain: rain = 1, snow = 0

Description: Warm, normal, or cold

1. Import the dataset

import pandas as pd
weather_df = pd.read_csv("weather.csv")

2. Explore the data to figure out which column is numerical and which is categorical data

3. As you could see above, the description column is the only categorical column. Let’s see how many categories it has.

levels = len(pd.value_counts(weather_df['Description']))
print("There are {} categories in the Description column".format(levels))
O/P: There are 3 categories in the Description column

4. As you saw above, the description column has 3 categories. To handle categorical variables we need to encode the value of the column. Before implementing the encoding, let’s understand the encoding technique that we are going to use.

Multi-class, categorical variables must be converted into dummy variables via a process termed “dummy coding” as we had discussed in the Data-Preprocessing Blog

Dummy coding a multi-class, categorical variable creates n-1 new binary features, which correspond to the levels within the categorical variable.

For example, a multi-class, categorical variable with three levels will create two binary features. After the multi-class, categorical feature has been dummy coded, the original feature must be dropped.

df_dummies = pd.get_dummies(weather_df, drop_first=True)
O/P : (10000, 9)

5. The original DataFrame, df, consisted of eight columns, one of which (that is, Description) was a multi-class, categorical variable with three levels. Using the dummy encoding technique we transformed this feature into n-1 (that is, 2), separated dummy variables, and dropped the original feature, Description. Thus, df_dummies should now contain one more column than df (that is, 9 columns).

print("There are {} columns in df_dummies".format(df_dummies.shape[1]))
O/P : There are 9 columns in df_dummies

6. Before splitting the data into testing and training sets, it is a good practice to shuffle the rows of the dataset to avoid any ordering effect.

from sklearn.utils import shuffle
df_shuffled = shuffle(df_dummies,random_state=42)

7. Linear regression is used for predicting a continuous outcome. Thus, in this exercise, we will pretend that the continuous variable Temperature_c (the temperature in Celsius) is the dependent variable, and that we are preparing data to fit a linear regression model.

Split the data into X –Independent Variable and y– Continuous outcome variable

DV = 'Temperature_c'
X = df_shuffled.drop(DV,axis =1)
y = df_shuffled[DV]

8. Splitting X and y into testing and training data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Exercise 2: Fitting the simple linear regression model and determining the Intercept and Coefficient

9. Instantiating the Linear Regression model

from sklearn.linear_model import LinearRegression
model = LinearRegression()

10. Fitting the model to the Humidity column in training data[['Humidity']],y_train)

11. Extracting the value of Intercept

intercept = model.intercept_
O/P: 34.499407598825144

12. Extracting the value of Coefficient

coefficient = model.coef_
O/P: array([-30.69215601])

13. Printing the formula for predicting temperature

print("Temperature = {0:0.2f} + ({1:0.2f} x Humidity)".format(intercept,coefficient[0]))
O/P: Temperature = 34.50 + (-30.69 x Humidity)

Exercise 3: Generating Predictions and Evaluating the performance of the Simple Linear Regression Model

Now that we have extracted the intercept and coefficient of our simple linear regression, let’s go ahead and generate predictions and evaluate how the model performs on the unseen test data

14. Generate the predictions on the test data

predictions = model.predict(X_test[['Humidity']])

In order to evaluate the model performance we can determine the correlation between the predicted and actual value using a scatter plot.

The scatter plot displays the relation between the predicted value and the actual value. A perfect regression model will display a straight, diagonal line between the predicted and actual value.

We can determine the correlation between the predicted and actual value using the Pearson r correlation coefficient and display the correlation coefficient on the plot’s title.

import matplotlib.pyplot as plt
from scipy.stats import pearsonr
plt.xlabel("Y Test(Actual Values)")
plt.ylabel("Predicted Values")
plt.title("Predicted vs Actual value (r ={0:0.2f})".format(pearsonr(y_test,predictions)[0]))
O/P: (0.6239569499210785, 0.0)

With a Pearson r value of 0.62, there is a moderate, positive, linear correlation between the predicted and actual values. A perfect model would have all points on the plot in a straight line and an r value of 1.0.

Hypothesis testing –Test for Normality of a distribution

Looking at a Histogram we can determine wherher a distribution is normal or not. However sometimes it is difficult to determine this. A distribution may appear normal when it is not, and sometimes a distribution may appear not normal when it is normal.

To test the Normality of the distribution we will use the Shapiro-Wilk test.

1. The null hypothesis for the Shapiro-Wilk test is that data is normally distributed.

Thus, a p-value < 0.05 indicates a non-normal distribution while a p-value > 0.05 indicates a normal distribution.

In this use case we will be using the Shapiro-wilk test to create a programmatic title which will tell the reader whether the distribution is normal or not.

import seaborn as sns
from scipy.stats import shapiro
sns.distplot((y_test-predictions),bins =50)
plt.title("Histogram of residuals (Shapiro W p-value = {0:0.3f})".format(shapiro(y_test-predictions)[1]))

The histogram shows us that the residuals are negatively skewed and the value of the Shapiro W p-value in the title tells us that the distribution is not normal. Therefore we will be rejecting the null hypothesis which says that our data is distributed normally. This gives us further evidence that our model has room for improvement.

15. Let’s compute the metrics for mean absolute error, mean squared error, root mean squared error and r-squared and put them in a dataframe

from sklearn import metrics
import numpy as np
metrics_df = pd.DataFrame({'Metrics':['MAE','MSE','RMSE','R-Squared'],

Mean absolute error (MAE) is the average absolute difference between the predicted values and the actual values.

Mean squared error (MSE) is the average of the squared differences between the predicted and actual values.

Root mean squared error (RMSE) is the square root of the MSE.

R-squared tells us the proportion of variance in the dependent variable that can be explained by the model.

Thus, in this simple linear regression model, humidity explained only 38.9% of the variance in temperature. Additionally, our predictions were within ± 6.052 degrees Celsius.


[1]Data Science with Python: Combine Python with Machine Learning Principles to Discover Hidden Patterns in Raw Data :Book by Aaron England, Mohamed Noordeen Alaudeen, and Rohan Chopra.

[2] R Programming for Data Science Book by Roger D. Peng

[3] Statistics for Machine Learning Book by Pratap Dangeti