In this post, we will be covering basics about the types of distribution. It will be a foundation for creating machine learning models going forward. Moreover, these fundamentals are also necessary for performing Exploratory Data Analysis (EDA).
There are various types of distribution that we could encounter. Some of them that we will be covering in this post are as as follows:
- Gaussian/Normal Distribution
- Log Normal Distribution
So, let’s dive right into it.
As the name suggests, population is the complete distribution/the complete list of observations about the subject which we concerned about.
- The total number of people in the town who own a dog,
- average weight of all the people in a city.
Now, when we talk about Population there is one more word that is associated with it which we need to also discuss here which is the Population mean.
What will be the Population mean in our case here? Suppose for the 2nd example we have 100k people in the whole city
The Population mean would be:
Where x1…100k represents the weight of individuals for the whole population of 100k people.
Now usually when there is a huge population then it is not possible to take the whole and calculate the weight or whatever the required ask is. So, what we basically do is we take a Sample from the whole population.
Sample is basically the subset of the whole population which is being analyzed. As it is usually expensive to perform analysis on the whole population, most of the statistical analysis is done using a sample of a population and drawing conclusion from it.
Ex. Suppose in the 2nd example we take the sample as 10k out of 100k.
Similarly, the sample mean will be defined as:
Where w1…10k represents the weight of individuals for the sample population of 10k people.
Now, that we have discussed what Population and sample is, let’s try to understand what random variables are.
Random Variable distribution
The formal definition of Random Variable is basically defined as the variable whose values depend on the outcome of a random phenomenon.
It is divided into following types:
- Discrete Random Variable is something is which may take only countable number of distinct values such as 1, 2, 3, 4, 5 and so on. So, for a random variable, if it is able to take only a finite number of distinct values then it is a discrete random variable. Ex. Number of children in the family, number of students enrolled in the fall semester, Number of cars that were bought from a dealer etc.
- Continuous Random Variable In case of continuous random variable, it could take infinite number of possible values. Some of the examples can be the weight, height, time taken to travel from office to home. Now that we have understood what Random Variables are, let’s understand about the different types of Distributions.
There are a lot of distributions that are there in statistics, but the most important that we are concerned from a Machine Learning standpoint is the Gaussian/ Normal Distribution.
Gaussian/ Normal Distribution
Before understanding the Gaussian/Normal distribution, let’s quickly brush up on the formulas and definitions of Mean, Variance and Standard deviation.
Now, coming back to Normal Distribution, we can basically say that a random variable x which is continuous belongs to a Gaussian distribution with some mean and Standard deviation.
If such kind of condition exists, then usually the distribution will follow a bell curve as below.
Below are few things that we need to know and remember about normal distributions:
We have the empirical rule which tells us about what percentage of data falls within how much Standard Deviations from the mean.
As per the empirical rule, of we notice above, 68% of the data falls within 1 standard deviation from the mean. If you go a bit further, 95% of the data falls within 2 standard deviations from the mean and almost 99.7% of the data falls within 3 standard deviations from the mean.
What we could possibly infer from the above is standard deviation is a measure for the spread of a distribution.
A distribution with a smaller standard deviation would indicate that the data is tightly clustered around the mean whereas a large standard deviation signifies that the data is spread out around the mean.
There are few properties of Gaussian/Normal distribution that we should keep in mind:
- Gaussian/Normal distribution has equal mean, median, mode.
- The total area of the curve is always 1.
- The curve is symmetric around the mean.
- The distribution is equally divided on both sides of the curve. That means left half of the curve is same as the right half.
Now that we have discussed about Gaussian/Normal distribution, let’s go over and discuss about Log Normal Distribution and what are its properties.
Log Normal Distribution
A Lognormal distribution is basically a probability distribution where the logarithm of the random variable is normally distributed.
Let’s write this in the form of an equation below:
As per the above equation if the log of the random variable is normally distributed then it is a Log Normal Distribution.
The above curve is a curve for Lognormal distribution. If you look at it, you will notice that the curve is not completely a bell curve, rather it is a bit right skewed.
Ex. consider the salary of the whole population in a town. If you check it in a log normal distribution, then you will notice that most of the population falls around the mean denoted by the red lines in the graph below.
If your notice the lines towards the right side of the graph, these are the people who will have a very high salary, however the count of such people will be very low.
So, now that we have understood the above 2 distributions, the fundamental question that we need to ask ourselves is why we are learning these distributions. How is this going to help us solve a problem or build a model.
For that let’s take a scenario and understand the holistic picture behind all this!!
Ex., Suppose David has categorized his monthly expenses which includes spending on Travel, Food, Apartment Rent, Utilities and Vehicle Installment.
Based on the expenses let’s find out what kind of distribution the different expenses follow. Suppose the Food category follows a Gaussian/Normal distribution.
So, to ensure uniformity, we convert the Food category column to Standard Normal Distribution where the mean is 0 and standard deviation is 1. After conversion the values will be in the form of Standard Scalar.
We are basically scaling down the distribution to a mean which is 0 and standard deviation 1.
Now, we check the Travel category and let’s consider that it follows Log Normal Distribution. To ensure uniformity we have to convert this column also to Standard normal distribution by calculating the log of every number. Now if you check it will be following a gaussian/normal distribution.
After converting the data to Gaussian distribution, we can further convert it into Standard Normal Distribution using the below formula:
After calculating the Standard Normal Distribution, we will apply the Standard Scalar.
Now the values in the Food category and the Travel category will be in the same scale.
Now, if we feed this uniform scaled data to our model, the results will be more accurate than it would otherwise have been without Standard Scaling