**Introduction**

In this post, we will be covering basics about the types of distribution. It will be a foundation for creating machine learning models going forward. Moreover, these fundamentals are also necessary for performing Exploratory Data Analysis (EDA).

There are various types of distribution that we could encounter. Some of them that we will be covering in this post are as as follows:

- Gaussian/Normal Distribution
- Log Normal Distribution

So, let’s dive right into it.

**Population**

As the name suggests, population is the complete distribution/the complete list of observations about the subject which we concerned about.

E.g.,

- The total number of people in the town who own a dog,
- average weight of all the people in a city.

Now, when we talk about **Population **there is one more word that is associated with it which we need to also discuss here which is the **Population mean.**

What will be the **Population mean **in our case here? Suppose for the 2^{nd} example we have 100k people in the whole city

The **Population mean **would be:

Where **x _{1…100k} **represents the weight of individuals for the whole population of 100k people.

Now usually when there is a huge population then it is not possible to take the whole and calculate the weight or whatever the required ask is. So, what we basically do is we take a **Sample **from the whole **population.**

__Sample distribution__

__Sample distribution__

Sample is basically the subset of the whole population which is being analyzed. As it is usually expensive to perform analysis on the whole population, most of the statistical analysis is done using a sample of a population and drawing conclusion from it.

Ex. Suppose in the 2^{nd} example we take the sample as 10k out of 100k.

Similarly, the **sample mean **will be defined as:

Where **w _{1…10k} **represents the weight of individuals for the

**sample**population of 10k people.

Now, that we have discussed what **Population **and **sample **is, let’s try to understand what **random variables **are.

__Random Variable distribution__

__Random Variable distribution__

The formal definition of Random Variable **is** basically defined as the variable whose values depend on the outcome of a random phenomenon.

It is divided into following types:

**Discrete Random Variable**is something is which may take only countable number of distinct values such as 1, 2, 3, 4, 5 and so on. So, for a random variable, if it is able to take only a finite number of distinct values then it is a**discrete random variable.**Ex. Number of children in the family, number of students enrolled in the fall semester, Number of cars that were bought from a dealer etc.**Continuous Random Variable**In case of continuous random variable, it could take infinite number of possible values. Some of the examples can be the weight, height, time taken to travel from office to home. Now that we have understood what**Random Variables**are, let’s understand about the different types of**Distributions.**

There are a lot of distributions that are there in statistics, but the most important that we are concerned from a Machine Learning standpoint is the **Gaussian/ Normal Distribution.**

__Gaussian/ Normal Distribution__

__Gaussian/ Normal Distribution__

Before understanding the Gaussian/Normal distribution, let’s quickly brush up on the formulas and definitions of **Mean, Variance and Standard deviation.**

Now, coming back to **Normal Distribution**, we can basically say that a **random variable x **which is **continuous **belongs to a **Gaussian distribution **with some **mean **and **Standard deviation.**

If such kind of condition exists, then usually the distribution will follow a bell curve as below.

Below are few things that we need to know and remember about normal distributions:

We have the empirical rule which tells us about what percentage of data falls within how much **Standard Deviations **from the **mean.**

As per the empirical rule, of we notice above, **68%** of the data falls within **1 standard deviation** from the **mean**. If you go a bit further, **95%** of the data falls within **2 standard deviations** from the **mean** and almost **99.7%** of the data falls within **3 standard deviations** from the **mean**.

What we could possibly infer from the above is **standard deviation** is a measure for the spread of a **distribution**.

A **distribution** with a smaller standard deviation would indicate that the data is tightly clustered around the **mean **whereas a large standard deviation signifies that the data is spread out around the **mean.**

There are few properties of **Gaussian/Normal distribution** that we should keep in mind:

- Gaussian/Normal distribution has equal
**mean, median, mode.** - The total area of the curve is always 1.
- The curve is symmetric around the
**mean.** - The distribution is equally divided on both sides of the curve. That means left half of the curve is same as the right half.

Now that we have discussed about Gaussian/Normal distribution, let’s go over and discuss about **Log Normal Distribution **and what are its properties.

__Log Normal Distribution__

__Log Normal Distribution__

A **Lognormal distribution **is basically a probability distribution where the **logarithm **of the **random variable **is **normally distributed.**

Let’s write this in the form of an equation below:

As per the above equation if the **log **of the random variable is normally distributed then it is a Log Normal Distribution.

The above curve is a curve for Lognormal distribution. If you look at it, you will notice that the curve is not completely a bell curve, rather it is a bit **right skewed.**

**Ex. **consider the salary of the whole population in a town. If you check it in a log normal distribution, then you will notice that most of the population falls around the mean denoted by the red lines in the graph below.

If your notice the lines towards the right side of the graph, these are the people who will have a very high salary, however the count of such people will be very low.

So, now that we have understood the above 2 distributions, the fundamental question that we need to ask ourselves is why we are learning these distributions. How is this going to help us solve a problem or build a model.

**For that let’s take a scenario and understand the holistic picture behind all this!!**

**Example**

Ex., Suppose David has categorized his monthly expenses which includes spending on Travel, Food, Apartment Rent, Utilities and Vehicle Installment.

Based on the expenses let’s find out what kind of distribution the different expenses follow. Suppose the Food category follows a **Gaussian/Normal distribution.**

So, to ensure uniformity, we convert the Food category column to **Standard Normal Distribution **where the **mean **is **0** and **standard deviation is 1.** After conversion the values will be in the form of Standard** Scalar.**

We are basically scaling down the distribution to a **mean** which is **0** and **standard deviation 1**.

Now, we check the Travel category and let’s consider that it follows **Log Normal Distribution. **To ensure uniformity we have to convert this column also to **Standard normal distribution **by calculating the log of every number. Now if you check it will be following a **gaussian/normal distribution**.

After converting the data to **Gaussian distribution, **we can further convert it into **Standard Normal Distribution **using the below formula:

After calculating the **Standard Normal Distribution, **we will apply the **Standard Scalar.**

Now the values in the **Food **category and the **Travel **category will be in the same scale.

Now, if we feed this uniform scaled data to our model, the results will be more accurate than it would otherwise have been without **Standard Scaling**