In this post, we will discuss about 2 very important topics and how it helps in Exploratory data analysis — Probability Density Function and Cumulative Density Function.
A continuous random variable distribution can be characterized through its Probability Distribution Function. We will understand this statement in greater detail in the subsequent section.
Cumulative Density Function is the cumulative probability, which describes the probability that a random variable X with a given probability distribution will be found at a value less than or equal to x.
This seems confusing right, well don’t worry, we will go through it in detail to understand the concept behind it !!
Probability Density Function (PDF)
Before understanding about Probability Density Function, we need to understand few basic concepts, So, let’s see what those are:
- Discrete Distribution: A statistical distribution dealing with discrete data is Discrete Distribution.
- Continuous Distribution: A statistical distribution dealing with Continuous data is Continuous Distribution.
Using the above 2 types of distributions we can define our Probability Density Function as a method which assigns probabilities to a random variable which would either belong to a Discrete or Continuous Distribution.
Let’s try to understand this concept properly with respect to both Discrete and Continuous data.
Discrete Distribution and Probability Mass Function (PMF)
The Probability Density Function is called as the Probability Mass Function when we are dealing with Discrete data.
Let’s understand this by taking couple of examples:
In the above example if we are tossing a coin, then the probability of getting a Heads or Tails is equal to 0.5.
The table above basically controls the probability of the outcomes of Random Variable which is associated with the tossing of the coin.Here the Result is the Random Variable.
So, the table above is the Probability Mass Function for the Result obtained from tossing the coin.
Similar to the 1st example, the result in this case is considered the Random Variable.
Couple of things that we need to remember in case of Probability Mass Function:
- The probability for any value of the Result will always fall between the boundaries of 0 and 1.
- The total of all the probabilities of the Result will always be 1. We can see this is both the examples where the sum of total probabilities of the Result adds up to 1.
The above 2 examples were straight forward, and the distribution of the data was also not that big. But let’s take an example where the distribution is more complicated.
Let’s take an example of Number of users signing up for an event online in an hour.
To approach this problem, we must follow couple of steps:
- Using a statistical distribution, we need to approximate the process of users signing up to the event in an hour.
- Then use the Probability mass function of that statistical distribution.
Let’s see how we could do this:
The Probability Mass Function will assign probabilities to different possible outcomes
Suppose we have the below possible outcomes/results for users signing up in an hour.
- 0 user signing up in an hour
- 1 user signing up in an hour
- 2 users signing up in an hour
- 3 users signing up in an hour and so on.
Here the number of users signing up for the event in an hour is a discrete data which means we cannot have any value like 3.5 users signing up or any thing of that sort. That means we would be using only discrete values to approximate the process of users signing up for the event.
So, the Probability Mass Function would thereby assign probabilities to only the discrete value of users signing up and would have 0 probability for any number which is between 2 or 3, like 2.9 or 2.5.
This was about Discrete distributions and Probability Mass Function. Now let’s discuss about Continuous distribution and its corresponding Probability Density Function.
Continuous Distribution and Probability Density Function (PDF)
Below are couple of differences between PDF and PMF.
- PDF is used when using a Continuous distribution, whereas PMF is used while using a Discrete distribution.
- In case of PDF, the probability of a particular outcome is always Zero.
Let’s take an example to understand this concept better:
Example: Let’s take the height of students of in a class. Suppose we want to find the probability of the height of student being 5’10’’.
The possible value of heights that we could have between 5’ and 6’ could be infinite number. In that case the probability of getting a student with height 5’10’’ would be clearly 0. The argument for the probability being 0 is the height could never be exactly 5’10’’. There would definitely be some measuring error. It can be 5’10.01’’ but never exactly 5’10’’.
Therefore, we can conclude that the probability of someone having a fixed particular height will always be 0.
Hence from the above example, it’s more reasonable to consider Range of Outcomes instead of considering the exact value.
Let’s see few examples to make this more concrete:
- What would be the probability of someone’s height being between 5’6’’ and 5’8’’
- What would be the probability of someone height being less than 6’
Area Under the Curve (AOC)
The below curve is a continuous distribution curve with the x-axis being the height of the students. The curve is used to approximate the way heights are distributed for the students of the classroom. The below curve is thus a Probability Density Function for the continuous distribution we are using here to approximate the occurrence of the random variable height.
The higher region of the curve would represent higher probabilities and lower regions would represent lower probabilities.
Let’s say we want to find the probability of students whose height is less than 6’. It will basically be the AOC to the left of 6’ shown in the shaded graph below. P(height<6’)
Now let’s say we want to find the probability of students whose height is more than 6’3’’. It will be the AOC to the right of 6’3’’ as shown below. P(height>6’3’’)
Now let’s say we want to find the probability of students whose height is more than 5’6’’ and less than 6’6’’ — P(5’6’’<height<6’6’’). It will be the AOC between 5’6’’ and 6’6’’ as shown below.
Now let’s say we want to find the probability of students whose height is exactly 6’. It will basically be the AOC at the point 6’. This would be the area of the straight dotted line shown below which would be 0.
Therefore, this proves our argument which we had made previously that the height of a student being exactly the same as a particular random value say 6’ would be 0.
So, now that we have discussed at length about Probability Density Function, let’s get into the detail of understanding what Cumulative Distribution Function is and how is it significant.
Cumulative Distribution Function (CDF)
Cumulative Distribution Function basically gives us the cumulative probability. Let’s try to understand what this means and how it can be interpreted.
Suppose we have the below distribution of weights shown below and its percentage of distribution is shown in the y-axis.
Let’s say we want to find the % of distribution for the weight 120. According to the curve the value is 0.1.
Now for weight of 130, the % of distribution would be 0.2. In order to find the CDF at 130, we will find the cumulative sum of the % of distribution which is (0.1 + 0.2). This is nothing but the cumulative probability that we are calculating here and this approach we will be using with respect to each and every point in the distribution.
So, the curve of a cumulative distribution would look like the below:
If you notice, after a particular point the slope of the CDF is becoming constant because the % of distribution is also falling after 160.
Now, let’s say we want to see what the corresponding CDF value for 200 which is let’s say 0.9.
What this means is there are 90% of the students whose weight is less than 200lbs. This also indicates that there are 10% of the population who are greater than 200lbs.
Now let’s implement whatever we discussed so far in code and see how it works !!
To check other posts related to Exploratory Data Analysis (EDA), refer the below links:
ii. Exploratory Data Analysis for Univariate, Bivariate and Multivariate data