About the dataset

The dataset is about consists of Placement data of MBA students of a B-school. It includes the following features:

  1. serial number (sl_no)
  2. gender(gender)
  3. secondary school percentage(ssc_p)
  4. secondary school specialization(ssc_b)
  5. higher secondary school percentage(hsc_p)
  6. higher secondary school specialization(hsc_b)
  7. degree percentage(degree_p)
  8. degree specialization(degree_t)
  9. workex(workex)
  10. competitive exam percentage(etest_p)
  11. Specialization(specialisation)
  12. mba percentage(mba_p)
  13. status(status)
  14. salary(salary)

You can download the dataset from kaggle https://www.kaggle.com/benroshan/factors-affecting-campus-placement?select=Placement_Data_Full_Class.csv

1. Import the dataset

In [8]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
In [9]:
df = pd.read_csv('Placement.csv')
df.head()
Out[9]:
sl_no gender ssc_p ssc_b hsc_p hsc_b hsc_s degree_p degree_t workex etest_p specialisation mba_p status salary
0 1 M 67.00 Others 91.00 Others Commerce 58.00 Sci&Tech No 55.0 Mkt&HR 58.80 Placed 270000.0
1 2 M 79.33 Central 78.33 Others Science 77.48 Sci&Tech Yes 86.5 Mkt&Fin 66.28 Placed 200000.0
2 3 M 65.00 Central 68.00 Central Arts 64.00 Comm&Mgmt No 75.0 Mkt&Fin 57.80 Placed 250000.0
3 4 M 56.00 Central 52.00 Central Science 52.00 Sci&Tech No 66.0 Mkt&HR 59.43 Not Placed NaN
4 5 M 85.80 Central 73.60 Central Commerce 73.30 Comm&Mgmt No 96.8 Mkt&Fin 55.50 Placed 425000.0

2. Create a histogram with 'hsc_p' and 'ssc_p' column and 10 bins to see how the data is distributed

In [59]:
fig, axes = plt.subplots(nrows = 1, ncols = 2,figsize =(10,5))

'''1. Creating the histogram for hsc_p with 10 bins'''

axes[0].set_title('hsc histogram')
axes[0].hist(df['hsc_p'],bins=10)
axes[0].set_xlabel('hsc_percentage')
axes[0].set_ylabel('frequency')

'''2. Creating the histogram for ssc_p with 10 bins'''

axes[1].set_title('ssc histogram')
axes[1].hist(df['ssc_p'],bins=10)
axes[1].set_xlabel('ssc_percentage')
axes[1].set_ylabel('frequency')
plt.tight_layout()
plt.show()
In [61]:
'''1.Showing the CDF and PDF using 'hsc_p' column''' 

fig, axes = plt.subplots(nrows = 1, ncols = 2,figsize =(10,5))
count_hsc, bin_edges_hsc = np.histogram(df['hsc_p'],bins = 10, density = True)
print("counts_hsc: ",count_hsc)
pdf_hsc = (count_hsc/(sum(count_hsc)))
print("pdf_hsc:",pdf_hsc)
print("bin_edges_hsc: ",bin_edges_hsc)
axes[0].set_title('CDF and PDF -- HSC_P')
axes[0].plot(bin_edges_hsc[1:],pdf_hsc)
axes[0].set_xlabel("Hsc Percentage")
axes[0].set_ylabel("% of distribution")
cdf_hsc = np.cumsum(pdf_hsc)
axes[0].plot(bin_edges_hsc[1:],cdf_hsc)

'''2. Showing the CDF and PDF using the ssc_p column'''

count_ssc, bin_edges_ssc = np.histogram(df['ssc_p'],bins = 10, density = True)
print("counts_ssc: ",count_ssc)
pdf_ssc = (count_ssc/(sum(count_ssc)))
print("pdf_ssc:",pdf_ssc)
print("bin_edges_ssc: ",bin_edges_ssc)
axes[1].set_title('CDF and PDF -- SSC_P')
axes[1].plot(bin_edges_ssc[1:],pdf_ssc)
axes[1].set_xlabel("SSC Percentage")
axes[1].set_ylabel("% of distribution")
cdf_ssc = np.cumsum(pdf_ssc)
axes[1].plot(bin_edges_ssc[1:],cdf_ssc)
plt.tight_layout()
plt.show()
counts_hsc:  [0.00383127 0.00536378 0.01302632 0.02452013 0.05210528 0.0283514
 0.02145512 0.00613003 0.00766254 0.00229876]
pdf_hsc: [0.02325581 0.03255814 0.07906977 0.14883721 0.31627907 0.17209302
 0.13023256 0.0372093  0.04651163 0.01395349]
bin_edges_hsc:  [37.   43.07 49.14 55.21 61.28 67.35 73.42 79.49 85.56 91.63 97.7 ]
counts_ssc:  [0.00479402 0.00671163 0.02109371 0.01821729 0.04122861 0.03068176
 0.02780534 0.02492893 0.02109371 0.00958805]
pdf_ssc: [0.02325581 0.03255814 0.10232558 0.08837209 0.2        0.14883721
 0.13488372 0.12093023 0.10232558 0.04651163]
bin_edges_ssc:  [40.89  45.741 50.592 55.443 60.294 65.145 69.996 74.847 79.698 84.549
 89.4  ]