Chi-Squared Test of Independence


Chi-Squared Test of Independence determines the association between categorical variables. This means that it says whether the variables are related to each other or independent. It’s also called Chi-Square Test of Association.

The Chi-Squared Test uses a contingency table to determine the association. The contingency table contains the data which is classified according to the categorical features. The data of these categories are arranged in rows and columns. One of the categories has its data arranged in rows and the other has its arranged in columns. Each categorical column should have at least 2 or more categories.

To carry out the Chi-Squared Test, we need to meet the following requirements:

  1. We should use 2 categorical features.
  2. Each categorical feature should have 2 or more categories
  3. Independence of Observation
    • There should be no relation between the subjects of each group.
    • The categorical features should not be paired
  4. Should have a large sample size
    • For each cell the expected frequency should be atleast 1.
    • Expected frequency should be 5 for majority of cells.

Before getting into the coding part, let’s check the dataset and see what features it has:

About the dataset

The dataset is about consists of Placement data of MBA students of a B-school. It includes the following features:

  1. serial number (sl_no)
  2. gender(gender)
  3. secondary school percentage(ssc_p)
  4. secondary school specialization(ssc_b)
  5. higher secondary school percentage(hsc_p)
  6. higher secondary school specialization(hsc_b)
  7. degree percentage(degree_p)
  8. degree specialization(degree_t)
  9. workex(workex)
  10. competitive exam percentage(etest_p)
  11. Specialization(specialisation)
  12. mba percentage(mba_p)
  13. status(status)
  14. salary(salary)

You can download the dataset from kaggle

Also if you want to learn about hypothesis testing using T-test, the click here