Analyzing Student’s Performance in Exams using Python

Filed Under: Python Advanced
StudentPerformance FeaImg

Hey coder! Today we will be analyzing a student performance dataset and understand the factors which can affect the performance of students in various subjects.

Let’s get started already!

Also read: Sentiment Analysis on Animal Crossing Game Dataset using Python


Understanding the Student Dataset

You can download the dataset from here. The dataset contains around 1000 data points and has 8 features. The independent variables excluding the columns for the three subjects are the following:

  1. gender: sex of students
  2. race/ethnicity : ethnicity of students
  3. parental level of education : parents’ final education
  4. lunch : having lunch before test (normal or abnormal)
  5. test preparation course : complete or not complete before test

Code Implementation to get information from Dataset

Now that we are clear with what the dataset is, we will try to get information from the dataset using features of the python programming language.

Importing all necessary modules/libraries

import seaborn as sns
import matplotlib.pyplot as plt 
import pandas as pd 
import numpy as np

Loading and Cleaning Dataset

Let’s take a look at the dataset using the `read_csv` and `head` functions of the pandas module. The code for the same is below.

data = pd.read_csv('StudentsPerformance.csv')
print("Number of data points : ",data.shape[0])
data.head()
StudentsPerformance Dataset
StudentsPerformance Dataset

Some column names are too long and can get confusing as well. Let’s rename the column names to make things a lot simpler. The code to achieve the same is below and when the code executes you can see that now the data looks a lot simpler.

data.rename(columns={"race/ethnicity":"ethnicity",
                     "parental level of education":"parent_education"
                    ,"math score":"math",
                     "reading score":"reading",
                     "writing score":"writing",
                    "test preparation course":"pre"},
            inplace=True)
data.head()
StudentsPerformance Cleaner Dataset
StudentsPerformance Cleaner Dataset

Understand the factor which affects students performance

To know which factor may affect the student’s performance, we classify the score into a couple of ranks and figure out which feature affects the score more significantly.

Visualizing Male and Female Performance

We deal with the other things later. First, let’s figure out how males and females perform in all the three subjects present in the dataset.

We will start off by separating the male and female datasets using the code below.

male_data = data[data["gender"]=='male']
female_data = data[data["gender"]=='female']

The next step involves plotting the scores of males and females in three subjects using the subplots of matplotlib and sns.barplot of the seaborn library. The code and output of the same are below.

plt.figure(figsize=(20,10),facecolor='w')

x_data = ["Male","Female"]

plt.subplot(1,3,1)

plt.title("Maths Score Male v/s Female",size=14)
plt.xlabel("Gender",size=14)
plt.ylabel("Score of Student",size=14)

math_data = [0,0]
for i in male_data['math']:
  math_data[0]+=i
for i in female_data['math']:
  math_data[1]+=i

math_bar = sns.barplot(x_data,math_data)
for p in math_bar.patches:
    math_bar.annotate(format(p.get_height(), '.1f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
    
plt.subplot(1,3,2)

plt.title("Reading Score Male v/s Female",size=14)
plt.xlabel("Gender",size=14)
plt.ylabel("Score of Student",size=14)

reading_data = [0,0]
for i in male_data['reading']:
  reading_data[0]+=i
for i in female_data['reading']:
  reading_data[1]+=i

reading_bar = sns.barplot(x_data,reading_data)
for p in reading_bar.patches:
    reading_bar.annotate(format(p.get_height(), '.1f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
    
plt.subplot(1,3,3)

plt.title("Writing Score Male v/s Female",size=14)
plt.xlabel("Gender",size=14)
plt.ylabel("Score of Student",size=14)

writing_data = [0,0]
for i in male_data['writing']:
  writing_data[0]+=i
for i in female_data['writing']:
  writing_data[1]+=i

writing_bar = sns.barplot(x_data,writing_data)
for p in writing_bar.patches:
    writing_bar.annotate(format(p.get_height(), '.1f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
    
plt.tight_layout()
plt.show()
Male Vs Female Scores Bar Plots
Male Vs Female Scores Bar Plots

You can observe that the females excel in both reading and writing and males excel in the math scores. A possible reason behind girls performing better in reading and writing can be that girls tend to focus more on the role of emotions and also girls use both brain hemispheres when presented with reading and writing tasks whereas boys use only one.

Visualizing the performance of various groups

Next, let’s figure out how the various groups perform in all the three subjects present in the dataset. We can achieve the same using the code below.

You can see that the group E has the best performance for all the subjects, and group A has the worst performance.

fig, ax = plt.subplots(figsize=(10,7),facecolor='w')
fig.subplots_adjust(hspace=0.8, wspace=0.8, left = 0.2, right = 1.5)
for idx in range(3):
    plt.subplot(1,3, idx+1)
    ethn_df = data.groupby("ethnicity")[list(data.columns[-3:])[idx]].mean()
    sns.barplot(x=ethn_df.index, y = ethn_df.values, palette = "Reds")
    plt.xlabel("Group Name")
    plt.ylabel("Mean Scores in Subject")
    plt.xticks(rotation=90)
    plt.title(list(data.columns[-3:])[idx])
plt.show()
Groupwise Scores Bar Plots
Groupwise Scores Bar Plots

Visualizing the performance on the basis of test preparation

We can compare the performance of the students on the basis of the test preparation of the students in the three subjects.

The code for the same is below. You can observe that the score distribution got narrower when students complete the preparation before the test, and also you can see that obviously, the average of the score is better in that case.

i=1
plt.figure(figsize=(20,7),facecolor='w')
for item in data.columns[-3:]:
    plt.subplot(1,3,i)
    sns.boxplot(x=data["pre"], y=data[item])
    plt.title(item+" vs pre test",size=14)
    i+=1
plt.show()
Preparation Scores Bar Plots
Preparation Scores Bar Plots

Visualizing the performance on the basis of test preparation

We can compare the performance of the students in the three subjects on the basis of the type of lunch the students had before the exam.

The code for the same is below. You can observe that the students will score better when they had a standard meal before the exam.

i=1
plt.figure(figsize=(20,7),facecolor='w')
for item in data.columns[-3:]:
    plt.subplot(1,3,i)
    sns.boxplot(x=data["lunch"], y=data[item])
    plt.title(item+" vs Lunch",size=14)
    i+=1
plt.show()
MealType Scores Bar Plots
MealType Scores Bar Plots

Conclusion

From the tutorial, we can conclude that some of the factors that affect the performance of the students in exams are as follows :

  1. Parents’ education level can have an affect on the performance of students, but its not an important one.
  2. Finishing preparation of the course before the exam is benefitial.
  3. Having proper lunch is important for students, and it is also one of the most significant one.

In conclusion, if students want to have good performance, they should have enough nutrients and make efforts to prepare for any sort of test/exam.

I hope you liked the tutorial!

Thank you for reading!


close
Generic selectors
Exact matches only
Search in title
Search in content