Data

Our Dataset

Our group selected the Student Performance Dataset from UC Irvine’s Machine Learning Repository, which evaluates student academic and engagement outcomes in two Portuguese high schools based on various factors

View Our Dataset

About our Dataset

Where our data comes from

The dataset was generated from two secondary schools in Portugal, Gabriel Pereira and Mousinho da Silveira, using a combination of school records and questionnaires. Students and their parents provided demographic and social information through surveys, while academic records documented student grades. The dataset consists of 649 subjects and captures both numeric variables (such as grades and absences) and categorical variables (such as gender, school, and parental status). To ensure privacy, student names and other personally identifying details were excluded. The original sources of the dataset are the reports and questionnaires collected from these two Portuguese schools, which focus on student performance in Mathematics and Portuguese—two core subjects in Portugal. The dataset was created by Paulo Cortez, a professor at the University of Minho, but no specific funding source is mentioned, leaving open the question of whether this research was institutionally or governmentally supported.

Boy writing on a piece of paper next to a girl. — Photo courtesy of Unsplash.

Three dimensional bar graph against a light blue background. — Photo courtesy of Unsplash.

What our data reveals

The dataset includes key features such as the school attended (GP for Gabriel Pereira or MS for Mousinho da Silveira), student sex (F for female, M for male), age (ranging from 15 to 22), and home address type (U for urban, R for rural). It also records family size (LE3 for families with three or fewer members, GT3 for families with more than three members) and parental status (T for parents living together, A for parents apart). Additionally, the dataset captures parental education levels, with both mother’s and father’s education coded as 0 (none), 1 (4th-9th grade), 2 (5th-9th grade), or 3 (secondary education). The dataset further includes the mother’s job category, classified as teacher, health care related, civil services, or at_home, while the father’s occupation follows the same classification as the mother’s.

This dataset offers valuable insights into the external factors that impact student engagement and academic success. It can highlight whether school type, family environment, or parental education levels play significant roles in determining final grades. However, there are notable limitations to what the dataset can reveal. It does not account for students’ mental health, familial relationships outside of parents, or social interactions at school, all of which can significantly influence academic performance. Factors such as bullying, peer influence, or additional familial stressors remain unaccounted for, potentially affecting student outcomes in ways that are not reflected in the data.

Data Critique

While the dataset provides valuable insights into student performance, it has several limitations that affect the depth and generalizability of its findings.

Some variables contain vague classifications, such as “other,” which lack clarity, and gender is recorded only in a binary format, limiting representation. The “famsup” variable, which indicates family educational support, does not capture the quality or extent of that support. Additionally, the dataset offers only a narrow view of students’ socio-cultural backgrounds, omitting key factors such as whether schools are publicly or privately funded, student-to-teacher ratios, class sizes, and the number of subjects contributing to final grades. This lack of context raises concerns about the dataset’s applicability beyond the two schools studied, as findings may not generalize to other educational or geographical settings. Furthermore, the dataset reflects an ideological bias toward quantifying student success through numeric indicators, reinforcing a grade-centric approach. By prioritizing structured data, it overlooks important qualitative factors such as student motivation, mental health, teacher quality, learning styles, and school funding.

Despite these shortcomings, the dataset remains valuable for understanding how specific external factors correlate with academic outcomes and can serve as a foundation for further research into student success.

View our data processing methodology

Methodology

Edulytics