Date of Completion
Dr. Anna Bargagliotti
Machine learning is often used to build predictive models by extracting patterns from large data sets. Such techniques are increasingly being utilized to predict outcomes in the social sciences. One such application is predicting student success. Machine learning can be applied to predicting student acceptance and success in academia. Using these tools for education-related data analysis, may enable the evaluation of programs, resources and curriculum. Currently, research is needed to examine application, admissions, and retention data in order to address equity in college computer science programs. However, most student-level data sets contain sensitive data that cannot be made public. To help facilitate research and the application of machine learning models to this field, we generate an artificial student-level data set of 50,000 students to simulate college admissions data. We generate this data set for public access and without privacy concerns. Once the data is generated, we then analyze it using logistic regression, K-Nearest Neighbor, random forest, neural networks, and XGBoost techniques to demonstrate and compare the type of analyses that can be conducted on data sets of this type. Finally we provide an analysis on whether the predictive gains of machine learning models outweigh the potential loss of interpretability in comparison to classical statistical methods.
Mauro, Jack; Martinez, Elena; and Bargagliotti, Anna, "Generating a Dataset for Comparing Linear vs. Non-Linear Prediction Methods in Education Research" (2022). Honors Thesis. 446.