Date of Completion
5-14-2026
Degree Type
Honors Thesis
Discipline
Computer Science (CMSI)
First Advisor
Lanyu Shang
Abstract
We investigate preference optimization over chain-of-thought (CoT) reasoning using automatically constructed preference signals derived from the accuracy and internal consistency of a model. Our results show that framing reasoning as a preference learning problem improves both the accuracy of the final answer and the structure of the model outputs. We observe a non-monotonic relationship between performance and the Direct Preference Optimization (DPO) scaling parameter β, where moderate values maximize accuracy while lower values improve stability, highlighting a tradeoff between optimization strength and reliable generation. We further identify a tradeoff between reasoning consistency and accuracy. Increasing the consistency weight improves agreement between reasoning and re-evaluated answers but can cause accuracy to decrease, suggesting that consistency may reinforce both correct and incorrect reasoning. This tradeoff varies by dataset, indicating that the usefulness of consistency as a training signal is task-dependent. By evaluating the results from training with various hyperparameter combinations, we were able to understand how hyperparameter choices affect a model’s performance. By identifying regions of stable performance with low missing answer rates, we can more confidently select hyperparameter values to begin with when training. Overall, our findings demonstrate that preference- based learning over reasoning traces is a promising approach for improving language model reasoning, provided that optimization strength and scoring design are carefully balanced.
Recommended Citation
Scolari, Cameron and Shang, Lanyu, "Automatically Constructed Preference Pairs for Chain-of-Thought: Consistency Gains with Accuracy Tradeoffs" (2026). Honors Thesis. 627.
https://digitalcommons.lmu.edu/honors-thesis/627

