Honors Thesis

Automatically Constructed Preference Pairs for Chain-of-Thought: Consistency Gains with Accuracy Tradeoffs

Date of Completion

5-14-2026

Degree Type

Honors Thesis

Discipline

Computer Science (CMSI)

First Advisor

Lanyu Shang

Abstract

We investigate preference optimization over chain-of-thought (CoT) reasoning using automatically constructed preference signals derived from the accuracy and internal consistency of a model. Our results show that framing reasoning as a preference learning problem improves both the accuracy of the final answer and the structure of the model outputs. We observe a non-monotonic relationship between performance and the Direct Preference Optimization (DPO) scaling parameter β, where moderate values maximize accuracy while lower values improve stability, highlighting a tradeoff between optimization strength and reliable generation. We further identify a tradeoff between reasoning consistency and accuracy. Increasing the consistency weight improves agreement between reasoning and re-evaluated answers but can cause accuracy to decrease, suggesting that consistency may reinforce both correct and incorrect reasoning. This tradeoff varies by dataset, indicating that the usefulness of consistency as a training signal is task-dependent. By evaluating the results from training with various hyperparameter combinations, we were able to understand how hyperparameter choices affect a model’s performance. By identifying regions of stable performance with low missing answer rates, we can more confidently select hyperparameter values to begin with when training. Overall, our findings demonstrate that preference- based learning over reasoning traces is a promising approach for improving language model reasoning, provided that optimization strength and scoring design are carefully balanced.

Recommended Citation

Scolari, Cameron and Shang, Lanyu, "Automatically Constructed Preference Pairs for Chain-of-Thought: Consistency Gains with Accuracy Tradeoffs" (2026). Honors Thesis. 627.
https://digitalcommons.lmu.edu/honors-thesis/627

Download

Included in

Artificial Intelligence and Robotics Commons

COinS

Honors Thesis

Automatically Constructed Preference Pairs for Chain-of-Thought: Consistency Gains with Accuracy Tradeoffs

Date of Completion

Degree Type

Discipline

First Advisor

Abstract

Recommended Citation

Included in

Search

Browse

Submissions

Links

Resources

About

Honors Thesis

Automatically Constructed Preference Pairs for Chain-of-Thought: Consistency Gains with Accuracy Tradeoffs

Author

Date of Completion

Degree Type

Discipline

First Advisor

Abstract

Recommended Citation

Included in

Share

Search

Browse

Submissions

Links

Resources

About