Date of Award

Spring 5-16-2026

Access Restriction

Thesis

Degree Name

Master of Science

Department

Computer Science

School or College

Seaver College of Science and Engineering

First Advisor

Andrew Forney

Abstract

Reinforcement learning under unobserved confounders produces biased estimates that randomization-based algorithms cannot fix; in the bandit setting they incur unbounded regret [Bareinboim et al., 2015]. Forney et al. [2017] establish that under intent conditioning the effect of treatment on the treated becomes empirically identifiable, and an inverse-variance combination of three estimators yields a Thompson-sampling agent that outperforms any non-counterfactual baseline. The bandit’s tabular Beta posterior is a permissive substrate: observational seeding, experimental warm-up, and counterfactual fusion all write into a single sufficient statistic per (intent, arm) cell. A deep Q-network is not the same kind of object. Whether the machinery survives the lift to function-approximated reinforcement learning, and and whether the deep-RL analogs of Forney et al.’s two best bandit algorithms — which are alternatives over a single Beta posterior in the bandit, never combined — compose helpfully when the function-approximator’s substrate splits into a shared- parameter network and a tabular correction layer, are the open questions this thesis addresses. We port the three-estimator combiner to a deep Q-network with a Gaussian-variance adaptation suited to bootstrapped TD targets, ablate in WindMaze (a purpose-built gridworld with a hidden per-episode wind direction and a tunable intent-noise channel), and bound performance with a finite-horizon value-iteration oracle over the augmented state. The transfer holds with a precondition: the counterfactual-fusion agent significantly outperforms plain intent-conditioning at every reliability level we tested, but the function-approximated extension requires substantially more pre-collection data than the bandit to reach its asymptotic regime. The deep-RL combination interferes destructively: running observational warm-up of the Q-network alongside the inverse-variance correction in the tabular layer underperforms either component alone at moderate intent reliability and recovers only at oracle-quality intent. The bandit cannot pose this combination question — its single Beta posterior leaves no second substrate to run a second algorithm over — so the interference is a structural property of the function-approximation substrate, not of the data.

Available for download on Wednesday, November 18, 2026

Share

COinS