Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.
Presented at the Reinforcement Learning Conference (RLC), Edmonton, Alberta, Canada, August 5–9, 2025.
Reinforcement learning from human feedback (RLHF) aims to learn or fine-tune policies via human preference data when a ground-truth reward function is not known. However, conventional RLHF methods provide no performance guarantees and have an unacceptably high probability of returning poorly performing policies. We propose Policy Optimization and Safety Test for Policy Improvement (POSTPI), an algorithm that provides high-confidence policy performance guarantees without direct knowledge of the ground-truth reward function, given only a preference dataset. The user of the algorithm may select any initial policy $\pi_\text{init}$ and confidence level $1 - \delta$, and POSTPI will ensure that the probability it returns a policy with performance worse than $\pi_\text{init}$ under the unobserved ground-truth reward function is at most $\delta$. We show theory as well as empirical results in the Safety Gymnasium suite that demonstrate that POSTPI reliably provides the desired guarantee.
Hon Tik Tse, Philip S Thomas, and Scott Niekum. "High-Confidence Policy Improvement from Human Feedback." Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.
BibTeX:@article{tse2025high,
title={High-Confidence Policy Improvement from Human Feedback},
author={Tse, Hon Tik and Thomas, Philip S. and Niekum, Scott},
journal={Reinforcement Learning Journal},
year={2025}
}