Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.
Presented at the Reinforcement Learning Conference (RLC), Edmonton, Alberta, Canada, August 5–9, 2025.
Existing approaches to language model alignment often treat safety as a tradeoff against helpfulness which can lead to unacceptable actions in sensitive domains. To ensure reliable performance in such settings, we propose High-Confidence Safe Reinforcement Learning from Human Feedback (HC-RLHF), a method that provides high-confidence safety guarantees while maximizing helpfulness. Similar to previous methods, HC-RLHF explicitly decouples human preferences regarding helpfulness and harmlessness (safety) and trains separate reward and cost models, respectively. It then employs a two-step process to find safe solutions. In the first step, it optimizes the reward function while ensuring that a specific upper-confidence bound on the cost constraint is satisfied. In the second step, the trained model undergoes a safety test to verify whether its performance satisfies a separate upper-confidence bound on the cost constraint. We provide a theoretical analysis of HC-RLHF, including a proof that it will not return an unsafe solution with a probability greater than a user-specified threshold. For our empirical analysis, we apply HC-RLHF to align three different language models (Qwen2-1.5B, Qwen2.5-3B, and LLaMa-3.2-3B) with human preferences. Our results demonstrate that HC-RLHF produces safe models with high probability while also improving helpfulness and harmlessness compared to previous methods.
Yaswanth Chittepu, Blossom Metevier, Will Schwarzer, Austin Hoag, Scott Niekum, and Philip S Thomas. "Reinforcement Learning from Human Feedback with High-Confidence Safety Guarantees." Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.
BibTeX:@article{chittepu2025reinforcement,
title={Reinforcement Learning from Human Feedback with High-Confidence Safety Guarantees},
author={Chittepu, Yaswanth and Metevier, Blossom and Schwarzer, Will and Hoag, Austin and Niekum, Scott and Thomas, Philip S.},
journal={Reinforcement Learning Journal},
year={2025}
}