Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.
Presented at the Reinforcement Learning Conference (RLC), Edmonton, Alberta, Canada, August 5–9, 2025.
The integration of AI into high-stakes decision-making domains demands safety and accountability. Traditional contextual bandit algorithms for online and adaptive decision-making must balance exploration and exploitation, posing significant risks when applied to critical environments where exploratory actions can lead to severe consequences. To address these challenges, we propose MixUCB, a flexible human-in-the-loop contextual bandit framework that enhances safe exploration by incorporating human expertise and oversight with machine automation. Based on the model's confidence and the associated risks, MixUCB intelligently determines when to seek human intervention. The reliance on human input gradually reduces as the system learns and gains confidence. Theoretically, we analyze the regret and query complexity in order to rigorously answer the question of when to query. Empirically, we validate the effectiveness through extensive experiments on both synthetic and real-world datasets. Our findings underscore the importance of designing decision-making frameworks that are not only theoretically and technically sound, but also align with societal expectations of accountability and safety. Our experimental code is available at: https://github.com/sdean-group/MixUCB
Jinyan Su, Rohan Banerjee, Jiankai Sun, Wen Sun, and Sarah Dean. "MixUCB: Enhancing Safe Exploration in Contextual Bandits with Human Oversight." Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.
BibTeX:@article{su2025mixucb,
title={{MixUCB}: {E}nhancing Safe Exploration in Contextual Bandits with Human Oversight},
author={Su, Jinyan and Banerjee, Rohan and Sun, Jiankai and Sun, Wen and Dean, Sarah},
journal={Reinforcement Learning Journal},
year={2025}
}