Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism

By Kihyun Yu, Duksang Lee, William Overman, and Dabeen Lee

Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.

Presented at the Reinforcement Learning Conference (RLC), Edmonton, Alberta, Canada, August 5–9, 2025.


Download:

Abstract:

This paper studies the safe reinforcement learning problem formulated as an episodic finite-horizon tabular constrained Markov decision process with an unknown transition kernel and stochastic reward and cost functions. We propose a model-based algorithm based on novel cost and reward function estimators that provide tighter cost pessimism and reward optimism. While guaranteeing no constraint violation in every episode, our algorithm achieves a regret upper bound of $\widetilde{\mathcal{O}}((\bar C - \bar C_b)^{-1}H^{2.5} S\sqrt{AK})$ where $\bar C$ is the cost budget for an episode, $\bar C_b$ is the expected cost under a safe baseline policy over an episode, $H$ is the horizon, and $S$, $A$ and $K$ are the number of states, actions, and episodes, respectively. This improves upon the best-known regret upper bound, and when $\bar C- \bar C_b=\Omega(H)$, the gap from the regret lower bound of $\Omega(H^{1.5}\sqrt{SAK})$ is $\widetilde{\mathcal{O}}(\sqrt{S})$. The reduction in the regret upper bound is a consequence of our novel reward and cost function estimators. The key is to control the error of estimating the unknown transition kernel over each episode. In particular, we provide a tighter bound on the estimation error for each episode, based on a Bellman-type law of total variance to analyze the expected sum of the variances of value function estimates. The bound is given by a function of the estimated transition kernel, whose choice can be optimized by the algorithm. This leads to a tighter dependence on the horizon in the function estimators. We also present numerical results to demonstrate the computational effectiveness of our proposed framework.


Citation Information:

Kihyun Yu, Duksang Lee, William Overman, and Dabeen Lee. "Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism." Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.

BibTeX:
@article{yu2025improved,
    title={Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism},
    author={Yu, Kihyun and Lee, Duksang and Overman, William and Lee, Dabeen},
    journal={Reinforcement Learning Journal},
    year={2025}
}