Reinforcement Learning Journal, vol. 5, 2024, pp. 2107–2122.
Presented at the Reinforcement Learning Conference (RLC), Amherst Massachusetts, August 9–12, 2024.
Existing posterior sampling algorithms for continuing reinforcement learning (RL) rely on maintaining state-action visitation counts, making them unsuitable for complex environments with high-dimensional state spaces. We develop the first extension of posterior sampling for RL (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into scalable agent designs. Our approach, continuing PSRL (CPSRL), determines when to resample a new model of the environment from the posterior distribution based on a simple randomization scheme. We establish an $\tilde{O}(\tau S \sqrt{A T})$ bound on the Bayesian regret in the tabular setting, where $S$ is the number of environment states, $A$ is the number of actions, and $\tau$ denotes the {\it reward averaging time}, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze this random resampling approach. Our simulations demonstrate CPSRL's effectiveness in high-dimensional state spaces where traditional algorithms fail.
Wanqiao Xu, Shi Dong, and Benjamin Van Roy. "Posterior Sampling for Continuing Environments." Reinforcement Learning Journal, vol. 5, 2024, pp. 2107–2122.
BibTeX:@article{xu2024posterior,
title={Posterior Sampling for Continuing Environments},
author={Xu, Wanqiao and Dong, Shi and Roy, Benjamin Van},
journal={Reinforcement Learning Journal},
volume={5},
pages={2107--2122},
year={2024}
}