Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.
Presented at the Reinforcement Learning Conference (RLC), Edmonton, Alberta, Canada, August 5–9, 2025.
For the non-stationary multi-armed bandit (MAB) problem, many existing methods allow a general mechanism for the non-stationarity, but rely on a budget for the non-stationarity that is sub-linear to the total number of time steps $T$. In many real-world settings, however, the mechanism for the non-stationarity can be modeled, but there is no budget for the non- stationarity. We instead consider the non-stationary bandit problem where the reward means change due to a latent, auto-regressive (AR) state. We develop Latent AR LinUCB (LARL), an online linear contextual bandit algorithm that does not rely on the non-stationary budget, but instead forms predictions of reward means by implicitly predicting the latent state. The key idea is to reduce the problem to a linear dynamical system which can be solved as a linear contextual bandit. In fact, LARL approximates a steady-state Kalman filter and efficiently learns system parameters online. We provide an interpretable regret bound for LARL with respect to the level of non-stationarity in the environment. LARL achieves sub-linear regret in this setting if the noise variance of the latent state process is sufficiently small with respect to $T$ . Empirically, LARL outperforms various baseline methods in this non-stationary bandit problem.
Anna L Trella, Walter H Dempsey, Asim Gazi, Ziping Xu, Finale Doshi-Velez, and Susan Murphy. "Non-Stationary Latent Auto-Regressive Bandits." Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.
BibTeX:@article{trella2025stationary,
title={Non-Stationary Latent Auto-Regressive Bandits},
author={Trella, Anna L. and Dempsey, Walter H. and Gazi, Asim and Xu, Ziping and Doshi-Velez, Finale and Murphy, Susan},
journal={Reinforcement Learning Journal},
year={2025}
}