Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.
Presented at the Reinforcement Learning Conference (RLC), Edmonton, Alberta, Canada, August 5–9, 2025.
Many policy gradient methods prevent drastic changes to policies during learning. This is commonly achieved through a Kullback-Leibler (KL) divergence term. Recent work has established a theoretical connection between this heuristic and Mirror Descent (MD), offering insight into the empirical successes of existing policy gradient and actor-critic algorithms. This insight has further motivated the development of novel algorithms that better adhere to the principles of MD, alongside a growing body of theoretical research on policy mirror descent. In this study, we examine the empirical feasibility of MD-based policy updates in off-policy actor-critic. Specifically, we introduce principled MD adaptations of three widely used actor-critic algorithms and systematically evaluate their empirical effectiveness. Our findings indicate that, while MD-style policy updates do not seem to exhibit significant practical advantages over conventional approaches to off-policy actor-critic, they can somewhat mitigate sensitivity to step size selection with widely used deep-learning optimizers.
Samuel Neumann, Jiamin He, Adam White, and Martha White. "Investigating the Utility of Mirror Descent in Off-policy Actor-Critic." Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.
BibTeX:@article{neumann2025investigating,
title={Investigating the Utility of Mirror Descent in Off-policy Actor-Critic},
author={Neumann, Samuel and He, Jiamin and White, Adam and White, Martha},
journal={Reinforcement Learning Journal},
year={2025}
}