Reinforcement Learning Journal, vol. 2, 2024, pp. 864–883.
Presented at the Reinforcement Learning Conference (RLC), Amherst Massachusetts, August 9–12, 2024.
Policy gradient methods form the basis for many successful reinforcement learning algorithms, but their success depends heavily on selecting an appropriate step size and many other hyperparameters. While many adaptive step size methods exist, none are both free of hyperparameter tuning and able to converge quickly to an optimal policy. It is unclear why these methods are insufficient, so we aim to uncover what needs to be addressed to make an effective adaptive step size for policy gradient methods. Through extensive empirical investigation, the results reveal that when the step size is above optimal, the policy overcommits to sub-optimal actions leading to longer training times. These findings suggest the need for a new kind of policy optimization that can prevent or recover from entropy collapses.
Scott M Jordan, Samuel Neumann, James E Kostas, Adam White, and Philip S Thomas. "The Cliff of Overcommitment with Policy Gradient Step Sizes." Reinforcement Learning Journal, vol. 2, 2024, pp. 864–883.
BibTeX:@article{jordan2024cliff,
title={The Cliff of Overcommitment with Policy Gradient Step Sizes},
author={Jordan, Scott M. and Neumann, Samuel and Kostas, James E. and White, Adam and Thomas, Philip S.},
journal={Reinforcement Learning Journal},
volume={2},
pages={864--883},
year={2024}
}