Policy Gradient Algorithms with Monte Carlo Tree Learning for Non-Markov Decision Processes

By Tetsuro Morimura, Kazuhiro Ota, Kenshi Abe, and Peinan Zhang

Reinforcement Learning Journal, vol. 3, 2024, pp. 1351–1376.

Presented at the Reinforcement Learning Conference (RLC), Amherst Massachusetts, August 9–12, 2024.


Download:

Abstract:

Policy gradient (PG) is a reinforcement learning (RL) approach that optimizes a parameterized policy model for an expected return using gradient ascent. While PG can work well even in non-Markovian environments, it may encounter plateaus or peakiness issues. As another successful RL approach, algorithms based on Monte Carlo Tree Search (MCTS), which include AlphaZero, have obtained groundbreaking results, especially in the game-playing domain. They are also effective when applied to non-Markov decision processes. However, the standard MCTS is a method for decision-time planning, which differs from the online RL setting. In this work, we first introduce Monte Carlo Tree Learning (MCTL), an adaptation of MCTS for online RL setups. We then explore a combined policy approach of PG and MCTL to leverage their strengths. We derive conditions for asymptotic convergence with the results of a two-timescale stochastic approximation and propose an algorithm that satisfies these conditions and converges to a reasonable solution. Our numerical experiments validate the effectiveness of the proposed methods.


Citation Information:

Tetsuro Morimura, Kazuhiro Ota, Kenshi Abe, and Peinan Zhang. "Policy Gradient Algorithms with Monte Carlo Tree Learning for Non-Markov Decision Processes." Reinforcement Learning Journal, vol. 3, 2024, pp. 1351–1376.

BibTeX:

@article{morimura2024policy,
    title={Policy Gradient Algorithms with {Monte} {Carlo} Tree Learning for Non-{Markov} Decision Processes},
    author={Morimura, Tetsuro and Ota, Kazuhiro and Abe, Kenshi and Zhang, Peinan},
    journal={Reinforcement Learning Journal},
    volume={3},
    pages={1351--1376},
    year={2024}
}