Mitigating the Curse of Horizon in Monte-Carlo Returns

By Alex Ayoub, David Szepesvari, Francesco Zanini, Bryan Chan, Dhawal Gupta, Bruno Castro da Silva, and Dale Schuurmans

Reinforcement Learning Journal, vol. 2, 2024, pp. 563–572.

Presented at the Reinforcement Learning Conference (RLC), Amherst Massachusetts, August 9–12, 2024.


Download:

Abstract:

The standard framework in reinforcement learning (RL) dictates that an agent should use every observation collected from interactions with the environment when updating its value estimates. As this sequence of observations becomes longer, the agent is afflicted with the curse of horizon since the computational cost of its updates scales linearly with the length of the sequence. In this paper, we propose methods to mitigate this curse when computing value estimates with Monte-Carlo methods. This is accomplished by selecting a subsequence of observations on which the value estimates are computed. We empirically demonstrate on standard RL benchmarks that adopting an adaptive sampling scheme outperforms the default uniform sampling procedure.


Citation Information:

Alex Ayoub, David Szepesvari, Francesco Zanini, Bryan Chan, Dhawal Gupta, Bruno Castro da Silva, and Dale Schuurmans. "Mitigating the Curse of Horizon in Monte-Carlo Returns." Reinforcement Learning Journal, vol. 2, 2024, pp. 563–572.

BibTeX:

@article{ayoub2024mitigating,
    title={Mitigating the Curse of Horizon in {Monte}-{Carlo} Returns},
    author={Ayoub, Alex and Szepesvari, David and Zanini, Francesco and Chan, Bryan and Gupta, Dhawal and Silva, Bruno Castro da and Schuurmans, Dale},
    journal={Reinforcement Learning Journal},
    volume={2},
    pages={563--572},
    year={2024}
}