Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.
Presented at the Reinforcement Learning Conference (RLC), Edmonton, Alberta, Canada, August 5–9, 2025.
In the traditional view of reinforcement learning, the agent's goal is to find an optimal policy that maximizes its expected sum of rewards. Once the agent finds this policy, the learning ends. This view contrasts with *continual reinforcement learning*, where learning does not end, and agents are expected to continually learn and adapt indefinitely. Despite the clear distinction between these two paradigms of learning, much of the progress in continual reinforcement learning has been shaped by foundations rooted in the traditional view of reinforcement learning. In this paper, we first examine whether the foundations of traditional reinforcement learning are suitable for the continual reinforcement learning paradigm. We identify four key pillars of the traditional reinforcement learning foundations that are antithetical to the goals of continual learning: the Markov decision process formalism, the focus on atemporal artifacts, the expected sum of rewards as an evaluation metric, and episodic benchmark environments that embrace the other three foundations. We then propose a new formalism that sheds the first and the third foundations and replaces them with the history process as a mathematical formalism and a new definition of deviation regret, adapted for continual learning, as an evaluation metric. Finally, we discuss possible approaches to shed the other two foundations.
Esraa Elelimy, David Szepesvari, Martha White, and Michael Bowling. "Rethinking the Foundations for Continual Reinforcement Learning." Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.
BibTeX:@article{elelimy2025rethinking,
title={Rethinking the Foundations for Continual Reinforcement Learning},
author={Elelimy, Esraa and Szepesvari, David and White, Martha and Bowling, Michael},
journal={Reinforcement Learning Journal},
year={2025}
}