Gaussian Process Q-Learning for Finite-Horizon Markov Decision Processes

By Maximilian Bloor, Tom Savage, Calvin Tsay, Antonio Del rio chanona, and Max Mowbray

Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.

Presented at the Reinforcement Learning Conference (RLC), Edmonton, Alberta, Canada, August 5–9, 2025.


Download:

Abstract:

Many real-world control and optimization problems require making decisions over a finite time horizon to maximize performance. This paper proposes a reinforcement learning framework that approximately solves the finite-horizon Markov Decision Process (MDP) by combining Gaussian Processes (GPs) with Q-learning. The method addresses two key challenges: the tractability of exact dynamic programming in continuous state-control spaces, and the need for sample-efficient state-action value function approximation in systems where data collection is expensive. Using GPs and backward induction, we construct state-action value function approximations that enable efficient policy learning with limited data. To handle the computational burden of GPs as data accumulate across iterations, we propose a subset selection mechanism that uses M-determinantal point processes to draw diverse, high-performing subsets. The proposed method is evaluated on a linear quadratic regulator problem and online optimization of a non-isothermal semi-batch reactor. Improved learning efficiency is shown relative to the use of Deep Q-networks and exact GPs built with all available data.


Citation Information:

Maximilian Bloor, Tom Savage, Calvin Tsay, Antonio Del rio chanona, and Max Mowbray. "Gaussian Process Q-Learning for Finite-Horizon Markov Decision Processes." Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.

BibTeX:
@article{bloor2025gaussian,
    title={Gaussian Process {Q-Learning} for Finite-Horizon {Markov} Decision Processes},
    author={Bloor, Maximilian and Savage, Tom and Tsay, Calvin and chanona, Antonio Del rio and Mowbray, Max},
    journal={Reinforcement Learning Journal},
    year={2025}
}