Sampling from Energy-based Policies using Diffusion

By Vineet Jain, Tara Akhound-Sadegh, and Siamak Ravanbakhsh

Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.

Presented at the Reinforcement Learning Conference (RLC), Edmonton, Alberta, Canada, August 5–9, 2025.


Download:

Abstract:

Energy-based policies offer a flexible framework for modeling complex, multimodal behaviors in reinforcement learning (RL). In maximum entropy RL, the optimal policy is a Boltzmann distribution derived from the soft Q-function, but direct sampling from this distribution in continuous action spaces is computationally intractable. As a result, existing methods typically use simpler parametric distributions, like Gaussians, for policy representation — limiting their ability to capture the full complexity of multimodal action distributions. In this paper, we introduce a diffusion-based approach for sampling from energy-based policies, where the negative Q-function defines the energy function. Based on this approach, we propose an actor-critic method called Diffusion Q-Sampling (DQS) that enables more expressive policy representations, allowing stable learning in diverse environments. We show that our approach enhances sample efficiency in continuous control tasks and captures multimodal behaviors, addressing key limitations of existing methods.


Citation Information:

Vineet Jain, Tara Akhound-Sadegh, and Siamak Ravanbakhsh. "Sampling from Energy-based Policies using Diffusion." Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.

BibTeX:
@article{jain2025sampling,
    title={Sampling from Energy-based Policies using Diffusion},
    author={Jain, Vineet and Akhound-Sadegh, Tara and Ravanbakhsh, Siamak},
    journal={Reinforcement Learning Journal},
    year={2025}
}