RLJ 2025: Volumes 6

You can download this entire issue as one large (320 MB) pdf here: link. DOI https://doi.org/10.5281/zenodo.13899776. Below are links to individual papers.

Reinforcement Learning for Finite Space Mean-Field Type Game, by Kai Shao, Jiacheng Shen, and Mathieu Lauriere.
Understanding Behavioral Metric Learning: A Large-Scale Study on Distracting Reinforcement Learning Environments, by Ziyan Luo, Tianwei Ni, Pierre-Luc Bacon, Doina Precup, and Xujie Si.
Which Experiences Are Influential for RL Agents? Efficiently Estimating The Influence of Experiences, by Takuya Hiraoka, Takashi Onishi, Guanquan Wang, and Yoshimasa Tsuruoka.
Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback, by Qinqing Zheng, Mikael Henaff, Amy Zhang, Aditya Grover, and Brandon Amos.
A Finite-Time Analysis of Distributed Q-Learning, by Han-Dong Lim, and Donghwan Lee.
Finite-Time Analysis of Minimax Q-Learning, by Narim Jeong, and Donghwan Lee.
Collaboration Promotes Group Resilience in Multi-Agent RL, by Ilai Shraga, Guy Azran, Matthias Gerstgrasser, Ofir Abu, Jeffrey Rosenschein, and Sarah Keren.
Bayesian Meta-Reinforcement Learning with Laplace Variational Recurrent Networks, by Joery A. de Vries, Jinke He, Mathijs de Weerdt, and Matthijs T. J. Spaan.
Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models, by Aaron Dharna, Cong Lu, and Jeff Clune.
Action Mapping for Reinforcement Learning in Continuous Environments with Constraints, by Mirco Theile, Lukas Dirnberger, Raphael Trumpp, Marco Caccamo, and Alberto Sangiovanni-Vincentelli.
Chargax: A JAX Accelerated EV Charging Simulator, by Koen Ponse, Jan Felix Kleuker, Thomas M. Moerland, and Aske Plaat.
Effect of a slowdown correlated to the current state of the environment on an asynchronous learning architecture, by Idriss Abdallah, Laurent CIARLETTA, Patrick HENAFF, Jonathan Champagne, and Matthieu BONAVENT.
Cascade - A sequential ensemble method for continuous control tasks, by Robin Schmöcker, and Alexander Dockhorn.
Average-Reward Soft Actor-Critic, by Jacob Adamczyk, Volodymyr Makarenko, Stas Tiomkin, and Rahul V Kulkarni.
Burning RED: Unlocking Subtask-Driven Reinforcement Learning and Risk-Awareness in Average-Reward Markov Decision Processes, by Juan Sebastian Rojas, and Chi-Guhn Lee.
Your Learned Constraint is Secretly a Backward Reachable Tube, by Mohamad Qadri, Gokul Swamy, Jonathan Francis, Michael Kaess, and Andrea Bajcsy.
Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism, by Kihyun Yu, Duksang Lee, William Overman, and Dabeen Lee.
Offline vs. Online Learning in Model-based RL: Lessons for Data Collection Strategies, by Jiaqi Chen, Ji Shi, Cansu Sancaktar, Jonas Frey, and Georg Martius.
Uncertainty Prioritized Experience Replay, by Rodrigo Antonio Carrasco-Davis, Sebastian Lee, Claudia Clopath, and Will Dabney.
RL$^3$: Boosting Meta Reinforcement Learning via RL inside RL$^2$, by Abhinav Bhatia, Samer B. Nashed, and Shlomo Zilberstein.
Pareto Optimal Learning from Preferences with Hidden Context, by Ryan Bahlous-Boldi, Li Ding, Lee Spector, and Scott Niekum.
WOFOSTGym: A Crop Simulator for Learning Annual and Perennial Crop Management Strategies, by William Solow, Sandhya Saisubramanian, and Alan Fern.
When and Why Hyperbolic Discounting Matters for Reinforcement Learning Interventions, by Ian M. Moore, Eura Nofshin, Siddharth Swaroop, Susan Murphy, Finale Doshi-Velez, and Weiwei Pan.
Reinforcement Learning from Human Feedback with High-Confidence Safety Guarantees, by Yaswanth Chittepu, Blossom Metevier, Will Schwarzer, Austin Hoag, Scott Niekum, and Philip S. Thomas.
AVID: Adapting Video Diffusion Models to World Models, by Marc Rigter, Tarun Gupta, Agrin Hilmkil, and Chao Ma.
Non-Stationary Latent Auto-Regressive Bandits, by Anna L. Trella, Walter H. Dempsey, Asim Gazi, Ziping Xu, Finale Doshi-Velez, and Susan Murphy.
Hierarchical Multi-agent Reinforcement Learning for Cyber Network Defense, by Aditya Vikram Singh, Ethan Rathbun, Emma Graham, Lisa Oakley, Simona Boboila, Peter Chin, and Alina Oprea.
The Confusing Instance Principle for Online Linear Quadratic Control, by Waris Radji, and Odalric-Ambrym Maillard.
Drive Fast, Learn Faster: On-Board RL for High Performance Autonomous Racing, by Benedict Hildisch, Edoardo Ghignone, Nicolas Baumann, Cheng Hu, Andrea Carron, and Michele Magno.
Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models, by Kefan Song, Jin Yao, Runnan Jiang, Rohan Chandra, and Shangtong Zhang.
Pure Exploration for Constrained Best Mixed Arm Identification with a Fixed Budget, by Dengwang Tang, Rahul Jain, Ashutosh Nayyar, and Pierluigi Nuzzo.
Quantitative Resilience Modeling for Autonomous Cyber Defense, by Xavier Cadet, Simona Boboila, Edward Koh, Peter Chin, and Alina Oprea.
Efficient Information Sharing for Training Decentralized Multi-Agent World Models, by Xiaoling Zeng, and Qi Zhang.
Recursive Reward Aggregation, by Yuting Tang, Yivan Zhang, Johannes Ackermann, Yu-Jie Zhang, Soichiro Nishimori, and Masashi Sugiyama.
A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP, by Tejaram Sangadi, Prashanth L. A., and Krishna Jagannathan.
Impoola: The Power of Average Pooling for Image-based Deep Reinforcement Learning, by Raphael Trumpp, Ansgar Schäfftlein, Mirco Theile, and Marco Caccamo.
Fast Adaptation with Behavioral Foundation Models, by Harshit Sikchi, Andrea Tirinzoni, Ahmed Touati, Yingchen Xu, Anssi Kanervisto, Scott Niekum, Amy Zhang, Alessandro Lazaric, and Matteo Pirotta.
Multi-Task Reinforcement Learning Enables Parameter Scaling, by Reginald McLean, Evangelos Chatzaroulas, J K Terry, Isaac Woungang, Nariman Farsad, and Pablo Samuel Castro.
Eau De $Q$-Network: Adaptive Distillation of Neural Networks in Deep Reinforcement Learning, by Théo Vincent, Tim Faust, Yogesh Tripathi, Jan Peters, and Carlo D'Eramo.
Disentangling Recognition and Decision Regrets in Image-Based Reinforcement Learning, by Alihan Hüyük, Arndt Ryo Koblitz, Atefeh Mohajeri Moghaddam, and Matthew Andrews.
Learning to Explore in Diverse Reward Settings via Temporal-Difference-Error Maximization, by Sebastian Griesbach, and Carlo D'Eramo.
Nonparametric Policy Improvement in Continuous Action Spaces via Expert Demonstrations, by Agustin Castellano, Sohrab Rezaei, Jared Markowitz, and Enrique Mallada.
DisDP: Robust Imitation Learning via Disentangled Diffusion Policies, by Pankhuri Vanjani, Paul Mattes, Xiaogang Jia, Vedant Dave, and Rudolf Lioutikov.
Mitigating Goal Misgeneralization via Minimax Regret, by Karim Abdel Sadek, Matthew Farrugia-Roberts, Usman Anwar, Hannah Erlebach, Christian Schroeder de Witt, David Krueger, and Michael D Dennis.
Long-Horizon Planning with Predictable Skills, by Nico Gürtler, and Georg Martius.
HANQ: Hypergradients, Asymmetry, and Normalization for Fast and Stable Deep $Q$-Learning, by Braham Snyder, and Chen-Yu Wei.
Benchmarking Massively Parallelized Multi-Task Reinforcement Learning for Robotics Tasks, by Viraj Joshi, Zifan Xu, Bo Liu, Peter Stone, and Amy Zhang.
Optimal discounting for offline input-driven MDP, by Randy Lefebvre, and Audrey Durand.
Make the Pertinent Salient: Task-Relevant Reconstruction for Visual Control with Distractions, by Kyungmin Kim, JB Lanier, and Roy Fox.
Reinforcement Learning for Human-AI Collaboration via Probabilistic Intent Inference, by Yuxin Lin, Seyede Fatemeh Ghoreishi, Tian Lan, and Mahdi Imani.
PufferLib 2.0: Reinforcement Learning at 1M steps/s, by Joseph Suarez.
Uncovering RL Integration in SSL Loss: Objective-Specific Implications for Data-Efficient RL, by Ömer Veysel Çağatan, and Baris Akgun.
Benchmarking Partial Observability in Reinforcement Learning with a Suite of Memory-Improvable Domains, by Ruo Yu Tao, Kaicheng Guo, Cameron Allen, and George Konidaris.
Rectifying Regression in Reinforcement Learning, by Alex Ayoub, David Szepesvari, Alireza Bakhtiari, Csaba Szepesvari, and Dale Schuurmans.
High-Confidence Policy Improvement from Human Feedback, by Hon Tik Tse, Philip S. Thomas, and Scott Niekum.
Adaptive Reward Sharing to Enhance Learning in the Context of Multiagent Teams, by Kyle Tilbury, and David Radke.
MixUCB: Enhancing Safe Exploration in Contextual Bandits with Human Oversight, by Jinyan Su, Rohan Banerjee, Jiankai Sun, Wen Sun, and Sarah Dean.
Efficient Morphology-Aware Policy Transfer to New Embodiments, by Michael Przystupa, Hongyao Tang, Glen Berseth, Mariano Phielipp, Santiago Miret, Martin Jägersand, and Matthew E. Taylor.
Understanding Learned Representations and Action Collapse in Visual Reinforcement Learning, by Xi Chen, Zhihui Zhu, and Andrew Perrault.
Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions, by Ayush Jain, Norio Kosaka, Xinhu Li, Kyung-Min Kim, Erdem Biyik, and Joseph J Lim.
Leveraging priors on distribution functions for multi-arm bandits, by Sumit Vashishtha, and Odalric-Ambrym Maillard.
ProtoCRL: Prototype-based Network for Continual Reinforcement Learning, by Michela Proietti, Peter R. Wurman, Peter Stone, and Roberto Capobianco.
Finer Behavioral Foundation Models via Auto-Regressive Features and Advantage Weighting, by Edoardo Cetin, Ahmed Touati, and Yann Ollivier.
Pretraining Decision Transformers with Reward Prediction for In-Context Multi-task Structured Bandit Learning, by Subhojyoti Mukherjee, Josiah P. Hanna, Qiaomin Xie, and Robert D Nowak.
Multi-task Representation Learning for Fixed Budget Pure-Exploration in Linear and Bilinear Bandits, by Subhojyoti Mukherjee, Qiaomin Xie, and Robert D Nowak.
Offline Reinforcement Learning with Domain-Unlabeled Data, by Soichiro Nishimori, Xin-Qiang Cai, Johannes Ackermann, and Masashi Sugiyama.
Multi-Agent Reinforcement Learning for Inverse Design in Photonic Integrated Circuits, by Yannik Mahlau, Maximilian Schier, Christoph Reinders, Frederik Schubert, Marco Bügling, and Bodo Rosenhahn.
Syllabus: Portable Curricula for Reinforcement Learning Agents, by Ryan Sullivan, Ryan Pégoud, Ameen Ur Rehman, Xinchen Yang, Junyun Huang, Aayush Verma, Nistha Mitra, and John P Dickerson.
Exploration-Free Reinforcement Learning with Linear Function Approximation, by Luca Civitavecchia, and Matteo Papini.
SPEQ: Offline Stabilization Phases for Efficient Q-Learning in High Update-To-Data Ratio Reinforcement Learning, by Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, and Andrew D. Bagdanov.
Value Bonuses using Ensemble Errors for Exploration in Reinforcement Learning, by Abdul Wahab, Raksha Kumaraswamy, and Martha White.
Gaussian Process Q-Learning for Finite-Horizon Markov Decision Processes, by Maximilian Bloor, Tom Savage, Calvin Tsay, Antonio Del rio chanona, and Max Mowbray.
On the Effect of Regularization in Policy Mirror Descent, by Jan Felix Kleuker, Aske Plaat, and Thomas M. Moerland.
Concept-Based Off-Policy Evaluation, by Ritam Majumdar, Jack Teversham, and Sonali Parbhoo.
Investigating the Utility of Mirror Descent in Off-policy Actor-Critic, by Samuel Neumann, Jiamin He, Adam White, and Martha White.
Hybrid Classical/RL Local Planner for Ground Robot Navigation, by Vishnu Dutt Sharma, Jeongran Lee, Matthew Andrews, and Ilija Hadžić.
How Should We Meta-Learn Reinforcement Learning Algorithms?, by Alexander David Goldie, Zilin Wang, Jaron Cohen, Jakob Nicolaus Foerster, and Shimon Whiteson.
Seldonian Reinforcement Learning for Ad Hoc Teamwork, by Edoardo Zorzi, Alberto Castellini, Leonidas Bakopoulos, Georgios Chalkiadakis, and Alessandro Farinelli.
Offline Reinforcement Learning with Wasserstein Regularization via Optimal Transport Maps, by Motoki Omura, Yusuke Mukuta, Kazuki Ota, Takayuki Osa, and Tatsuya Harada.
Intrinsically Motivated Discovery of Temporally Abstract Graph-based Models of the World, by Akhil Bagaria, Anita De Mello Koch, Rafael Rodriguez-Sanchez, Sam Lobel, and George Konidaris.
An Optimisation Framework for Unsupervised Environment Design, by Nathan Monette, Alistair Letcher, Michael Beukman, Matthew Thomas Jackson, Alexander Rutherford, Alexander David Goldie, and Jakob Nicolaus Foerster.
Epistemically-guided forward-backward exploration, by Núria Armengol Urpí, Marin Vlastelica, Georg Martius, and Stelian Coros.
Rethinking the Foundations for Continual Reinforcement Learning, by Esraa Elelimy, David Szepesvari, Martha White, and Michael Bowling.
Modelling human exploration with light-weight meta reinforcement learning algorithms, by Thomas D. Ferguson, Alona Fyshe, and Adam White.
Zero-Shot Reinforcement Learning Under Partial Observability, by Scott Jeen, Tom Bewley, and Jonathan Cullen.
Building Sequential Resource Allocation Mechanisms without Payments, by Sihan Zeng, Sujay Bhatt, Alec Koppel, and Sumitra Ganesh.
From Explainability to Interpretability: Interpretable Reinforcement Learning Via Model Explanations, by Peilang Li, Umer Siddique, and Yongcan Cao.
Joint-Local Grounded Action Transformation for Sim-to-Real Transfer in Multi-Agent Traffic Control, by Justin Turnau, Longchao Da, Khoa Vo, Ferdous Al Rafi, Shreyas Bachiraju, Tiejin Chen, and Hua Wei.
Sampling from Energy-based Policies using Diffusion, by Vineet Jain, Tara Akhound-Sadegh, and Siamak Ravanbakhsh.
Multiple-Frequencies Population-Based Training, by Waël Doulazmi, Auguste Lehuger, Marin Toromanoff, Valentin Charraut, Thibault Buhet, and Fabien Moutarde.
TransAM: Transformer-Based Agent Modeling for Multi-Agent Systems via Local Trajectory Encoding, by Conor Wallace, Umer Siddique, and Yongcan Cao.
Towards Improving Reward Design in RL: A Reward Alignment Metric for RL Practitioners, by Calarina Muslimani, Kerrick Johnstonbaugh, Suyog Chandramouli, Serena Booth, W. Bradley Knox, and Matthew E. Taylor.
Optimistic critics can empower small actors, by Olya Mastikhina, Dhruv Sreenivas, and Pablo Samuel Castro.
PAC Apprenticeship Learning with Bayesian Active Inverse Reinforcement Learning, by Ondrej Bajgar, Dewi Sid William Gould, Jonathon Liu, Alessandro Abate, Konstantinos Gatsis, and Michael A Osborne.
AVG-DICE: Stationary Distribution Correction by Regression, by Fengdi Che, Bryan Chan, Chen Ma, and A. Rupam Mahmood.
V-Max: A RL Framework for Autonomous Driving, by Valentin Charraut, Waël Doulazmi, Thomas Tournaire, and Thibault Buhet.
Offline Action-Free Learning of Ex-BMDPs by Comparing Diverse Datasets, by Alexander Levine, Peter Stone, and Amy Zhang.
One Goal, Many Challenges: Robust Preference Optimization Amid Content-Aware, Multi-Source Noise, by Amirabbas Afzali, Amirhossein Afsharrad, Seyed Shahabeddin Mousavi, and Sanjay Lall.
A Timer-Based Hybrid Supervisor for Robust, Chatter-Free Policy Switching, by Jan de Priester, and Ricardo Sanfelice.
Deep Reinforcement Learning with Gradient Eligibility Traces, by Esraa Elelimy, Brett Daley, Andrew Patterson, Marlos C. Machado, Adam White, and Martha White.
On Slowly-varying Non-stationary Bandits, by Ramakrishnan K, and Aditya Gopalan.
Focused Skill Discovery: Learning to Control Specific State Variables while Minimizing Side Effects, by Jonathan Colaço Carr, Qinyi Sun, and Cameron Allen.
Goals vs. Rewards: A Preliminary Comparative Study of Objective Specification Mechanisms, by Septia Rani, Serena Booth, and Sarath Sreedharan.
An Analysis of Action-Value Temporal-Difference Methods That Learn State Values, by Brett Daley, Prabhat Nagarajan, Martha White, and Marlos C. Machado.
PEnGUiN: Partially Equivariant Graph NeUral Networks for Sample Efficient MARL, by Joshua McClellan, Greyson Brothers, Furong Huang, and Pratap Tokekar.
Shaping Laser Pulses with Reinforcement Learning, by Francesco Capuano, Davorin Peceli, and Gabriele Tiboni.
Reinforcement Learning with Adaptive Temporal Discounting, by Sahaj Singh Maini, and Zoran Tiganj.
Human-Level Competitive Pokémon via Scalable Offline Reinforcement Learning with Transformers, by Jake Grigsby, Yuqi Xie, Justin Sasek, Steven Zheng, and Yuke Zhu.
Adaptive Submodular Policy Optimization, by Branislav Kveton, Anup Rao, Viet Dac Lai, Nikos Vlassis, and David Arbour.
Learning Fair Pareto-Optimal Policies in Multi-Objective Reinforcement Learning, by Umer Siddique, Peilang Li, and Yongcan Cao.
Representation Learning and Skill Discovery with Empowerment, by Andrew Levy, Alessandro G Allievi, and George Konidaris.
Empirical Bound Information-Directed Sampling for Norm-Agnostic Bandits, by Piotr M. Suder, and Eric Laber.
Thompson Sampling for Constrained Bandits, by Rohan Deb, Mohammad Ghavamzadeh, and Arindam Banerjee.
AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability, by Fernando Rosas, Alexander Boyd, and Manuel Baltieri.
Achieving Limited Adaptivity for Multinomial Logistic Bandits, by Sukruta Prakash Midigeshi, Tanmay Goyal, and Gaurav Sinha.