RLJ 2024: Volumes 1–5

You can download this entire issue as one large (320 MB) pdf here: link. DOI https://doi.org/10.5281/zenodo.13899776. Below are links to individual papers.

Co-Learning Empirical Games & World Models, by Max Olan Smith, and Michael P. Wellman.
Improving Thompson Sampling via Information Relaxation for Budgeted Multi-armed Bandits, by Woojin Jeong, and Seungki Min.
Graph Neural Thompson Sampling, by Shuang Wu, and Arash A. Amini.
JoinGym: An Efficient Join Order Selection Environment, by Junxiong Wang, Kaiwen Wang, Yueying Li, Nathan Kallus, Immanuel Trummer, and Wen Sun.
An Open-Loop Baseline for Reinforcement Learning Locomotion Tasks, by Antonin Raffin, Olivier Sigaud, Jens Kober, Alin Albu-Schaeffer, João Silvério, and Freek Stulp.
Online Planning in POMDPs with State-Requests, by Raphaël Avalos, Eugenio Bargiacchi, Ann Nowe, Diederik Roijers, and Frans A Oliehoek.
A Recipe for Unbounded Data Augmentation in Visual Reinforcement Learning, by Abdulaziz Almuzairee, Nicklas Hansen, and Henrik I Christensen.
BetaZero: Belief-State Planning for Long-Horizon POMDPs using Learned Approximations, by Robert J. Moss, Anthony Corso, Jef Caers, and Mykel Kochenderfer.
Non-adaptive Online Finetuning for Offline Reinforcement Learning, by Audrey Huang, Mohammad Ghavamzadeh, Nan Jiang, and Marek Petrik.
Guided Data Augmentation for Offline Reinforcement Learning and Imitation Learning, by Nicholas E. Corrado, Yuxiao Qu, John U. Balis, Adam Labiosa, and Josiah P. Hanna.
Towards Principled, Practical Policy Gradient for Bandits and Tabular MDPs, by Michael Lu, Matin Aghaei, Anant Raj, and Sharan Vaswani.
Unifying Model-Based and Model-Free Reinforcement Learning with Equivalent Policy Sets, by Benjamin Freed, Thomas Wei, Roberto Calandra, Jeff Schneider, and Howie Choset.
The Role of Inherent Bellman Error in Offline Reinforcement Learning with Linear Function Approximation, by Noah Golowich, and Ankur Moitra.
Learning Action-based Representations Using Invariance, by Max Rudolph, Caleb Chuck, Kevin Black, Misha Lvovsky, Scott Niekum, and Amy Zhang.
Cyclicity-Regularized Coordination Graphs, by Oliver Järnefelt, Mahdi Kallel, and Carlo D'Eramo.
Assigning Credit with Partial Reward Decoupling in Multi-Agent Proximal Policy Optimization, by Aditya Kapoor, Benjamin Freed, Jeff Schneider, and Howie Choset.
OCAtari: Object-Centric Atari 2600 Reinforcement Learning Environments, by Quentin Delfosse, Jannis Blüml, Bjarne Gregori, Sebastian Sztwiertnia, and Kristian Kersting.
SplAgger: Split Aggregation for Meta-Reinforcement Learning, by Jacob Beck, Matthew Thomas Jackson, Risto Vuorio, Zheng Xiong, and Shimon Whiteson.
A Tighter Convergence Proof of Reverse Experience Replay, by Nan Jiang, Jinzhao Li, and Yexiang Xue.
Learning to Optimize for Reinforcement Learning, by Qingfeng Lan, A. Rupam Mahmood, Shuicheng YAN, and Zhongwen Xu.
Multi-view Disentanglement for Reinforcement Learning with Multiple Cameras, by Mhairi Dunion, and Stefano V Albrecht.
Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning, by Trevor McInroe, Adam Jelley, Stefano V Albrecht, and Amos Storkey.
Surprise-Adaptive Intrinsic Motivation for Unsupervised Reinforcement Learning, by Adriana Hugessen, Roger Creus Castanyer, Faisal Mohamed, and Glen Berseth.
Mitigating the Curse of Horizon in Monte-Carlo Returns, by Alex Ayoub, David Szepesvari, Francesco Zanini, Bryan Chan, Dhawal Gupta, Bruno Castro da Silva, and Dale Schuurmans.
A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization, by Yudong Luo, Yangchen Pan, Han Wang, Philip Torr, and Pascal Poupart.
ROIL: Robust Offline Imitation Learning without Trajectories, by Gersi Doko, Guang Yang, Daniel S. Brown, and Marek Petrik.
Harnessing Discrete Representations for Continual Reinforcement Learning, by Edan Jacob Meyer, Adam White, and Marlos C. Machado.
Three Dogmas of Reinforcement Learning, by David Abel, Mark K Ho, and Anna Harutyunyan.
Policy Gradient with Active Importance Sampling, by Matteo Papini, Giorgio Manganini, Alberto Maria Metelli, and Marcello Restelli.
The Limits of Pure Exploration in POMDPs: When the Observation Entropy is Enough, by Riccardo Zamboni, Duilio Cirino, Marcello Restelli, and Mirco Mutti.
Physics-Informed Model and Hybrid Planning for Efficient Dyna-Style Reinforcement Learning, by Zakariae EL ASRI, Olivier Sigaud, and Nicolas THOME.
Trust-based Consensus in Multi-Agent Reinforcement Learning Systems, by Ho Long Fung, Victor-Alexandru Darvariu, Stephen Hailes, and Mirco Musolesi.
Bidirectional-Reachable Hierarchical Reinforcement Learning with Mutually Responsive Policies, by Yu Luo, Fuchun Sun, Tianying Ji, and Xianyuan Zhan.
Informed POMDP: Leveraging Additional Information in Model-Based RL, by Gaspard Lambrechts, Adrien Bolland, and Damien Ernst.
An Optimal Tightness Bound for the Simulation Lemma, by Sam Lobel, and Ronald Parr.
Best Response Shaping, by Milad Aghajohari, Tim Cooijmans, Juan Agustin Duque, Shunichi Akatsuka, and Aaron Courville.
A Provably Efficient Option-Based Algorithm for both High-Level and Low-Level Learning, by Gianluca Drappo, Alberto Maria Metelli, and Marcello Restelli.
SwiftTD: A Fast and Robust Algorithm for Temporal Difference Learning, by Khurram Javed, Arsalan Sharifnassab, and Richard S. Sutton.
The Cliff of Overcommitment with Policy Gradient Step Sizes, by Scott M. Jordan, Samuel Neumann, James E. Kostas, Adam White, and Philip S. Thomas.
Multistep Inverse Is Not All You Need, by Alexander Levine, Peter Stone, and Amy Zhang.
Contextualized Hybrid Ensemble Q-learning: Learning Fast with Control Priors, by Emma Cramer, Bernd Frauenknecht, Ramil Sabirov, and Sebastian Trimpe.
Sequential Decision-Making for Inline Text Autocomplete, by Rohan Chitnis, Shentao Yang, and Alborz Geramifard.
Exploring Uncertainty in Distributional Reinforcement Learning, by Georgy Antonov, and Peter Dayan.
Robotic Manipulation Datasets for Offline Compositional Reinforcement Learning, by Marcel Hussing, Jorge Mendez-Mendez, Anisha Singrodia, Cassandra Kent, and Eric Eaton.
Dissecting Deep RL with High Update Ratios: Combatting Value Divergence, by Marcel Hussing, Claas A Voelcker, Igor Gilitschenski, Amir-massoud Farahmand, and Eric Eaton.
Demystifying the Recency Heuristic in Temporal-Difference Learning, by Brett Daley, Marlos C. Machado, and Martha White.
On the consistency of hyper-parameter selection in value-based deep reinforcement learning, by Johan Samir Obando Ceron, João Guilherme Madeira Araújo, Aaron Courville, and Pablo Samuel Castro.
Value Internalization: Learning and Generalizing from Social Reward, by Frieda Rong, and Max Kleiman-Weiner.
Mixture of Experts in a Mixture of RL settings, by Timon Willi, Johan Samir Obando Ceron, Jakob Nicolaus Foerster, Gintare Karolina Dziugaite, and Pablo Samuel Castro.
Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement Learning, by Davide Corsi, Davide Camponogara, and Alessandro Farinelli.
On Welfare-Centric Fair Reinforcement Learning, by Cyrus Cousins, Kavosh Asadi, Elita Lobo, and Michael Littman.
Inverse Reinforcement Learning with Multiple Planning Horizons, by Jiayu Yao, Weiwei Pan, Finale Doshi-Velez, and Barbara E Engelhardt.
Constant Stepsize Q-learning: Distributional Convergence, Bias and Extrapolation, by Yixuan Zhang, and Qiaomin Xie.
More Efficient Randomized Exploration for Reinforcement Learning via Approximate Sampling, by Haque Ishfaq, Yixin Tan, Yu Yang, Qingfeng Lan, Jianfeng Lu, A. Rupam Mahmood, Doina Precup, and Pan Xu.
Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis, by Qining Zhang, Honghao Wei, and Lei Ying.
A Natural Extension To Online Algorithms For Hybrid RL With Limited Coverage, by Kevin Tan, and Ziping Xu.
Tiered Reward: Designing Rewards for Specification and Fast Learning of Desired Behavior, by Zhiyuan Zhou, Shreyas Sundara Raman, Henry Sowerby, and Michael Littman.
Enabling Intelligent Interactions between an Agent and an LLM: A Reinforcement Learning Approach, by Bin Hu, Chenyang Zhao, Pu Zhang, Zihao Zhou, Yuanhang Yang, Zenglin Xu, and Bin Liu.
An Idiosyncrasy of Time-discretization in Reinforcement Learning, by Kris De Asis, and Richard S. Sutton.
Dreaming of Many Worlds: Learning Contextual World Models aids Zero-Shot Generalization, by Sai Prasanna, Karim Farid, Raghu Rajan, and André Biedenkapp.
Policy Gradient Algorithms with Monte Carlo Tree Learning for Non-Markov Decision Processes, by Tetsuro Morimura, Kazuhiro Ota, Kenshi Abe, and Peinan Zhang.
Offline Diversity Maximization under Imitation Constraints, by Marin Vlastelica, Jin Cheng, Georg Martius, and Pavel Kolev.
Zero-shot cross-modal transfer of Reinforcement Learning policies through a Global Workspace, by Léopold Maytié, Benjamin Devillers, Alexandre Arnold, and Rufin VanRullen.
Stabilizing Extreme Q-learning by Maclaurin Expansion, by Motoki Omura, Takayuki Osa, YUSUKE Mukuta, and Tatsuya Harada.
Combining Automated Optimisation of Hyperparameters and Reward Shape, by Julian Dierkes, Emma Cramer, Holger Hoos, and Sebastian Trimpe.
Sample Complexity of Offline Distributionally Robust Linear Markov Decision Processes, by He Wang, Laixi Shi, and Yuejie Chi.
PASTA: Pretrained Action-State Transformer Agents, by Raphael Boige, Yannis Flet-Berliac, Lars C.P.M. Quaedvlieg, Arthur Flajolet, Guillaume Richard, and Thomas PIERROT.
Cost Aware Best Arm Identification, by Kellen Kanarios, Qining Zhang, and Lei Ying.
ICU-Sepsis: A Benchmark MDP Built from Real Medical Data, by Kartik Choudhary, Dhawal Gupta, and Philip S. Thomas.
When does Self-Prediction help? Understanding Auxiliary Tasks in Reinforcement Learning, by Claas A Voelcker, Tyler Kastner, Igor Gilitschenski, and Amir-massoud Farahmand.
ROER: Regularized Optimal Experience Replay, by Changling Li, Zhang-Wei Hong, Pulkit Agrawal, Divyansh Garg, and Joni Pajarinen.
Combining Reconstruction and Contrastive Methods for Multimodal Representations in RL, by Philipp Becker, Sebastian Mossburger, Fabian Otto, and Gerhard Neumann.
RL for Consistency Models: Reward Guided Text-to-Image Generation with Fast Inference, by Owen Oertell, Jonathan Daniel Chang, Yiyi Zhang, Kianté Brantley, and Wen Sun.
A Super-human Vision-based Reinforcement Learning Agent for Autonomous Racing in Gran Turismo, by Miguel Vasco, Takuma Seno, Kenta Kawamoto, Kaushik Subramanian, Peter R. Wurman, and Peter Stone.
Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL, by Miguel Suau, Matthijs T. J. Spaan, and Frans A Oliehoek.
Learning Abstract World Models for Value-preserving Planning with Options, by Rafael Rodriguez-Sanchez, and George Konidaris.
Verification-Guided Shielding for Deep Reinforcement Learning, by Davide Corsi, Guy Amir, Andoni Rodríguez, Guy Katz, César Sánchez, and Roy Fox.
Learning Discrete World Models for Heuristic Search, by Forest Agostinelli, and Misagh Soltani.
Distributionally Robust Constrained Reinforcement Learning under Strong Duality, by Zhengfei Zhang, Kishan Panaganti, Laixi Shi, Yanan Sui, Adam Wierman, and Yisong Yue.
Representation Alignment from Human Feedback for Cross-Embodiment Reward Learning from Mixed-Quality Demonstrations, by Connor Mattson, Anurag Sidharth Aribandi, and Daniel S. Brown.
Revisiting Sparse Rewards for Goal-Reaching Reinforcement Learning, by Gautham Vasan, Yan Wang, Fahim Shahriar, James Bergstra, Martin Jägersand, and A. Rupam Mahmood.
Policy-Guided Diffusion, by Matthew Thomas Jackson, Michael Matthews, Cong Lu, Benjamin Ellis, Shimon Whiteson, and Jakob Nicolaus Foerster.
Agent-Centric Human Demonstrations Train World Models, by James Staley, Elaine Short, Shivam Goel, and Yash Shukla.
Can Differentiable Decision Trees Enable Interpretable Reward Learning from Human Feedback?, by Akansha Kalra, and Daniel S. Brown.
Imitation Learning from Observation through Optimal Transport, by Wei-Di Chang, Scott Fujimoto, David Meger, and Gregory Dudek.
Light-weight Probing of Unsupervised Representations for Reinforcement Learning, by Wancong Zhang, Anthony GX-Chen, Vlad Sobal, Yann LeCun, and Nicolas Carion.
Quantifying Interaction Level Between Agents Helps Cost-efficient Generalization in Multi-agent Reinforcement Learning, by Yuxin Chen, Chen Tang, Thomas Tian, Chenran Li, Jinning Li, Masayoshi Tomizuka, and Wei Zhan.
Shield Decomposition for Safe Reinforcement Learning in General Partially Observable Multi-Agent Environments, by Daniel Melcer, Christopher Amato, and Stavros Tripakis.
Reward Centering, by Abhishek Naik, Yi Wan, Manan Tomar, and Richard S. Sutton.
MultiHyRL: Robust Hybrid RL for Obstacle Avoidance against Adversarial Attacks on the Observation Space, by Jan de Priester, Zachary Bell, Prashant Ganesh, and Ricardo Sanfelice.
Investigating the Interplay of Prioritized Replay and Generalization, by Parham Mohammad Panahi, Andrew Patterson, Martha White, and Adam White.
Towards General Negotiation Strategies with End-to-End Reinforcement Learning, by Bram M. Renting, Thomas M. Moerland, Holger Hoos, and Catholijn M Jonker.
PID Accelerated Temporal Difference Algorithms, by Mark Bedaywi, Amin Rakhsha, and Amir-massoud Farahmand.
States as goal-directed concepts: an epistemic approach to state-representation learning, by Nadav Amir, Yael Niv, and Angela J Langdon.
Posterior Sampling for Continuing Environments, by Wanqiao Xu, Shi Dong, and Benjamin Van Roy.
Reinforcement Learning from Delayed Observations via World Models, by Armin Karamzade, Kyungmin Kim, Montek Kalsi, and Roy Fox.
Offline Reinforcement Learning from Datasets with Structured Non-Stationarity, by Johannes Ackermann, Takayuki Osa, and Masashi Sugiyama.
Resource Usage Evaluation of Discrete Model-Free Deep Reinforcement Learning Algorithms, by Olivia P. Dizon-Paradis, Stephen E. Wormald, Daniel E. Capecci, Avanti Bhandarkar, and Damon L. Woodard.
D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning, by Rafael Rafailov, Kyle Beltran Hatch, Anikait Singh, Aviral Kumar, Laura Smith, Ilya Kostrikov, Philippe Hansen-Estruch, Victor Kolev, Philip J. Ball, Jiajun Wu, Sergey Levine, and Chelsea Finn.
Weight Clipping for Deep Continual and Reinforcement Learning, by Mohamed Elsayed, Qingfeng Lan, Clare Lyle, and A. Rupam Mahmood.
A Batch Sequential Halving Algorithm without Performance Degradation, by Sotetsu Koyamada, Soichiro Nishimori, and Shin Ishii.
Causal Contextual Bandits with Adaptive Context, by Rahul Madhavan, Aurghya Maiti, Gaurav Sinha, and Siddharth Barman.
Policy Architectures for Compositional Generalization in Control, by Allan Zhou, Vikash Kumar, Chelsea Finn, and Aravind Rajeswaran.
Semi-Supervised One Shot Imitation Learning, by Philipp Wu, Kourosh Hakhamaneshi, Yuqing Du, Igor Mordatch, Aravind Rajeswaran, and Pieter Abbeel.
Cross-environment Hyperparameter Tuning for Reinforcement Learning, by Andrew Patterson, Samuel Neumann, Raksha Kumaraswamy, Martha White, and Adam White.
Human-compatible driving agents through data-regularized self-play reinforcement learning, by Daphne Cornelisse, and Eugene Vinitsky.
Inception: Efficiently Computable Misinformation Attacks on Markov Games, by Jeremy McMahan, Young Wu, Yudong Chen, Jerry Zhu, and Qiaomin Xie.
Learning to Navigate in Mazes with Novel Layouts using Abstract Top-down Maps, by Linfeng Zhao, and Lawson L.S. Wong.
Boosting Soft Q-Learning by Bounding, by Jacob Adamczyk, Volodymyr Makarenko, Stas Tiomkin, and Rahul V Kulkarni.
Bandits with Multimodal Structure, by Hassan SABER, and Odalric-Ambrym Maillard.
Bounding-Box Inference for Error-Aware Model-Based Reinforcement Learning, by Erin J Talvitie, Zilei Shao, Huiying Li, Jinghan Hu, Jacob Boerma, Rory Zhao, and Xintong Wang.
Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms, by Javad Azizi, Thang Duong, Yasin Abbasi-Yadkori, András György, Claire Vernade, and Mohammad Ghavamzadeh.
Optimizing Rewards while meeting $\omega$-regular Constraints, by Christopher Zeitler, Kristina Miller, Sayan Mitra, John Schierman, and Mahesh Viswanathan.