Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning

By Trevor McInroe, Adam Jelley, Stefano V Albrecht, and Amos Storkey

Reinforcement Learning Journal, vol. 2, 2024, pp. 516–546.

Presented at the Reinforcement Learning Conference (RLC), Amherst Massachusetts, August 9–12, 2024.


Download:

Abstract:

Offline pretraining with a static dataset followed by online fine-tuning (offline-to-online, or OtO) is a paradigm that is well matched to a real-world RL deployment process. In this scenario, we aim to find the best-performing policy within a limited budget of online interactions. Previous work in the OtO setting has focused on correcting for bias introduced by the policy-constraint mechanisms of offline RL algorithms. Such constraints keep the learned policy close to the behavior policy that collected the dataset, but we show this can unnecessarily limit policy performance if the behavior policy is far from optimal. Instead, we forgo policy constraints and frame OtO RL as an exploration problem that aims to maximize the benefit of the online data-collection. We first study the major online RL exploration methods based on intrinsic rewards and UCB in the OtO setting, showing that intrinsic rewards add training instability through reward-function modification, and UCB methods are myopic and it is unclear which learned-component's ensemble to use for action selection. We then introduce an algorithm for \textbf{p}lanning to go out of distribution (PTGOOD) that avoids these issues. PTGOOD uses a non-myopic planning procedure that targets exploration in relatively high-reward regions of the state-action space unlikely to be visited by the behavior policy. By leveraging concepts from the Conditional Entropy Bottleneck, PTGOOD encourages data collected online to provide new information relevant to improving the final deployment policy without altering rewards. We show empirically in different control tasks that PTGOOD significantly improves agent returns during online fine-tuning and finds the optimal policy in as few as 10k online steps in the Walker control task and in as few as 50k in complex control tasks such as Humanoid. We find that PTGOOD avoids the suboptimal policy convergence that many of our baselines exhibit in several environments.


Citation Information:

Trevor McInroe, Adam Jelley, Stefano V Albrecht, and Amos Storkey. "Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning." Reinforcement Learning Journal, vol. 2, 2024, pp. 516–546.

BibTeX:

@article{mcinroe2024planning,
    title={Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning},
    author={McInroe, Trevor and Jelley, Adam and Albrecht, Stefano V and Storkey, Amos},
    journal={Reinforcement Learning Journal},
    volume={2},
    pages={516--546},
    year={2024}
}