A Natural Extension To Online Algorithms For Hybrid RL With Limited Coverage

By Kevin Tan, and Ziping Xu

Reinforcement Learning Journal, vol. 3, 2024, pp. 1252–1264.

Presented at the Reinforcement Learning Conference (RLC), Amherst Massachusetts, August 9–12, 2024.


Download:

Abstract:

Hybrid Reinforcement Learning (RL), leveraging both online and offline data, has garnered recent interest, yet research on its provable benefits remains sparse. Additionally, many existing hybrid RL algorithms (Song et al., 2023; Nakamoto et al., 2023; Amortila et al., 2024) impose a stringent coverage assumption called single-policy concentrability on the offline dataset, requiring that the behavior policy visits every state-action pair that the optimal policy does. With such an assumption, no exploration of unseen state-action pairs is needed during online learning. We show that this is unnecessary, and instead study online algorithms designed to ''fill in the gaps'' in the offline dataset, exploring states and actions that the behavior policy did not explore. To do so, previous approaches focus on estimating the offline data distribution to guide online exploration (Li et al., 2023). We show that a natural extension to standard optimistic online algorithms -- warm-starting them by including the offline dataset in the experience replay buffer -- achieves similar provable gains from hybrid data even when the offline dataset does not have single-policy concentrability. We accomplish this by partitioning the state-action space into two, bounding the regret on each partition through an offline and an online complexity measure, and showing that the regret of this hybrid RL algorithm can be characterized by the best partition -- despite the algorithm not knowing the partition itself. As an example, we propose DISC-GOLF, a modification of an existing optimistic online algorithm with general function approximation called GOLF used in Jin et al. (2021); Xie et al. (2022), and show that it demonstrates provable gains over both online-only and offline-only reinforcement learning, with competitive bounds when specialized to the tabular, linear and block MDP cases. Numerical simulations further validate our theory that hybrid data facilitates more efficient exploration, supporting the potential of hybrid RL in various scenarios.


Citation Information:

Kevin Tan and Ziping Xu. "A Natural Extension To Online Algorithms For Hybrid RL With Limited Coverage." Reinforcement Learning Journal, vol. 3, 2024, pp. 1252–1264.

BibTeX:

@article{tan2024natural,
    title={A Natural Extension To Online Algorithms For Hybrid {RL} With Limited Coverage},
    author={Tan, Kevin and Xu, Ziping},
    journal={Reinforcement Learning Journal},
    volume={3},
    pages={1252--1264},
    year={2024}
}