Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.
Presented at the Reinforcement Learning Conference (RLC), Edmonton, Alberta, Canada, August 5–9, 2025.
In reinforcement learning (RL), deep $Q$-learning algorithms are often more sample- and compute-efficient than alternatives like the Monte Carlo policy gradient, but tend to suffer from instability that limits their use in practice. Some of this instability can be mitigated through a delayed \textit{target network}, yet this doubles memory usage and arguably slows down convergence. In this work, we explore the possibility of stabilization (returns do not drop with further gradient steps) without sacrificing the speed of convergence (high returns do not require many gradient steps). Inspired by self-supervised learning (SSL) and adaptive optimization, we empirically arrive at three modifications to the standard deep $Q$-network (DQN) — no two of which work well alone in our experiments. These modifications are, in the order of our experiments: 1) an \textbf{A}symmetric \textit{predictor} in the neural network, 2) a particular combination of \textbf{N}ormalization layers, and 3) \textbf{H}ypergradient descent on the learning rate. Aligning with prior work in SSL, \textbf{HANQ} (pronounced ""\textit{hank}"") avoids DQN's target network, uses the same number of hyperparameters as DQN, and yet matches or exceeds DQN's performance in our offline RL experiments on three out of four environments.
Braham Snyder and Chen-Yu Wei. "HANQ: Hypergradients, Asymmetry, and Normalization for Fast and Stable Deep $Q$-Learning." Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.
BibTeX:@article{snyder2025hanq,
title={{HANQ}: {H}ypergradients, Asymmetry, and Normalization for Fast and Stable Deep $Q$-Learning},
author={Snyder, Braham and Wei, Chen-Yu},
journal={Reinforcement Learning Journal},
year={2025}
}