Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models

By Aaron Dharna, Cong Lu, and Jeff Clune

Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.

Presented at the Reinforcement Learning Conference (RLC), Edmonton, Alberta, Canada, August 5–9, 2025.


Download:

Abstract:

Multi-agent interactions have long fueled innovation, from natural predator-prey dynamics to the space race. Self-play (SP) algorithms try to harness these dynamics by pitting agents against ever-improving opponents, thereby creating an implicit curriculum. However, SP often fails to produce diverse solutions and can get stuck in locally optimal behaviors. We introduce Foundation-Model Self-Play (FMSP), a new direction that leverages the code-generation capabilities and vast knowledge of foundation models (FMs) to overcome these challenges by leaping across optima in policy space. We propose a family of approaches: (1) Vanilla FMSP (vFMSP) continually refines and improves an agent’s policy via competitive self-play; (2) Novelty-Search Self-Play (NSSP) builds a diverse population of strategies, ignoring performance; and (3) the most promising variant, Quality-Diversity Self-Play (QDSP), creates a diverse set of high-quality policies by combining elements of both NSSP and vFMSP. We evaluate FMSPs in a continuous-control pursuer-evader setting (Car Tag) and in “Gandalf,” a simple AI safety simulation in which an attacker tries to jailbreak an LLM’s defenses. In Car Tag, our algorithms explore a wide variety of reinforcement learning, tree search, and heuristic-based methods, to name just a few. In terms of discovered policy quality, QDSP and vFMSP find policies that surpass strong human-designed strategies. In Gandalf, our algorithms can successfully automatically red-team an LLM, breaking through and jailbreaking six different, progressively stronger levels of defense. Furthermore, FMSPs enable us to automatically close the loop and rapidly patch the discovered vulnerabilities. Overall, FMSP and its many possible variants represent a promising new research frontier of improving self-play with foundation models, opening fresh paths toward more creative and open-ended strategy discovery.


Citation Information:

Aaron Dharna, Cong Lu, and Jeff Clune. "Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models." Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.

BibTeX:
@article{dharna2025foundation,
    title={Foundation Model Self-Play: {O}pen-Ended Strategy Innovation via Foundation Models},
    author={Dharna, Aaron and Lu, Cong and Clune, Jeff},
    journal={Reinforcement Learning Journal},
    year={2025}
}