From Explainability to Interpretability: Interpretable Reinforcement Learning Via Model Explanations

By Peilang Li, Umer Siddique, and Yongcan Cao

Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.

Presented at the Reinforcement Learning Conference (RLC), Edmonton, Alberta, Canada, August 5–9, 2025.


Download:

Abstract:

Deep reinforcement learning (RL) has shown remarkable success in complex domains, however, the inherent black box nature of deep neural network policies raises significant challenges in understanding and trusting the decision-making processes. While existing explainable RL methods provide local insights, they fail to deliver a global understanding of the model, particularly in high-stakes applications. To overcome this limitation, we propose a novel model-agnostic framework that bridges the gap between explainability and interpretability by leveraging Shapley values to transform complex deep RL policies into transparent representations. The proposed approach SILVER (Shapley value-based Interpretable poLicy Via Explanation Regression) offers two key contributions: a novel approach employing Shapley values to policy interpretation beyond local explanations, and a general framework applicable to off-policy and on-policy algorithms. We evaluate SILVER with three existing deep RL algorithms and validate its performance in three classic control environments. The results demonstrate that SILVER not only preserves the original models' performance but also generates more stable interpretable policies.


Citation Information:

Peilang Li, Umer Siddique, and Yongcan Cao. "From Explainability to Interpretability: Interpretable Reinforcement Learning Via Model Explanations." Reinforcement Learning Journal, vol. TBD, 2025, pp. TBD.

BibTeX:
@article{li2025from,
    title={From Explainability to Interpretability: {I}nterpretable Reinforcement Learning Via Model Explanations},
    author={Li, Peilang and Siddique, Umer and Cao, Yongcan},
    journal={Reinforcement Learning Journal},
    year={2025}
}