Explainable deep reinforcement learning via online mimicking

Makris, Nikolaos; Μακρής, Νικόλαος

Ερμηνεύσιμη βαθιά ενισχυτική μάθηση με ταυτόχρονη μίμηση

Master Thesis

Author

Makris, Nikolaos

Μακρής, Νικόλαος

Date

2025-01

Advisor

Vouros, George
Βούρος, Γεώργιος

Keywords

Explainable Deep Reinforcement Learning (XDRL) ; Soft Actor-Critic (SAC) ; Dual gradient descent ; XGBoost ; Continuous state-action spaces ; Optimality-interpretability trade-off

Abstract

This study aims to propose a method for training interpretable reinforcement learning policies in continuous state-action spaces, in close interplay with the original deep models, and to examine the effects that the training of interpretable policy models have on the original models. Thus, the objective is to validate the feasibility of this method while examining the trade-offs between the optimality and interpretability of policy models. This extends previous work in the field of Explainable Deep Reinforcement Learning (XDRL). The work completed so far has focused on XDQN (Explainable Deep Q-Networks) as well as on the interpretability of Actor-Critic methods in discrete action spaces, without considering the trade-off between optimality and interpretability. Specifically, in the proposed framework, during the training process, the original and interpretable policy models - from Soft Actor-Critic (SAC) and XGBoost , respectively - interact by influencing each other’s training. The XGBoost model is trained to closely approximate the SAC policy model, after which SAC is fine-tuned towards the XGBoost model to minimize the difference in predictions and, therefore, increase the fidelity of the interpretable model. This latter step is achieved using the Dual Gradient Descent method, used in constrained optimization problems. All experiments are conducted using the OpenAI Gym environment, in four settings with continuous action spaces of increasing dimensionality, to evaluate the framework’s effectiveness. It is observed that, due to the training of both models in close interplay, the final SAC policy differs significantly from the optimal SAC policy (the one that is learned by SAC alone). This difference increases as the complexity of the experimental setting increases, as expected. However, this interplay between the two models leads to convergence on policies that, while not necessarily optimal, are interpretable. Indeed, the results demonstrate that the final SAC’s policy and the XGBoost model’s predictions are closely aligned, allowing them to be interchangeable regardless of the experimental setting complexity. This thesis contributes by introducing a novel framework that supports integrating interpretable policy models into Deep Reinforcement Learning methods. It demonstrates this through the interaction of SAC and XGBoost policy models via the Dual Gradient Descent optimization method, while providing results on the trade-off between the optimality and interpretability of policies.