Paper Summary - Contrastive explanations for reinforcement learning in terms of expected consequences by van der Waa et. al. (2018)

6 minute read

Waa, J., Diggelen, J. V., Bosch, K., & Neerincx, M. (2018). Contrastive explanations for reinforcement learning in terms of expected consequences. In Proceedings of the Workshop on Explainable AI on the IJCAI conference, Stockholm, Sweden., 37.

Paper Summary

The paper presents a method to generate global and counterfactual explanations for RL agent policies. A user query is translated into an alternative policy and used to simulate a trajectory, which is then translated into human-interpretable terms using descriptive sets and outcome mappings. The explanation is then the translation of the trajectory using these tools, or a diff between the trajectories generated using the learned policy and the alternative policy (for a counterfactual explanation). The paper presents a user study which surveys which properties of the explanation users found most useful.


A user is able to ask a query inquiring why a certain policy was not followed. This query is translated into a partial policy description which is used to learn an alternate policy. A trajectory is simulated using an environment model and the original agent policy as well as the newly learned policy and translated to a human-interpretable form using descriptive sets and action outcomes. The translated trajectories are fit to a template which serves as the explanation for the agent policy. Contrastive explanations are generated by taking the relative complement of the trajectory generated using the original policy, and the trajectory generated using the alternative policy.

User-interpretable MDP

The paper presents techniques to construct a user interpretable MDP using a descriptive set $\mathbf{C}$ and action outcomes $\mathbf{O}$. These are similar to predicate sets as described in Hayes & Shah (2017) and are hand-authored labels that can be applied to a state with a set of binary classifiers. $\mathbf{O}$ is similar to $\mathbf{C}$, but is meant to describe the outcomes of actions (i.e., the effect an action has on a state). More formally, the MDP $\langle S, A, R, T, \lambda \rangle$ is augmented as $\langle S, A, R, T, \lambda, \mathbf{C}, \mathbf{O}, \mathbf{t}, \mathbf{k} \rangle$. $\mathbf{k}$ is a function which translates a state feature vector $s$ into its descriptive set description ($\mathbf{k}: S \rightarrow \mathbf{C}$). $\mathbf{t}$ is a function which translates the application of an action $a$ in a translated state $c$ into a set of outcomes using $\mathbf{O}$ ($\mathbf{t}: \mathbf{C} \times A \rightarrow Pr(\mathbf{O}))$.

$C, O, t, k$ are independent of the other variables in the MDP and do not affect the training of the agent, allowing this method to be independent of the agent architecture. They are hand-authored, although $\mathbf{k}$ can be implemented with a number of binary classification models which learn to classify a state $s \in S$ with a predicate $c \in \mathbf{C}$.

Outcomes can be annotated as positive or negative to add further information to the explanations.

Contrastive Query to Alternate Policy

A user query is broken down into a set of $(s, a_t, a_f)$ tuples, where $a_t$ is the original action taken in state $s$, and $a_f$ is the alternate (foil) action that the user expects the agent to have taken. These are used to define an alternate MDP using the reward function $R_I$ -

\[R_I(s_i, a_f) = \frac{\mathbf{\lambda_f}}{\lambda} w(s_i, s_t) [R_t(s_i, a_f) - R_t(s_i, a_t)](1 + \mathbf{\epsilon})\]

where $\lambda_f$ is the discount factor for the new MDP, $a_f$ is the current foil action, $a_t$ is the action taken by the original policy $\pi_t(s_t)$, $s_i, i \in {t, t+1, \cdots, t + \mathbf{n}}$ is the $i$th state starting with $s_t$, $w$ is a distance-based weight between the states $s_i$ and $s_t$ defined as -

\[w(s_i, s_t) = - \exp \left( \frac{\mathbf{d}(s_i, s_t)}{\mathbf{\sigma}} \right)^2\]

and is the radial-basis function (RBF) with a Gaussian kernel and distance function $d$.

I am not sure how a reward function defined in terms of $s$ and $a$ would work, since the usual model for it involves reward being given for existing in a state, thus being defined only in terms of $s$. How would this change the agent function? When is a reward given to the agent?

This reward function is designed to make the new value function $Q_I(s_i, a_f) > Q_t(s_t, a_t)$. The bolded terms indicate hyperparameters which can be tuned to adjust the magnitude of this change. This is a value function which prefers taking actions which the user expects to see. $Q_f = Q_t + Q_I$ defines a new policy $\pi_f$ which can be used to simulate a trajectory.

The authors describe the effects and trade-offs of each hyperparameter in the paper. I will omit them here for brevity.

The authors do not specify how $Q_I$ should be learned. Presumably, we can learn it using the original agent’s learning algorithm while using the new reward structure.

Generating Explanations

A trajectory $\gamma$ is generated by running the agent in the environment (or simulating it with an environment model) with the original action selection mechanism and new value function $Q_f$. A trajectory is simply a set of states and actions taken in those states by a given policy. If the environment is stochastic, the most probable transition is assumed. This gives us a sequence $\gamma(s_t, \pi) = { (s_0, \pi(s_0)), \cdots (s_n, \pi(s_n)) | T}$. These are translated into a human-interpretable form using the $C$ and $O$ sets and presented in natural language using a template.

A counterfactual is generated by using $\pi_f$ to generate a trajectory and analysing the relative complement between the original and new trajectories.

The paper does not present a template to actually generate this kind of counterfactual explanation.

Experimental Setup

The authors conduct a user study ($N = 82$) in which explanations generated using this method were presented to the subjects. The explanations were presented in pairs, and users were asked to rate which explanation they preferred more and why. The users were asked to provide ratings for properties of the explanations like explanation length, information content and whether it was about an action or the whole policy depending on what they valued when making the comparison.

The authors conducted this study using a modified grid-world based environment lightly described in the paper. They do not describe any details regarding how a suitable RL agent for the environment was obtained.

Results and Analysis

A majority of participants preferred explanations that address strategy and policy, and that provide ample information.

Exact numbers are not provided, and tests are not conducted to determine whether the populations that chose complementary features are statistically significant.


The paper builds on existing work by Hayes & Shah (2017). The authors identify a flaw in the explanations generated using the method in Hayes & Shah (2017) as only explaining what the agent does, and not why it does it. Specifically, it does not explain the correlations between states and policy in terms of rewards and state transitions. This method does so by using the action outcomes mapping which provides a description of the effects of actions in a human-interpretable manner.


The method presented in this paper requires learning a Q-function, equivalent to training a new agent, just to answer a user query. This seems intractable for anything beyond toy problems. A computational evaluation of the method would have helped alleviate concerns.

The paper runs into the same issues as Hayes & Shah (2017) regarding the hand-craftedness of the $C$ and $O$ sets.

This paper also lacked clarity in its presentation for me. Providing a clear algorithm would have helped figure out what the steps are.

The explanation quality depends a lot on how accurate the learned transition model used to simulate trajectories is. If it is inaccurate, the explanations will suffer (and it would be interesting to see how user satisfaction with the explanation changes with the quality of the learned transition model).

I want to try replacing the foil policy as described in this method with a simple policy that simply follows the original agent policy and carries out the user-specified foil actions wherever applicable (no learning $Q_I$ with an alternate reward function). I am curious to see how the explanation from that fares against the explanations generated by this method in terms of explanation quality. I like the idea of using as an explanation the simulation of “what will happen” if a certain action is taken.

This method is not easily transferred to another environment due to the requirement of the $d$ function used to measure distance between states. I imagine a new one must be constructed for every environment, and it may not be clear what a good measure is either. A demonstration of this technique with a different environment would have helped.