Overview#

We describe the problem setup of Off-Policy Evaluation (OPE) and Selection (OPS).

Off-Policy Evaluation#

We consider a general reinforcement learning setup, which is formalized by Markov Decision Process (MDP) as \(\langle \mathcal{S}, \mathcal{A}, \mathcal{T}, P_r, \gamma \rangle\). \(\mathcal{S}\) is the state space and \(\mathcal{A}\) is the action space, which is either discrete or continuous. Let \(\mathcal{T}: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{P}(\mathcal{S})\) is the state transition probability where \(\mathcal{T}(s' | s,a)\) is the probability of observing state \(s'\) after taking action \(a\) given state \(s\). \(P_r: \mathcal{S} \times \mathcal{A} \times \mathbb{R} \rightarrow [0,1]\) is the probability distribution of the immediate reward. Given \(P_r\), \(R: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}\) is the expected reward function where \(R(s,a) := \mathbb{E}_{r \sim P_r (r | s, a)}[r]\) is the expected reward when taking action \(a\) for state \(s\). We also let \(\gamma \in (0,1]\) be a discount factor. Finally, \(\pi: \mathcal{S} \rightarrow \mathcal{P}(\mathcal{A})\) denotes a policy where \(\pi(a| s)\) is the probability of taking action \(a\) at a given state \(s\). Note that we also denote \(d_0\) as the initial state distribution.

Off-Policy Evaluation and Selection (OPE/OPS) in the process of offline RL

In OPE/OPS, we are given a logged dataset \(\mathcal{D}\) consisting of \(n\) trajectories, each of which is generated by a behavior policy \(\pi_b\) as follows.

\[\tau := \{ (s_t, a_t, s_{t+1}, r_t) \}_{t=0}^{T} \sim p(s_0) \prod_{t=0}^{T} \pi_b(a_t | s_t) \mathcal{T}(s_{t+1} | s_t, a_t) P_r (r_t | s_t, a_t)\]

Our goal is to leverage the logged dataset to accurately evaluate the performance of evaluation policies (Off-Policy Evaluation; OPE) and to select the best candidate policies based on OPE results (i.e., Off-Policy Selection; OPS).

Policy Value Estimation#

In the basic OPE, we aim at evaluating the policy value or the expected trajectory-wise reward of the given evaluation policy \(\pi\):

\[J(\pi) := \mathbb{E}_{\tau} \left [ \sum_{t=0}^{T-1} \gamma^t r_{t} \mid \pi \right ],\]

Estimating the policy value before deploying the policy in an online environment is beneficial, as we can reduce the deployment cost and risks in online evaluation. However, the challenging point is that we need to answer a counterfactual question, ‘’What if a new policy chooses a different action from that of behavior policy?’’ by dealing with the distribution shift between \(\pi_b\) and \(\pi\).

We discuss the properties of various OPE estimators together with their implementation details in Supported OPE estimators.

See also

Supported OPE estimators and their API reference
(advanced) Marginal OPE estimators and their API reference
Quickstart and related example codes

Cumulative Distribution and Risk Function Estimation#

In practical situations, we are sometimes more interested in risk functions such as conditional value at risk and quartile range rather than the expectation of the trajectory-wise reward. To derive these risk functions, we first estimate the following cumulative distribution function.

\[F(m, \pi) := \mathbb{E} \left[ \mathbb{I} \left \{ \sum_{t=0}^{T-1} \gamma^t r_t \leq m \right \} \mid \pi \right]\]

Then, we can derive various risk functions based on \(F(\cdot)\) as follows.

Mean: \(\mu(F) := \int_{G} G \, \mathrm{d}F(G)\)
Variance: \(\sigma^2(F) := \int_{G} (G - \mu(F))^2 \, \mathrm{d}F(G)\)
\(\alpha\)-quartile: \(Q^{\alpha}(F) := \min \{ G \mid F(G) \leq \alpha \}\)
Conditional Value at Risk (CVaR): \(\int_{G} G \, \mathbb{I}\{ G \leq Q^{\alpha}(F) \} \, \mathrm{d}F(G)\)

where we let \(G := \sum_{t=0}^{T-1} \gamma^t r_t\) to represent the trajectory-wise reward as a random variable and \(dF(G) := \mathrm{lim}_{\Delta \rightarrow 0} F(G) - F(G- \Delta)\).

We also discuss the properties of various cumulative distribution OPE estimators together with their implementation details in Supported OPE estimators.

See also

Off-Policy Selection#

Finally, OPS aims to select the best policy among several candidates as follows.

\[\hat{\pi} := {\arg \max}_{\pi \in \Pi} \hat{J}(\pi)\]

where the \(\hat{J}(\cdot)\) is the OPE estimate of the policy value, which can be substituted by some other metrics including CVaR.

In OPS, how well the ranking of the candidate policy preserves and the safety of the chosen policy matters as well as the accuracy of OPE. In the next page, we provide a review of conventional evaluation metrics of OPE/OPS and describe the risk-return tradeoff metrics of top-\(k\) policy selection. We also feature SharpeRatio@k, which is the main contribution of our research paper “Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation in Reinforcement Learning” in this page.

See also

For further theoretical properties of OPE estimators, we refer readers to a survey paper [28]. awesome-offline-rl also provides a comprehensive list of literature.

See also

Overview (online/offline RL) describes the problem setting of the policy learning (offline RL) part.

<<< Prev Offline RL

Next >>> SharpeRatio metrics

Next >>> Supported Implementation