Supported Implementation#

Create OPE Input#

Before proceeding to OPE/OPS, we first create input_dict to enable a smooth implementation.

# create input for OPE class
from scope_rl.ope import CreateOPEInput
prep = CreateOPEInput(
    env=env,
)
input_dict = prep.obtain_whole_inputs(
    logged_dataset=logged_dataset,
    evaluation_policies=evaluation_policies,
    require_value_prediction=True,  # use model-based prediction
    n_trajectories_on_policy_evaluation=100,
    random_state=random_state,
)

Tip

How to create input_dict for multiple logged datasets?

When obtaining input_dict from the same evaluation policies across multiple datasets, try the following command.

multiple_input_dict = prep.obtain_whole_inputs(
    logged_dataset=logged_dataset,            # MultipleLoggedDataset
    evaluation_policies=evaluation_policies,  # single list
    ...,
)

When obtaining input_dict from different evaluation policies for each logged dataset, try the following command.

multiple_input_dict = prep.obtain_whole_inputs(
    logged_dataset=logged_dataset,                                 # MultipleLoggedDataset (two logged dataset in this case)
    evaluation_policies=evaluation_policies,                       # nested list or dict that have the same keys with logged_datasets
    ...,
)

In both cases, MultipleInputDict will be returned.

MultipleInputDict saves the paths to each input_dict and make it accessible through the following command.

input_dict_ = multiple_input_dict.get(behavior_policy_name=behavior_policy.name, dataset_id=0)
How to select models for value/weight learning methods?

To enable value prediction (for model-based estimators) and weight prediction (for marginal estimators), set True for the following arguments.

input_dict = prep.obtain_whole_inputs(
    ...,
    require_value_prediction=True,
    require_weight_prediction=True,
    ...,
)

Then, we can customize the choice of weight and value functions using the following arguments.

input_dict = prep.obtain_whole_inputs(
    ...,
    q_function_method="fqe",   # one of {"fqe", "dice", "mql"}, default="fqe"
    v_function_method="fqe",   # one of {"fqe", "dice_q", "dice_v", "mql", "mvl"}, default="fqe"
    w_function_method="dice",  # one of {"dice", "mwl"}, default="dice"
    ...,
)

To further customize the models, please specify model_args when initializing CreateOPEInput as follows.

from d3rlpy.models.encoders import VectorEncoderFactory
from d3rlpy.models.q_functions import MeanQFunctionFactory

prep = CreateOPEInput(
    env=env,
    model_args={
        "fqe": {
            "encoder_factory": VectorEncoderFactory(hidden_units=[30, 30]),
            "q_func_factory": MeanQFunctionFactory(),
            "learning_rate": 1e-4,
        },
        "state_action_dual" : {  # "dice"
            "method": "dual_dice",
        },
        "state_action_value": {  # "mql"
            "batch_size": 64,
            "lr": 1e-4,
        },
    }
)

where the keys of model_args are the following.

key: [
    "fqe",                  # fqe
    "state_action_dual",    # dice_q
    "state_action_value",   # mql
    "state_action_weight",  # mwl
    "state_dual",           # dice_v
    "state_value",          # mvl
    "state_weight",         # mwl
    "hidden_dim",           # hidden dim of value/weight function, except FQE
]
How to collect input_dict in a non-episodic setting?

When the goal is to evaluate the policy under a stationary distribution (\(d^{\pi}(s)\)) rather than in an episodic setting (i.e., cartpole or taxi used in [12, 13]), we need to (re-)collect initial states from evaluation policies stationary distribution.

In this case, please turn the following options.

input_dict = prep.obtain_whole_inputs(
    ...,
    resample_initial_state=True,
    use_stationary_distribution_on_policy_evaluation=True,  # when env is provided
    ...,
)

See also

Supported Implementation (learning) describes how to obtain logged_dataset using a behavior policy in detail.

Basic Off-Policy Evaluation (OPE)#

The goal of (basic) OPE is to evaluate the following expected trajectory-wise reward of a policy (referred to as policy value).

\[J(\pi) := \mathbb{E}_{\tau} \left [ \sum_{t=0}^{T-1} \gamma^t r_{t} \mid \pi \right ],\]

where \(\pi\) is the (evaluation) policy, \(\tau\) is the trajectory observed by the evaluation policy, and \(r_t\) is the immediate reward at each timestep. (Please refer to the problem setup for additional notations.)

Here, we describe the class for conducting OPE and the implemented OPE estimators for estimating the policy value. We begin with the OffPolicyEvaluation class to streamline the OPE procedure.

# initialize the OPE class
from scope_rl.ope import OffPolicyEvaluation as OPE
ope = OPE(
    logged_dataset=logged_dataset,
    ope_estimators=[DM(), TIS(), PDIS(), DR()],
)

Using the OPE class, we can obtain the OPE results of various estimators at once as follows.

ope_dict = ope.estimate_policy_value(input_dict)

Tip

How to conduct OPE with multiple logged datasets?

Conducting OPE with multiple logged datasets requires no additional effort.

First, the same command with the single logged dataset case also works with multiple logged datasets.

ope = OPE(
    logged_dataset=logged_dataset,  # MultipleLoggedDataset
    ope_estimators=[DM(), TIS(), PDIS(), DR()],
)
multiple_ope_dict = ope.estimate_policy_value(
    input_dict,  # MultipleInputDict
)

The returned value is a dictionary containing the ope result.

In addition, we can specify which logged dataset and input_dict to use by setting behavior_policy_name and dataset_id.

multiple_ope_dict = ope.estimate_policy_value(
    input_dict,
    behavior_policy_name=behavior_policy.name,  #
    dataset_id=0,  # specify which logged dataset and input_dict to use
)

The basic visualization function also works by specifying the dataset id.

ope.visualize_off_policy_estimates(
    input_dict,
    behavior_policy_name=behavior_policy.name,
    dataset_id=0,  #
    ...,
)
card-img-top

policy value estimated with the specified dataset

Moreover, we provide additional visualization functions for the case with multiple logged datasets.

ope.visualize_policy_value_with_multiple_estimates(
    input_dict,      # MultipleInputDict
    behavior_policy_name=None,                   # compare estimators with multiple behavior policies
    # behavior_policy_name=behavior_policy.name  # compare estimators with a single behavior policy
    plot_type="ci",  # one of {"ci", "violin", "scatter"}, default="ci"
    ...,
)

When the plot_type is “ci”, the plot is somewhat similar to the basic visualization. (The star indicates the ground-truth policy value and the confidence intervals are derived by multiple estimates across datasets.)

card-img-top

policy value estimated with the multiple datasets

When the plot_type is “violin”, the plot visualizes the distribution of multiple estimates. This is particularly useful to see how the estimation result can vary depending on different datasets or random seeds.

card-img-top

policy value estimated with the multiple datasets (violin)

Finally, when the plot_type is “scatter”, the plot visualizes each estimation with its color specifying the behavior policy. This function is particularly useful to see how the choice of behavior policy (e.g., their stochasticity) affects the estimation result.

card-img-top

policy value estimated with the multiple datasets (scatter)

The OPE class implements the following functions.

(OPE)

  • estimate_policy_value

  • estimate_intervals

  • summarize_off_policy_estimates

(Evaluation of OPE estimators)

  • evaluate_performance_of_ope_estimators

(Visualization)

  • visualize_off_policy_estimates

(Visualization with multiple estimates on multiple logged datasets)

  • visualize_policy_value_with_multiple_estimates

Below, we describe the implemented OPE estimators.

Standard OPE estimators

Direct Method (DM)

Trajectory-wise Importance Sampling (TIS)

Per-Decision Importance Sampling (PDIS)

Doubly Robust (DR)

Self-Normalized estimators

Marginal OPE estimators

State Marginal estimators

State-Action Marginal estimators

Double Reinforcement Learning

Spectrum of Off-Policy Evaluation

Extensions

High Confidence Off-Policy Evaluation

Extension to the continuous action space

Tip

How to define my own OPE estimator?

To define your own OPE estimator, use BaseOffPolicyEstimator.

Basically, the common inputs for each function are the following keys from logged_dataset and input_dict.

(logged_dataset)

key: [
    size,
    step_per_trajectory,
    action,
    reward,
    pscore,
]

(input_dict)

key: [
    evaluation_policy_action,
    evaluation_policy_action_dist,
    state_action_value_prediction,
    initial_state_value_prediction,
    state_action_marginal_importance_weight,
    state_marginal_importance_weight,
    on_policy_policy_value,
    gamma,
]

n_step_pdis is also applicable to marginal estimators and action_scaler and sigma are added in the continuous-action case.

If you want to add other arguments, please add them to the initialization arguments for API consistency.

Finally, contributions to SCOPE-RL with a new OPE estimator are more than welcome! Please read the guidelines for contribution (CONTRIBUTING.md).

Direct Method (DM)#

DM [3] is a model-based approach that uses the initial state value (estimated by e.g., Fitted Q Evaluation (FQE) [4]). It first learns the Q-function and then leverages the learned Q-function as follows.

\[\hat{J}_{\mathrm{DM}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi(a | s_{0}^{(i)}) \hat{Q}(s_{0}^{(i)}, a) = \frac{1}{n} \sum_{i=1}^n \hat{V}(s_{0}^{(i)}),\]

where \(\mathcal{D}=\{\{(s_t, a_t, r_t)\}_{t=0}^{T-1}\}_{i=1}^n\) is the logged dataset with \(n\) trajectories. \(T\) indicates step per episode. \(\hat{Q}(s_t, a_t)\) is the estimated state-action value and \(\hat{V}(s_t)\) is the estimated state value.

DM has low variance compared to other estimators, but can produce larger bias due to approximation errors.

  • DirectMethod

Note

We use the implementation of FQE provided by d3rlpy.

Trajectory-wise Importance Sampling (TIS)#

TIS [5] uses the importance sampling technique to correct the distribution shift between \(\pi\) and \(\pi_0\) as follows.

\[\hat{J}_{\mathrm{TIS}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t w_{1:T-1}^{(i)} r_t^{(i)},\]

where \(w_{0:T-1} := \prod_{t=0}^{T-1} (\pi(a_t | s_t) / \pi_0(a_t | s_t))\) is the trajectory-wise importance weight.

TIS enables an unbiased estimation of the policy value. However, when the trajectory length \(T\) is large, TIS suffers from high variance due to the product of importance weights over the entire horizon.

  • TrajectoryWiseImportanceSampling

Per-Decision Importance Sampling (PDIS)#

PDIS [5] leverages the sequential nature of the MDP to reduce the variance of TIS. Specifically, since \(s_t\) only depends on \(s_0, \ldots, s_{t-1}\) and \(a_0, \ldots, a_{t-1}\) and is independent of \(s_{t+1}, \ldots, s_{T}\) and \(a_{t+1}, \ldots, a_{T}\), PDIS only considers the importance weight of the past interactions when estimating \(r_t\) as follows.

\[\hat{J}_{\mathrm{PDIS}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t w_{0:t}^{(i)} r_t^{(i)},\]

where \(w_{0:t} := \prod_{t'=0}^t (\pi(a_{t'} | s_{t'}) / \pi_b(a_{t'} | s_{t'}))\) is the importance weight for each time step wrt the previous actions.

PDIS remains unbiased while reducing the variance of TIS. However, when \(t\) is large, PDIS still suffers from high variance.

  • PerDecisionImportanceSampling

Doubly Robust (DR)#

DR [6, 7] is a hybrid of model-based estimation and importance sampling. It introduces \(\hat{Q}\) as a baseline estimation in the recursive form of PDIS and applies importance weighting only on its residual.

\[\hat{J}_{\mathrm{DR}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \left(w_{0:t}^{(i)} (r_t^{(i)} - \hat{Q}(s_t^{(i)}, a_t^{(i)})) + w_{0:t-1}^{(i)} \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) \hat{Q}(s_t^{(i)}, a) \right),\]

DR is unbiased and has lower variance than PDIS when \(\hat{Q}(\cdot)\) is reasonably accurate to satisfy \(0 < \hat{Q}(\cdot) < 2 Q(\cdot)\). However, when the importance weight is quite large, it may still suffer from high variance.

  • DoublyRobust

Self-Normalized estimators#

Self-normalized estimators [11] aim to reduce the scale of importance weight for the variance reduction purpose. Specifically, the self-normalized versions of PDIS and DR are defined as follows.

\[\hat{J}_{\mathrm{SNPDIS}} (\pi; \mathcal{D}) := \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \frac{w_{0:t}^{(i)}}{\sum_{i'=1}^n w_{0:t}^{(i')}} r_t^{(i)},\]
\[\hat{J}_{\mathrm{SNDR}} (\pi; \mathcal{D}) := \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \left(\frac{w_{0:t}^{(i)}}{\sum_{i'=1}^n w_{0:t}^{(i')}} (r_t^{(i)} - \hat{Q}(s_t^{(i)}, a_t^{(i)})) + \frac{w_{0:t-1}^{(i)}}{\sum_{i'=1}^n w_{0:t-1}^{(i')}} \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) \hat{Q}(s_t^{(i)}, a) \right),\]

In more general, self-normalized estimators substitute the importance weight \(w_{\ast}\) as follows.

\[\tilde{w}_{\ast} := \frac{w_{\ast}}{\sum_{i=1}^n w_{\ast}}\]

where \(\tilde{w}_{\ast}\) is the self-normalized importance weight.

Self-normalized estimators are no longer unbiased, but have variance bounded by \(r_{max}^2\) while also remaining consistent.

  • SelfNormalizedTIS

  • SelfNormalizedPDIS

  • SelfNormalizedDR

Marginalized Importance Sampling Estimators#

When the length of the trajectory (\(T\)) is large, even per-decision importance weights can be exponentially large in the latter part of the trajectory. To alleviate this, state marginal or state-action marginal importance weights can be used instead of the per-decision importance weight as follows [12, 13].

\[\begin{split}\rho(s, a) &:= d^{\pi}(s, a) / d^{\pi_0}(s, a) \\ \rho(s) &:= d^{\pi}(s) / d^{\pi_0}(s)\end{split}\]

\(d^{\pi}(s, a)\) and \(d^{\pi}(s)\) is the marginal visitation probability of the policy \(\pi\) on \((s, a)\) or \(s\), respectively. The use of marginal importance weights is particularly beneficial when policy visits the same or similar states among different trajectories or different timesteps. (e.g., when the state transition is something like \(\cdots \rightarrow s_1 \rightarrow s_2 \rightarrow s_1 \rightarrow s_2 \rightarrow \cdots\) or when the trajectories always visits some particular state as \(\cdots \rightarrow s_{*} \rightarrow s_{1} \rightarrow s_{*} \rightarrow \cdots\)). Then, State-Action Marginal Importance Sampling (SMIS) and State Marginal Doubly Robust (SMDR) are defined as follows.

\[\hat{J}_{\mathrm{SAM-IS}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \rho(s_t^{(i)}, a_t^{(i)}) r_t^{(i)},\]
\[\begin{split}\hat{J}_{\mathrm{SAM-DR}} (\pi; \mathcal{D}) &:= \frac{1}{n} \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi(a | s_0^{(i)}) \hat{Q}(s_0^{(i)}, a) \\ & \quad \quad + \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \rho(s_t^{(i)}, a_t^{(i)}) \left(r_t^{(i)} + \gamma \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) \hat{Q}(s_{t+1}^{(i)}, a) - \hat{Q}(s_t^{(i)}, a_t^{(i)}) \right),\end{split}\]

Similarly, State-Marginal Importance Sampling (SAMIS) and State Action-Marginal Doubly Robust (SAMDR) are defined as follows.

\[\hat{J}_{\mathrm{SM-IS}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \rho(s_t^{(i)}) w_t(s_t^{(i)}, a_t^{(i)}) r_t^{(i)},\]
\[\begin{split}\hat{J}_{\mathrm{SM-DR}} (\pi; \mathcal{D}) &:= \frac{1}{n} \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi(a | s_0^{(i)}) \hat{Q}(s_0^{(i)}, a) \\ & \quad \quad + \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \rho(s_t^{(i)}) w_t(s_t^{(i)}, a_t^{(i)}) \left(r_t^{(i)} + \gamma \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) \hat{Q}(s_{t+1}^{(i)}, a) - \hat{Q}(s_t^{(i)}, a_t^{(i)}) \right),\end{split}\]

where \(w_t(s_t, a_t) := \pi(a_t | s_t) / \pi_0(a_t | s_t)\) is the immediate importance weight at timestep \(t\).

Tip

How to obtain state(-action) marginal importance weight?

To use marginalized importance sampling estimators, we need to first estimate the state marginal or state-action marginal importance weight. A dominant way to do this is to leverage the following relationship between the importance weights and the state-action value function under the assumption that the state visitation probability is consistent across various timesteps [12].

\[\begin{split}&\mathbb{E}_{(s, a, r, s') \sim \mathcal{D_{\pi_0}}}[w(s, a) r] \\ &= \mathbb{E}_{(s, a, r, s') \sim \mathcal{D_{\pi_0}}}[w(s, a)(Q_{\pi}(s, a) - \gamma \mathbb{E}_{a' \sim \pi(a' | s')}[Q(s', a')])] \\ &= (1 - \gamma) \mathbb{E}_{s_0 \sim d^{\pi}(s_0), a_0 \sim \pi(a_0 | s_0)}[Q_{\pi}(s_0, a_0)]\end{split}\]

The objective of weight learning is to minimize the difference between the middle term and the last term of the above equation when Q-function adversarially maximizes the difference. In particular, we provide the following algorithms to estimate state marginal and state-action marginal importance weights (and corresponding state-action value function) via minimax learning.

  • Augmented Lagrangian Method (ALM/DICE) [16]:

    This method simultaneously optimize both \(w(s, a)\) and \(Q(s, a)\). By setting different hyperparameters, ALM can be identical to BestDICE [16], DualDICE [20], GenDICE [18], AlgaeDICE [19], and MQL/MWL [12].

  • Minimax Q-Learning and Weight Learning (MQL/MWL) [12]:

    This method assumes that one of the value function or weight function is expressed by a function class in a reproducing kernel Hilbert space (RKHS) and optimizes only either the value function or the weight function.

We implement state marginal and state-action marginal OPE estimators in the following classes (both for Discrete- and Continuous- action spaces).

(State Marginal Estimators)

  • StateMarginalDM

  • StateMarginalIS

  • StateMarginalDR

  • StateMarginalSNIS

  • StateMarginalSNDR

(State-Action Marginal Estimators)

  • StateActionMarginalIS

  • StateActionMarginalDR

  • StateActionMarginalSNIS

  • StateActionMarginalSNDR

Double Reinforcement Learning (DRL)#

Comparing DR in the standard and marginal OPE, we notice that their formulation is slightly different as follows.

(DR in standard OPE)

\[\hat{J}_{\mathrm{DR}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \left( w_{0:t}^{(i)} (r_t^{(i)} - \hat{Q}(s_t^{(i)}, a_t^{(i)})) + w_{0:t-1}^{(i)} \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) \hat{Q}(s_t^{(i)}, a) \right),\]

(DR in marginal OPE)

\[\begin{split}\hat{J}_{\mathrm{SAM-DR}} (\pi; \mathcal{D}) &:= \frac{1}{n} \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi(a | s_0^{(i)}) \hat{Q}(s_0^{(i)}, a) \\ & \quad \quad + \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \rho(s_t^{(i)}, a_t^{(i)}) \left(r_t^{(i)} + \gamma \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) \hat{Q}(s_{t+1}^{(i)}, a) - \hat{Q}(s_t^{(i)}, a_t^{(i)}) \right),\end{split}\]

Then, a natural question arises, would it be possible to use marginal importance weight in DR in the standard formulation?

DRL [15] leverages the marginal importance sampling in the standard OPE formulation as follows.

\[\begin{split}\hat{J}_{\mathrm{DRL}} (\pi; \mathcal{D}) & := \frac{1}{n} \sum_{k=1}^K \sum_{i=1}^{n_k} \sum_{t=0}^{T-1} (\rho^j(s_{t}^{(i)}, a_{t}^{(i)}) (r_{t}^{(i)} - Q^j(s_{t}^{(i)}, a_{t}^{(i)})) \\ & \quad \quad + \rho^j(s_{t-1}^{(i)}, a_{t-1}^{(i)}) \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) Q^j(s_{t}^{(i)}, a))\end{split}\]

DRL achieves the semiparametric efficiency bound with a consistent value predictor \(Q\). Therefore, to alleviate the potential bias introduced in \(Q\), DRL uses the “cross-fitting” technique to estimate the value function. Specifically, let \(K\) is the number of folds and \(\mathcal{D}_j\) is the \(j\)-th split of logged data consisting of \(n_k\) samples. Cross-fitting trains \(\rho^j\) and \(Q^j\) on the subset of data used for OPE, i.e., \(\mathcal{D} \setminus \mathcal{D}_j\).

  • DoubleReinforcementLearning

Tip

How to obtain Q-hat via cross-fitting?

To obtain \(\hat{Q}\) via cross-fitting, please specify k_fold of obtain_whole_inputs of CreateOPEInput.

prep = CreateOPEInput(
    env=env,
)
input_dict = prep.obtain_whole_inputs(
    logged_dataset=logged_dataset,
    evaluation_policies=evaluation_policies,
    require_value_prediction=True,  # use model-based prediction
    k_fold=3,                       # use 3-fold cross-fitting
    random_state=random_state,
)

The default k_fold=1 trains \(\hat{Q}\) and \(\hat{w}\) without cross-fitting.

Spectrum of Off-Policy Estimators (SOPE)#

While state marginal or state-action marginal importance weight effectively alleviates the variance of per-decision importance weight, the estimation error of marginal importance weights may introduce some bias in estimation. To alleviate this and control the bias-variance tradeoff more flexibly, SOPE uses the following interpolated importance weights [14].

\[\begin{split}w_{\mathrm{SOPE}}(s_t, a_t) &= \begin{cases} \prod_{t'=0}^{k-1} w_t(s_{t'}, a_{t'}) & \mathrm{if} \, t < k \\ \rho(s_{t-k}, a_{t-k}) \prod_{t'=t-k+1}^{t} w_t(s_{t'}, a_{t'}) & \mathrm{otherwise} \end{cases} \\ w_{\mathrm{SOPE}}(s_t, a_t) &= \begin{cases} \prod_{t'=0}^{k-1} w_t(s_{t'}, a_{t'}) & \mathrm{if} \, t < k \\ \rho(s_{t-k}) \prod_{t'=t-k}^{t} w_t(s_{t'}, a_{t'}) & \mathrm{otherwise} \end{cases}\end{split}\]

where SOPE uses per-decision importance weight \(w_t(s_t, a_t) := \pi(a_t | s_t) / \pi_0(a_t | s_t)\) for the \(k\) most recent timesteps.

For instance, State Action-Marginal Importance Sampling (SAMIS) and State Action-Marginal Doubly Robust (SAM-DR) are defined as follows.

\[\hat{J}_{\mathrm{SOPE-SAM-IS}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{k-1} \gamma^t w_{0:t}^{(i)} r_t^{(i)} + \frac{1}{n} \sum_{i=1}^n \sum_{t=k}^{T-1} \gamma^t \rho(s_{t-k}^{(i)}, a_{t-k}^{(i)}) w_{t-k+1:t}^{(i)} r_t^{(i)},\]
\[\begin{split}\hat{J}_{\mathrm{SOPE-SAM-DR}} (\pi; \mathcal{D}) &:= \frac{1}{n} \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi(a | s_0^{(i)}) \hat{Q}(s_0^{(i)}, a) \\ & \quad \quad + \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{k-1} \gamma^t w_{0:t}^{(i)} \left(r_t^{(i)} + \gamma \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) \hat{Q}(s_{t+1}^{(i)}, a) - \hat{Q}(s_t^{(i)}, a_t^{(i)}) \right) \\ & \quad \quad + \frac{1}{n} \sum_{i=1}^n \sum_{t=k}^{T-1} \gamma^t \rho(s_{t-k}^{(i)}, a_{t-k}^{(i)}) w_{t-k+1:t}^{(i)} \left(r_t^{(i)} + \gamma \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) \hat{Q}(s_{t+1}^{(i)}, a) - \hat{Q}(s_t^{(i)}, a_t^{(i)}) \right),\end{split}\]

Tip

How to change the spectrum of (marginal) OPE?

SOPE is available by specifying n_step_pdis in the state marginal and state-action marginal estimators.

ope = OPE(
    logged_dataset=logged_dataset,
    ope_estimators=[SMIS(), SMDR(), SAMIS(), SAMDR()],  # any marginal estimators
    n_step_pdis=5,  # number of recent timesteps using per-decision importance sampling
)
estimation_dict = ope.estimate_policy_value(
    input_dict,
)

n_step_pdis=0 is equivalent to the original marginal OPE estimators.

High Confidence Off-Policy Evaluation (HCOPE)#

To alleviate the risk of optimistically overestimating the policy value, we are sometimes interested in the confidence intervals and the lower bound of the estimated policy value. We implement four methods to estimate the confidence intervals [21, 22, 23].

  • Hoeffding [23]:

\[|\hat{J}(\pi; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}[\hat{J}(\pi; \mathcal{D})]| \leq \hat{J}_{\max} \displaystyle \sqrt{\frac{\log(1 / \alpha)}{2 n}}.\]
  • Empirical Bernstein [21, 23]:

\[|\hat{J}(\pi; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}[\hat{J}(\pi; \mathcal{D})]| \leq \displaystyle \frac{7 \hat{J}_{\max} \log(2 / \alpha)}{3 (n - 1)} + \displaystyle \sqrt{\frac{2 \hat{\mathbb{V}}_{\mathcal{D}}(\hat{J}) \log(2 / \alpha)}{(n - 1)}}.\]
  • Student T-test [21]:

\[|\hat{J}(\pi; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}[\hat{J}(\pi; \mathcal{D})]| \leq \displaystyle \frac{T_{\mathrm{test}}(1 - \alpha, n-1)}{\sqrt{n} / \hat{\sigma}}.\]
  • Bootstrapping [21, 22]:

\[|\hat{J}(\pi; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}[\hat{J}(\pi; \mathcal{D})]| \leq \mathrm{Bootstrap}(1 - \alpha).\]

Note that all the above bound holds with probability \(1 - \alpha\). For notations, we denote \(\hat{\mathbb{V}}_{\mathcal{D}}(\cdot)\) to be the sample variance, \(T_{\mathrm{test}}(\cdot,\cdot)\) to be T value, and \(\sigma\) to be the standard deviation.

Among the above high confidence interval estimation, hoeffding and empirical bernstein derives a lower bound without any distribution assumption of \(p(\hat{J})\), which sometimes leads to quite conservative estimation. On the other hand, T-test is based on the assumption that each sample of \(p(\hat{J})\) follows the normal distribution.

Tip

How to use High-confidence OPE?

The implementation is available by calling estimate_intervals of each OPE estimator as follows.

ope = OPE(
    logged_dataset=logged_dataset,
    ope_estimators=[DM(), TIS(), PDIS(), DR()],  # any standard or marginal estimators
)
estimation_dict = ope.estimate_intervals(
    input_dict,
    ci="hoeffding",  # one of {"hoeffding", "bernstein", "ttest", "bootstrap"}
    alpha=0.05,      # confidence level
)

Extension to the Continuous Action Space#

When the action space is continuous, the naive importance weight \(w_t = \pi(a_t|s_t) / \pi_0(a_t|s_t) = (\pi(a |s_t) / \pi_0(a_t|s_t)) \cdot \mathbb{I} \{a = a_t \}\) rejects almost every actions, as the indicator function \(\mathbb{I}\{a = a_t\}\) filters only the action observed in the logged data.

To address this issue, continuous-action OPE estimators apply kernel density estimation technique to smooth the importance weight [57, 58].

\[\overline{w}_t = \int_{a \in \mathcal{A}} \frac{\pi(a | s_t)}{\pi_0(a_t | s_t)} \cdot \frac{1}{h} K \left( \frac{a - a_t}{h} \right) da,\]

where \(K(\cdot)\) denotes a kernel function and \(h\) is the bandwidth hyperparameter. We can use any function as \(K(\cdot)\) that meets the following qualities:

    1. \(\int xK(x) dx = 0\),

    1. \(\int K(x) dx = 1\),

    1. \(\lim _{x \rightarrow-\infty} K(x)=\lim _{x \rightarrow+\infty} K(x)=0\),

    1. \(K(x) \geq 0, \forall x\).

We provide the following kernel functions in SCOPE-RL.

  • Gaussian kernel: \(K(x) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{x^{2}}{2}}\)

  • Epanechnikov kernel: \(K(x) = \frac{3}{4} (1 - x^2) \, (|x| \leq 1)\)

  • Triangular kernel: \(K(x) = 1 - |x| \, (|x| \leq 1)\)

  • Cosine kernel: \(K(x) = \frac{\pi}{4} \mathrm{cos} \left( \frac{\pi}{2} x \right) \, (|x| \leq 1)\)

  • Uniform kernel: \(K(x) = \frac{1}{2} \, (|x| \leq 1)\)

Tip

How to control the bias-variance tradeoff with a kernel?

The bandwidth parameter \(h\) controls the bias-variance tradeoff. Specifically, a large value of \(h\) leads to a low-variance but high-bias estimation, while a small value of \(h\) leads to a high-variance but low-bias estimation.

The bandwidth parameter corresponds to bandwidth in the OffPolicyEvaluation class.

ope = OPE(
    logged_dataset=logged_dataset,
    ope_estimators=[DM(), TIS(), PDIS(), DR()],
    bandwidth=1.0,  # bandwidth hyperparameter
)

For multi-dimension actions, we define the kernel with dot product among actions as \(K(a, a') := K(a^T a')\). To control the scale of each dimension, action_scaler, which is specified in OffPolicyEvaluation, is also useful.

from d3rlpy.preprocessing import MinMaxActionScaler
ope = OPE(
    logged_dataset=logged_dataset,
    ope_estimators=[DM(), TIS(), PDIS(), DR()],
    bandwidth=1.0,  # bandwidth hyperparameter
    action_scaler=MinMaxActionScaler(
        minimum=env.action_space.low,
        maximum=env.action_space.high,
    ),
)

Cumulative Distribution Off-Policy Evaluation (CD-OPE)#

While the basic OPE aims to estimate the average policy performance, we are often also interested in the performance distribution of the evaluation policy. Cumulative distribution OPE enables flexible estimation of various risk functions such as variance and conditional value at risk (CVaR) using the cumulative distribution function (CDF) [8, 9, 10].

(Cumulative Distribution Function)

\[F(m, \pi) := \mathbb{E} \left[ \mathbb{I} \left \{ \sum_{t=0}^{T-1} \gamma^t r_t \leq m \right \} \mid \pi \right]\]

(Risk Functions derived by CDF)

  • Mean: \(\mu(F) := \int_{G} G \, \mathrm{d}F(G)\)

  • Variance: \(\sigma^2(F) := \int_{G} (G - \mu(F))^2 \, \mathrm{d}F(G)\)

  • \(\alpha\)-quartile: \(Q^{\alpha}(F) := \min \{ G \mid F(G) \leq \alpha \}\)

  • Conditional Value at Risk (CVaR): \(\int_{G} G \, \mathbb{I}\{ G \leq Q^{\alpha}(F) \} \, \mathrm{d}F(G)\)

where we let \(G := \sum_{t=0}^{T-1} \gamma^t r_t\) to represent the trajectory-wise reward as a random variable and \(dF(G) := \mathrm{lim}_{\Delta \rightarrow 0} F(G) - F(G- \Delta)\).

To estimate both CDF and various risk functions, we provide the following CumulativeDistributionOffPolicyEvaluation class.

# initialize the OPE class
from scope_rl.ope import CumulativeDistributionOPE
cd_ope = CumulativeDistributionOPE(
    logged_dataset=logged_dataset,
    ope_estimators=[CD_DM(), CD_IS(), CD_DR()],
)

It estimates the cumulative distribution of the trajectory-wise reward and various risk functions as follows.

cdf_dict = cd_ope.estimate_cumulative_distribution_function(input_dict)
variance_dict = cd_ope.estimate_variance(input_dict)

Tip

How to conduct Cumulative Distribution OPE with multiple logged datasets?

Conducting Cumulative Distribution OPE with multiple logged datasets requires no additional efforts.

First, the same command with the single logged dataset case also works with multiple logged datasets.

ope = CumulativeDistributionOPE(
    logged_dataset=logged_dataset,  # MultipleLoggedDataset
    ope_estimators=[CD_DM(), CD_IS(), CD_DR()],
)
multiple_cdf_dict = ope.estimate_cumulative_distribution_function(
    input_dict,  # MultipleInputDict
)

The returned value is the dictionary containing the ope result.

In addition, we can specify which logged dataset and input_dict to use by setting behavior_policy_name and dataset_id.

multiple_ope_dict = ope.estimate_cumulative_distribution_function(
    input_dict,
    behavior_policy_name=behavior_policy.name,  #
    dataset_id=0,  # specify which logged dataset and input_dict to use
)

The basic visualization function also works by specifying the dataset id.

ope.visualize_cumulative_distribution_function(
    input_dict,
    behavior_policy_name=behavior_policy.name,  #
    dataset_id=0,  #
    random_state=random_state,
)
card-img-top

cumulative distribution function estimated with the specified dataset

Moreover, we provide additional visualization functions for the multiple logged dataset case.

The following visualizes confidence intervals of the cumulative distribution function.

ope.visualize_cumulative_distribution_function_with_multiple_estimates(
    input_dict,      # MultipleInputDict
    behavior_policy_name=None,                   # compare estimators with multiple behavior policies
    # behavior_policy_name=behavior_policy.name  # compare estimators with a single behavior policy
    random_state=random_state,
)
card-img-top

cumulative distribution function estimated with the multiple datasets

In contrast, the following visualizes the distribution of multiple estimates of point-wise policy performance (e.g., policy value, variance, conditional value at risk, lower quartile).

ope.visualize_policy_value_with_multiple_estimates(
    input_dict,      # MultipleInputDict
    plot_type="ci",  # one of {"ci", "violin", "scatter"}, default="ci"
    random_state=random_state,
)

When the plot_type is “ci”, the plot is somewhat similar to the basic visualization. (The star indicates the ground-truth policy value and the confidence intervals are derived by multiple estimates across datasets.)

card-img-top

policy value estimated with the multiple datasets

When the plot_type is “violin”, the plot visualizes the distribution of multiple estimates. This is particularly useful to see how the estimation result can vary depending on different datasets or random seeds.

card-img-top

policy value estimated with the multiple datasets (violin)

Finally, when the plot_type is “scatter”, the plot visualizes each estimation with its color specifying the behavior policy. This function is particularly useful to see how the choice of behavior policy (e.g., their stochasticity) affects the estimation result.

card-img-top

policy value estimated with the multiple datasets (scatter)

CumulativeDistributionOffPolicyEvaluation implements the following functions.

(Cumulative Distribution Function)

  • estimate_cumulative_distribution_function

(Risk Functions and Statistics)

  • estimate_mean

  • estimate_variance

  • estimate_conditional_value_at_risk

  • estimate_interquartile_range

(Visualization)

  • visualize_policy_value

  • visualize_conditional_value_at_risk

  • visualize_interquartile_range

  • visualize_cumulative_distribution_function

(Visualization with multiple estimates on multiple logged datasets)

  • visualize_policy_value_with_multiple_estimates

  • visualize_variance_with_multiple_estimates

  • visualize_cumulative_distribution_function_with_multiple_estimates

  • visualize_lower_quartile_with_multiple_estimates

  • visualize_cumulative_distribution_function_with_multiple_estimates

(Others)

  • obtain_reward_scale

Below, we describe the implemented cumulative distribution OPE estimators.

Direct Method (DM)

Trajectory-wise Importance Sampling (TIS)

Trajectory-wise Doubly Robust (DR)

Self-Normalized estimators

Extension to the continuous action space

Tip

How to define my own cumulative distribution OPE estimator?

To define your own OPE estimator, use BaseCumulativeDistributionOPEEstimator.

Basically, the common inputs for each function are reward_scale (np.ndarray indicating x-axis of cumulative distribution function) and the following keys from logged_dataset and input_dict.

(logged_dataset)

key: [
    size,
    step_per_trajectory,
    action,
    reward,
    pscore,
]

(input_dict)

key: [
    evaluation_policy_action,
    evaluation_policy_action_dist,
    state_action_value_prediction,
    initial_state_value_prediction,
    state_action_marginal_importance_weight,
    state_marginal_importance_weight,
    on_policy_policy_value,
    gamma,
]

action_scaler and sigma are also added in the continuous-action case.

If you want to add other arguments, please add them to the initialization arguments for API consistency.

Finally, contributions to SCOPE-RL with a new OPE estimator are more than welcome! Please read the guidelines for contribution (CONTRIBUTING.md).

See also

API reference of BaseOffPolicyEstimator explains the abstract methods.

Direct Method (DM)#

DM adopts a model-based approach to estimate the cumulative distribution function.

\[\hat{F}_{\mathrm{DM}}(m, \pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi(a | s_0^{(i)}) \hat{G}(m; s_0^{(i)}, a)\]

where \(\hat{F}(\cdot)\) is the estimated cumulative distribution function and \(\hat{G}(\cdot)\) is an estimator for \(\mathbb{E} \left[ \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t \leq m \right \} \mid s,a \right]\).

DM is vulnerable to the approximation error, but has low variance.

  • CumulativeDistributionDM

Trajectory-wise Importance Sampling (TIS)#

TIS corrects the distribution shift by applying the importance sampling technique on the cumulative distribution estimation.

\[\hat{F}_{\mathrm{TIS}}(m, \pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n w_{0:T-1}^{(i)} \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t^{(i)} \leq m \right \}\]

where \(w_{0:T-1} := \prod_{t=0}^{T-1} (\pi(a_t | s_t) / \pi_0(a_t | s_t))\) is the trajectory-wise importance weight. TIS is unbiased but can suffer from high variance. As a consequence, \(\hat{F}_{\mathrm{TIS}}(\cdot)\) sometimes becomes more than 1.0 when the variance is high. Therefore, we correct CDF as follows [8].

\[\hat{F}^{\ast}_{\mathrm{TIS}}(m, \pi; \mathcal{D}) := \min(\max_{m' \leq m} \hat{F}_{\mathrm{TIS}}(m', \pi; \mathcal{D}), 1)\]

.

  • CumulativeDistributionTIS

Trajectory-wise Doubly Robust (TDR)#

TDR combines TIS and DM to reduce the variance while being unbiased.

\[\hat{F}_{\mathrm{TDR}}(m, \pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n w_{0:T-1}^{(i)} \left( \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t^{(i)} \leq m \right \} - \hat{G}(m; s_0^{(i)}, a_0^{(i)}) \right) + \hat{F}_{\mathrm{DM}}(m, \pi; \mathcal{D})\]

TDR reduces the variance of TIS while being unbiased, leveraging the model-based estimate (i.e., DM) as a control variate. Since \(\hat{F}_{\mathrm{TDR}}(\cdot)\) may be less than zero or more than one, we should apply the following transformation to bound \(\hat{F}_{\mathrm{TDR}}(\cdot) \in [0, 1]\) [8].

\[\hat{F}^{\ast}_{\mathrm{TDR}}(m, \pi; \mathcal{D}) := \mathrm{clip}(\max_{m' \leq m} \hat{F}_{\mathrm{TDR}}(m', \pi; \mathcal{D}), 0, 1).\]

Note that this estimator is not equivalent to the (recursive) DR estimator defined by [9]. We are planning to implement the recursive version in a future update of the software.

  • CumulativeDistributionTDR

Finally, we also provide the self-normalized estimators for TIS and TDR. They use the self-normalized importance weight \(\tilde{w}_{\ast} := w_{\ast} / (\sum_{i=1}^{n} w_{\ast})\) for the variance reduction purpose.

  • CumulativeDistributionSNTIS

  • CumulativeDistributionSNDR

Evaluation Metrics of OPE/OPS#

Finally, we describe the metrics to evaluate the quality of OPE estimators and its OPS results.

  • Mean Squared Error (MSE) [24, 25, 26]:

    This metric measures the estimation accuracy as \(\sum_{\pi \in \Pi} (\hat{J}(\pi; \mathcal{D}) - J(\pi))^2 / |\Pi|\).

  • Regret@k [24, 26]:

    This metric measures how well the selected policy(ies) performs. In particular, Regret@1 indicates the expected performance difference between the (oracle) best policy and the selected policy as \(J(\pi^{\ast}) - J(\hat{\pi}^{\ast})\), where \(\pi^{\ast} := {\arg\max}_{\pi \in \Pi} J(\pi)\) and \(\hat{\pi}^{\ast} := {\arg\max}_{\pi \in \Pi} \hat{J}(\pi; \mathcal{D})\).

  • Spearman’s Rank Correlation Coefficient [24, 26]:

    This metric measures how well the ranking of the candidate estimators is preserved in the OPE result.

  • Type I and Type II Error Rate:

    This metric measures how well an OPE estimator validates whether the policy performance surpasses the given safety threshold or not.

To ease the comparison of candidate (evaluation) policies and the OPE estimators, we provide the OffPolicySelection class.

# Initialize the OPS class
from scope_rl.ope import OffPolicySelection
ops = OffPolicySelection(
    ope=ope,
    cumulative_distribution_ope=cd_ope,
)

The OffPolicySelection class returns both the OPE results and the OPS metrics as follows.

ranking_df, metric_df = ops.select_by_policy_value(
    input_dict,
    return_metrics=True,
    return_by_dataframe=True,
)

Moreover, the OPS class enables us to validate the best/worst/mean/std performance of top k deployment and how well the safety requirement is satisfied. Note that, we provide the detailed description of these top- \(k\) metrics and the proposed SharpeRatio@k metric in this page: Risk-Return Assessments of OPE via SharpeRatio@k.

ops.visualize_topk_policy_value_selected_by_standard_ope(
    input_dict=input_dict,
    safety_criteria=1.0,
)

Finally, the OPS class also implements the modules to compare the OPE result and the true policy metric as follows.

ops.visualize_policy_value_for_validation(
    input_dict=input_dict,
    n_cols=4,
    share_axes=True,
)

Tip

How to conduct OPS with multiple logged datasets?

Conducting OPS with multiple logged datasets requires no additional effort.

First, the same command with the single logged dataset case also works with multiple logged datasets.

ops = OffPolicySelection(
    ope=ope,                             # initialized with MultipleLoggedDataset
    cumulative_distribution_ope=cd_ope,  # initialized with MultipleLoggedDataset
)
ranking_df, metric_df = ops.select_by_policy_value(
    input_dict,  # MultipleInputDict
    return_metrics=True,
    return_by_dataframe=True,
)

The returned value is a dictionary containing the ops result.

Next, visualization functions for OPS demonstrate the aggregated ops result by default. For example, the average top-k performance and its confidence intervals are shown for the top-k visualization.

ops.visualize_topk_policy_value_selected_by_standard_ope(
    input_dict=input_dict,
    safety_criteria=1.0,
)
card-img-top

top-k deployment result with multiple logged datasets

In the validation visualization, colors indicate the behavior policies. This function is particularly useful to see how the choice of behavior policy (e.g., their stochasticity) affects the estimation result.

ops.visualize_policy_value_for_validation(
    input_dict=input_dict,
    n_cols=4,
    share_axes=True,
)
card-img-top

validation results on multiple logged datasets

Note that when the behavior_policy_name and dataset_id is specified, the methods show the result on the specified dataset.

The OPS class implements the following functions.

(OPS)

  • obtain_oracle_selection_result

  • select_by_policy_value

  • select_by_policy_value_via_cumulative_distribution_ope

  • select_by_policy_value_lower_bound

  • select_by_lower_quartile

  • select_by_conditional_value_at_risk

(Visualization)

  • visualize_policy_value_for_selection

  • visualize_cumulative_distribution_function_for_selection

  • visualize_policy_value_for_selection

  • visualize_policy_value_of_cumulative_distribution_ope_for_selection

  • visualize_conditional_value_at_risk_for_selection

  • visualize_interquartile_range_for_selection

(Visualization with multiple estimates on multiple logged datasets)

  • visualize_policy_value_with_multiple_estimates_standard_ope

  • visualize_policy_value_with_multiple_estimates_cumulative_distribution_ope

  • visualize_variance_with_multiple_estimates

  • visualize_cumulative_distribution_function_with_multiple_estimates

  • visualize_lower_quartile_with_multiple_estimates

  • visualize_cumulative_distribution_function_with_multiple_estimates

(Visualization of top k performance)

  • visualize_topk_policy_value_selected_by_standard_ope

  • visualize_topk_policy_value_selected_by_cumulative_distribution_ope

  • visualize_topk_policy_value_selected_by_lower_bound

  • visualize_topk_conditional_value_at_risk_selected_by_standard_ope

  • visualize_topk_conditional_value_at_risk_selected_by_cumulative_distribution_ope

  • visualize_topk_lower_quartile_selected_by_standard_ope

  • visualize_topk_lower_quartile_selected_by_cumulative_distribution_ope

(Visualization for validation)

  • visualize_policy_value_for_validation

  • visualize_policy_value_of_cumulative_distribution_ope_for_validation

  • visualize_policy_value_lower_bound_for_validation

  • visualize_variance_for_validation

  • visualize_lower_quartile_for_validation

  • visualize_conditional_value_at_risk_for_validation

<<< Prev Problem Formulation

<<< Prev Offline RL

Next >>> Visualization tools

Next >>> Package Reference