scope_rl.ope.discrete.marginal_estimators#

State(-Action) Marginal Off-Policy Estimators for discrete action cases.

Classes

DoubleReinforcementLearning

Double Reinforcement Learning (DRL) estimator for discrete action spaces.

StateActionMarginalDR

State-Action Marginal Doubly Robust (SAM-DR) for discrete action spaces.

StateActionMarginalIS

State-Action Marginal Importance Sampling (SAM-IS) for discrete action spaces.

StateActionMarginalSNDR

State-Action Marginal Self-Normalized Doubly Robust (SAM-SNDR) for discrete action spaces.

StateActionMarginalSNIS

State-Action Marginal Self-Normalized Importance Sampling (SAM-SNIS) for discrete action spaces.

StateMarginalDM

Direct Method (DM) for discrete-action and stationary OPE.

StateMarginalDR

State Marginal Doubly Robust (SM-DR) for discrete action spaces.

StateMarginalIS

State Marginal Importance Sampling (SM-IS) for discrete action spaces.

StateMarginalSNDR

State Marginal Self-Normalized Doubly Robust (SM-SNDR) for discrete action spaces.

StateMarginalSNIS

State Marginal Self-Normalized Importance Sampling (SM-SNIS) for discrete action spaces.

class scope_rl.ope.discrete.marginal_estimators.DoubleReinforcementLearning(estimator_name='drl')[source]#

Double Reinforcement Learning (DRL) estimator for discrete action spaces.

Bases: scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.DoubleReinforcementLearning

Note

DRL estimates the policy value using state-action marginal importance weight and Q-function estimated by cross-fitting.

\[\hat{J}_{\mathrm{DRL}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{k=1}^K \sum_{i=1}^{n_k} \sum_{t=0}^{T-1} (\rho^j(s_t^{(i)}, a_t^{(i)}) (r_t^{(i)} - Q^j(s_t^{(i)}, a_t^{(i)})) + \rho^j(s_{t-1}^{(i)}, a_{t-1}^{(i)}) \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) Q^j(s_t^{(i)}, a))\]

where \(\rho(s, a) \approx d^{\pi}(s, a) / d^{\pi_b}(s, a)\) is the state-action marginal importance weight, where \(d^{\pi}(s, a)\) is the marginal visitation probability of the policy \(\pi\) on \((s, a)\). \(Q(s, a)\) is the Q-function. \(K\) is the number of folds and \(\mathcal{D}_j\) is the \(j\)-th split of logged data consisting of \(n_k\) samples. \(\rho^j\) and \(Q^j\) are estimated on the subset of data used for OPE, i.e., \(\mathcal{D} \setminus \mathcal{D}_j\).

DRL achieves the semiparametric efficiency bound with a consistent value predictor.

There are several ways to estimate the state(-action) marginal importance weight such as Augmented Lagrangian Method (ALM) (Yang et al., 2020) and Minimax Weight Learning (MWL) (Uehara et al., 2020).

See also

The implementations of such weight learning methods are available at scope_rl.ope.weight_value_learning.

Parameters:

estimator_name (str, default="drl") – Name of the estimator.

References

Nathan Kallus and Masatoshi Uehara. “Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes.” 2020.

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. “Off-Policy Evaluation via the Regularized Lagrangian.” 2020.

Methods

estimate_interval(step_per_trajectory, ...)

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(step_per_trajectory, ...)

Estimate the policy value of the evaluation policy.

estimate_policy_value(step_per_trajectory, action, reward, state_action_marginal_importance_weight, evaluation_policy_action_dist, state_action_value_prediction, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • state_action_marginal_importance_weight (array-like of shape (n_trajectories * step_per_trajectory, )) – Importance weight wrt the state-action marginal distribution, i.e., \(d^{\pi}(s, a) / d^{\pi_b}(s, a)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s) \forall a \in \mathcal{A}\)

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

Returns:

V_hat – Estimated policy value.

Return type:

float

estimate_interval(step_per_trajectory, action, reward, state_action_marginal_importance_weight, evaluation_policy_action_dist, state_action_value_prediction, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=None, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • state_action_marginal_importance_weight (array-like of shape (n_trajectories * step_per_trajectory, )) – Importance weight wrt the state-action marginal distribution, i.e., \(d^{\pi}(s, a) / d^{\pi_b}(s, a)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s) \forall a \in \mathcal{A}\)

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

class scope_rl.ope.discrete.marginal_estimators.StateMarginalDM(estimator_name='sm_dm')[source]#

Direct Method (DM) for discrete-action and stationary OPE.

Bases: scope_rl.ope.BaseStateMarginalOPEEstimator -> scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.StateMarginalDM

Note

DM estimates the policy value using an estimated initial state value as follows.

\[\hat{J}_{\mathrm{DM}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi(a | s_0^{(i)}) \hat{Q}(s_0^{(i)}, a) = \frac{1}{n} \sum_{i=1}^n \hat{V}(s_0^{(i)}),\]

where \(\mathcal{D}=\{\{(s_t, a_t, r_t)\}_{t=0}^{T-1}\}_{i=1}^n\) is the logged dataset with \(n\) trajectories. \(T\) indicates step per episode. \(\hat{Q}(s_t, a_t)\) is the estimated Q value given a state-action pair. \(\hat{V}(s_t)\) is the estimated value function given a state.

DM has low variance compared to other estimators, but can produce larger bias due to approximation errors.

There are several methods to estimate \(\hat{Q}(s, a)\) such as Fitted Q Evaluation (FQE) (Le et al., 2019), Minimax Q-Function Learning (MQL) (Uehara et al., 2020), and Augmented Lagrangian Method (ALM) (Yang et al., 2020).

See also

The implementation of FQE is provided by d3rlpy. The implementations of Minimax Weight and Value Learning (including ALM) is available at scope_rl.ope.weight_value_learning.

Note

This function is different from DirectMethod in that the initial state is sampled from the stationary distribution \(d^{\pi}(s_0)\).

Parameters:

estimator_name (str, default="sm_dm") – Name of the estimator.

References

Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.” 2021.

Takuma Seno and Michita Imai. “d3rlpy: An Offline Deep Reinforcement Library.” 2021.

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Hoang Le, Cameron Voloshin, and Yisong Yue. “Batch Policy Learning under Constraints.” 2019.

Alina Beygelzimer and John Langford. “The Offset Tree for Learning with Partial Labels.” 2009.

Methods

estimate_interval(initial_state_value_prediction)

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(...)

Estimate the policy value of the evaluation policy.

estimate_policy_value(initial_state_value_prediction, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:

initial_state_value_prediction (array-like of shape (n_trajectories, )) – Estimated initial state value.

Returns:

V_hat – Estimated policy value.

Return type:

float

estimate_interval(initial_state_value_prediction, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=None, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:
  • initial_state_value_prediction (array-like of shape (n_trajectories, )) – Estimated initial state value.

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

class scope_rl.ope.discrete.marginal_estimators.StateMarginalIS(estimator_name='sm_is')[source]#

State Marginal Importance Sampling (SM-IS) for discrete action spaces.

Bases: scope_rl.ope.BaseStateMarginalOPEEstimator -> scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.StateMarginalIS

Note

SM-IS estimates the policy value using state marginal importance weighting. Following SOPE (Yuan et al., 2021), we combine state-marginal importance weighting and \(k\)-step PDIS as follows.

\[\hat{J}_{\mathrm{SM-IS}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{k-1} \gamma^t w_{0:t}^{(i)} r_t^{(i)} + \frac{1}{n} \sum_{i=1}^n \sum_{t=k}^{T-1} \gamma^t \rho(s_{t-k}^{(i)}) w_{t-k:t}^{(i)} r_t^{(i)},\]

where \(w_{t_1:t_2} := \prod_{t=t_1}^{t_2} (\pi(a_t | s_t) / \pi_0(a_t | s_t))\) and \(\rho(s) \approx d^{\pi}(s) / d^{\pi_b}(s)\) is the state-marginal importance weight, where \(d^{\pi}(s)\) is the marginal visitation probability of the policy \(\pi\) on \(s\). When \(k=0\), this estimator is reduced to the vanilla state marginal IS.

SM-IS is unbiased when the marginal importance weight is estimated correctly. Moreover, SM-IS reduces the variance caused by trajectory-wise or per-decision importance weighting by considering the marginal distribution across various timesteps.

There are several ways to estimate the state(-action) marginal importance weight such as Augmented Lagrangian Method (ALM) (Yang et al., 2020) and Minimax Weight Learning (MWL) (Uehara et al., 2020).

See also

The implementations of such weight learning methods are available at scope_rl.ope.weight_value_learning.

Parameters:

estimator_name (str, default="sm_is") – Name of the estimator.

References

Christina J. Yuan, Yash Chandak, Stephen Giguere, Philip S. Thomas, and Scott Niekum. “SOPE: Spectrum of Off-Policy Estimators.” 2021.

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. “Off-Policy Evaluation via the Regularized Lagrangian.” 2020.

Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. “Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation.” 2018

Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” 2000.

Methods

estimate_interval(n_step_pdis, ...[, gamma, ...])

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(n_step_pdis, ...[, gamma])

Estimate the policy value of the evaluation policy.

estimate_policy_value(n_step_pdis, step_per_trajectory, action, reward, state_marginal_importance_weight, pscore, evaluation_policy_action_dist, gamma=1.0, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:
  • n_step_pdis (int (>= 0)) – Number of initial steps whose rewards are estimated by step-wise importance weighting. When set to zero, the estimator is reduced to the vanilla state marginal IS.

  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • state_marginal_importance_weight (array-like of shape (n_trajectories * step_per_trajectory, )) – Importance weight wrt the state marginal distribution, i.e., \(d^{\pi}(s) / d^{\pi_b}(s)\)

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s) \forall a \in \mathcal{A}\)

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

V_hat – Estimated policy value.

Return type:

float

estimate_interval(n_step_pdis, step_per_trajectory, action, reward, state_marginal_importance_weight, pscore, evaluation_policy_action_dist, gamma=1.0, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=None, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:
  • n_step_pdis (int (>= 0)) – Number of initial steps whose rewards are estimated by step-wise importance weighting. When set to zero, the estimator is reduced to the vanilla state marginal IS.

  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • state_marginal_importance_weight (array-like of shape (n_trajectories * step_per_trajectory, )) – Importance weight wrt the state marginal distribution, i.e., \(d^{\pi}(s) / d^{\pi_b}(s)\)

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s) \forall a \in \mathcal{A}\)

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

class scope_rl.ope.discrete.marginal_estimators.StateMarginalDR(estimator_name='sm_dr')[source]#

State Marginal Doubly Robust (SM-DR) for discrete action spaces.

Bases: scope_rl.ope.BaseStateActionMarginalOPEEstimator -> scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.StateMarginalDR

Note

SM-DR estimates the policy value using state marginal importance weighting. Following SOPE (Yuan et al., 2021), we combine state-marginal importance weighting and \(k\)-step PDIS as follows.

\[\begin{split}\hat{J}_{\mathrm{SM-DR}} (\pi; \mathcal{D}) &:= \frac{1}{n} \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi(a | s_0^{(i)}) \hat{Q}(s_0^{(i)}, a) \\ & \quad \quad + \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{k-1} \gamma^t w_{0:t}^{(i)} \left(r_t^{(i)} + \gamma \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) \hat{Q}(s_{t+1}^{(i)}, a) - \hat{Q}(s_t^{(i)}, a_t^{(i)}) \right) \\ & \quad \quad + \frac{1}{n} \sum_{i=1}^n \sum_{t=k}^{T-1} \gamma^t \rho(s_{t-k}^{(i)}) w_{t-k:t}^{(i)} \left( r_t^{(i)} + \gamma \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) \hat{Q}(s_{t+1}^{(i)}, a) - \hat{Q}(s_t^{(i)}, a_t^{(i)}) \right),\end{split}\]

where \(w_{t_1:t_2} := \prod_{t=t_1}^{t_2} (\pi(a_t | s_t) / \pi_0(a_t | s_t))\) and \(\rho(s) \approx d^{\pi}(s) / d^{\pi_b}(s)\) is the state-marginal importance weight, where where \(d^{\pi}(s)\) is the marginal visitation probability of the policy \(\pi\) on \(s\). When \(k=0\), this estimator is reduced to the vanilla state marginal DR.

SM-DR is unbiased when either the marginal importance weight or Q-function is estimated correctly. Moreover, SM-DR reduces the variance caused by trajectory-wise or per-decision importance weighting by considering the marginal distribution across various timesteps.

There are several ways to estimate the state(-action) marginal importance weight such as Augmented Lagrangian Method (ALM) (Yang et al., 2020) and Minimax Weight Learning (MWL) (Uehara et al., 2020).

See also

The implementations of such weight learning methods are available at scope_rl.ope.weight_value_learning.

Parameters:

estimator_name (str, default="sm_dr") – Name of the estimator.

References

Christina J. Yuan, Yash Chandak, Stephen Giguere, Philip S. Thomas, and Scott Niekum. “SOPE: Spectrum of Off-Policy Estimators.” 2021.

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. “Off-Policy Evaluation via the Regularized Lagrangian.” 2020.

Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. “Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation.” 2018

Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” 2016.

Philip S. Thomas and Emma Brunskill. “Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” 2016.

Methods

estimate_interval(n_step_pdis, ...[, gamma, ...])

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(n_step_pdis, ...[, gamma])

Estimate the policy value of the evaluation policy.

estimate_policy_value(n_step_pdis, step_per_trajectory, action, reward, state_marginal_importance_weight, pscore, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, gamma=1.0, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:
  • n_step_pdis (int (>= 0)) – Number of initial steps whose rewards are estimated by step-wise importance weighting. When set to zero, the estimator is reduced to the vanilla state marginal IS.

  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • state_marginal_importance_weight (array-like of shape (n_trajectories * step_per_trajectory, )) – Importance weight wrt the state marginal distribution, i.e., \(d^{\pi}(s) / d^{\pi_b}(s)\)

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s) \forall a \in \mathcal{A}\)

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • initial_state_value_prediction (array-like of shape (n_trajectories, )) – Estimated initial state value.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

V_hat – Estimated policy value.

Return type:

float

estimate_interval(n_step_pdis, step_per_trajectory, action, reward, state_marginal_importance_weight, pscore, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, gamma=1.0, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=None, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:
  • n_step_pdis (int (>= 0)) – Number of initial steps whose rewards are estimated by step-wise importance weighting. When set to zero, the estimator is reduced to the vanilla state marginal IS.

  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • state_marginal_importance_weight (array-like of shape (n_trajectories * step_per_trajectory, )) – Importance weight wrt the state marginal distribution, i.e., \(d^{\pi}(s) / d^{\pi_b}(s)\)

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s) \forall a \in \mathcal{A}\)

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • initial_state_value_prediction (array-like of shape (n_trajectories, )) – Estimated initial state value.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

class scope_rl.ope.discrete.marginal_estimators.StateMarginalSNIS(estimator_name='sm_snis')[source]#

State Marginal Self-Normalized Importance Sampling (SM-SNIS) for discrete action spaces.

Bases: scope_rl.ope.discrete.StateMarginalIS -> scope_rl.ope.BaseStateMarginalOPEEstimator -> scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.StateMarginalSNIS

Note

SM-SNIS estimates the policy value using state marginal importance weighting. Following SOPE (Yuan et al., 2021), we combine state-marginal importance weighting and \(k\)-step PDIS as follows.

\[\hat{J}_{\mathrm{SM-SNIS}} (\pi; \mathcal{D}) := \sum_{i=1}^n \sum_{t=0}^{k-1} \gamma^t \frac{w_{0:t}^{(i)}}{\sum_{i'=1}^{n} w_{0:t}^{(i')}} r_t^{(i)} + \sum_{i=1}^n \sum_{t=k}^{T-1} \gamma^t \frac{\rho(s_{t-k}^{(i)}) w_{t-k:t}^{(i)}}{\sum_{i'=1}^n \rho(s_{t-k}^{(i')}) w_{t-k:t}^{(i')}} r_t^{(i)},\]

where \(w_{t_1:t_2} := \prod_{t=t_1}^{t_2} (\pi(a_t | s_t) / \pi_0(a_t | s_t))\) and \(\rho(s) \approx d^{\pi}(s) / d^{\pi_b}(s)\) is the state-marginal importance weight, where \(d^{\pi}(s)\) is the marginal visitation probability of the policy \(\pi\) on \(s\). When \(k=0\), this estimator is reduced to the vanilla state marginal SNIS.

SM-SNIS is consistent when the marginal importance weight is estimated correctly. Moreover, SM-SNIS reduces the variance caused by trajectory-wise or per-decision importance weighting by considering the marginal distribution across various timesteps.

There are several ways to estimate the state(-action) marginal importance weight such as Augmented Lagrangian Method (ALM) (Yang et al., 2020) and Minimax Weight Learning (MWL) (Uehara et al., 2020).

See also

The implementations of such weight learning methods are available at scope_rl.ope.weight_value_learning.

Parameters:

estimator_name (str, default="sm_snis") – Name of the estimator.

References

Christina J. Yuan, Yash Chandak, Stephen Giguere, Philip S. Thomas, and Scott Niekum. “SOPE: Spectrum of Off-Policy Estimators.” 2021.

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. “Off-Policy Evaluation via the Regularized Lagrangian.” 2020.

Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. “Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation.” 2018

Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” 2000.

Methods

estimate_interval(n_step_pdis, ...[, gamma, ...])

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(n_step_pdis, ...[, gamma])

Estimate the policy value of the evaluation policy.

class scope_rl.ope.discrete.marginal_estimators.StateMarginalSNDR(estimator_name='sm_sndr')[source]#

State Marginal Self-Normalized Doubly Robust (SM-SNDR) for discrete action spaces.

Bases: scope_rl.ope.discrete.StateMarginalDR -> scope_rl.ope.BaseStateMarginalOPEEstimator -> scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.StateMarginalSNDR

Note

SM-SNDR estimates the policy value using state marginal importance weighting. Following SOPE (Yuan et al., 2021), we combine state-marginal importance weighting and \(k\)-step PDIS as follows.

\[\begin{split}\hat{J}_{\mathrm{SM-SNDR}} (\pi; \mathcal{D}) &:= \frac{1}{n} \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi(a | s_0^{(i)}) \hat{Q}(s_0^{(i)}, a) \\ & \quad \quad + \sum_{i=1}^n \sum_{t=0}^{k-1} \gamma^t \frac{w_{0:t}^{(i)}}{\sum_{i'=1}^n w_{0:t}^{(i')}} \left(r_t^{(i)} + \gamma \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) \hat{Q}(s_{t+1}^{(i)}, a) - \hat{Q}(s_t^{(i)}, a_t^{(i)}) \right) \\ & \quad \quad + \sum_{i=1}^n \sum_{t=k}^{T-1} \gamma^t \frac{\rho(s_{t-k}^{(i)}) w_{t-k:t}^{(i)}}{\sum_{i'=1}^n \rho(s_{t-k}^{(i')}) w_{t-k:t}^{(i')}} \left(r_t^{(i)} + \gamma \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) \hat{Q}(s_{t+1}^{(i)}, a) - \hat{Q}(s_t^{(i)}, a_t^{(i)}) \right),\end{split}\]

where \(w_{t_1:t_2} := \prod_{t=t_1}^{t_2} (\pi(a_t | s_t) / \pi_0(a_t | s_t))\) and \(\rho(s) \approx d^{\pi}(s) / d^{\pi_b}(s)\) is the state-marginal importance weight, where where \(d^{\pi}(s)\) is the marginal visitation probability of the policy \(\pi\) on \(s\). When \(k=0\), this estimator is reduced to the vanilla state marginal SNDR.

SM-SNDR is consistent when either the marginal importance weight or Q-function is estimated correctly. Moreover, SM-SNDR reduces the variance caused by trajectory-wise or per-decision importance weighting by considering the marginal distribution across various timesteps.

There are several ways to estimate the state(-action) marginal importance weight such as Augmented Lagrangian Method (ALM) (Yang et al., 2020) and Minimax Weight Learning (MWL) (Uehara et al., 2020).

See also

The implementations of such weight learning methods are available at scope_rl.ope.weight_value_learning.

Parameters:

estimator_name (str, default="sm_sndr") – Name of the estimator.

References

Christina J. Yuan, Yash Chandak, Stephen Giguere, Philip S. Thomas, and Scott Niekum. “SOPE: Spectrum of Off-Policy Estimators.” 2021.

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. “Off-Policy Evaluation via the Regularized Lagrangian.” 2020.

Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. “Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation.” 2018

Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” 2016.

Philip S. Thomas and Emma Brunskill. “Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” 2016.

Methods

estimate_interval(n_step_pdis, ...[, gamma, ...])

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(n_step_pdis, ...[, gamma])

Estimate the policy value of the evaluation policy.

class scope_rl.ope.discrete.marginal_estimators.StateActionMarginalIS(estimator_name='sam_is')[source]#

State-Action Marginal Importance Sampling (SAM-IS) for discrete action spaces.

Bases: scope_rl.ope.BaseStateActionMarginalOPEEstimator -> scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.StateActionMarginalIS

Note

SAM-IS estimates the policy value using state-action marginal importance weighting. Following SOPE (Yuan et al., 2021), we combine state-marginal importance weighting and \(k\)-step PDIS as follows.

\[\hat{J}_{\mathrm{SAM-IS}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{k-1} \gamma^t w_{0:t}^{(i)} r_t^{(i)} + \frac{1}{n} \sum_{i=1}^n \sum_{t=k}^{T-1} \gamma^t \rho(s_{t-k}^{(i)}, a_{t-k}^{(i)}) w_{t-k+1:t}^{(i)} r_t^{(i)},\]

where \(w_{t_1:t_2} := \prod_{t=t_1}^{t_2} (\pi(a_t | s_t) / \pi_0(a_t | s_t))\) and \(\rho(s, a) \approx d^{\pi}(s, a) / d^{\pi_b}(s, a)\) is the state-action marginal importance weight. where \(d^{\pi}(s, a)\) is the marginal visitation probability of the policy \(\pi\) on \((s, a)\). When \(k=0\), this estimator is reduced to the vanilla state-action marginal IS.

SAM-IS is unbiased when the marginal importance weight is estimated correctly. Moreover, SAM-IS reduces the variance caused by trajectory-wise or per-decision importance weighting by considering the marginal distribution across various timesteps.

There are several ways to estimate the state(-action) marginal importance weight such as Augmented Lagrangian Method (ALM) (Yang et al., 2020) and Minimax Weight Learning (MWL) (Uehara et al., 2020).

See also

The implementations of such weight learning methods are available at scope_rl.ope.weight_value_learning.

Parameters:

estimator_name (str, default="sam_is") – Name of the estimator.

References

Christina J. Yuan, Yash Chandak, Stephen Giguere, Philip S. Thomas, and Scott Niekum. “SOPE: Spectrum of Off-Policy Estimators.” 2021.

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. “Off-Policy Evaluation via the Regularized Lagrangian.” 2020.

Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” 2000.

Methods

estimate_interval(n_step_pdis, ...[, gamma, ...])

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(n_step_pdis, ...[, gamma])

Estimate the policy value of the evaluation policy.

estimate_policy_value(n_step_pdis, step_per_trajectory, action, reward, state_action_marginal_importance_weight, pscore, evaluation_policy_action_dist, gamma=1.0, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:
  • n_step_pdis (int (>= 0)) – Number of initial steps whose rewards are estimated by step-wise importance weighting. When set to zero, the estimator is reduced to the vanilla state marginal IS.

  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • state_action_marginal_importance_weight (array-like of shape (n_trajectories * step_per_trajectory, )) – Importance weight wrt the state-action marginal distribution, i.e., \(d^{\pi}(s, a) / d^{\pi_b}(s, a)\)

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s) \forall a \in \mathcal{A}\)

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

V_hat – Estimated policy value.

Return type:

float

estimate_interval(n_step_pdis, step_per_trajectory, action, reward, state_action_marginal_importance_weight, pscore, evaluation_policy_action_dist, gamma=1.0, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=None, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:
  • n_step_pdis (int (>= 0)) – Number of initial steps whose rewards are estimated by step-wise importance weighting. When set to zero, the estimator is reduced to the vanilla state marginal IS.

  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • state_action_marginal_importance_weight (array-like of shape (n_trajectories * step_per_trajectory, )) – Importance weight wrt the state-action marginal distribution, i.e., \(d^{\pi}(s, a) / d^{\pi_b}(s, a)\)

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s) \forall a \in \mathcal{A}\)

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

class scope_rl.ope.discrete.marginal_estimators.StateActionMarginalDR(estimator_name='sam_dr')[source]#

State-Action Marginal Doubly Robust (SAM-DR) for discrete action spaces.

Bases: scope_rl.ope.BaseStateActionMarginalOPEEstimator -> scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.StateActionMarginalDR

Note

SAM-DR estimates the policy value using state-action marginal importance weighting. Following SOPE (Yuan et al., 2021), we combine state-marginal importance weighting and \(k\)-step PDIS as follows.

\[\begin{split}\hat{J}_{\mathrm{SAM-DR}} (\pi; \mathcal{D}) &:= \frac{1}{n} \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi(a | s_0^{(i)}) \hat{Q}(s_0^{(i)}, a) \\ & \quad \quad + \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{k-1} \gamma^t w_{0:t}^{(i)} \left( r_t^{(i)} + \gamma \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) \hat{Q}(s_{t+1}^{(i)}, a) - \hat{Q}(s_t^{(i)}, a_t^{(i)}) \right) \\ & \quad \quad + \frac{1}{n} \sum_{i=1}^n \sum_{t=k}^{T-1} \gamma^t \rho(s_{t-k}^{(i)}, a_{t-k}^{(i)}) w_{t-k+1:t}^{(i)} \left( r_t^{(i)} + \gamma \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) \hat{Q}(s_{t+1}^{(i)}, a) - \hat{Q}(s_t^{(i)}, a_t^{(i)}) \right),\end{split}\]

where \(w_{t_1:t_2} := \prod_{t=t_1}^{t_2} (\pi(a_t | s_t) / \pi_0(a_t | s_t))\) and \(\rho(s, a) \approx d^{\pi}(s, a) / d^{\pi_b}(s, a)\) is the state-action marginal importance weight. where \(d^{\pi}(s, a)\) is the marginal visitation probability of the policy \(\pi\) on \((s, a)\). When \(k=0\), this estimator is reduced to the vanilla state-action marginal DR.

SAM-DR is unbiased when either the marginal importance weight or Q-function is estimated correctly. Moreover, SAM-DR reduces the variance caused by trajectory-wise or per-decision importance weighting by considering the marginal distribution across various timesteps.

There are several ways to estimate the state(-action) marginal importance weight such as Augmented Lagrangian Method (ALM) (Yang et al., 2020) and Minimax Weight Learning (MWL) (Uehara et al., 2020).

See also

The implementations of such weight learning methods are available at scope_rl.ope.weight_value_learning.

Parameters:

estimator_name (str, default="sam_dr") – Name of the estimator.

References

Christina J. Yuan, Yash Chandak, Stephen Giguere, Philip S. Thomas, and Scott Niekum. “SOPE: Spectrum of Off-Policy Estimators.” 2021.

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. “Off-Policy Evaluation via the Regularized Lagrangian.” 2020.

Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” 2016.

Philip S. Thomas and Emma Brunskill. “Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” 2016.

Methods

estimate_interval(n_step_pdis, ...[, gamma, ...])

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(n_step_pdis, ...[, gamma])

Estimate the policy value of the evaluation policy.

estimate_policy_value(n_step_pdis, step_per_trajectory, action, reward, state_action_marginal_importance_weight, pscore, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, gamma=1.0, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:
  • n_step_pdis (int (>= 0)) – Number of initial steps whose rewards are estimated by step-wise importance weighting. When set to zero, the estimator is reduced to the vanilla state marginal IS.

  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • state_action_marginal_importance_weight (array-like of shape (n_trajectories * step_per_trajectory, )) – Importance weight wrt the state-action marginal distribution, i.e., \(d^{\pi}(s, a) / d^{\pi_b}(s, a)\)

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s) \forall a \in \mathcal{A}\)

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • initial_state_value_prediction (array-like of shape (n_trajectories, )) – Estimated initial state value.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

V_hat – Estimated policy value.

Return type:

float

estimate_interval(n_step_pdis, step_per_trajectory, action, reward, state_action_marginal_importance_weight, pscore, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, gamma=1.0, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=None, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:
  • n_step_pdis (int (>= 0)) – Number of initial steps whose rewards are estimated by step-wise importance weighting. When set to zero, the estimator is reduced to the vanilla state marginal IS.

  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • state_action_marginal_importance_weight (array-like of shape (n_trajectories * step_per_trajectory, )) – Importance weight wrt the state-action marginal distribution, i.e., \(d^{\pi}(s, a) / d^{\pi_b}(s, a)\)

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s) \forall a \in \mathcal{A}\)

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • initial_state_value_prediction (array-like of shape (n_trajectories, )) – Estimated initial state value.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

class scope_rl.ope.discrete.marginal_estimators.StateActionMarginalSNIS(estimator_name='sam_snis')[source]#

State-Action Marginal Self-Normalized Importance Sampling (SAM-SNIS) for discrete action spaces.

Bases: scope_rl.ope.discrete.StateActionMarginalIS -> scope_rl.ope.BaseStateActionMarginalOPEEstimator -> scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.StateActionMarginalSNIS

Note

SAM-SNIS estimates the policy value using state-action marginal importance weighting. Following SOPE (Yuan et al., 2021), we combine state-marginal importance weighting and \(k\)-step PDIS as follows.

\[\begin{split}\hat{J}_{\mathrm{SAM-SNIS}} (\pi; \mathcal{D}) &:= \sum_{i=1}^n \sum_{t=0}^{k-1} \gamma^t \frac{w_{0:t}^{(i)}}{\sum_{i'=1}^n w_{0:t}^{(i')}} r_t^{(i)} \\ & \quad \quad + \sum_{i=1}^n \sum_{t=k}^{T-1} \gamma^t \frac{\rho(s_{t-k}^{(i)}, a_{t-k}^{(i)}) w_{t-k+1:t}^{(i)}}{\sum_{i'=1}^n \rho(s_{t-k}^{(i')}, a_{t-k}^{(i')}) w_{t-k+1:t}^{(i')}} r_t^{(i)},\end{split}\]

where \(w_{t_1:t_2} := \prod_{t=t_1}^{t_2} (\pi(a_t | s_t) / \pi_0(a_t | s_t))\) and \(\rho(s, a) \approx d^{\pi}(s, a) / d^{\pi_b}(s, a)\) is the state-action marginal importance weight. where \(d^{\pi}(s, a)\) is the marginal visitation probability of the policy \(\pi\) on \((s, a)\). When \(k=0\), this estimator is reduced to the vanilla state-action marginal SNIS.

SAM-SNIS is consistent when the marginal importance weight is estimated correctly. Moreover, SAM-SNIS reduces the variance caused by trajectory-wise or per-decision importance weighting by considering the marginal distribution across various timesteps.

There are several ways to estimate the state(-action) marginal importance weight such as Augmented Lagrangian Method (ALM) (Yang et al., 2020) and Minimax Weight Learning (MWL) (Uehara et al., 2020).

See also

The implementations of such weight learning methods are available at scope_rl.ope.weight_value_learning.

Parameters:

estimator_name (str, default="sam_snis") – Name of the estimator.

References

Christina J. Yuan, Yash Chandak, Stephen Giguere, Philip S. Thomas, and Scott Niekum. “SOPE: Spectrum of Off-Policy Estimators.” 2021.

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. “Off-Policy Evaluation via the Regularized Lagrangian.” 2020.

Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” 2000.

Methods

estimate_interval(n_step_pdis, ...[, gamma, ...])

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(n_step_pdis, ...[, gamma])

Estimate the policy value of the evaluation policy.

class scope_rl.ope.discrete.marginal_estimators.StateActionMarginalSNDR(estimator_name='sam_sndr')[source]#

State-Action Marginal Self-Normalized Doubly Robust (SAM-SNDR) for discrete action spaces.

Bases: scope_rl.ope.discrete.StateActionMarginalDR -> scope_rl.ope.BaseStateActionMarginalOPEEstimator -> scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.StateActionMarginalSNDR

Note

SAM-SNDR estimates the policy value using state-action marginal importance weighting. Following SOPE (Yuan et al., 2021), we combine state-marginal importance weighting and \(k\)-step PDIS as follows.

\[\begin{split}\hat{J}_{\mathrm{SAM-SNDR}} (\pi; \mathcal{D}) &:= \frac{1}{n} \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi(a | s_0^{(t)}) \hat{Q}(s_0^{(t)}, a) \\ & \quad \quad + \sum_{i=1}^n \sum_{t=0}^{k-1} \gamma^t \frac{w_{0:t}^{(i)}}{\sum_{i'=1}^n w_{0:t}^{(i')}} \left( r_t^{(i)} + \gamma \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) \hat{Q}(s_{t+1}^{(i)}, a) - \hat{Q}(s_t^{(i)}, a_t^{(i)}) \right) \\ & \quad \quad + \sum_{i=1}^n \sum_{t=k}^{T-1} \gamma^t \frac{\rho(s_{t-k}^{(i)}, a_{t-k}^{(i)}) w_{t-k+1:t}^{(i)}}{\sum_{i'=1}^n \rho(s_{t-k}^{(i')}, a_{t-k}^{(i')}) w_{t-k+1:t}^{(i')}} \left( r_t^{(i)} + \gamma \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) \hat{Q}(s_{t+1}^{(i)}, a) - \hat{Q}(s_t^{(i)}, a_t^{(i)}) \right),\end{split}\]

where \(w_{t_1:t_2} := \prod_{t=t_1}^{t_2} (\pi(a_t | s_t) / \pi_0(a_t | s_t))\) and \(\rho(s, a) \approx d^{\pi}(s, a) / d^{\pi_b}(s, a)\) is the state-action marginal importance weight. where \(d^{\pi}(s, a)\) is the marginal visitation probability of the policy \(\pi\) on \((s, a)\). When \(k=0\), this estimator is reduced to the vanilla state-action marginal SNDR.

SAM-SNDR is consistent when either the marginal importance weight or Q-function is estimated correctly. Moreover, SAM-SNDR reduces the variance caused by trajectory-wise or per-decision importance weighting by considering the marginal distribution across various timesteps.

There are several ways to estimate the state(-action) marginal importance weight such as Augmented Lagrangian Method (ALM) (Yang et al., 2020) and Minimax Weight Learning (MWL) (Uehara et al., 2020).

See also

The implementations of such weight learning methods are available at scope_rl.ope.weight_value_learning.

Parameters:

estimator_name (str, default="sam_sndr") – Name of the estimator.

References

Christina J. Yuan, Yash Chandak, Stephen Giguere, Philip S. Thomas, and Scott Niekum. “SOPE: Spectrum of Off-Policy Estimators.” 2021.

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. “Off-Policy Evaluation via the Regularized Lagrangian.” 2020.

Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” 2016.

Philip S. Thomas and Emma Brunskill. “Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” 2016.

Methods

estimate_interval(n_step_pdis, ...[, gamma, ...])

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(n_step_pdis, ...[, gamma])

Estimate the policy value of the evaluation policy.