scope_rl.ope.continuous.marginal_estimators.StateActionMarginalSNDR#

class scope_rl.ope.continuous.marginal_estimators.StateActionMarginalSNDR(estimator_name='sam_sndr')[source]#

State-Action Marginal Self-Normalized Doubly Robust (SAM-SNDR) for continuous action spaces.

Bases: scope_rl.ope.continuous.StateActionMarginalDR scope_rl.ope.BaseStateActionMarginalOPEEstimator -> scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.continuous.StateActionMarginalSNDR

Note

SAM-SNDR estimates the policy value using state-action marginal importance weighting. Following SOPE (Yuan et al., 2021), we combine state-marginal importance weighting and \(k\)-step PDIS as follows.

\[\begin{split}\hat{J}_{\mathrm{SAM-SNDR}} (\pi; \mathcal{D}) &:= \frac{1}{n} \sum_{i=1}^n \hat{Q}(s_0^{(i)}, \pi(s_0^{(i)})) \\ & \quad \quad + \sum_{i=1}^n \sum_{t=0}^{k-1} \gamma^t \frac{w_{0:t}^{(i)} \delta(\pi, a_{0:t}^{(i)})}{\sum_{i'=1}^n w_{0:t}^{(i')} \delta(\pi, a_{0:t}^{(i')})} (r_t^{(i)} + \gamma \hat{Q}(s_{t+1}^{(i)}, \pi(s_{t+1}^{(i)})) - \hat{Q}(s_t^{(i)}, a_t^{(i)})) \\ & \quad \quad + \sum_{i=1}^n \sum_{t=k}^{T-1} \gamma^t \frac{w(s_{t-k}^{(i)}, a_{t-k}^{(i)}) w_{t-k+1:t}^{(i)} \delta(\pi, a_{t-k+1:t}^{(i)})}{\sum_{i'=1}^n w(s_{t-k}^{(i')}, a_{t-k}^{(i')}) w_{t-k+1:t}^{(i')} \delta(\pi, a_{t-k+1:t}^{(i')})} (r_t^{(i)} + \gamma \hat{Q}(s_{t+1}^{(i)}, \pi(s_{t+1}^{(i)})) - \hat{Q}(s_t^{(i)}, a_t^{(i)})),\end{split}\]

where \(w_{t_1:t_2} := \prod_{t=t_1}^{t_2} (\pi(a_t | s_t) / \pi_0(a_t | s_t))\) and \(\rho(s, a) \approx d^{\pi}(s, a) / d^{\pi_b}(s, a)\) is the state-marginal importance weight, where \(d^{\pi}(s, a)\) is the marginal visitation probability of the policy \(\pi\) on \((s, a)\). \(Q(s, a)\) is the state-action value. \(\delta(\pi, a_{t_1:t_2}) = \prod_{t=t_1}^{t_2} K(\pi(s_t), a_t)\) quantifies the similarity between the action logged in the dataset and that taken by the evaluation policy (\(K(\cdot, \cdot)\) is a kernel function). Note that the bandwidth of the kernel is an important hyperparameter; the variance of the above estimator often becomes small when the bandwidth of the kernel is large, while the bias often becomes large in those cases. Additionally, when \(k=0\), this estimator is reduced to the vanilla state-action marginal SNDR.

SAM-SNDR corrects the distribution shift between the behavior and evaluation policies. Moreover, SAM-SNDR reduces the variance caused by trajectory-wise or per-decision importance weighting by considering the marginal distribution across various timesteps.

There are several ways to estimate the state(-action) marginal importance weight such as Augmented Lagrangian Method (ALM) (Yang et al., 2020) and Minimax Weight Learning (MWL) (Uehara et al., 2020).

See also

The implementations of such weight learning methods are available at scope_rl.ope.weight_value_learning.

Parameters:

estimator_name (str, default="sam_sndr") – Name of the estimator.

References

Christina J. Yuan, Yash Chandak, Stephen Giguere, Philip S. Thomas, and Scott Niekum. “SOPE: Spectrum of Off-Policy Estimators.” 2021.

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. “Off-Policy Evaluation via the Regularized Lagrangian.” 2020.

Nathan Kallus and Angela Zhou. “Policy Evaluation and Optimization with Continuous Treatments.” 2019.

Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” 2016.

Philip S. Thomas and Emma Brunskill. “Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” 2016.

Methods

estimate_interval(n_step_pdis, ...[, gamma, ...])

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(n_step_pdis, ...[, ...])

Estimate the policy value of the evaluation policy.

estimate_interval(n_step_pdis, step_per_trajectory, action, reward, state_action_marginal_importance_weight, pscore, evaluation_policy_action, state_action_value_prediction, initial_state_value_prediction, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=None, **kwargs)#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:
  • n_step_pdis (int (>= 0)) – Number of initial steps whose rewards are estimated by step-wise importance weighting. When set to zero, the estimator is reduced to the vanilla state marginal IS.

  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • state_action_marginal_importance_weight (array-like of shape (n_trajectories * step_per_trajectory, )) – Importance weight wrt the state-action marginal distribution, i.e., \(d^{\pi}(s, a) / d^{\pi_b}(s, a)\)

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, 2)) – \(\hat{Q}\) for the observed action and that chosen by the evaluation policy, i.e., (row 0) \(\hat{Q}(s_t, a_t)\) and (row 2) \(\hat{Q}(s_t, \pi(a | s_t))\).

  • initial_state_value_prediction (array-like of shape (n_trajectories, )) – Estimated initial state value.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

estimate_policy_value(n_step_pdis, step_per_trajectory, action, reward, state_action_marginal_importance_weight, pscore, evaluation_policy_action, state_action_value_prediction, initial_state_value_prediction, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)#

Estimate the policy value of the evaluation policy.

Parameters:
  • n_step_pdis (int (>= 0)) – Number of initial steps whose rewards are estimated by step-wise importance weighting. When set to zero, the estimator is reduced to the vanilla state marginal IS.

  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • state_action_marginal_importance_weight (array-like of shape (n_trajectories * step_per_trajectory, )) – Importance weight wrt the state-action marginal distribution, i.e., \(d^{\pi}(s, a) / d^{\pi_b}(s, a)\)

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, 2)) – \(\hat{Q}\) for the observed action and that chosen by the evaluation policy, i.e., (row 0) \(\hat{Q}(s_t, a_t)\) and (row 2) \(\hat{Q}(s_t, \pi(a | s_t))\).

  • initial_state_value_prediction (array-like of shape (n_trajectories, )) – Estimated initial state value.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

Returns:

V_hat – Estimated policy value.

Return type:

float

Methods