scope_rl.ope.discrete.basic_estimators.PerDecisionImportanceSampling#

class scope_rl.ope.discrete.basic_estimators.PerDecisionImportanceSampling[source]#

Per-Decision Importance Sampling (PDIS) for discrete action spaces.

Bases: scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.PerDecisionImportanceSampling

Note

PDIS estimates the policy value via step-wise importance weighting as follows.

\[\hat{J}_{\mathrm{PDIS}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t w_{0:t}^{(i)} r_t^{(i)},\]

where \(w_{0:t} := \prod_{t'=0}^t (\pi(a_{t'} | s_{t'}) / \pi_0(a_{t'} | s_{t'}))\) is the importance weight for each time step wrt the previous actions (referred to as the per-decision or step-wise importance weight).

By using per-decision importance weighting instead of trajectory-wise importance weighting of TIS, PDIS has lower variance than TIS while remaining unbiased. However, when the trajectory length (\(T\)) is large, PDIS still suffers from high variance.

Parameters:

estimator_name (str, default="pdis") – Name of the estimator.

References

Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” 2000.

Methods

estimate_interval(step_per_trajectory, ...)

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(step_per_trajectory, ...)

Estimate the policy value of the evaluation policy.

estimate_policy_value(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, gamma=1.0, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s_t) \forall a \in \mathcal{A}\)

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

V_hat – Estimated policy value.

Return type:

float

estimate_interval(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, gamma=1.0, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=None, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s_t) \forall a \in \mathcal{A}\)

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Name of the method to estimate the confidence interval.

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

Methods