scope_rl.ope.continuous.basic_estimators.PerDecisionImportanceSampling#

class scope_rl.ope.continuous.basic_estimators.PerDecisionImportanceSampling(estimator_name='pdis')[source]#

Per-Decision Importance Sampling (PDIS) (designed for deterministic evaluation policies) for continuous action spaces.

Bases: scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.continuous.PerDecisionImportanceSampling

Note

PDIS estimates the policy value via step-wise importance weighting as follows.

\[\hat{J}_{\mathrm{PDIS}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t w_{0:t}^{(i)} \delta(\pi, a_{0:t}^{(i)}) r_t^{(i)},\]

where \(w_{0:t} := \prod_{t'=0}^t (1 / \pi_0(a_{t'} | s_{t'}))\). is the importance weight for each time step wrt the previous actions (referred to as the per-decision or step-wise importance weight). \(\delta(\pi, a_{0:t}) = \prod_{t'=0}^t K(\pi(s_{t'}), a_{t'})\) quantifies the similarity between the action logged in the dataset and that taken by the evaluation policy where \(K(\cdot)\) is a kernel function. Note that the bandwidth of the kernel is an important hyperparameter; the variance of the above estimator often becomes small when the bandwidth of the kernel is large, while the bias often becomes large in those cases.

By using per-decision importance weighting instead of trajectory-wise importance weighting of TIS, PDIS has lower variance than TIS while still correcting the distribution shift between the behavior and evaluation policies. However, when the trajectory length (\(T\)) is large, PDIS still suffers from high variance.

Parameters:: estimator_name (str, default="pdis") – Name of the estimator.

References

Nathan Kallus and Angela Zhou. “Policy Evaluation and Optimization with Continuous Treatments.” 2019.

Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” 2000.

Methods

`estimate_interval`(step_per_trajectory, ...)	Estimate the confidence interval of the policy value by nonparametric bootstrap.
`estimate_policy_value`(step_per_trajectory, ...)	Estimate the policy value of the evaluation policy.

estimate_policy_value(step_per_trajectory, action, reward, pscore, evaluation_policy_action, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

Returns:

V_hat – Estimated policy value.

Return type:

float

estimate_interval(step_per_trajectory, action, reward, pscore, evaluation_policy_action, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=12345, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Name of the method to estimate the confidence interval.
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

Methods