scope_rl.ope.discrete.basic_estimators.TrajectoryWiseImportanceSampling#

class scope_rl.ope.discrete.basic_estimators.TrajectoryWiseImportanceSampling[source]#

Trajectory-wise Important Sampling (TIS) for discrete action spaces.

Bases: scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.TrajectoryWiseImportanceSampling

Note

TIS estimates the policy value via trajectory-wise importance weighting as follows.

\[\hat{J}_{\mathrm{TIS}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t w_{0:T-1}^{(i)} r_t^{(i)},\]

where \(w_{0:T-1} := \prod_{t=0}^{T-1} (\pi(a_t | s_t) / \pi_0(a_t | s_t))\) is the trajectory-wise importance weight.

TIS enables an unbiased estimation of the policy value. However, when the trajectory length (\(T\)) is large, TIS suffers from high variance due to the product of importance weights over the entire horizon.

Parameters:: estimator_name (str, default="tis") – Name of the estimator.

References

Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” 2000.

Methods

`estimate_interval`(step_per_trajectory, ...)	Estimate the confidence interval of the policy value by nonparametric bootstrap.
`estimate_policy_value`(step_per_trajectory, ...)	Estimate the policy value of the evaluation policy.

estimate_policy_value(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, gamma=1.0, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s_t) \forall a \in \mathcal{A}\)
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

V_hat – Estimated policy value.

Return type:

float

estimate_interval(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, gamma=1.0, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=None, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s_t) \forall a \in \mathcal{A}\)
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Name of the method to estimate the confidence interval.
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

Methods