scope_rl.ope.continuous.basic_estimators.SelfNormalizedTIS#
- class scope_rl.ope.continuous.basic_estimators.SelfNormalizedTIS(estimator_name='sntis')[source]#
Self-Normalized Trajectory-wise Importance Sampling (SNTIS) (designed for deterministic evaluation policies) for continuous action spaces.
Bases:
scope_rl.ope.continuous.TrajectoryWiseImportanceSampling->scope_rl.ope.BaseOffPolicyEstimatorImported as:
scope_rl.ope.continuous.SelfNormalizedTISNote
SNTIS estimates the policy value via self-normalized trajectory-wise importance weighting as follows.
\[\hat{J}_{\mathrm{SNTIS}} (\pi; \mathcal{D}) := \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \frac{w_{0:T-1}^{(i)} \delta(\pi, a_{0:T-1}^{(i)})}{\sum_{i'=1}^n w_{1:T-1}^{(i')} \delta(\pi, a_{0:T-1}^{(i')})} r_t^{(i)},\]where \(w_{0:T-1} := \prod_{t=0}^{T-1} (1 / \pi_0(a_t | s_t))\) is the trajectory-wise importance weight. \(\delta(\pi, a_{0:T}) = \prod_{t=0}^{T-1} K(\pi(s_t), a_t)\) quantifies the similarity between the action logged in the dataset and that taken by the evaluation policy where \(K(\cdot)\) is a kernel function. Note that the bandwidth of the kernel is an important hyperparameter; the variance of the above estimator often becomes small when the bandwidth of the kernel is large, while the bias often becomes large in those cases.
The self-normalized estimator has variance bounded by \(r_{max}^2\).
- Parameters:
estimator_name (str, default="sntis") – Name of the estimator.
References
Nathan Kallus and Angela Zhou. “Policy Evaluation and Optimization with Continuous Treatments.” 2019.
Nathan Kallus and Masatoshi Uehara. “Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning.” 2019.
Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” 2000.
Methods
estimate_interval(step_per_trajectory, ...)Estimate the confidence interval of the policy value by nonparametric bootstrap.
estimate_policy_value(step_per_trajectory, ...)Estimate the policy value of the evaluation policy.
- estimate_interval(step_per_trajectory, action, reward, pscore, evaluation_policy_action, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=12345, **kwargs)#
Estimate the confidence interval of the policy value by nonparametric bootstrap.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Name of the method to estimate the confidence interval.
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.
- Returns:
estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.
key: [ mean, {100 * (1. - alpha)}% CI (lower), {100 * (1. - alpha)}% CI (upper), ]
- Return type:
- estimate_policy_value(step_per_trajectory, action, reward, pscore, evaluation_policy_action, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)#
Estimate the policy value of the evaluation policy.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
- Returns:
V_hat – Estimated policy value.
- Return type:
Methods