scope_rl.ope.continuous.basic_estimators.SelfNormalizedDR#

class scope_rl.ope.continuous.basic_estimators.SelfNormalizedDR[source]#

Self-Normalized Doubly Robust (SNDR) (designed for deterministic evaluation policies) for continuous action spaces.

Bases: scope_rl.ope.continuous.DoublyRobust -> scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.continuous.SelfNormalizedDR

Note

SNDR estimates the policy value via self-normalized step-wise importance weighting and estimated Q-function \(\hat{Q}\) as follows.

\[\hat{J}_{\mathrm{SNDR}} (\pi; \mathcal{D}) := \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \left( \frac{w_{0:t}^{(i)} \delta(\pi, a_{0:t}^{(i)})}{\sum_{i'=1}^n w_{0:t}^{(i')} \delta(\pi, a_{0:t}^{(i')})} (r_t^{(i)} - \hat{Q}(s_t^{(i)}, a_t^{(i)})) + \frac{w_{0:t-1}^{(i)} \delta(\pi, a_{0:t-1}^{(i)})}{\sum_{i'=1}^n w_{0:t-1}^{(i')} \delta(\pi, a_{0:t-1}^{(i')})} \hat{Q}(s_t^{(i)}, \pi(s_t^{(i)})) \right),\]

where \(w_{0:t} := \prod_{t'=0}^t (1 / \pi_0(a_{t'} | s_{t'}))\) is the per-decision importance weight. \(\delta(\pi, a_{0:t}) = \prod_{t'=1}^t K(\pi(s_{t'}), a_{t'})\) quantifies the similarity between the action logged in the dataset and that taken by the evaluation policy where \(K(\cdot)\) is a kernel function. Note that the bandwidth of the kernel is an important hyperparameter; the variance of the above estimator often becomes small when the bandwidth of the kernel is large, while the bias often becomes large in those cases.

The self-normalized estimator has variance bounded by \(r_{max}^2\).

Parameters:

estimator_name (str, default="sndr") – Name of the estimator.

References

Nathan Kallus and Angela Zhou. “Policy Evaluation and Optimization with Continuous Treatments.” 2019.

Nathan Kallus and Masatoshi Uehara. “Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning.” 2019.

Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” 2016.

Philip S. Thomas and Emma Brunskill. “Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” 2016.

Methods

estimate_interval(step_per_trajectory, ...)

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(step_per_trajectory, ...)

Estimate the policy value of the evaluation policy.

estimate_interval(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=12345, **kwargs)#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, 2)) – \(\hat{Q}\) for the observed action and that chosen by the evaluation policy, i.e., (row 0) \(\hat{Q}(s_t, a_t)\) and (row 2) \(\hat{Q}(s_t, \pi(a | s_t))\).

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Name of the method to estimate the confidence interval.

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

estimate_policy_value(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)#

Estimate the policy value of the evaluation policy.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, 2)) – \(\hat{Q}\) for the observed action and that chosen by the evaluation policy, i.e., (row 0) \(\hat{Q}(s_t, a_t)\) and (row 2) \(\hat{Q}(s_t, \pi(a | s_t))\).

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

Returns:

V_hat – Estimated policy value.

Return type:

float

Methods