scope_rl.ope.continuous.basic_estimators#

Off-Policy Estimators for continuous action cases (designed for deterministic evaluation policies).

Classes

DirectMethod

Direct Method (DM) (designed for deterministic evaluation policies) for continuous action spaces.

DoublyRobust

Doubly Robust (DR) (designed for deterministic evaluation policies) for continuous action spaces.

PerDecisionImportanceSampling

Per-Decision Importance Sampling (PDIS) (designed for deterministic evaluation policies) for continuous action spaces.

SelfNormalizedDR

Self-Normalized Doubly Robust (SNDR) (designed for deterministic evaluation policies) for continuous action spaces.

SelfNormalizedPDIS

Self-Normalized Per-Decision Importance Sampling (SNPDIS) (designed for deterministic evaluation policies) for continuous action spaces.

SelfNormalizedTIS

Self-Normalized Trajectory-wise Importance Sampling (SNTIS) (designed for deterministic evaluation policies) for continuous action spaces.

TrajectoryWiseImportanceSampling

Trajectory-wise Importance Sampling (TIS) (designed for deterministic evaluation policies) for continuous action spaces.

class scope_rl.ope.continuous.basic_estimators.DirectMethod[source]#

Direct Method (DM) (designed for deterministic evaluation policies) for continuous action spaces.

Bases: scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.continuous.DirectMethod

Note

DM estimates the policy value using an estimated initial state value as follows.

\[\hat{J}_{\mathrm{DM}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \hat{Q}(s_0^{(i)}, \pi(s_0^{(i)})) = \frac{1}{n} \sum_{i=1}^n \hat{V}(s_0^{(i)}),\]

where \(\mathcal{D}=\{\{(s_t, a_t, r_t)\}_{t=0}^{T-1}\}_{i=1}^n\) is the logged dataset with \(n\) trajectories. \(T\) indicates step per episode. \(\hat{Q}(s_t, a_t)\) is the estimated Q value given a state-action pair. \(\hat{V}(s_t)\) is the estimated value function given a state.

DM has low variance compared to other estimators, but can produce larger bias due to approximation errors.

There are several methods to estimate \(\hat{Q}(s, a)\) such as Fitted Q Evaluation (FQE) (Le et al., 2019) and Minimax Q-Function Learning (MQL) (Uehara et al., 2020).

See also

The implementation of FQE is provided by d3rlpy. The implementations of Minimax Learning is available at scope_rl.ope.weight_value_learning.

Parameters:

estimator_name (str, default="dm") – Name of the estimator.

References

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Hoang Le, Cameron Voloshin, and Yisong Yue. “Batch Policy Learning under Constraints.” 2019.

Methods

estimate_interval(step_per_trajectory, ...)

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(step_per_trajectory, ...)

Estimate the policy value of the evaluation policy.

estimate_policy_value(step_per_trajectory, state_action_value_prediction, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, 2)) – \(\hat{Q}\) for the observed action and that chosen by the evaluation policy, i.e., (row 0) \(\hat{Q}(s_t, a_t)\) and (row 2) \(\hat{Q}(s_t, \pi(a | s_t))\).

Returns:

V_hat – Estimated policy value.

Return type:

float

estimate_interval(step_per_trajectory, state_action_value_prediction, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=None, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, 2)) – \(\hat{Q}\) for the observed action and that chosen by the evaluation policy, i.e., (row 0) \(\hat{Q}(s_t, a_t)\) and (row 2) \(\hat{Q}(s_t, \pi(a | s_t))\).

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Name of the method to estimate the confidence interval.

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

class scope_rl.ope.continuous.basic_estimators.TrajectoryWiseImportanceSampling(estimator_name='tis')[source]#

Trajectory-wise Importance Sampling (TIS) (designed for deterministic evaluation policies) for continuous action spaces.

Bases: scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.continuous.TrajectoryWiseImportanceSampling

Note

TIS estimates the policy value via trajectory-wise importance weighting as follows.

\[\hat{J}_{\mathrm{TIS}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t w_{0:T-1}^{(i)} \delta(\pi, a_{0:T-1}^{(i)}) r_t^{(i)},\]

where \(w_{0:T-1} := \prod_{t=0}^{T-1} (1 / \pi_0(a_t | s_t))\) is the (trajectory-wise) importance weight. \(\delta(\pi, a_{0:T-1}) = \prod_{t=0}^{T-1} K(\pi(s_t), a_t)\) quantifies the similarity between the action logged in the dataset and that taken by the evaluation policy where \(K(\cdot)\) is a kernel function. Note that the bandwidth of the kernel is an important hyperparameter; the variance of the above estimator often becomes small when the bandwidth of the kernel is large, while the bias often becomes large in those cases.

TIS is able to correct the distribution shift between the behavior and evaluation policies. However, when the trajectory length (\(T\)) is large, TIS suffers from high variance due to the product of importance weights over the entire horizon.

Parameters:

estimator_name (str, default="tis") – Name of the estimator.

References

Nathan Kallus and Angela Zhou. “Policy Evaluation and Optimization with Continuous Treatments.” 2019.

Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” 2000.

Methods

estimate_interval(step_per_trajectory, ...)

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(step_per_trajectory, ...)

Estimate the policy value of the evaluation policy.

estimate_policy_value(step_per_trajectory, action, reward, pscore, evaluation_policy_action, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

Returns:

V_hat – Estimated policy value.

Return type:

float

estimate_interval(step_per_trajectory, action, reward, pscore, evaluation_policy_action, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=12345, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Name of the method to estimate the confidence interval.

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

class scope_rl.ope.continuous.basic_estimators.PerDecisionImportanceSampling(estimator_name='pdis')[source]#

Per-Decision Importance Sampling (PDIS) (designed for deterministic evaluation policies) for continuous action spaces.

Bases: scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.continuous.PerDecisionImportanceSampling

Note

PDIS estimates the policy value via step-wise importance weighting as follows.

\[\hat{J}_{\mathrm{PDIS}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t w_{0:t}^{(i)} \delta(\pi, a_{0:t}^{(i)}) r_t^{(i)},\]

where \(w_{0:t} := \prod_{t'=0}^t (1 / \pi_0(a_{t'} | s_{t'}))\). is the importance weight for each time step wrt the previous actions (referred to as the per-decision or step-wise importance weight). \(\delta(\pi, a_{0:t}) = \prod_{t'=0}^t K(\pi(s_{t'}), a_{t'})\) quantifies the similarity between the action logged in the dataset and that taken by the evaluation policy where \(K(\cdot)\) is a kernel function. Note that the bandwidth of the kernel is an important hyperparameter; the variance of the above estimator often becomes small when the bandwidth of the kernel is large, while the bias often becomes large in those cases.

By using per-decision importance weighting instead of trajectory-wise importance weighting of TIS, PDIS has lower variance than TIS while still correcting the distribution shift between the behavior and evaluation policies. However, when the trajectory length (\(T\)) is large, PDIS still suffers from high variance.

Parameters:

estimator_name (str, default="pdis") – Name of the estimator.

References

Nathan Kallus and Angela Zhou. “Policy Evaluation and Optimization with Continuous Treatments.” 2019.

Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” 2000.

Methods

estimate_interval(step_per_trajectory, ...)

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(step_per_trajectory, ...)

Estimate the policy value of the evaluation policy.

estimate_policy_value(step_per_trajectory, action, reward, pscore, evaluation_policy_action, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

Returns:

V_hat – Estimated policy value.

Return type:

float

estimate_interval(step_per_trajectory, action, reward, pscore, evaluation_policy_action, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=12345, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Name of the method to estimate the confidence interval.

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

class scope_rl.ope.continuous.basic_estimators.DoublyRobust[source]#

Doubly Robust (DR) (designed for deterministic evaluation policies) for continuous action spaces.

Bases: scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.continuous.DoublyRobust

Note

DR estimates the policy value via step-wise importance weighting and estimated Q-function \(\hat{Q}\) as follows.

\[\hat{J}_{\mathrm{DR}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \left(w_{0:t}^{(i)} \delta(\pi, a_{0:t}^{(i)}) (r_t^{(i)} - \hat{Q}(s_t^{(i)}, a_t^{(i)})) + w_{0:t-1}^{(i)} \delta(\pi, a_{0:t-1}^{(i)}) \hat{Q}(s_t^{(i)}, \pi(s_t^{(i)})) \right),\]

where \(w_{0:t} := \prod_{t'=0}^t (1 / \pi_0(a_{t'} | s_{t'}))\) is the per-decision importance weight. \(\delta(\pi, a_{0:t}) = \prod_{t'=1}^t K(\pi(s_{t'}), a_{t'})\) quantifies the similarity between the action logged in the dataset and that taken by the evaluation policy where \(K(\cdot)\) is a kernel function. Note that the bandwidth of the kernel is an important hyperparameter; the variance of the above estimator often becomes small when the bandwidth of the kernel is large, while the bias often becomes large in those cases.

DR corrects distribution shift between the behavior and evaluation policies and has lower variance than PDIS when \(\hat{Q}(\cdot)\) is reasonably accurate and satisfies \(0 < \hat{Q}(\cdot) < 2 Q(\cdot)\). However, when the importance weight is quite large, it may still suffer from a high variance.

Parameters:

estimator_name (str, default="dr") – Name of the estimator.

References

Nathan Kallus and Angela Zhou. “Policy Evaluation and Optimization with Continuous Treatments.” 2019.

Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” 2016.

Philip S. Thomas and Emma Brunskill. “Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” 2016.

Methods

estimate_interval(step_per_trajectory, ...)

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(step_per_trajectory, ...)

Estimate the policy value of the evaluation policy.

estimate_policy_value(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, 2)) – \(\hat{Q}\) for the observed action and that chosen by the evaluation policy, i.e., (row 0) \(\hat{Q}(s_t, a_t)\) and (row 2) \(\hat{Q}(s_t, \pi(a | s_t))\).

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

Returns:

V_hat – Estimated policy value.

Return type:

float

estimate_interval(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=12345, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, 2)) – \(\hat{Q}\) for the observed action and that chosen by the evaluation policy, i.e., (row 0) \(\hat{Q}(s_t, a_t)\) and (row 2) \(\hat{Q}(s_t, \pi(a | s_t))\).

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Name of the method to estimate the confidence interval.

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

class scope_rl.ope.continuous.basic_estimators.SelfNormalizedTIS(estimator_name='sntis')[source]#

Self-Normalized Trajectory-wise Importance Sampling (SNTIS) (designed for deterministic evaluation policies) for continuous action spaces.

Bases: scope_rl.ope.continuous.TrajectoryWiseImportanceSampling -> scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.continuous.SelfNormalizedTIS

Note

SNTIS estimates the policy value via self-normalized trajectory-wise importance weighting as follows.

\[\hat{J}_{\mathrm{SNTIS}} (\pi; \mathcal{D}) := \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \frac{w_{0:T-1}^{(i)} \delta(\pi, a_{0:T-1}^{(i)})}{\sum_{i'=1}^n w_{1:T-1}^{(i')} \delta(\pi, a_{0:T-1}^{(i')})} r_t^{(i)},\]

where \(w_{0:T-1} := \prod_{t=0}^{T-1} (1 / \pi_0(a_t | s_t))\) is the trajectory-wise importance weight. \(\delta(\pi, a_{0:T}) = \prod_{t=0}^{T-1} K(\pi(s_t), a_t)\) quantifies the similarity between the action logged in the dataset and that taken by the evaluation policy where \(K(\cdot)\) is a kernel function. Note that the bandwidth of the kernel is an important hyperparameter; the variance of the above estimator often becomes small when the bandwidth of the kernel is large, while the bias often becomes large in those cases.

The self-normalized estimator has variance bounded by \(r_{max}^2\).

Parameters:

estimator_name (str, default="sntis") – Name of the estimator.

References

Nathan Kallus and Angela Zhou. “Policy Evaluation and Optimization with Continuous Treatments.” 2019.

Nathan Kallus and Masatoshi Uehara. “Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning.” 2019.

Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” 2000.

Methods

estimate_interval(step_per_trajectory, ...)

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(step_per_trajectory, ...)

Estimate the policy value of the evaluation policy.

class scope_rl.ope.continuous.basic_estimators.SelfNormalizedPDIS(estimator_name='snpdis')[source]#

Self-Normalized Per-Decision Importance Sampling (SNPDIS) (designed for deterministic evaluation policies) for continuous action spaces.

Bases: scope_rl.ope.continuous.PerDecisionImportanceSampling -> scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.continuous.SelfNormalizedPDIS

Note

SNPDIS estimates the policy value via self-normalized step-wise importance weighting as follows.

\[\hat{J}_{\mathrm{SNPDIS}} (\pi; \mathcal{D}) := \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \frac{w_{1:t}^{(i)} \delta(\pi, a_{0:t}^{(i)})}{\sum_{i'=1}^n w_{1:t}^{(i')} \delta(\pi, a_{0:t}^{(i')})} r_t^{(i)},\]

where \(w_{0:t} := \prod_{t'=1}^t (1 / \pi_0(a_{t'} | s_{t'}))\) is the per-decision importance weight. \(\delta(\pi, a_{0:t}) = \prod_{t'=1}^t K(\pi(s_{t'}), a_{t'})\) quantifies the similarity between the action logged in the dataset and that taken by the evaluation policy where \(K(\cdot)\) is a kernel function. Note that the bandwidth of the kernel is an important hyperparameter; the variance of the above estimator often becomes small when the bandwidth of the kernel is large, while the bias often becomes large in those cases.

The self-normalized estimator has variance bounded by \(r_{max}^2\).

Parameters:

estimator_name (str, default="snpdis") – Name of the estimator.

References

Nathan Kallus and Angela Zhou. “Policy Evaluation and Optimization with Continuous Treatments.” 2019.

Nathan Kallus and Masatoshi Uehara. “Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning.” 2019.

Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” 2000.

Methods

estimate_interval(step_per_trajectory, ...)

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(step_per_trajectory, ...)

Estimate the policy value of the evaluation policy.

class scope_rl.ope.continuous.basic_estimators.SelfNormalizedDR[source]#

Self-Normalized Doubly Robust (SNDR) (designed for deterministic evaluation policies) for continuous action spaces.

Bases: scope_rl.ope.continuous.DoublyRobust -> scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.continuous.SelfNormalizedDR

Note

SNDR estimates the policy value via self-normalized step-wise importance weighting and estimated Q-function \(\hat{Q}\) as follows.

\[\hat{J}_{\mathrm{SNDR}} (\pi; \mathcal{D}) := \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \left( \frac{w_{0:t}^{(i)} \delta(\pi, a_{0:t}^{(i)})}{\sum_{i'=1}^n w_{0:t}^{(i')} \delta(\pi, a_{0:t}^{(i')})} (r_t^{(i)} - \hat{Q}(s_t^{(i)}, a_t^{(i)})) + \frac{w_{0:t-1}^{(i)} \delta(\pi, a_{0:t-1}^{(i)})}{\sum_{i'=1}^n w_{0:t-1}^{(i')} \delta(\pi, a_{0:t-1}^{(i')})} \hat{Q}(s_t^{(i)}, \pi(s_t^{(i)})) \right),\]

where \(w_{0:t} := \prod_{t'=0}^t (1 / \pi_0(a_{t'} | s_{t'}))\) is the per-decision importance weight. \(\delta(\pi, a_{0:t}) = \prod_{t'=1}^t K(\pi(s_{t'}), a_{t'})\) quantifies the similarity between the action logged in the dataset and that taken by the evaluation policy where \(K(\cdot)\) is a kernel function. Note that the bandwidth of the kernel is an important hyperparameter; the variance of the above estimator often becomes small when the bandwidth of the kernel is large, while the bias often becomes large in those cases.

The self-normalized estimator has variance bounded by \(r_{max}^2\).

Parameters:

estimator_name (str, default="sndr") – Name of the estimator.

References

Nathan Kallus and Angela Zhou. “Policy Evaluation and Optimization with Continuous Treatments.” 2019.

Nathan Kallus and Masatoshi Uehara. “Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning.” 2019.

Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” 2016.

Philip S. Thomas and Emma Brunskill. “Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” 2016.

Methods

estimate_interval(step_per_trajectory, ...)

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(step_per_trajectory, ...)

Estimate the policy value of the evaluation policy.