scope_rl.ope.discrete.basic_estimators#

Off-Policy Estimators for discrete action cases.

Classes

`DirectMethod`	Direct Method (DM) for discrete action spaces.
`DoublyRobust`	Doubly Robust (DR) for discrete action spaces.
`PerDecisionImportanceSampling`	Per-Decision Importance Sampling (PDIS) for discrete action spaces.
`SelfNormalizedDR`	Self-Normalized Doubly Robust (SNDR) for discrete action spaces.
`SelfNormalizedPDIS`	Self-Normalized Per-Decision Importance Sampling (SNPDIS) for discrete action spaces.
`SelfNormalizedTIS`	Self-Normalized Trajectory-wise Important Sampling (SNTIS) for discrete action spaces.
`TrajectoryWiseImportanceSampling`	Trajectory-wise Important Sampling (TIS) for discrete action spaces.

class scope_rl.ope.discrete.basic_estimators.DirectMethod(estimator_name='dm')[source]#

Direct Method (DM) for discrete action spaces.

Bases: scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.DirectMethod

Note

DM estimates the policy value using an estimated initial state value as follows.

\[\hat{J}_{\mathrm{DM}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi(a | s_0^{(i)}) \hat{Q}(s_0^{(i)}, a) = \frac{1}{n} \sum_{i=1}^n \hat{V}(s_0^{(i)}),\]

where \(\mathcal{D}=\{\{(s_t, a_t, r_t)\}_{t=0}^{T-1}\}_{i=1}^n\) is the logged dataset with \(n\) trajectories. \(T\) indicates step per episode. \(\hat{Q}(s_t, a_t)\) is the estimated Q value given a state-action pair. \(\hat{V}(s_t)\) is the estimated value function given a state.

DM has low variance compared to other estimators, but can produce larger bias due to approximation errors.

There are several methods to estimate \(\hat{Q}(s, a)\) such as Fitted Q Evaluation (FQE) (Le et al., 2019) and Minimax Q-Function Learning (MQL) (Uehara et al., 2020).

See also

The implementation of FQE is provided by d3rlpy. The implementations of Minimax Learning is available at scope_rl.ope.weight_value_learning.

Parameters:: estimator_name (str, default="dm") – Name of the estimator.

References

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Hoang Le, Cameron Voloshin, and Yisong Yue. “Batch Policy Learning under Constraints.” 2019.

Methods

`estimate_interval`(step_per_trajectory, ...)	Estimate the confidence interval of the policy value by nonparametric bootstrap.
`estimate_policy_value`(step_per_trajectory, ...)	Estimate the policy value of the evaluation policy.

estimate_policy_value(step_per_trajectory, evaluation_policy_action_dist, state_action_value_prediction, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s_t) \forall a \in \mathcal{A}\)
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

Returns:

V_hat – Estimated policy value.

Return type:

float

estimate_interval(step_per_trajectory, evaluation_policy_action_dist, state_action_value_prediction, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=None, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s_t) \forall a \in \mathcal{A}\)
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Name of the method to estimate the confidence interval.
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

class scope_rl.ope.discrete.basic_estimators.TrajectoryWiseImportanceSampling[source]#

Trajectory-wise Important Sampling (TIS) for discrete action spaces.

Bases: scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.TrajectoryWiseImportanceSampling

Note

TIS estimates the policy value via trajectory-wise importance weighting as follows.

\[\hat{J}_{\mathrm{TIS}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t w_{0:T-1}^{(i)} r_t^{(i)},\]

where \(w_{0:T-1} := \prod_{t=0}^{T-1} (\pi(a_t | s_t) / \pi_0(a_t | s_t))\) is the trajectory-wise importance weight.

TIS enables an unbiased estimation of the policy value. However, when the trajectory length (\(T\)) is large, TIS suffers from high variance due to the product of importance weights over the entire horizon.

Parameters:: estimator_name (str, default="tis") – Name of the estimator.

References

Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” 2000.

Methods

`estimate_interval`(step_per_trajectory, ...)	Estimate the confidence interval of the policy value by nonparametric bootstrap.
`estimate_policy_value`(step_per_trajectory, ...)	Estimate the policy value of the evaluation policy.

estimate_policy_value(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, gamma=1.0, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s_t) \forall a \in \mathcal{A}\)
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

V_hat – Estimated policy value.

Return type:

float

estimate_interval(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, gamma=1.0, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=None, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s_t) \forall a \in \mathcal{A}\)
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Name of the method to estimate the confidence interval.
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

class scope_rl.ope.discrete.basic_estimators.PerDecisionImportanceSampling[source]#

Per-Decision Importance Sampling (PDIS) for discrete action spaces.

Bases: scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.PerDecisionImportanceSampling

Note

PDIS estimates the policy value via step-wise importance weighting as follows.

\[\hat{J}_{\mathrm{PDIS}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t w_{0:t}^{(i)} r_t^{(i)},\]

where \(w_{0:t} := \prod_{t'=0}^t (\pi(a_{t'} | s_{t'}) / \pi_0(a_{t'} | s_{t'}))\) is the importance weight for each time step wrt the previous actions (referred to as the per-decision or step-wise importance weight).

By using per-decision importance weighting instead of trajectory-wise importance weighting of TIS, PDIS has lower variance than TIS while remaining unbiased. However, when the trajectory length (\(T\)) is large, PDIS still suffers from high variance.

Parameters:: estimator_name (str, default="pdis") – Name of the estimator.

References

Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” 2000.

Methods

`estimate_interval`(step_per_trajectory, ...)	Estimate the confidence interval of the policy value by nonparametric bootstrap.
`estimate_policy_value`(step_per_trajectory, ...)	Estimate the policy value of the evaluation policy.

estimate_policy_value(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, gamma=1.0, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s_t) \forall a \in \mathcal{A}\)
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

V_hat – Estimated policy value.

Return type:

float

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s_t) \forall a \in \mathcal{A}\)
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Name of the method to estimate the confidence interval.
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

class scope_rl.ope.discrete.basic_estimators.DoublyRobust[source]#

Doubly Robust (DR) for discrete action spaces.

Bases: scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.DoublyRobust

Note

DR estimates the policy value via step-wise importance weighting and estimated Q-function \(\hat{Q}\) as follows.

\[\hat{J}_{\mathrm{DR}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \left( w_{0:t}^{(i)} (r_t^{(i)} - \hat{Q}(s_t^{(i)}, a_t^{(i)})) + w_{0:t-1}^{(i)} \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) \hat{Q}(s_t^{(i)}, a) \right),\]

where \(w_{0:t} := \prod_{t'=0}^t (\pi(a_{t'} | s_{t'}) / \pi_0(a_{t'} | s_{t'}))\) is the per-decision importance weight.

DR is unbiased and has lower variance than PDIS when \(\hat{Q}(\cdot)\) is reasonably accurate and satisfies \(0 < \hat{Q}(\cdot) < 2 Q(\cdot)\). However, when the importance weight is quite large, it may still suffer from a high variance.

Parameters:: estimator_name (str, default="dr") – Name of the estimator.

References

Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” 2016.

Philip S. Thomas and Emma Brunskill. “Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” 2016.

Methods

`estimate_interval`(step_per_trajectory, ...)	Estimate the confidence interval of the policy value by nonparametric bootstrap.
`estimate_policy_value`(step_per_trajectory, ...)	Estimate the policy value of the evaluation policy.

estimate_policy_value(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, state_action_value_prediction, gamma=1.0, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s_t) \forall a \in \mathcal{A}\)
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

V_hat – Estimated policy value.

Return type:

float

estimate_interval(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, state_action_value_prediction, gamma=1.0, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=None, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s_t) \forall a \in \mathcal{A}\)
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Name of the method to estimate the confidence interval.
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

class scope_rl.ope.discrete.basic_estimators.SelfNormalizedTIS[source]#

Self-Normalized Trajectory-wise Important Sampling (SNTIS) for discrete action spaces.

Bases: scope_rl.ope.discrete.TrajectoryWiseImportanceSampling -> scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.SelfNormalizedTIS

Note

SNTIS estimates the policy value via self-normalized trajectory-wise importance weighting as follows.

\[\hat{J}_{\mathrm{SNTIS}} (\pi; \mathcal{D}) := \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \frac{w_{0:T-1}^{(i)}}{\sum_{i'=1}^n w_{0:T-1}^{(i')}} r_t^{(i)},\]

where \(w_{0:T-1} := \prod_{t=0}^{T-1} (\pi(a_t | s_t) / \pi_0(a_t | s_t))\) is the trajectory-wise importance weight.

The self-normalized estimator is no longer unbiased, but has variance bounded by \(r_{max}^2\) while also remaining consistent.

Parameters:: estimator_name (str, default="sntis") – Name of the estimator.

References

Nathan Kallus and Masatoshi Uehara. “Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning.” 2019.

Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” 2000.

Methods

`estimate_interval`(step_per_trajectory, ...)	Estimate the confidence interval of the policy value by nonparametric bootstrap.
`estimate_policy_value`(step_per_trajectory, ...)	Estimate the policy value of the evaluation policy.

class scope_rl.ope.discrete.basic_estimators.SelfNormalizedPDIS[source]#

Self-Normalized Per-Decision Importance Sampling (SNPDIS) for discrete action spaces.

Bases: scope_rl.ope.discrete.PerDecisionImportanceSampling -> scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.SelfNormalizedPDIS

Note

SNPDIS estimates the policy value via self-normalized step-wise importance weighting as follows.

\[\hat{J}_{\mathrm{SNPDIS}} (\pi; \mathcal{D}) := \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \frac{w_{1:t}^{(i)}}{\sum_{i'=1}^n w_{1:t}^{(i')}} r_t^{(i)},\]

where \(w_{0:t} := \prod_{t'=1}^t (\pi(a_{t'} | s_{t'}) / \pi_0(a_{t'} | s_{t'}))\) is the per-decision importance weight.

The self-normalized estimator is no longer unbiased, but has variance bounded by \(r_{max}^2\) while also remaining consistent.

Parameters:: estimator_name (str, default="snpdis") – Name of the estimator.

References

Nathan Kallus and Masatoshi Uehara. “Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning.” 2019.

Doina Precup, Richard S. Sutton, and Satinder P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation.” 2000.

Methods

`estimate_interval`(step_per_trajectory, ...)	Estimate the confidence interval of the policy value by nonparametric bootstrap.
`estimate_policy_value`(step_per_trajectory, ...)	Estimate the policy value of the evaluation policy.

class scope_rl.ope.discrete.basic_estimators.SelfNormalizedDR[source]#

Self-Normalized Doubly Robust (SNDR) for discrete action spaces.

Bases: scope_rl.ope.discrete.DoublyRobust -> scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.SelfNormalizedDR

Note

SNDR estimates the policy value via self-normalized step-wise importance weighting and estimated Q-function \(\hat{Q}\) as follows.

\[\hat{J}_{\mathrm{SNDR}} (\pi; \mathcal{D}) := \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \left( \frac{w_{0:t}^{(i)}}{\sum_{i'=1}^n w_{0:t}^{(i')}} (r_t^{(i)} - \hat{Q}(s_t^{(i)}, a_t^{(i)})) + \frac{w_{0:t-1}^{(i)}}{\sum_{i'=1}^n w_{0:t-1}^{(i')}} \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) \hat{Q}(s_t^{(i)}, a) \right),\]

where \(w_{0:t} := \prod_{t'=0}^t (\pi(a_{t'} | s_{t'}) / \pi_0(a_{t'} | s_{t'}))\) is the per-decision importance weight.

The self-normalized estimator is no longer unbiased, but has variance bounded by \(r_{max}^2\) while also remaining consistent.

Parameters:: estimator_name (str, default="sndr") – Name of the estimator.

References

Nathan Kallus and Masatoshi Uehara. “Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning.” 2019.

Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” 2016.

Philip S. Thomas and Emma Brunskill. “Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” 2016.

Methods

`estimate_interval`(step_per_trajectory, ...)	Estimate the confidence interval of the policy value by nonparametric bootstrap.
`estimate_policy_value`(step_per_trajectory, ...)	Estimate the policy value of the evaluation policy.