scope_rl.ope.discrete.cumulative_distribution_estimators#

Cumulative Distribution Off-Policy Estimators for discrete action cases.

Classes

CumulativeDistributionDM

Direct Method (DM) for estimating the cumulative distribution function (CDF) for discrete action spaces.

CumulativeDistributionSNTDR

Self Normalized Trajectory-wise Doubly Robust (SNTDR) for estimating the cumulative distribution function (CDF) for discrete action spaces.

CumulativeDistributionSNTIS

Self Normalized Trajectory-wise Importance Sampling (SNTIS) for estimating the cumulative distribution function (CDF) for discrete action spaces.

CumulativeDistributionTDR

Trajectory-wise Doubly Robust (TDR) for estimating the cumulative distribution function (CDF) for discrete action spaces.

CumulativeDistributionTIS

Trajectory-wise Importance Sampling (TIS) for estimating the cumulative distribution function (CDF) for discrete action spaces.

class scope_rl.ope.discrete.cumulative_distribution_estimators.CumulativeDistributionDM(estimator_name='cdf_dm')[source]#

Direct Method (DM) for estimating the cumulative distribution function (CDF) for discrete action spaces.

Bases: scope_rl.ope.BaseCumulativeDistributionOPEEstimator

Imported as: scope_rl.ope.discrete.CumulativeDistributionDM

Note

DM estimates the CDF using the initial state value as follows.

\[\hat{F}_{\mathrm{DM}}(m, \pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi(a \mid s_0^{(i)}) \hat{G}(m; s_0^{(i)}, a)\]

where \(\hat{F}(\cdot)\) is the estimated cumulative distribution function and \(\hat{G}(\cdot)\) is an estimator for \(\mathbb{E} \left[ \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t \leq m \right \} \mid s,a \right]\).

DM has low variance compared to other estimators, but can produce larger bias due to approximation errors.

There are several methods to estimate \(\hat{Q}(s, a)\) such as Fitted Q Evaluation (FQE) (Le et al., 2019) and Minimax Q-Function Learning (MQL) (Uehara et al., 2020).

See also

The implementation of FQE is provided by d3rlpy. The implementations of Minimax Learning is available at scope_rl.ope.weight_value_learning.

Parameters:

estimator_name (str, default="cdf_dm") – Name of the estimator.

References

Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” 2021.

Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” 2021.

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Hoang Le, Cameron Voloshin, and Yisong Yue. “Batch Policy Learning under Constraints.” 2019.

Methods

estimate_conditional_value_at_risk(...[, ...])

Estimate conditional value at risk.

estimate_cumulative_distribution_function(...)

Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.

estimate_interquartile_range(...[, gamma, alpha])

Estimate interquartile range.

estimate_mean(step_per_trajectory, reward, ...)

Estimate mean.

estimate_variance(step_per_trajectory, ...)

Estimate variance.

estimate_cumulative_distribution_function(step_per_trajectory, reward, evaluation_policy_action_dist, state_action_value_prediction, reward_scale, gamma=1.0, **kwargs)[source]#

Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

estimated_cumulative_distribution_function – Estimated cumulative distribution function for the pre-defined reward scale.

Return type:

ndarray of shape (n_partition, ) or (n_episode, )

estimate_mean(step_per_trajectory, reward, evaluation_policy_action_dist, state_action_value_prediction, reward_scale, gamma=1.0, **kwargs)[source]#

Estimate mean.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

estimated_mean – Estimated mean of the reward under the evaluation policy.

Return type:

float

estimate_variance(step_per_trajectory, reward, evaluation_policy_action_dist, state_action_value_prediction, reward_scale, gamma=1.0, **kwargs)[source]#

Estimate variance.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

estimated_variance – Estimated variance of the reward under the evaluation policy.

Return type:

float

estimate_conditional_value_at_risk(step_per_trajectory, reward, evaluation_policy_action_dist, state_action_value_prediction, reward_scale, gamma=1.0, alphas=None, **kwargs)[source]#

Estimate conditional value at risk.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • alphas (array-like of shape (n_alpha, ), default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given, np.linspace(0, 1, 21) will be used.

Returns:

estimated_conditional_value_at_risk – Estimated conditional value at risk (CVaR) of the reward under the evaluation policy.

Return type:

ndarray of (n_alpha, )

estimate_interquartile_range(step_per_trajectory, reward, evaluation_policy_action_dist, state_action_value_prediction, reward_scale, gamma=1.0, alpha=0.05, **kwargs)[source]#

Estimate interquartile range.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • alpha (float, default=0.05) – Proportion of the shaded region.

Returns:

estimated_interquartile_range – Estimated interquartile range of the reward under the evaluation policy.

key: [
    mean,
    {100 * (1. - alpha)}% quartile (lower),
    {100 * (1. - alpha)}% quartile (upper),
]

Return type:

dict

class scope_rl.ope.discrete.cumulative_distribution_estimators.CumulativeDistributionTIS(estimator_name='cdf_tis')[source]#

Trajectory-wise Importance Sampling (TIS) for estimating the cumulative distribution function (CDF) for discrete action spaces.

Bases: scope_rl.ope.BaseCumulativeDistributionOPEEstimator

Imported as: scope_rl.ope.discrete.CumulativeDistributionTIS

Note

TIS estimates the CDF via trajectory-wise importance weighting as follows.

\[\hat{F}_{\mathrm{TIS}}(m, \pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n w_{0:T-1}^{(i)} \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t^{(i)} \leq m \right \}\]

where \(\hat{F}(\cdot)\) is the estimated cumulative distribution function, \(w_{0:T-1} := \prod_{t=0}^{T-1} (\pi(a_t \mid s_t) / \pi_0(a_t \mid s_t))\) is the trajectory-wise importance weight, and \(\mathbb{I} \{ \cdot \}\) is the indicator function.

TIS enables an unbiased estimation of the policy value. However, when the trajectory length (\(T\)) is large, TIS suffers from high variance due to the product of importance weights over the entire horizon.

Parameters:

estimator_name (str, default="cdf_tis") – Name of the estimator.

References

Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” 2021.

Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” 2021.

Methods

estimate_conditional_value_at_risk(...[, ...])

Estimate conditional value at risk.

estimate_cumulative_distribution_function(...)

Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.

estimate_interquartile_range(...[, gamma, alpha])

Estimate interquartile range.

estimate_mean(step_per_trajectory, action, ...)

Estimate mean.

estimate_variance(step_per_trajectory, ...)

Estimate variance.

estimate_cumulative_distribution_function(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, reward_scale, gamma=1.0, **kwargs)[source]#

Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

estimated_cumulative_distribution_function – Estimated cumulative distribution function for the pre-defined reward scale.

Return type:

ndarray of shape (n_partition, ) or (n_episode, )

estimate_mean(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, reward_scale, gamma=1.0, **kwargs)[source]#

Estimate mean.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

estimated_mean – Estimated mean of the reward under the evaluation policy.

Return type:

float

estimate_variance(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, reward_scale, gamma=1.0, **kwargs)[source]#

Estimate variance.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

estimated_variance – Estimated variance of the reward under the evaluation policy.

Return type:

float

estimate_conditional_value_at_risk(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, reward_scale, gamma=1.0, alphas=None, **kwargs)[source]#

Estimate conditional value at risk.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • alphas (array-like of shape (n_alpha, ), default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given, np.linspace(0, 1, 21) will be used.

Returns:

estimated_conditional_value_at_risk – Estimated conditional value at risk (CVaR) of the reward under the evaluation policy.

Return type:

ndarray of (n_alpha, )

estimate_interquartile_range(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, reward_scale, gamma=1.0, alpha=0.05, **kwargs)[source]#

Estimate interquartile range.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • alpha (float, default=0.05) – Proportion of the shaded region.

Returns:

estimated_interquartile_range – Estimated interquartile range of the reward under the evaluation policy.

key: [
    mean,
    {100 * (1. - alpha)}% quartile (lower),
    {100 * (1. - alpha)}% quartile (upper),
]

Return type:

dict

class scope_rl.ope.discrete.cumulative_distribution_estimators.CumulativeDistributionTDR(estimator_name='cdf_tdr')[source]#

Trajectory-wise Doubly Robust (TDR) for estimating the cumulative distribution function (CDF) for discrete action spaces.

Bases: scope_rl.ope.BaseCumulativeDistributionOPEEstimator

Imported as: scope_rl.ope.discrete.CumulativeDistributionTrajectoryWiseDR

Note

TDR estimates the CDF via trajectory-wise importance weighting and estimated Q-function \(\hat{Q}\) as follows.

\[\begin{split}\hat{F}_{\mathrm{TDR}}(m, \pi; \mathcal{D}) &:= \frac{1}{n} \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi(a \mid s_0^{(i)}) \hat{G}(m; s_0^{(i)}, a) \\ & \quad \quad + \frac{1}{n} \sum_{i=1}^n w_{0:T-1}^{(i)} \left( \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t^{(i)} \leq m \right \} - \hat{G}(m; s_0^{(i)}, a_0^{(i)}) \right)\end{split}\]

where \(\hat{F}(\cdot)\) is the estimated cumulative distribution function and \(\hat{G}(\cdot;s,a)\) is an estimator for \(\mathbb{E} \left[ \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t \leq m \right \} \mid s,a \right]\). \(w_{0:T-1} := \prod_{t=0}^{T-1} (\pi(a_t \mid s_t) / \pi_0(a_t \mid s_t))\) is the trajectory-wise importance weight and \(\mathbb{I} \{ \cdot \}\) is the indicator function.

TDR is unbiased and has lower variance than TIS when \(\hat{Q}(\cdot)\) is reasonably accurate and satisfies \(0 < \hat{Q}(\cdot) < 2 Q(\cdot)\). However, when the importance weight is quite large, it may still suffer from a high variance.

Parameters:

estimator_name (str, default="cdf_tdr") – Name of the estimator.

References

Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” 2021.

Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” 2021.

Methods

estimate_conditional_value_at_risk(...[, ...])

Estimate conditional value at risk.

estimate_cumulative_distribution_function(...)

Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.

estimate_interquartile_range(...[, gamma, alpha])

Estimate interquartile range.

estimate_mean(step_per_trajectory, action, ...)

Estimate mean.

estimate_variance(step_per_trajectory, ...)

Estimate variance.

estimate_cumulative_distribution_function(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, state_action_value_prediction, reward_scale, gamma=1.0, **kwargs)[source]#

Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

estimated_cumulative_distribution_function – Estimated cumulative distribution function for the pre-defined reward scale.

Return type:

ndarray of shape (n_partition, ) or (n_episode, )

estimate_mean(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, state_action_value_prediction, reward_scale, gamma=1.0, **kwargs)[source]#

Estimate mean.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

estimated_mean – Estimated mean of the reward under the evaluation policy.

Return type:

float

estimate_variance(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, state_action_value_prediction, reward_scale, gamma=1.0, **kwargs)[source]#

Estimate variance.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

estimated_variance – Estimated variance of the reward under the evaluation policy.

Return type:

float

estimate_conditional_value_at_risk(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, state_action_value_prediction, reward_scale, gamma=1.0, alphas=None, **kwargs)[source]#

Estimate conditional value at risk.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • alphas (array-like of shape (n_alpha, ), default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given, np.linspace(0, 1, 21) will be used.

Returns:

estimated_conditional_value_at_risk – Estimated conditional value at risk (CVaR) of the reward under the evaluation policy.

Return type:

ndarray of (n_alpha, )

estimate_interquartile_range(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, state_action_value_prediction, reward_scale, gamma=1.0, alpha=0.05, **kwargs)[source]#

Estimate interquartile range.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • alpha (float, default=0.05) – Proportion of the shaded region.

Returns:

estimated_interquartile_range – Estimated interquartile range of the reward under the evaluation policy.

key: [
    mean,
    {100 * (1. - alpha)}% quartile (lower),
    {100 * (1. - alpha)}% quartile (upper),
]

Return type:

dict

class scope_rl.ope.discrete.cumulative_distribution_estimators.CumulativeDistributionSNTIS(estimator_name='cdf_sntis')[source]#

Self Normalized Trajectory-wise Importance Sampling (SNTIS) for estimating the cumulative distribution function (CDF) for discrete action spaces.

Bases: scope_rl.ope.discrete.CumulativeDistributionTIS scope_rl.ope.BaseCumulativeDistributionOPEEstimator

Imported as: scope_rl.ope.discrete.CumulativeDistributionSNTIS

Note

SNTIS estimates the CDF via trajectory-wise importance weighting as follows.

\[\hat{F}_{\mathrm{SNTIS}}(m, \pi; \mathcal{D})) := \sum_{i=1}^n \frac{w_{0:T-1}^{(i)}}{\sum_{i'=1}^n w_{0:T-1}^{(i')}} \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t^{(i)} \leq m \right \}\]

where \(\hat{F}(\cdot)\) is the estimated cumulative distribution function, \(w_{0:T-1} := \prod_{t=0}^{T-1} (\pi(a_t \mid s_t) / \pi_0(a_t \mid s_t))\) is the trajectory-wise importance weight, and \(\mathbb{I} \{ \cdot \}\) is the indicator function.

The self-normalized estimator is no longer unbiased, but has a bounded variance while also remaining consistent.

Parameters:

estimator_name (str, default="cdf_sntis") – Name of the estimator.

References

Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” 2021.

Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” 2021.

Methods

estimate_conditional_value_at_risk(...[, ...])

Estimate conditional value at risk.

estimate_cumulative_distribution_function(...)

Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.

estimate_interquartile_range(...[, gamma, alpha])

Estimate interquartile range.

estimate_mean(step_per_trajectory, action, ...)

Estimate mean.

estimate_variance(step_per_trajectory, ...)

Estimate variance.

estimate_cumulative_distribution_function(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, reward_scale, gamma=1.0, **kwargs)[source]#

Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

estimated_cumulative_distribution_function – Estimated cumulative distribution function for the pre-defined reward scale.

Return type:

ndarray of shape (n_partition, ) or (n_episode, )

class scope_rl.ope.discrete.cumulative_distribution_estimators.CumulativeDistributionSNTDR(estimator_name='cdf_sntdr')[source]#

Self Normalized Trajectory-wise Doubly Robust (SNTDR) for estimating the cumulative distribution function (CDF) for discrete action spaces.

Bases: scope_rl.ope.discrete.CumulativeDistributionTDR scope_rl.ope.BaseCumulativeDistributionOPEEstimator

Imported as: scope_rl.ope.discrete.CumulativeDistributionSNTDR

Note

SNTDR estimates the CDF via trajectory-wise importance weighting and estimated Q-function \(\hat{Q}\) as follows.

\[\begin{split}\hat{F}_{\mathrm{SNTDR}}(m, \pi; \mathcal{D})) &:= \frac{1}{n} \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi(a \mid s_0^{(t)}) \hat{G}(m; s_0^{(t)}, a) \\ & \quad \quad + \sum_{i=1}^n \frac{w_{0:T-1}^{(i)}}{\sum_{i'=1}^n w_{0:T-1}^{(i')}} \left( \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t^{(i)} \leq m \right \} - \hat{G}(m; s_0^{(i)}, a_0^{(i)}) \right)\end{split}\]

where \(\hat{F}(\cdot)\) is the estimated cumulative distribution function and \(\hat{G}(\cdot)\) is an estimator for \(\mathbb{E} \left[ \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t \leq m \right \} \mid s,a \right]\). \(w_{0:T-1} := \prod_{t=0}^{T-1} (\pi(a_t \mid s_t) / \pi_0(a_t \mid s_t))\) is the trajectory-wise importance weight and and \(\mathbb{I} \{ \cdot \}\) is the indicator function.

The self-normalized estimator is no longer unbiased, but has a bounded variance while also remaining consistent.

Parameters:

estimator_name (str, default="cdf_sntdr") – Name of the estimator.

References

Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” 2021.

Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” 2021.

Methods

estimate_conditional_value_at_risk(...[, ...])

Estimate conditional value at risk.

estimate_cumulative_distribution_function(...)

Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.

estimate_interquartile_range(...[, gamma, alpha])

Estimate interquartile range.

estimate_mean(step_per_trajectory, action, ...)

Estimate mean.

estimate_variance(step_per_trajectory, ...)

Estimate variance.

estimate_cumulative_distribution_function(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, state_action_value_prediction, reward_scale, gamma=1.0, **kwargs)[source]#

Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

estimated_cumulative_distribution_function – Estimated cumulative distribution function for the pre-defined reward scale.

Return type:

ndarray of shape (n_partition, ) or (n_episode, )