scope_rl.ope.continuous.cumulative_distribution_estimators#
Cumulative Distribution Off-Policy Estimators for continuous action cases (designed for deterministic evaluation policies).
Classes
Direct Method (DM) for estimating the cumulative distribution function (CDF) for continuous action spaces. |
|
Self Normalized Trajectory-wise Doubly Robust (SNTDR) for estimating the cumulative distribution function (CDF) for continuous action spaces. |
|
Self Normalized Trajectory-wise Importance Sampling (SNTIS) for estimating the cumulative distribution function (CDF) for continuous action spaces. |
|
Trajectory-wise Doubly Robust (TDR) for estimating the cumulative distribution function (CDF) for continuous action spaces. |
|
Trajectory-wise Importance Sampling (TIS) for estimating the cumulative distribution function (CDF) for continuous action spaces. |
- class scope_rl.ope.continuous.cumulative_distribution_estimators.CumulativeDistributionDM(estimator_name='cdf_dm')[source]#
Direct Method (DM) for estimating the cumulative distribution function (CDF) for continuous action spaces.
Bases:
scope_rl.ope.BaseCumulativeDistributionOPEEstimatorImported as:
scope_rl.ope.continuous.CumulativeDistributionDMNote
DM estimates the CDF using the initial state value as follows.
\[\hat{F}_{\mathrm{DM}}(m, \pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \hat{G}(m; s_0^{(i)}, \pi(s_0^{(i)}))\]where \(\hat{F}(\cdot)\) is the estimated cumulative distribution function and \(\hat{G}(\cdot)\) is an estimator for \(\mathbb{E} \left[ \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t \leq m \right \} \mid s,a \right]\).
DM has low variance compared to other estimators, but can produce larger bias due to approximation errors.
There are several methods to estimate \(\hat{Q}(s, a)\) such as Fitted Q Evaluation (FQE) (Le et al., 2019) and Minimax Q-Function Learning (MQL) (Uehara et al., 2020).
See also
The implementation of FQE is provided by d3rlpy. The implementations of Minimax Learning is available at
scope_rl.ope.weight_value_learning.- Parameters:
estimator_name (str, default="cdf_dm") – Name of the estimator.
References
Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” 2021.
Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” 2021.
Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.
Hoang Le, Cameron Voloshin, and Yisong Yue. “Batch Policy Learning under Constraints.” 2019.
Methods
estimate_conditional_value_at_risk(...[, ...])Estimate conditional value at risk.
Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.
estimate_interquartile_range(...[, gamma, alpha])Estimate interquartile range.
estimate_mean(step_per_trajectory, reward, ...)Estimate mean.
estimate_variance(step_per_trajectory, ...)Estimate variance.
- estimate_cumulative_distribution_function(step_per_trajectory, reward, state_action_value_prediction, reward_scale, gamma=1.0, **kwargs)[source]#
Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
- Returns:
estimated_cumulative_distribution_function – Estimated cumulative distribution function for the pre-defined reward scale.
- Return type:
ndarray of shape (n_partition, ) or (n_episode, )
- estimate_mean(step_per_trajectory, reward, state_action_value_prediction, reward_scale, gamma=1.0, **kwargs)[source]#
Estimate mean.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
- Returns:
estimated_mean – Estimated mean of the reward under the evaluation policy.
- Return type:
- estimate_variance(step_per_trajectory, reward, state_action_value_prediction, reward_scale, gamma=1.0, **kwargs)[source]#
Estimate variance.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
- Returns:
estimated_variance – Estimated variance of the reward under the evaluation policy.
- Return type:
- estimate_conditional_value_at_risk(step_per_trajectory, reward, state_action_value_prediction, reward_scale, gamma=1.0, alphas=None, **kwargs)[source]#
Estimate conditional value at risk.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
alphas (array-like of shape (n_alpha, ), default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given,
np.linspace(0, 1, 21)will be used.
- Returns:
estimated_conditional_value_at_risk – Estimated conditional value at risk (CVaR) of the reward under the evaluation policy.
- Return type:
ndarray of (n_alpha, )
- estimate_interquartile_range(step_per_trajectory, reward, state_action_value_prediction, reward_scale, gamma=1.0, alpha=0.05, **kwargs)[source]#
Estimate interquartile range.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
alpha (float, default=0.05) – Proportion of the shaded region.
- Returns:
estimated_interquartile_range – Estimated interquartile range of the reward under the evaluation policy.
- Return type:
- class scope_rl.ope.continuous.cumulative_distribution_estimators.CumulativeDistributionTIS(estimator_name='cdf_tis')[source]#
Trajectory-wise Importance Sampling (TIS) for estimating the cumulative distribution function (CDF) for continuous action spaces.
Bases:
scope_rl.ope.BaseCumulativeDistributionOPEyEstimatorImported as:
scope_rl.ope.continuous.CumulativeDistributionTISNote
TIS estimates the CDF via trajectory-wise importance weighting as follows.
\[\hat{F}_{\mathrm{TIS}}(m, \pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n w_{1:T-1}^{(i)} \delta(\pi, a_{0:T-1}^{(i)}) \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t^{(i)} \leq m \right \}\]where \(\hat{F}(\cdot)\) is the estimated cumulative distribution function, \(w_{0:T-1} := \prod_{t=0}^{T-1} (\pi(a_t \mid s_t) / \pi_0(a_t \mid s_t))\) is the trajectory-wise importance weight, and \(\mathbb{I} \{ \cdot \}\) is the indicator function. \(\delta(\pi, a_{0:T-1}) = \prod_{t=0}^{T-1} K(\pi(s_t), a_t)\) quantifies the similarity between the action logged in the dataset and that taken by the evaluation policy. Note that the bandwidth of the kernel is an important hyperparameter; the variance of the above estimator often becomes small when the bandwidth of the kernel is large, while the bias often becomes large in those cases.
TIS corrects the distribution shift between the behavior and evaluation policies. However, when the trajectory length (\(T\)) is large, TIS suffers from high variance due to the product of importance weights over the entire horizon.
- Parameters:
estimator_name (str, default="cdf_tis") – Name of the estimator.
References
Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” 2021.
Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” 2021.
Nathan Kallus and Angela Zhou. “Policy Evaluation and Optimization with Continuous Treatments.” 2019.
Methods
estimate_conditional_value_at_risk(...[, ...])Estimate conditional value at risk.
Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.
estimate_interquartile_range(...[, gamma, ...])Estimate interquartile range.
estimate_mean(step_per_trajectory, action, ...)Estimate mean.
estimate_variance(step_per_trajectory, ...)Estimate variance.
- estimate_cumulative_distribution_function(step_per_trajectory, action, reward, pscore, evaluation_policy_action, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#
Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
- Returns:
estimated_cumulative_distribution_function – Estimated cumulative distribution function for the pre-defined reward scale.
- Return type:
ndarray of shape (n_partition, ) or (n_episode, )
- estimate_mean(step_per_trajectory, action, reward, pscore, evaluation_policy_action, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#
Estimate mean.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
- Returns:
estimated_mean – Estimated mean of the reward under the evaluation policy.
- Return type:
- estimate_variance(step_per_trajectory, action, reward, pscore, evaluation_policy_action, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#
Estimate variance.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
- Returns:
estimated_variance – Estimated variance of the reward under the evaluation policy.
- Return type:
- estimate_conditional_value_at_risk(step_per_trajectory, action, reward, pscore, evaluation_policy_action, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, alphas=None, **kwargs)[source]#
Estimate conditional value at risk.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
alphas (array-like of shape (n_alpha, ), default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given,
np.linspace(0, 1, 21)will be used.
- Returns:
estimated_conditional_value_at_risk – Estimated conditional value at risk (CVaR) of the reward under the evaluation policy.
- Return type:
ndarray of (n_alpha, )
- estimate_interquartile_range(step_per_trajectory, action, reward, pscore, evaluation_policy_action, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, alpha=0.05, **kwargs)[source]#
Estimate interquartile range.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
alpha (float, default=0.05) – Proportion of the shaded region.
- Returns:
estimated_interquartile_range – Estimated interquartile range of the reward under the evaluation policy.
- Return type:
- class scope_rl.ope.continuous.cumulative_distribution_estimators.CumulativeDistributionTDR(estimator_name='cdf_tdr')[source]#
Trajectory-wise Doubly Robust (TDR) for estimating the cumulative distribution function (CDF) for continuous action spaces.
Bases:
scope_rl.ope.BaseCumulativeDistributionOPEEstimatorImported as:
scope_rl.ope.continuous.CumulativeDistributionTDRNote
TDR estimates the CDF via trajectory-wise importance weighting and estimated Q-function \(\hat{Q}\) as follows.
\[\begin{split}\hat{F}_{\mathrm{TDR}}(m, \pi; \mathcal{D}) &:= \frac{1}{n} \sum_{i=1}^n \hat{G}(m; s_0^{(i)}, \pi(s_0^{(i)})) \\ & \quad \quad + \frac{1}{n} \sum_{i=1}^n w_{0:T-1}^{(i)} \delta(\pi, a_{0:T-1}^{(i)}) \left( \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t^{(i)} \leq m \right \} - \hat{G}(m; s_0^{(i)}, a_0^{(i)}) \right)\end{split}\]where \(\hat{F}(\cdot)\) is the estimated cumulative distribution function and \(\hat{G}(\cdot)\) is an estimator for \(\mathbb{E} \left[ \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t^{(i)} \leq m \right \} \mid s,a \right]\). \(w_{0:T-1} := \prod_{t=0}^{T-1} (\pi(a_t \mid s_t) / \pi_0(a_t \mid s_t))\) is the trajectory-wise importance weight and \(\mathbb{I} \{ \cdot \}\) is the indicator function. \(\delta(\pi, a_{0:T-1}) = \prod_{t=0}^{T-1} K(\pi(s_t), a_t)\) quantifies the similarity between the action logged in the dataset and that taken by the evaluation policy. Note that the bandwidth of the kernel is an important hyperparameter; the variance of the above estimator often becomes small when the bandwidth of the kernel is large, while the bias often becomes large in those cases.
TDR corrects the distribution shift between the behavior and evaluation policies and often has lower variance than TIS when \(\hat{Q}(\cdot)\) is reasonably accurate and satisfies \(0 < \hat{Q}(\cdot) < 2 Q(\cdot)\). However, when the importance weight is quite large, it may still suffer from a high variance.
- Parameters:
estimator_name (str, default="cdf_tdr") – Name of the estimator.
References
Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” 2021.
Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” 2021.
Nathan Kallus and Angela Zhou. “Policy Evaluation and Optimization with Continuous Treatments.” 2019.
Methods
estimate_conditional_value_at_risk(...[, ...])Estimate conditional value at risk.
Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.
estimate_interquartile_range(...[, gamma, ...])Estimate interquartile range.
estimate_mean(step_per_trajectory, action, ...)Estimate mean.
estimate_variance(step_per_trajectory, ...)Estimate variance.
- estimate_cumulative_distribution_function(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#
Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
- Returns:
estimated_cumulative_distribution_function – Estimated cumulative distribution function for the pre-defined reward scale.
- Return type:
ndarray of shape (n_partition, ) or (n_episode, )
- estimate_mean(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#
Estimate mean.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
- Returns:
estimated_mean – Estimated mean of the reward under the evaluation policy.
- Return type:
- estimate_variance(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#
Estimate variance.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
- Returns:
estimated_variance – Estimated variance of the reward under the evaluation policy.
- Return type:
- estimate_conditional_value_at_risk(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, alphas=None, **kwargs)[source]#
Estimate conditional value at risk.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
alphas (array-like of shape (n_alpha, ), default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given,
np.linspace(0, 1, 21)will be used.
- Returns:
estimated_conditional_value_at_risk – Estimated conditional value at risk (CVaR) of the reward under the evaluation policy.
- Return type:
ndarray of (n_alpha, )
- estimate_interquartile_range(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, alpha=0.05, **kwargs)[source]#
Estimate interquartile range.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
alpha (float, default=0.05) – Proportion of the shaded region.
- Returns:
estimated_interquartile_range – Estimated interquartile range of the reward under the evaluation policy.
- Return type:
- class scope_rl.ope.continuous.cumulative_distribution_estimators.CumulativeDistributionSNTIS(estimator_name='cdf_sntis')[source]#
Self Normalized Trajectory-wise Importance Sampling (SNTIS) for estimating the cumulative distribution function (CDF) for continuous action spaces.
Bases:
scope_rl.ope.continuous.CumulativeDistributionTIS->scope_rl.ope.BaseCumulativeDistributionOPEEstimatorImported as:
scope_rl.ope.continuous.CumulativeDistributionSNTISNote
SNTIS estimates the CDF via trajectory-wise importance weighting as follows.
\[\hat{F}_{\mathrm{SNTIS}}(m, \pi; \mathcal{D})) := \sum_{i=1}^n \frac{w_{0:T-1}^{(i)} \delta(\pi, a_{0:T-1}^{(i)})}{\sum_{i'=1}^n w_{0:T-1}^{(i')} \delta(\pi, a_{0:T-1}^{(i')})} \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t^{(i)} \leq m \right \}\]where \(\hat{F}(\cdot)\) is the estimated cumulative distribution function, \(w_{0:T-1} := \prod_{t=0}^{T-1} (\pi(a_t \mid s_t) / \pi_0(a_t \mid s_t))\) is the trajectory-wise importance weight, and \(\mathbb{I} \{ \cdot \}\) is the indicator function. \(\delta(\pi, a_{0:T-1}) = \prod_{t=0}^{T-1} K(\pi(s_t), a_t)\) quantifies the similarity between the action logged in the dataset and that taken by the evaluation policy. Note that the bandwidth of the kernel is an important hyperparameter; the variance of the above estimator often becomes small when the bandwidth of the kernel is large, while the bias often becomes large in those cases.
The self-normalized estimator has a bounded variance.
- Parameters:
estimator_name (str, default="cdf_sntis") – Name of the estimator.
References
Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” 2021.
Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” 2021.
Nathan Kallus and Angela Zhou. “Policy Evaluation and Optimization with Continuous Treatments.” 2019.
Methods
estimate_conditional_value_at_risk(...[, ...])Estimate conditional value at risk.
Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.
estimate_interquartile_range(...[, gamma, ...])Estimate interquartile range.
estimate_mean(step_per_trajectory, action, ...)Estimate mean.
estimate_variance(step_per_trajectory, ...)Estimate variance.
- estimate_cumulative_distribution_function(step_per_trajectory, action, reward, pscore, evaluation_policy_action, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#
Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
- Returns:
estimated_cumulative_distribution_function – Estimated cumulative distribution function for the pre-defined reward scale.
- Return type:
ndarray of shape (n_partition, ) or (n_episode, )
- class scope_rl.ope.continuous.cumulative_distribution_estimators.CumulativeDistributionSNTDR(estimator_name='cdf_sntdr')[source]#
Self Normalized Trajectory-wise Doubly Robust (SNTDR) for estimating the cumulative distribution function (CDF) for continuous action spaces.
Bases:
scope_rl.ope.continuous.CumulativeDistributionTDR->scope_rl.ope.BaseCumulativeDistributionOPEEstimatorImported as:
scope_rl.ope.continuous.CumulativeDistributionSNTDRNote
SNTDR estimates the CDF via trajectory-wise importance weighting and estimated Q-function \(\hat{Q}\) as follows.
\[\begin{split}\hat{F}_{\mathrm{SNTDR}}(m, \pi; \mathcal{D})) &:= \frac{1}{n} \sum_{i=1}^n \hat{G}(m; s_0^{(i)}, \pi(s_0^{(i)})) \\ & \quad \quad + \sum_{i=1}^n \frac{w_{0:T-1}^{(i)} \delta(\pi, a_{0:T-1}^{(i)})}{\sum_{i'=1}^n w_{0:T-1}^{(i')} \delta(\pi, a_{0:T-1}^{(i')})} \left( \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t^{(i)} \leq m \right \} - \hat{G}(m; s_0^{(i)}, a_0^{(i)}) \right)\end{split}\]where \(\hat{F}(\cdot)\) is the estimated cumulative distribution function and \(\hat{G}(\cdot)\) is an estimator for \(\mathbb{E} \left[ \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t \leq m \right \} \mid s,a \right]\). \(w_{0:T-1} := \prod_{t=0}^{T-1} (\pi(a_t \mid s_t) / \pi_0(a_t \mid s_t))\) is the trajectory-wise importance weight and \(\mathbb{I} \{ \cdot \}\) is the indicator function. \(\delta(\pi, a_{0:T-1}) = \prod_{t=0}^{T-1} K(\pi(s_t), a_t)\) quantifies the similarity between the action logged in the dataset and that taken by the evaluation policy. Note that the bandwidth of the kernel is an important hyperparameter; the variance of the above estimator often becomes small when the bandwidth of the kernel is large, while the bias often becomes large in those cases.
The self-normalized estimator has a bounded variance.
- Parameters:
estimator_name (str, default="cdf_sntdr") – Name of the estimator.
References
Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” 2021.
Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” 2021.
Nathan Kallus and Angela Zhou. “Policy Evaluation and Optimization with Continuous Treatments.” 2019.
Methods
estimate_conditional_value_at_risk(...[, ...])Estimate conditional value at risk.
Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.
estimate_interquartile_range(...[, gamma, ...])Estimate interquartile range.
estimate_mean(step_per_trajectory, action, ...)Estimate mean.
estimate_variance(step_per_trajectory, ...)Estimate variance.
- estimate_cumulative_distribution_function(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#
Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
- Returns:
estimated_cumulative_distribution_function – Estimated cumulative distribution function for the pre-defined reward scale.
- Return type:
ndarray of shape (n_partition, ) or (n_episode, )