scope_rl.ope.discrete.cumulative_distribution_estimators.CumulativeDistributionSNTIS#
- class scope_rl.ope.discrete.cumulative_distribution_estimators.CumulativeDistributionSNTIS(estimator_name='cdf_sntis')[source]#
Self Normalized Trajectory-wise Importance Sampling (SNTIS) for estimating the cumulative distribution function (CDF) for discrete action spaces.
Bases:
scope_rl.ope.discrete.CumulativeDistributionTISscope_rl.ope.BaseCumulativeDistributionOPEEstimatorImported as:
scope_rl.ope.discrete.CumulativeDistributionSNTISNote
SNTIS estimates the CDF via trajectory-wise importance weighting as follows.
\[\hat{F}_{\mathrm{SNTIS}}(m, \pi; \mathcal{D})) := \sum_{i=1}^n \frac{w_{0:T-1}^{(i)}}{\sum_{i'=1}^n w_{0:T-1}^{(i')}} \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t^{(i)} \leq m \right \}\]where \(\hat{F}(\cdot)\) is the estimated cumulative distribution function, \(w_{0:T-1} := \prod_{t=0}^{T-1} (\pi(a_t \mid s_t) / \pi_0(a_t \mid s_t))\) is the trajectory-wise importance weight, and \(\mathbb{I} \{ \cdot \}\) is the indicator function.
The self-normalized estimator is no longer unbiased, but has a bounded variance while also remaining consistent.
- Parameters:
estimator_name (str, default="cdf_sntis") – Name of the estimator.
References
Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” 2021.
Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” 2021.
Methods
estimate_conditional_value_at_risk(...[, ...])Estimate conditional value at risk.
Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.
estimate_interquartile_range(...[, gamma, alpha])Estimate interquartile range.
estimate_mean(step_per_trajectory, action, ...)Estimate mean.
estimate_variance(step_per_trajectory, ...)Estimate variance.
- estimate_cumulative_distribution_function(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, reward_scale, gamma=1.0, **kwargs)[source]#
Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
- Returns:
estimated_cumulative_distribution_function – Estimated cumulative distribution function for the pre-defined reward scale.
- Return type:
ndarray of shape (n_partition, ) or (n_episode, )
- estimate_conditional_value_at_risk(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, reward_scale, gamma=1.0, alphas=None, **kwargs)#
Estimate conditional value at risk.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
alphas (array-like of shape (n_alpha, ), default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given,
np.linspace(0, 1, 21)will be used.
- Returns:
estimated_conditional_value_at_risk – Estimated conditional value at risk (CVaR) of the reward under the evaluation policy.
- Return type:
ndarray of (n_alpha, )
- estimate_interquartile_range(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, reward_scale, gamma=1.0, alpha=0.05, **kwargs)#
Estimate interquartile range.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
alpha (float, default=0.05) – Proportion of the shaded region.
- Returns:
estimated_interquartile_range – Estimated interquartile range of the reward under the evaluation policy.
key: [ mean, {100 * (1. - alpha)}% quartile (lower), {100 * (1. - alpha)}% quartile (upper), ]
- Return type:
- estimate_mean(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, reward_scale, gamma=1.0, **kwargs)#
Estimate mean.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
- Returns:
estimated_mean – Estimated mean of the reward under the evaluation policy.
- Return type:
- estimate_variance(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, reward_scale, gamma=1.0, **kwargs)#
Estimate variance.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
- Returns:
estimated_variance – Estimated variance of the reward under the evaluation policy.
- Return type:
Methods