scope_rl.ope.continuous.cumulative_distribution_estimators.CumulativeDistributionTIS#

class scope_rl.ope.continuous.cumulative_distribution_estimators.CumulativeDistributionTIS(estimator_name='cdf_tis')[source]#

Trajectory-wise Importance Sampling (TIS) for estimating the cumulative distribution function (CDF) for continuous action spaces.

Bases: scope_rl.ope.BaseCumulativeDistributionOPEyEstimator

Imported as: scope_rl.ope.continuous.CumulativeDistributionTIS

Note

TIS estimates the CDF via trajectory-wise importance weighting as follows.

\[\hat{F}_{\mathrm{TIS}}(m, \pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n w_{1:T-1}^{(i)} \delta(\pi, a_{0:T-1}^{(i)}) \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t^{(i)} \leq m \right \}\]

where \(\hat{F}(\cdot)\) is the estimated cumulative distribution function, \(w_{0:T-1} := \prod_{t=0}^{T-1} (\pi(a_t \mid s_t) / \pi_0(a_t \mid s_t))\) is the trajectory-wise importance weight, and \(\mathbb{I} \{ \cdot \}\) is the indicator function. \(\delta(\pi, a_{0:T-1}) = \prod_{t=0}^{T-1} K(\pi(s_t), a_t)\) quantifies the similarity between the action logged in the dataset and that taken by the evaluation policy. Note that the bandwidth of the kernel is an important hyperparameter; the variance of the above estimator often becomes small when the bandwidth of the kernel is large, while the bias often becomes large in those cases.

TIS corrects the distribution shift between the behavior and evaluation policies. However, when the trajectory length (\(T\)) is large, TIS suffers from high variance due to the product of importance weights over the entire horizon.

Parameters:: estimator_name (str, default="cdf_tis") – Name of the estimator.

References

Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” 2021.

Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” 2021.

Nathan Kallus and Angela Zhou. “Policy Evaluation and Optimization with Continuous Treatments.” 2019.

Methods

`estimate_conditional_value_at_risk`(...[, ...])	Estimate conditional value at risk.
`estimate_cumulative_distribution_function`(...)	Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.
`estimate_interquartile_range`(...[, gamma, ...])	Estimate interquartile range.
`estimate_mean`(step_per_trajectory, action, ...)	Estimate mean.
`estimate_variance`(step_per_trajectory, ...)	Estimate variance.

estimate_cumulative_distribution_function(step_per_trajectory, action, reward, pscore, evaluation_policy_action, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#

Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

Returns:

estimated_cumulative_distribution_function – Estimated cumulative distribution function for the pre-defined reward scale.

Return type:

ndarray of shape (n_partition, ) or (n_episode, )

estimate_mean(step_per_trajectory, action, reward, pscore, evaluation_policy_action, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#

Estimate mean.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

Returns:

estimated_mean – Estimated mean of the reward under the evaluation policy.

Return type:

float

estimate_variance(step_per_trajectory, action, reward, pscore, evaluation_policy_action, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#

Estimate variance.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

Returns:

estimated_variance – Estimated variance of the reward under the evaluation policy.

Return type:

float

estimate_conditional_value_at_risk(step_per_trajectory, action, reward, pscore, evaluation_policy_action, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, alphas=None, **kwargs)[source]#

Estimate conditional value at risk.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
alphas (array-like of shape (n_alpha, ), default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given, np.linspace(0, 1, 21) will be used.

Returns:

estimated_conditional_value_at_risk – Estimated conditional value at risk (CVaR) of the reward under the evaluation policy.

Return type:

ndarray of (n_alpha, )

estimate_interquartile_range(step_per_trajectory, action, reward, pscore, evaluation_policy_action, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, alpha=0.05, **kwargs)[source]#

Estimate interquartile range.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
alpha (float, default=0.05) – Proportion of the shaded region.

Returns:

estimated_interquartile_range – Estimated interquartile range of the reward under the evaluation policy.

Return type:

dict

Methods