scope_rl.ope.continuous.cumulative_distribution_estimators.CumulativeDistributionSNTDR#

class scope_rl.ope.continuous.cumulative_distribution_estimators.CumulativeDistributionSNTDR(estimator_name='cdf_sntdr')[source]#

Self Normalized Trajectory-wise Doubly Robust (SNTDR) for estimating the cumulative distribution function (CDF) for continuous action spaces.

Bases: scope_rl.ope.continuous.CumulativeDistributionTDR -> scope_rl.ope.BaseCumulativeDistributionOPEEstimator

Imported as: scope_rl.ope.continuous.CumulativeDistributionSNTDR

Note

SNTDR estimates the CDF via trajectory-wise importance weighting and estimated Q-function \(\hat{Q}\) as follows.

\[\begin{split}\hat{F}_{\mathrm{SNTDR}}(m, \pi; \mathcal{D})) &:= \frac{1}{n} \sum_{i=1}^n \hat{G}(m; s_0^{(i)}, \pi(s_0^{(i)})) \\ & \quad \quad + \sum_{i=1}^n \frac{w_{0:T-1}^{(i)} \delta(\pi, a_{0:T-1}^{(i)})}{\sum_{i'=1}^n w_{0:T-1}^{(i')} \delta(\pi, a_{0:T-1}^{(i')})} \left( \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t^{(i)} \leq m \right \} - \hat{G}(m; s_0^{(i)}, a_0^{(i)}) \right)\end{split}\]

where \(\hat{F}(\cdot)\) is the estimated cumulative distribution function and \(\hat{G}(\cdot)\) is an estimator for \(\mathbb{E} \left[ \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t \leq m \right \} \mid s,a \right]\). \(w_{0:T-1} := \prod_{t=0}^{T-1} (\pi(a_t \mid s_t) / \pi_0(a_t \mid s_t))\) is the trajectory-wise importance weight and \(\mathbb{I} \{ \cdot \}\) is the indicator function. \(\delta(\pi, a_{0:T-1}) = \prod_{t=0}^{T-1} K(\pi(s_t), a_t)\) quantifies the similarity between the action logged in the dataset and that taken by the evaluation policy. Note that the bandwidth of the kernel is an important hyperparameter; the variance of the above estimator often becomes small when the bandwidth of the kernel is large, while the bias often becomes large in those cases.

The self-normalized estimator has a bounded variance.

Parameters:

estimator_name (str, default="cdf_sntdr") – Name of the estimator.

References

Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” 2021.

Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” 2021.

Nathan Kallus and Angela Zhou. “Policy Evaluation and Optimization with Continuous Treatments.” 2019.

Methods

estimate_conditional_value_at_risk(...[, ...])

Estimate conditional value at risk.

estimate_cumulative_distribution_function(...)

Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.

estimate_interquartile_range(...[, gamma, ...])

Estimate interquartile range.

estimate_mean(step_per_trajectory, action, ...)

Estimate mean.

estimate_variance(step_per_trajectory, ...)

Estimate variance.

estimate_cumulative_distribution_function(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#

Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

Returns:

estimated_cumulative_distribution_function – Estimated cumulative distribution function for the pre-defined reward scale.

Return type:

ndarray of shape (n_partition, ) or (n_episode, )

estimate_conditional_value_at_risk(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, alphas=None, **kwargs)#

Estimate conditional value at risk.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

  • alphas (array-like of shape (n_alpha, ), default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given, np.linspace(0, 1, 21) will be used.

Returns:

estimated_conditional_value_at_risk – Estimated conditional value at risk (CVaR) of the reward under the evaluation policy.

Return type:

ndarray of (n_alpha, )

estimate_interquartile_range(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, alpha=0.05, **kwargs)#

Estimate interquartile range.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

  • alpha (float, default=0.05) – Proportion of the shaded region.

Returns:

estimated_interquartile_range – Estimated interquartile range of the reward under the evaluation policy.

Return type:

dict

estimate_mean(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)#

Estimate mean.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

Returns:

estimated_mean – Estimated mean of the reward under the evaluation policy.

Return type:

float

estimate_variance(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)#

Estimate variance.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

Returns:

estimated_variance – Estimated variance of the reward under the evaluation policy.

Return type:

float

Methods