scope_rl.ope.continuous.cumulative_distribution_estimators.CumulativeDistributionTDR#

class scope_rl.ope.continuous.cumulative_distribution_estimators.CumulativeDistributionTDR(estimator_name='cdf_tdr')[source]#

Trajectory-wise Doubly Robust (TDR) for estimating the cumulative distribution function (CDF) for continuous action spaces.

Bases: scope_rl.ope.BaseCumulativeDistributionOPEEstimator

Imported as: scope_rl.ope.continuous.CumulativeDistributionTDR

Note

TDR estimates the CDF via trajectory-wise importance weighting and estimated Q-function \(\hat{Q}\) as follows.

\[\begin{split}\hat{F}_{\mathrm{TDR}}(m, \pi; \mathcal{D}) &:= \frac{1}{n} \sum_{i=1}^n \hat{G}(m; s_0^{(i)}, \pi(s_0^{(i)})) \\ & \quad \quad + \frac{1}{n} \sum_{i=1}^n w_{0:T-1}^{(i)} \delta(\pi, a_{0:T-1}^{(i)}) \left( \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t^{(i)} \leq m \right \} - \hat{G}(m; s_0^{(i)}, a_0^{(i)}) \right)\end{split}\]

where \(\hat{F}(\cdot)\) is the estimated cumulative distribution function and \(\hat{G}(\cdot)\) is an estimator for \(\mathbb{E} \left[ \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t^{(i)} \leq m \right \} \mid s,a \right]\). \(w_{0:T-1} := \prod_{t=0}^{T-1} (\pi(a_t \mid s_t) / \pi_0(a_t \mid s_t))\) is the trajectory-wise importance weight and \(\mathbb{I} \{ \cdot \}\) is the indicator function. \(\delta(\pi, a_{0:T-1}) = \prod_{t=0}^{T-1} K(\pi(s_t), a_t)\) quantifies the similarity between the action logged in the dataset and that taken by the evaluation policy. Note that the bandwidth of the kernel is an important hyperparameter; the variance of the above estimator often becomes small when the bandwidth of the kernel is large, while the bias often becomes large in those cases.

TDR corrects the distribution shift between the behavior and evaluation policies and often has lower variance than TIS when \(\hat{Q}(\cdot)\) is reasonably accurate and satisfies \(0 < \hat{Q}(\cdot) < 2 Q(\cdot)\). However, when the importance weight is quite large, it may still suffer from a high variance.

Parameters:

estimator_name (str, default="cdf_tdr") – Name of the estimator.

References

Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” 2021.

Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” 2021.

Nathan Kallus and Angela Zhou. “Policy Evaluation and Optimization with Continuous Treatments.” 2019.

Methods

estimate_conditional_value_at_risk(...[, ...])

Estimate conditional value at risk.

estimate_cumulative_distribution_function(...)

Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.

estimate_interquartile_range(...[, gamma, ...])

Estimate interquartile range.

estimate_mean(step_per_trajectory, action, ...)

Estimate mean.

estimate_variance(step_per_trajectory, ...)

Estimate variance.

estimate_cumulative_distribution_function(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#

Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

Returns:

estimated_cumulative_distribution_function – Estimated cumulative distribution function for the pre-defined reward scale.

Return type:

ndarray of shape (n_partition, ) or (n_episode, )

estimate_mean(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#

Estimate mean.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

Returns:

estimated_mean – Estimated mean of the reward under the evaluation policy.

Return type:

float

estimate_variance(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#

Estimate variance.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

Returns:

estimated_variance – Estimated variance of the reward under the evaluation policy.

Return type:

float

estimate_conditional_value_at_risk(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, alphas=None, **kwargs)[source]#

Estimate conditional value at risk.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

  • alphas (array-like of shape (n_alpha, ), default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given, np.linspace(0, 1, 21) will be used.

Returns:

estimated_conditional_value_at_risk – Estimated conditional value at risk (CVaR) of the reward under the evaluation policy.

Return type:

ndarray of (n_alpha, )

estimate_interquartile_range(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, reward_scale, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, alpha=0.05, **kwargs)[source]#

Estimate interquartile range.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_0(a \mid s)\)

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

  • reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

  • alpha (float, default=0.05) – Proportion of the shaded region.

Returns:

estimated_interquartile_range – Estimated interquartile range of the reward under the evaluation policy.

Return type:

dict

Methods