scope_rl.ope.discrete.cumulative_distribution_estimators.CumulativeDistributionDM#

class scope_rl.ope.discrete.cumulative_distribution_estimators.CumulativeDistributionDM(estimator_name='cdf_dm')[source]#

Direct Method (DM) for estimating the cumulative distribution function (CDF) for discrete action spaces.

Bases: scope_rl.ope.BaseCumulativeDistributionOPEEstimator

Imported as: scope_rl.ope.discrete.CumulativeDistributionDM

Note

DM estimates the CDF using the initial state value as follows.

\[\hat{F}_{\mathrm{DM}}(m, \pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi(a \mid s_0^{(i)}) \hat{G}(m; s_0^{(i)}, a)\]

where \(\hat{F}(\cdot)\) is the estimated cumulative distribution function and \(\hat{G}(\cdot)\) is an estimator for \(\mathbb{E} \left[ \mathbb{I} \left \{\sum_{t=0}^{T-1} \gamma^t r_t \leq m \right \} \mid s,a \right]\).

DM has low variance compared to other estimators, but can produce larger bias due to approximation errors.

There are several methods to estimate \(\hat{Q}(s, a)\) such as Fitted Q Evaluation (FQE) (Le et al., 2019) and Minimax Q-Function Learning (MQL) (Uehara et al., 2020).

See also

The implementation of FQE is provided by d3rlpy. The implementations of Minimax Learning is available at scope_rl.ope.weight_value_learning.

Parameters:: estimator_name (str, default="cdf_dm") – Name of the estimator.

References

Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” 2021.

Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” 2021.

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Hoang Le, Cameron Voloshin, and Yisong Yue. “Batch Policy Learning under Constraints.” 2019.

Methods

`estimate_conditional_value_at_risk`(...[, ...])	Estimate conditional value at risk.
`estimate_cumulative_distribution_function`(...)	Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.
`estimate_interquartile_range`(...[, gamma, alpha])	Estimate interquartile range.
`estimate_mean`(step_per_trajectory, reward, ...)	Estimate mean.
`estimate_variance`(step_per_trajectory, ...)	Estimate variance.

estimate_cumulative_distribution_function(step_per_trajectory, reward, evaluation_policy_action_dist, state_action_value_prediction, reward_scale, gamma=1.0, **kwargs)[source]#

Estimate the cumulative distribution function (CDF) of the reward distribution under the evaluation policy.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

estimated_cumulative_distribution_function – Estimated cumulative distribution function for the pre-defined reward scale.

Return type:

ndarray of shape (n_partition, ) or (n_episode, )

estimate_mean(step_per_trajectory, reward, evaluation_policy_action_dist, state_action_value_prediction, reward_scale, gamma=1.0, **kwargs)[source]#

Estimate mean.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

estimated_mean – Estimated mean of the reward under the evaluation policy.

Return type:

float

estimate_variance(step_per_trajectory, reward, evaluation_policy_action_dist, state_action_value_prediction, reward_scale, gamma=1.0, **kwargs)[source]#

Estimate variance.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

Returns:

estimated_variance – Estimated variance of the reward under the evaluation policy.

Return type:

float

estimate_conditional_value_at_risk(step_per_trajectory, reward, evaluation_policy_action_dist, state_action_value_prediction, reward_scale, gamma=1.0, alphas=None, **kwargs)[source]#

Estimate conditional value at risk.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
alphas (array-like of shape (n_alpha, ), default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given, np.linspace(0, 1, 21) will be used.

Returns:

estimated_conditional_value_at_risk – Estimated conditional value at risk (CVaR) of the reward under the evaluation policy.

Return type:

ndarray of (n_alpha, )

estimate_interquartile_range(step_per_trajectory, reward, evaluation_policy_action_dist, state_action_value_prediction, reward_scale, gamma=1.0, alpha=0.05, **kwargs)[source]#

Estimate interquartile range.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
reward_scale (array-like of shape (n_partition, )) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
alpha (float, default=0.05) – Proportion of the shaded region.

Returns:

estimated_interquartile_range – Estimated interquartile range of the reward under the evaluation policy.

key: [
    mean,
    {100 * (1. - alpha)}% quartile (lower),
    {100 * (1. - alpha)}% quartile (upper),
]

Return type:

dict

Methods