scope_rl.ope.online.calc_on_policy_conditional_value_at_risk#

scope_rl.ope.online.calc_on_policy_conditional_value_at_risk(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alphas=None, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, random_state=None)[source]#

Calculate the conditional value at risk (CVaR) of the on-policy policy value.

Parameters:
  • env (gym.Env) – Reinforcement learning (RL) environment.

  • policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.

  • n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.

  • step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.

  • evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • alphas (array-like of shape (n_alpha, ) or float, default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given, np.linspace(0, 1, 21) will be used.

  • use_custom_reward_scale (bool, default=False) – Whether to use a customized reward scale or the reward observed under the behavior policy. If True, the reward scale is uniform, following Huang et al. (2021). If False, the reward scale follows the one defined in Chundak et al. (2021).

  • scale_min (float, default=None) – Minimum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.

  • scale_max (float, default=None) – Maximum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.

  • n_partition (int, default=None) – Number of partitions in the reward scale (x-axis of the CDF). When use_custom_reward_scale == True, a value must be given.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

on_policy_conditional_value_at_risk – CVaR of the on-policy policy value.

Return type:

np.ndarray