scope_rl.ope.online.calc_on_policy_statistics#

scope_rl.ope.online.calc_on_policy_statistics(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, quartile_alpha=0.05, cvar_alpha=0.05, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, random_state=None)[source]#

Calculate the mean, variance, conditional value at risk, interquartile range of the on-policy policy value.

Parameters:

env (gym.Env) – Reinforcement learning (RL) environment.
policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.
n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.
step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.
evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
quartile_alpha (float, default=0.05) – Proportion of the shaded region. The value should be within (0, 1].
cvar_alpha (float, default=0.05) – Proportion of the shaded region. The value should be within (0, 1].
use_custom_reward_scale (bool, default=False) – Whether to use a customized reward scale or the reward observed under the behavior policy. If True, the reward scale is uniform, following Huang et al. (2021). If False, the reward scale follows the one defined in Chundak et al. (2021).
scale_min (float, default=None) – Minimum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.
scale_max (float, default=None) – Maximum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.
n_partition (int, default=None) – Number of partitions in the reward scale (x-axis of the CDF). When use_custom_reward_scale == True, a value must be given.
random_state (int, default=None (>= 0)) – Random state.

Returns:

statistics_dict – Dictionary containing the mean, variance, CVaR, and interquartile range of the on-policy policy value.

Return type:

dict