scope_rl.ope.online.calc_on_policy_interquartile_range#

scope_rl.ope.online.calc_on_policy_interquartile_range(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alpha=0.05, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, random_state=None)[source]#

Calculate the interquartile range of the on-policy policy value.

env: gym.Env

Reinforcement learning (RL) environment.

policy: {QLearningAlgoBase, BaseHead}

A policy to be evaluated.

n_trajectories: int, default=100 (> 0)

Number of trajectories to rollout.

step_per_trajectory: int, default=None (> 0)

Number of timesteps in an trajectory.

evaluate_on_stationary_distribution: bool, default=False

Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.

gamma: float, default=1.0

Discount factor. The value should be within (0, 1].

alpha: float, default=0.05

Proportion of the shaded region. The value should be within (0, 1].

use_custom_reward_scale: bool, default=False

Whether to use a customized reward scale or the reward observed under the behavior policy. If True, the reward scale is uniform, following Huang et al. (2021). If False, the reward scale follows the one defined in Chundak et al. (2021).

scale_min: float, default=None

Minimum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.

scale_max: float, default=None

Maximum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.

n_partition: int, default=None

Number of partitions in the reward scale (x-axis of the CDF). When use_custom_reward_scale == True, a value must be given.

random_state: int, default=None (>= 0)

Random state.

Returns:

interquartile_range_dict – Dictionary containing the interquartile range of the on-policy policy value.

Return type:

dict