scope_rl.ope.online.calc_on_policy_interquartile_range#
- scope_rl.ope.online.calc_on_policy_interquartile_range(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alpha=0.05, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, random_state=None)[source]#
Calculate the interquartile range of the on-policy policy value.
- env: gym.Env
Reinforcement learning (RL) environment.
- policy: {QLearningAlgoBase, BaseHead}
A policy to be evaluated.
- n_trajectories: int, default=100 (> 0)
Number of trajectories to rollout.
- step_per_trajectory: int, default=None (> 0)
Number of timesteps in an trajectory.
- evaluate_on_stationary_distribution: bool, default=False
Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.
- gamma: float, default=1.0
Discount factor. The value should be within (0, 1].
- alpha: float, default=0.05
Proportion of the shaded region. The value should be within (0, 1].
- use_custom_reward_scale: bool, default=False
Whether to use a customized reward scale or the reward observed under the behavior policy. If True, the reward scale is uniform, following Huang et al. (2021). If False, the reward scale follows the one defined in Chundak et al. (2021).
- scale_min: float, default=None
Minimum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.
- scale_max: float, default=None
Maximum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.
- n_partition: int, default=None
Number of partitions in the reward scale (x-axis of the CDF). When use_custom_reward_scale == True, a value must be given.
- random_state: int, default=None (>= 0)
Random state.
- Returns:
interquartile_range_dict – Dictionary containing the interquartile range of the on-policy policy value.
- Return type: