scope_rl.ope.online.calc_on_policy_policy_value#

scope_rl.ope.online.calc_on_policy_policy_value(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alpha=0.05, use_bootstrap=False, n_bootstrap_samples=100, random_state=None)[source]#

Calculate an on-policy policy value of a given policy.

Parameters:

env (gym.Env) – Reinforcement learning (RL) environment.
policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.
n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.
step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.
evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
alpha (float, default=0.05) – Significance level. The value should be within (0, 1].
use_bootstrap (bool, default=False) – Whether to use bootstrap sampling or not.
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.

Returns:

on_policy_policy_value – Average on-policy policy value.

Return type:

float