scope_rl.ope.online.rollout_policy_online#

scope_rl.ope.online.rollout_policy_online(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, random_state=None)[source]#

Rollout a given policy on the environment and generate trajectory-wise rewards under the policy online.

Parameters:

env (gym.Env) – Reinforcement learning (RL) environment.
policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.
n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.
step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.
evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
random_state (int, default=None (>= 0)) – Random state.

Returns:

on_policy_policy_values – Trajectory-wise on-policy policy values.

Return type:

ndarray of shape (n_trajectories, )