scope_rl.ope.online#

On-Policy performance comparison.

Functions

calc_on_policy_conditional_value_at_risk

Calculate the conditional value at risk (CVaR) of the on-policy policy value.

calc_on_policy_cumulative_distribution_function

Calculate the cumulative distribution of the on-policy policy value.

calc_on_policy_interquartile_range

Calculate the interquartile range of the on-policy policy value.

calc_on_policy_policy_value

Calculate an on-policy policy value of a given policy.

calc_on_policy_policy_value_interval

Estimate the confidence interval of on-policy policy value by nonparametric bootstrap.

calc_on_policy_statistics

Calculate the mean, variance, conditional value at risk, interquartile range of the on-policy policy value.

calc_on_policy_variance

Calculate the variance of the on-policy policy value.

rollout_policy_online

Rollout a given policy on the environment and generate trajectory-wise rewards under the policy online.

visualize_on_policy_conditional_value_at_risk

Visualize the conditional value at risk of the on-policy policy value.

visualize_on_policy_cumulative_distribution_function

Visualize the cumulative distribution function of the on-policy policy value.

visualize_on_policy_interquartile_range

Visualize the interquartile range of the on-policy policy value.

visualize_on_policy_policy_value

Visualize on-policy policy value estimates of the given policies.

visualize_on_policy_policy_value_with_variance

Visualize the policy value estimated by OPE estimators.

scope_rl.ope.online.visualize_on_policy_policy_value(env, policies, policy_names, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alpha=0.05, n_bootstrap_samples=100, random_state=None, fig_dir=None, fig_name='on_policy_policy_value.png')[source]#

Visualize on-policy policy value estimates of the given policies.

Parameters:
  • env (gym.Env) – Reinforcement learning (RL) environment.

  • policies (list of {QLearningAlgoBase, BaseHead}) – List of policies to be evaluated.

  • policy_names (list of str) – Name of policies.

  • n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.

  • step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.

  • evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="on_policy_policy_value.png") – Name of the bar figure.

scope_rl.ope.online.visualize_on_policy_policy_value_with_variance(env, policies, policy_names, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alpha=0.05, random_state=None, fig_dir=None, fig_name='estimated_policy_value.png')[source]#

Visualize the policy value estimated by OPE estimators.

Parameters:
  • env (gym.Env) – Reinforcement learning (RL) environment.

  • policies (list of {QLearningAlgoBase, BaseHead}) – List of policies to be evaluated.

  • policy_names (list of str) – Name of policies.

  • n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.

  • step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.

  • evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • random_state (int, default=None (>= 0)) – Random state.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_policy_value.png") – Name of the bar figure.

scope_rl.ope.online.visualize_on_policy_cumulative_distribution_function(env, policies, policy_names, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, random_state=None, legend=True, fig_dir=None, fig_name='on_policy_cumulative_distribution_function.png')[source]#

Visualize the cumulative distribution function of the on-policy policy value.

Parameters:
  • env (gym.Env) – Reinforcement learning (RL) environment.

  • policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.

  • n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.

  • step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.

  • evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • use_custom_reward_scale (bool, default=False) – Whether to use a customized reward scale or the reward observed under the behavior policy. If True, the reward scale is uniform, following Huang et al. (2021). If False, the reward scale follows the one defined in Chundak et al. (2021).

  • scale_min (float, default=None) – Minimum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.

  • scale_max (float, default=None) – Maximum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.

  • n_partition (int, default=None) – Number of partitions in the reward scale (x-axis of the CDF). When use_custom_reward_scale == True, a value must be given.

  • random_state (int, default=None (>= 0)) – Random state.

  • legend (bool, default=True) – Whether to include a legend in the figure.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="on_policy_cumulative_distribution_function.png") – Name of the bar figure.

scope_rl.ope.online.visualize_on_policy_conditional_value_at_risk(env, policies, policy_names, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alphas=None, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, random_state=None, legend=True, fig_dir=None, fig_name='on_policy_conditional_value_at_risk.png')[source]#

Visualize the conditional value at risk of the on-policy policy value.

Parameters:
  • env (gym.Env) – Reinforcement learning (RL) environment.

  • policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.

  • n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.

  • step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.

  • evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • alphas (array-like of shape (n_alpha, ), default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given, np.linspace(0, 1, 21) will be used.

  • use_custom_reward_scale (bool, default=False) – Whether to use a customized reward scale or the reward observed under the behavior policy. If True, the reward scale is uniform, following Huang et al. (2021). If False, the reward scale follows the one defined in Chundak et al. (2021).

  • scale_min (float, default=None) – Minimum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.

  • scale_max (float, default=None) – Maximum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.

  • n_partition (int, default=None) – Number of partitions in the reward scale (x-axis of the CDF). When use_custom_reward_scale == True, a value must be given.

  • random_state (int, default=None (>= 0)) – Random state.

  • legend (bool, default=True) – Whether to include a legend in the figure.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="on_policy_conditional_value_at_risk.png") – Name of the bar figure.

scope_rl.ope.online.visualize_on_policy_interquartile_range(env, policies, policy_names, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alpha=0.05, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, random_state=None, fig_dir=None, fig_name='on_policy_interquartile_range.png')[source]#

Visualize the interquartile range of the on-policy policy value.

Parameters:
  • env (gym.Env) – Reinforcement learning (RL) environment.

  • policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.

  • n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.

  • step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.

  • evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • use_custom_reward_scale (bool, default=False) – Whether to use a customized reward scale or the reward observed under the behavior policy. If True, the reward scale is uniform, following Huang et al. (2021). If False, the reward scale follows the one defined in Chundak et al. (2021).

  • scale_min (float, default=None) – Minimum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.

  • scale_max (float, default=None) – Maximum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.

  • n_partition (int, default=None) – Number of partitions in the reward scale (x-axis of the CDF). When use_custom_reward_scale == True, a value must be given.

  • random_state (int, default=None (>= 0)) – Random state.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="on_policy_conditional_value_at_risk.png") – Name of the bar figure.

scope_rl.ope.online.calc_on_policy_statistics(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, quartile_alpha=0.05, cvar_alpha=0.05, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, random_state=None)[source]#

Calculate the mean, variance, conditional value at risk, interquartile range of the on-policy policy value.

Parameters:
  • env (gym.Env) – Reinforcement learning (RL) environment.

  • policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.

  • n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.

  • step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.

  • evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • quartile_alpha (float, default=0.05) – Proportion of the shaded region. The value should be within (0, 1].

  • cvar_alpha (float, default=0.05) – Proportion of the shaded region. The value should be within (0, 1].

  • use_custom_reward_scale (bool, default=False) – Whether to use a customized reward scale or the reward observed under the behavior policy. If True, the reward scale is uniform, following Huang et al. (2021). If False, the reward scale follows the one defined in Chundak et al. (2021).

  • scale_min (float, default=None) – Minimum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.

  • scale_max (float, default=None) – Maximum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.

  • n_partition (int, default=None) – Number of partitions in the reward scale (x-axis of the CDF). When use_custom_reward_scale == True, a value must be given.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

statistics_dict – Dictionary containing the mean, variance, CVaR, and interquartile range of the on-policy policy value.

Return type:

dict

scope_rl.ope.online.calc_on_policy_policy_value(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alpha=0.05, use_bootstrap=False, n_bootstrap_samples=100, random_state=None)[source]#

Calculate an on-policy policy value of a given policy.

Parameters:
  • env (gym.Env) – Reinforcement learning (RL) environment.

  • policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.

  • n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.

  • step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.

  • evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • alpha (float, default=0.05) – Significance level. The value should be within (0, 1].

  • use_bootstrap (bool, default=False) – Whether to use bootstrap sampling or not.

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

on_policy_policy_value – Average on-policy policy value.

Return type:

float

scope_rl.ope.online.calc_on_policy_policy_value_interval(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alpha=0.05, n_bootstrap_samples=100, random_state=None)[source]#

Estimate the confidence interval of on-policy policy value by nonparametric bootstrap.

Parameters:
  • env (gym.Env) – Reinforcement learning (RL) environment.

  • policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.

  • n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.

  • step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.

  • evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

on_policy_confidence_interval – Dictionary storing the calculated mean and upper-lower confidence bounds.

Return type:

dict

scope_rl.ope.online.calc_on_policy_variance(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, random_state=None)[source]#

Calculate the variance of the on-policy policy value.

Parameters:
  • env (gym.Env) – Reinforcement learning (RL) environment.

  • policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.

  • n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.

  • step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.

  • evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

on_policy_variance – Variance of the on-policy policy value.

Return type:

float

scope_rl.ope.online.calc_on_policy_conditional_value_at_risk(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alphas=None, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, random_state=None)[source]#

Calculate the conditional value at risk (CVaR) of the on-policy policy value.

Parameters:
  • env (gym.Env) – Reinforcement learning (RL) environment.

  • policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.

  • n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.

  • step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.

  • evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • alphas (array-like of shape (n_alpha, ) or float, default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given, np.linspace(0, 1, 21) will be used.

  • use_custom_reward_scale (bool, default=False) – Whether to use a customized reward scale or the reward observed under the behavior policy. If True, the reward scale is uniform, following Huang et al. (2021). If False, the reward scale follows the one defined in Chundak et al. (2021).

  • scale_min (float, default=None) – Minimum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.

  • scale_max (float, default=None) – Maximum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.

  • n_partition (int, default=None) – Number of partitions in the reward scale (x-axis of the CDF). When use_custom_reward_scale == True, a value must be given.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

on_policy_conditional_value_at_risk – CVaR of the on-policy policy value.

Return type:

np.ndarray

scope_rl.ope.online.calc_on_policy_interquartile_range(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alpha=0.05, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, random_state=None)[source]#

Calculate the interquartile range of the on-policy policy value.

env: gym.Env

Reinforcement learning (RL) environment.

policy: {QLearningAlgoBase, BaseHead}

A policy to be evaluated.

n_trajectories: int, default=100 (> 0)

Number of trajectories to rollout.

step_per_trajectory: int, default=None (> 0)

Number of timesteps in an trajectory.

evaluate_on_stationary_distribution: bool, default=False

Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.

gamma: float, default=1.0

Discount factor. The value should be within (0, 1].

alpha: float, default=0.05

Proportion of the shaded region. The value should be within (0, 1].

use_custom_reward_scale: bool, default=False

Whether to use a customized reward scale or the reward observed under the behavior policy. If True, the reward scale is uniform, following Huang et al. (2021). If False, the reward scale follows the one defined in Chundak et al. (2021).

scale_min: float, default=None

Minimum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.

scale_max: float, default=None

Maximum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.

n_partition: int, default=None

Number of partitions in the reward scale (x-axis of the CDF). When use_custom_reward_scale == True, a value must be given.

random_state: int, default=None (>= 0)

Random state.

Returns:

interquartile_range_dict – Dictionary containing the interquartile range of the on-policy policy value.

Return type:

dict

scope_rl.ope.online.calc_on_policy_cumulative_distribution_function(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, random_state=None)[source]#

Calculate the cumulative distribution of the on-policy policy value.

Parameters:
  • env (gym.Env) – Reinforcement learning (RL) environment.

  • policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.

  • n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.

  • step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.

  • evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • use_custom_reward_scale (bool, default=False) – Whether to use a customized reward scale or the reward observed under the behavior policy. If True, the reward scale is uniform, following Huang et al. (2021). If False, the reward scale follows the one defined in Chundak et al. (2021).

  • scale_min (float, default=None) – Minimum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.

  • scale_max (float, default=None) – Maximum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.

  • n_partition (int, default=None) – Number of partitions in the reward scale (x-axis of the CDF). When use_custom_reward_scale == True, a value must be given.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

  • cumulative_distribution_function (np.ndarray) – Cumulative distribution function of the on-policy policy value.

  • reward_scale (ndarray of shape (n_unique_reward, ) or (n_partition, )) – Reward Scale (x-axis of the cumulative distribution function).

scope_rl.ope.online.rollout_policy_online(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, random_state=None)[source]#

Rollout a given policy on the environment and generate trajectory-wise rewards under the policy online.

Parameters:
  • env (gym.Env) – Reinforcement learning (RL) environment.

  • policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.

  • n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.

  • step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.

  • evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

on_policy_policy_values – Trajectory-wise on-policy policy values.

Return type:

ndarray of shape (n_trajectories, )