scope_rl.ope.online#

On-Policy performance comparison.

Functions

`calc_on_policy_conditional_value_at_risk`	Calculate the conditional value at risk (CVaR) of the on-policy policy value.
`calc_on_policy_cumulative_distribution_function`	Calculate the cumulative distribution of the on-policy policy value.
`calc_on_policy_interquartile_range`	Calculate the interquartile range of the on-policy policy value.
`calc_on_policy_policy_value`	Calculate an on-policy policy value of a given policy.
`calc_on_policy_policy_value_interval`	Estimate the confidence interval of on-policy policy value by nonparametric bootstrap.
`calc_on_policy_statistics`	Calculate the mean, variance, conditional value at risk, interquartile range of the on-policy policy value.
`calc_on_policy_variance`	Calculate the variance of the on-policy policy value.
`rollout_policy_online`	Rollout a given policy on the environment and generate trajectory-wise rewards under the policy online.
`visualize_on_policy_conditional_value_at_risk`	Visualize the conditional value at risk of the on-policy policy value.
`visualize_on_policy_cumulative_distribution_function`	Visualize the cumulative distribution function of the on-policy policy value.
`visualize_on_policy_interquartile_range`	Visualize the interquartile range of the on-policy policy value.
`visualize_on_policy_policy_value`	Visualize on-policy policy value estimates of the given policies.
`visualize_on_policy_policy_value_with_variance`	Visualize the policy value estimated by OPE estimators.

scope_rl.ope.online.visualize_on_policy_policy_value(env, policies, policy_names, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alpha=0.05, n_bootstrap_samples=100, random_state=None, fig_dir=None, fig_name='on_policy_policy_value.png')[source]#

Visualize on-policy policy value estimates of the given policies.

Parameters:

env (gym.Env) – Reinforcement learning (RL) environment.
policies (list of {QLearningAlgoBase, BaseHead}) – List of policies to be evaluated.
policy_names (list of str) – Name of policies.
n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.
step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.
evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="on_policy_policy_value.png") – Name of the bar figure.

scope_rl.ope.online.visualize_on_policy_policy_value_with_variance(env, policies, policy_names, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alpha=0.05, random_state=None, fig_dir=None, fig_name='estimated_policy_value.png')[source]#

Visualize the policy value estimated by OPE estimators.

Parameters:

env (gym.Env) – Reinforcement learning (RL) environment.
policies (list of {QLearningAlgoBase, BaseHead}) – List of policies to be evaluated.
policy_names (list of str) – Name of policies.
n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.
step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.
evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
random_state (int, default=None (>= 0)) – Random state.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_policy_value.png") – Name of the bar figure.

scope_rl.ope.online.visualize_on_policy_cumulative_distribution_function(env, policies, policy_names, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, random_state=None, legend=True, fig_dir=None, fig_name='on_policy_cumulative_distribution_function.png')[source]#

Visualize the cumulative distribution function of the on-policy policy value.

Parameters:

env (gym.Env) – Reinforcement learning (RL) environment.
policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.
n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.
step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.
evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
use_custom_reward_scale (bool, default=False) – Whether to use a customized reward scale or the reward observed under the behavior policy. If True, the reward scale is uniform, following Huang et al. (2021). If False, the reward scale follows the one defined in Chundak et al. (2021).
scale_min (float, default=None) – Minimum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.
scale_max (float, default=None) – Maximum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.
n_partition (int, default=None) – Number of partitions in the reward scale (x-axis of the CDF). When use_custom_reward_scale == True, a value must be given.
random_state (int, default=None (>= 0)) – Random state.
legend (bool, default=True) – Whether to include a legend in the figure.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="on_policy_cumulative_distribution_function.png") – Name of the bar figure.

scope_rl.ope.online.visualize_on_policy_conditional_value_at_risk(env, policies, policy_names, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alphas=None, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, random_state=None, legend=True, fig_dir=None, fig_name='on_policy_conditional_value_at_risk.png')[source]#

Visualize the conditional value at risk of the on-policy policy value.

Parameters:

env (gym.Env) – Reinforcement learning (RL) environment.
policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.
n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.
step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.
evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
alphas (array-like of shape (n_alpha, ), default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given, np.linspace(0, 1, 21) will be used.
use_custom_reward_scale (bool, default=False) – Whether to use a customized reward scale or the reward observed under the behavior policy. If True, the reward scale is uniform, following Huang et al. (2021). If False, the reward scale follows the one defined in Chundak et al. (2021).
scale_min (float, default=None) – Minimum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.
scale_max (float, default=None) – Maximum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.
n_partition (int, default=None) – Number of partitions in the reward scale (x-axis of the CDF). When use_custom_reward_scale == True, a value must be given.
random_state (int, default=None (>= 0)) – Random state.
legend (bool, default=True) – Whether to include a legend in the figure.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="on_policy_conditional_value_at_risk.png") – Name of the bar figure.

scope_rl.ope.online.visualize_on_policy_interquartile_range(env, policies, policy_names, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alpha=0.05, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, random_state=None, fig_dir=None, fig_name='on_policy_interquartile_range.png')[source]#

Visualize the interquartile range of the on-policy policy value.

Parameters:

env (gym.Env) – Reinforcement learning (RL) environment.
policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.
n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.
step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.
evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
use_custom_reward_scale (bool, default=False) – Whether to use a customized reward scale or the reward observed under the behavior policy. If True, the reward scale is uniform, following Huang et al. (2021). If False, the reward scale follows the one defined in Chundak et al. (2021).
scale_min (float, default=None) – Minimum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.
scale_max (float, default=None) – Maximum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.
n_partition (int, default=None) – Number of partitions in the reward scale (x-axis of the CDF). When use_custom_reward_scale == True, a value must be given.
random_state (int, default=None (>= 0)) – Random state.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="on_policy_conditional_value_at_risk.png") – Name of the bar figure.

scope_rl.ope.online.calc_on_policy_statistics(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, quartile_alpha=0.05, cvar_alpha=0.05, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, random_state=None)[source]#

Calculate the mean, variance, conditional value at risk, interquartile range of the on-policy policy value.

Parameters:

env (gym.Env) – Reinforcement learning (RL) environment.
policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.
n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.
step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.
evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
quartile_alpha (float, default=0.05) – Proportion of the shaded region. The value should be within (0, 1].
cvar_alpha (float, default=0.05) – Proportion of the shaded region. The value should be within (0, 1].
use_custom_reward_scale (bool, default=False) – Whether to use a customized reward scale or the reward observed under the behavior policy. If True, the reward scale is uniform, following Huang et al. (2021). If False, the reward scale follows the one defined in Chundak et al. (2021).
scale_min (float, default=None) – Minimum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.
scale_max (float, default=None) – Maximum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.
n_partition (int, default=None) – Number of partitions in the reward scale (x-axis of the CDF). When use_custom_reward_scale == True, a value must be given.
random_state (int, default=None (>= 0)) – Random state.

Returns:

statistics_dict – Dictionary containing the mean, variance, CVaR, and interquartile range of the on-policy policy value.

Return type:

dict

scope_rl.ope.online.calc_on_policy_policy_value(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alpha=0.05, use_bootstrap=False, n_bootstrap_samples=100, random_state=None)[source]#

Calculate an on-policy policy value of a given policy.

Parameters:

env (gym.Env) – Reinforcement learning (RL) environment.
policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.
n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.
step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.
evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
alpha (float, default=0.05) – Significance level. The value should be within (0, 1].
use_bootstrap (bool, default=False) – Whether to use bootstrap sampling or not.
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.

Returns:

on_policy_policy_value – Average on-policy policy value.

Return type:

float

scope_rl.ope.online.calc_on_policy_policy_value_interval(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alpha=0.05, n_bootstrap_samples=100, random_state=None)[source]#

Estimate the confidence interval of on-policy policy value by nonparametric bootstrap.

Parameters:

env (gym.Env) – Reinforcement learning (RL) environment.
policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.
n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.
step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.
evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.

Returns:

on_policy_confidence_interval – Dictionary storing the calculated mean and upper-lower confidence bounds.

Return type:

dict

scope_rl.ope.online.calc_on_policy_variance(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, random_state=None)[source]#

Calculate the variance of the on-policy policy value.

Parameters:

env (gym.Env) – Reinforcement learning (RL) environment.
policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.
n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.
step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.
evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
random_state (int, default=None (>= 0)) – Random state.

Returns:

on_policy_variance – Variance of the on-policy policy value.

Return type:

float

scope_rl.ope.online.calc_on_policy_conditional_value_at_risk(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alphas=None, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, random_state=None)[source]#

Calculate the conditional value at risk (CVaR) of the on-policy policy value.

Parameters:

env (gym.Env) – Reinforcement learning (RL) environment.
policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.
n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.
step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.
evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
alphas (array-like of shape (n_alpha, ) or float, default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given, np.linspace(0, 1, 21) will be used.
use_custom_reward_scale (bool, default=False) – Whether to use a customized reward scale or the reward observed under the behavior policy. If True, the reward scale is uniform, following Huang et al. (2021). If False, the reward scale follows the one defined in Chundak et al. (2021).
scale_min (float, default=None) – Minimum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.
scale_max (float, default=None) – Maximum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.
n_partition (int, default=None) – Number of partitions in the reward scale (x-axis of the CDF). When use_custom_reward_scale == True, a value must be given.
random_state (int, default=None (>= 0)) – Random state.

Returns:

on_policy_conditional_value_at_risk – CVaR of the on-policy policy value.

Return type:

np.ndarray

scope_rl.ope.online.calc_on_policy_interquartile_range(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, alpha=0.05, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, random_state=None)[source]#

Calculate the interquartile range of the on-policy policy value.

env: gym.Env: Reinforcement learning (RL) environment.
policy: {QLearningAlgoBase, BaseHead}: A policy to be evaluated.
n_trajectories: int, default=100 (> 0): Number of trajectories to rollout.
step_per_trajectory: int, default=None (> 0): Number of timesteps in an trajectory.
evaluate_on_stationary_distribution: bool, default=False: Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.
gamma: float, default=1.0: Discount factor. The value should be within (0, 1].
alpha: float, default=0.05: Proportion of the shaded region. The value should be within (0, 1].
use_custom_reward_scale: bool, default=False: Whether to use a customized reward scale or the reward observed under the behavior policy. If True, the reward scale is uniform, following Huang et al. (2021). If False, the reward scale follows the one defined in Chundak et al. (2021).
scale_min: float, default=None: Minimum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.
scale_max: float, default=None: Maximum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.
n_partition: int, default=None: Number of partitions in the reward scale (x-axis of the CDF). When use_custom_reward_scale == True, a value must be given.
random_state: int, default=None (>= 0): Random state.

Returns:: interquartile_range_dict – Dictionary containing the interquartile range of the on-policy policy value.
Return type:: dict

scope_rl.ope.online.calc_on_policy_cumulative_distribution_function(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, random_state=None)[source]#

Calculate the cumulative distribution of the on-policy policy value.

Parameters:

env (gym.Env) – Reinforcement learning (RL) environment.
policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.
n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.
step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.
evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
use_custom_reward_scale (bool, default=False) – Whether to use a customized reward scale or the reward observed under the behavior policy. If True, the reward scale is uniform, following Huang et al. (2021). If False, the reward scale follows the one defined in Chundak et al. (2021).
scale_min (float, default=None) – Minimum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.
scale_max (float, default=None) – Maximum value of the reward scale in the CDF. When use_custom_reward_scale == True, a value must be given.
n_partition (int, default=None) – Number of partitions in the reward scale (x-axis of the CDF). When use_custom_reward_scale == True, a value must be given.
random_state (int, default=None (>= 0)) – Random state.

Returns:

cumulative_distribution_function (np.ndarray) – Cumulative distribution function of the on-policy policy value.
reward_scale (ndarray of shape (n_unique_reward, ) or (n_partition, )) – Reward Scale (x-axis of the cumulative distribution function).

scope_rl.ope.online.rollout_policy_online(env, policy, n_trajectories=100, step_per_trajectory=None, evaluate_on_stationary_distribution=False, gamma=1.0, random_state=None)[source]#

Rollout a given policy on the environment and generate trajectory-wise rewards under the policy online.

Parameters:

env (gym.Env) – Reinforcement learning (RL) environment.
policy ({QLearningAlgoBase, BaseHead}) – A policy to be evaluated.
n_trajectories (int, default=100 (> 0)) – Number of trajectories to rollout.
step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.
evaluate_on_stationary_distribution (bool, default=False) – Whether to evaluate a policy based on the stationary state distribution induced by it. When True, the evaluation policy is evaluated by rollout without resetting environment at each trajectory. This argument is irrelevant when working on the finite horizon setting.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
random_state (int, default=None (>= 0)) – Random state.

Returns:

on_policy_policy_values – Trajectory-wise on-policy policy values.

Return type:

ndarray of shape (n_trajectories, )