scope_rl.ope.ops.OffPolicySelection#
- class scope_rl.ope.ops.OffPolicySelection(ope=None, cumulative_distribution_ope=None)[source]#
Class to conduct OPS and evaluation of OPE/OPS with multiple estimators simultaneously.
Imported as:
scope_rl.ope.OffPolicySelectionNote
Off-Policy Selection (OPS)
OPS selects the “best” policy among several candidates based on the policy value or other statistics estimated by OPE.
\[\hat{\pi} := {\arg \max}_{\pi \in \Pi} \hat{J}(\pi)\]where \(\Pi\) is a set of candidate policies and \(\hat{J}(\cdot)\) is some OPE estimates of the policy performance. Below, we describe two types of OPE to estimate such policy performance.
Off-Policy Evaluation (OPE)
(Basic) OPE estimates the expected policy performance called the policy value.
\[V(\pi) := \mathbb{E} \left[ \sum_{t=1}^T \gamma^{t-1} r_t \mid \pi \right]\]where \(r_t\) is the reward observed at each timestep \(t\), \(T\) is the total number of timesteps in an episode, and \(\gamma\) is the discount factor.
See also
OffPolicyEvaluationCumulative Distribution OPE
In contrast, cumulative distribution OPE first estimates the following cumulative distribution function.
\[F(t, \pi) := \mathbb{E} \left[ \mathbb{I} \left \{ \sum_{t=1}^T \gamma^{t-1} r_t \leq t \right \} \mid \pi \right]\]Then, cumulative distribution OPE also estimates some risk functions including variance, conditional value at risk, and interquartile range based on the CDF estimate.
See also
CumulativeDistributionOPE- Parameters:
ope (OffPolicyEvaluation, default=None) – Instance of the (standard) OPE class.
cumulative_distribution_ope (CumulativeDistributionOPE, default=None) – Instance of the cumulative distribution OPE class.
Examples
Preparation:
# import necessary module from SCOPE-RL from scope_rl.dataset import SyntheticDataset from scope_rl.policy import EpsilonGreedyHead from scope_rl.ope import CreateOPEInput from scope_rl.ope import OffPolicySelection from scope_rl.ope import OffPolicyEvaluation as OPE from scope_rl.ope.discrete import TrajectoryWiseImportanceSampling as TIS from scope_rl.ope.discrete import PerDecisionImportanceSampling as PDIS from scope_rl.ope import CumulativeDistributionOPE from scope_rl.ope.discrete import CumulativeDistributionTIS as CD_IS from scope_rl.ope.discrete import CumulativeDistributionSNTIS as CD_SNIS # import necessary module from other libraries import gym import rtbgym from d3rlpy.algos import DoubleDQNConfig from d3rlpy.dataset import create_fifo_replay_buffer from d3rlpy.algos import ConstantEpsilonGreedy # initialize environment env = gym.make("RTBEnv-discrete-v0") # define (RL) agent (i.e., policy) and train on the environment ddqn = DoubleDQNConfig().create() buffer = create_fifo_replay_buffer( limit=10000, env=env, ) explorer = ConstantEpsilonGreedy( epsilon=0.3, ) ddqn.fit_online( env=env, buffer=buffer, explorer=explorer, n_steps=10000, n_steps_per_epoch=1000, ) # convert ddqn policy to stochastic data collection policy behavior_policy = EpsilonGreedyHead( ddqn, n_actions=env.action_space.n, epsilon=0.3, name="ddqn_epsilon_0.3", random_state=12345, ) # initialize dataset class dataset = SyntheticDataset( env=env, max_episode_steps=env.step_per_episode, ) # data collection logged_dataset = dataset.obtain_episodes( behavior_policies=behavior_policy, n_trajectories=100, random_state=12345, )
Create Input for OPE:
# evaluation policy ddqn_ = EpsilonGreedyHead( base_policy=ddqn, n_actions=env.action_space.n, name="ddqn", epsilon=0.0, random_state=12345 ) random_ = EpsilonGreedyHead( base_policy=ddqn, n_actions=env.action_space.n, name="random", epsilon=1.0, random_state=12345 ) # create input for off-policy evaluation (OPE) prep = CreateOPEInput( env=env, ) input_dict = prep.obtain_whole_inputs( logged_dataset=logged_dataset, evaluation_policies=[ddqn_, random_], n_trajectories_on_policy_evaluation=100, random_state=12345, )
Off-Policy Evaluation and Selection:
# OPS ope = OPE( logged_dataset=logged_dataset, ope_estimators=[TIS(), PDIS()], ) cd_ope = CumulativeDistributionOPE( logged_dataset=logged_dataset, ope_estimators=[ CD_IS(estimator_name="cd_is"), CD_SNIS(estimator_name="cd_snis"), ], ) ops = OffPolicySelection( ope=ope, cumulative_distribution_ope=cd_ope, ) ops_dict = ops.select_by_policy_value( input_dict=input_dict, return_metrics=True, )
Output:
>>> ops_dict {'tis': {'estimated_ranking': ['ddqn', 'random'], 'estimated_policy_value': array([21.3624954, 0.3827044]), 'estimated_relative_policy_value': array([1.44732354, 0.02592848]), 'mean_squared_error': 94.79587393975419, 'rank_correlation': SpearmanrResult(correlation=0.9999999999999999, pvalue=nan), 'regret': (0.0, 1), 'type_i_error_rate': 0.0, 'type_ii_error_rate': 0.0, 'safety_threshold': 13.284}, 'pdis': {'estimated_ranking': ['ddqn', 'random'], 'estimated_policy_value': array([18.02806424, 7.13847486]), 'estimated_relative_policy_value': array([1.22141357, 0.48363651]), 'mean_squared_error': 19.45349619733373, 'rank_correlation': SpearmanrResult(correlation=0.9999999999999999, pvalue=nan), 'regret': (0.0, 1), 'type_i_error_rate': 0.0, 'type_ii_error_rate': 0.0, 'safety_threshold': 13.284}}
See also
Related tutorials (OPS) and related tutorials (assessments)
References
Vladislav Kurenkov and Sergey Kolesnikov. “Showing Your Offline Reinforcement Learning Work: Online Evaluation Budget Matters.” 2022.
Shengpu Tang and Jenna Wiens. “Model Selection for Offline Reinforcement Learning: Practical Considerations for Healthcare Settings.” 2021.
Justin Fu, Mohammad Norouzi, Ofir Nachum, George Tucker, Ziyu Wang, Alexander Novikov, Mengjiao Yang, Michael R. Zhang, Yutian Chen, Aviral Kumar, Cosmin Paduraru, Sergey Levine, and Tom Le Paine. “Benchmarks for Deep Off-Policy Evaluation.” 2021.
Tom Le Paine, Cosmin Paduraru, Andrea Michi, Caglar Gulcehre, Konrad Zolna, Alexander Novikov, Ziyu Wang, and Nando de Freitas. “Hyperparameter Selection for Offline Reinforcement Learning.” 2020.
- Attributes:
- cumulative_distribution_ope
- estimators_name
- ope
Methods
obtain_topk_conditional_value_at_risk_selected_by_cumulative_distribution_ope(...)Obtain the topk deployment result (CVaR) selected by cumulative distribution OPE.
obtain_topk_conditional_value_at_risk_selected_by_standard_ope(...)Obtain the topk deployment result (CVaR) selected by standard OPE.
obtain_topk_lower_quartile_selected_by_cumulative_distribution_ope(...)Obtain the topk deployment result (lower quartile) selected by cumulative distribution OPE.
Obtain the topk deployment result (lower quartile) selected by standard OPE.
obtain_topk_policy_value_selected_by_cumulative_distribution_ope(...)Obtain the topk deployment result (policy value) selected by cumulative distribution OPE.
Obtain the topk deployment (policy value) result selected by its estimated lower bound.
Obtain the topk deployment result (policy value) selected by standard OPE.
obtain_true_selection_result(input_dict[, ...])Obtain the oracle selection result based on the ground-truth policy value.
select_by_conditional_value_at_risk(input_dict)Rank the candidate policies by their estimated conditional value at risk.
select_by_lower_quartile(input_dict[, ...])Rank the candidate policies by their estimated lower quartile of the trajectory-wise reward.
select_by_policy_value(input_dict[, ...])Rank the candidate policies by their estimated policy values.
select_by_policy_value_lower_bound(input_dict)Rank the candidate policies by their estimated policy value lower bound.
Rank the candidate policies by their estimated policy value via cumulative distribution OPE methods.
Visualize the conditional value at risk estimated by cumulative distribution OPE estimators (cdf plot).
Visualize the true conditional value at risk and its estimate (scatter plot).
visualize_conditional_value_at_risk_with_multiple_estimates(...)Visualize the conditional value at risk of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.
visualize_cumulative_distribution_function_for_selection(...)Visualize the cumulative distribution function (cdf plot).
visualize_cumulative_distribution_function_with_multiple_estimates(...)Visualize the policy value estimated by OPE estimators across multiple logged dataset.
Visualize the interquartile range estimated by cumulative distribution OPE estimators (box plot).
Visualize the true lower quartile and its estimate (scatter plot).
Visualize the lower quartile of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.
visualize_policy_value_for_selection(input_dict)Visualize the policy value estimated by OPE estimators (box plot).
visualize_policy_value_for_validation(input_dict)Visualize the true policy value and its estimate (scatter plot).
Visualize the true policy value and its estimate lower bound (scatter plot).
visualize_policy_value_of_cumulative_distribution_ope_for_selection(...)Visualize the policy value estimated by cumulative distribution OPE estimators (box plot).
visualize_policy_value_of_cumulative_distribution_ope_for_validation(...)Visualize the true policy value and its estimate obtained by cumulative distribution OPE (scatter plot).
visualize_policy_value_with_multiple_estimates_cumulative_distribution_ope(...)Visualize the policy value estimated by OPE estimators across multiple logged dataset.
visualize_policy_value_with_multiple_estimates_standard_ope(...)Visualize the policy value estimated by OPE estimators across multiple logged dataset.
visualize_topk_conditional_value_at_risk_selected_by_cumulative_distribution_ope(...)Visualize the topk deployment result (CVaR) selected by cumulative distribution OPE.
visualize_topk_conditional_value_at_risk_selected_by_standard_ope(...)Visualize the topk deployment result (CVaR) selected by standard OPE.
visualize_topk_lower_quartile_selected_by_cumulative_distribution_ope(...)Visualize the topk deployment result (lower quartile) selected by cumulative distribution OPE.
Visualize the topk deployment result (lower quartile) selected by standard OPE.
visualize_topk_policy_value_selected_by_cumulative_distribution_ope(...)Visualize the topk deployment result (policy value) selected by cumulative distribution OPE.
Visualize the topk deployment result (policy value) selected by its estimated lower bound.
Visualize the topk deployment result (policy value) selected by standard OPE.
visualize_variance_for_validation(input_dict)Visualize the true variance and its estimate (scatter plot).
Visualize the variance of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.
- obtain_true_selection_result(input_dict, behavior_policy_name=None, dataset_id=None, return_variance=False, return_lower_quartile=False, return_conditional_value_at_risk=False, return_by_dataframe=False, quartile_alpha=0.05, cvar_alpha=0.05)[source]#
Obtain the oracle selection result based on the ground-truth policy value.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
return_variance (bool, default=False) – Whether to return the variance or not.
return_lower_quartile (bool. default=False) – Whether to return the lower interquartile or not.
return_conditional_value_at_risk (bool, default=False) – Whether to return the conditional value at risk or not.
return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.
quartile_alpha (float, default=0.05) – Proportion of the shaded region of the interquartile range.
cvar_alpha (float, default=0.05) – Proportion of the shaded region of the conditional value at risk.
- Returns:
ground_truth_dict/ground_truth_df – Dictionary/dataframe containing the following ground-truth (on-policy) metrics.
key: [ ranking, policy_value, relative_policy_value, variance, ranking_by_lower_quartile, lower_quartile, ranking_by_conditional_value_at_risk, conditional_value_at_risk, parameters, # only when return_by_dataframe == False ]
- ranking: list of str
Name of the candidate policies sorted by the ground-truth policy value.
- policy_value: list of float
Ground-truth policy value of the candidate policies (sorted by ranking).
- relative_policy_value: list of float
Ground-truth relative policy value of the candidate policies compared to the behavior policy (sorted by ranking).
- variance: list of float
Ground-truth variance of the trajectory-wise reward of the candidate policies (sorted by ranking). If return_variance is False, None is recorded.
- ranking_by_lower_quartile: list of str
Name of the candidate policies sorted by the ground-truth lower quartile of the trajectory-wise reward. If return_lower_quartile is False, None is recorded.
- lower_quartile: list of float
Ground-truth lower quartile of the candidate policies (sorted by ranking_by_lower_quartile). If return_lower_quartile is False, None is recorded.
- ranking_by_conditional_value_at_risk: list of str
Name of the candidate policies sorted by the ground-truth conditional value at risk. If return_conditional_value_at_risk is False, None is recorded.
- conditional_value_at_risk: list of float
Ground-truth conditional value at risk of the candidate policies (sorted by ranking_by_conditional_value_at_risk). If return_conditional_value_at_risk is False, None is recorded.
- parameters: dict
Dictionary containing quartile_alpha, and cvar_alpha. If return_by_dataframe is True, parameters will not be returned.
- Return type:
- select_by_policy_value(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, return_true_values=False, return_metrics=False, return_by_dataframe=False, top_k_in_eval_metrics=1, safety_threshold=None, relative_safety_criteria=None)[source]#
Rank the candidate policies by their estimated policy values.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
return_true_values (bool, default=False) – Whether to return the true policy value and corresponding ranking of the candidate policies.
return_metrics (bool, default=False) – Whether to return the following evaluation metrics in terms of OPE and OPS: mean-squared-error, rank-correlation, regret@k, and Type I and Type II error rate.
return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.
top_k_in_eval_metrics (int, default=1) – How many candidate policies are included in regret@k.
safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.
relative_safety_criteria (float, default=None (>= 0)) – The relative policy value required to be considered a safe policy. For example, when 0.9 is given, candidate policy must exceed 90% of the behavior policy performance. Only applicable when using a single behavior policy.
- Returns:
ops_dict/(ranking_df_dict, metric_df) – Dictionary/dataframe containing the result of OPS conducted by OPE estimators.
key: [estimator_name][ estimated_ranking, estimated_policy_value, estimated_relative_policy_value, true_ranking, true_policy_value, true_relative_policy_value, mean_squared_error, rank_correlation, regret, type_i_error_rate, type_ii_error_rate, ]
- estimated_ranking: list of str
Name of the candidate policies sorted by the estimated policy value. Recorded in ranking_df_dict if return_by_dataframe is True.
- estimated_policy_value: list of float
Estimated policy value of the candidate policies (sorted by estimated_ranking). Recorded in ranking_df_dict if return_by_dataframe is True.
- estimated_relative_policy_value: list of float
Estimated relative policy value of the candidate policies compared to the behavior policy (sorted by estimated_ranking). Recorded in ranking_df_dict if return_by_dataframe is True.
- true_ranking: list of int
Ranking index of the (true) policy value of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.
- true_policy_value: list of float
True policy value of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict when return_by_dataframe is True.
- true_relative_policy_value: list of float
True relative policy value of the candidate policies compared to the behavior policy (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.
- mean_squared_error: float
Mean-squared-error of the estimators calculated across candidate evaluation policies. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.
- rank_correlation: tuple of float
Rank correlation coefficient between the true ranking and the estimated ranking, and its pvalue. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.
- regret: tuple of float and int
Regret@k and k. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.
- type_i_error_rate: float
Type I error rate of the hypothetical test. True Negative when the policy is safe but estimated as unsafe. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.
- type_ii_error_rate: float
Type II error rate of the hypothetical test. False Positive when the policy is unsafe but undetected. Recorded only when return_metric is True. Recorded in metric_df when return_by_dataframe is True.
- safety_threshold: float
A policy whose policy value is below the given threshold is to be considered unsafe.
- Return type:
- select_by_policy_value_via_cumulative_distribution_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, return_true_values=False, return_metrics=False, return_by_dataframe=False, top_k_in_eval_metrics=1, safety_threshold=None, relative_safety_criteria=None)[source]#
Rank the candidate policies by their estimated policy value via cumulative distribution OPE methods.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
return_true_values (bool, default=False) – Whether to return the true policy value and corresponding ranking of the candidate policies.
return_metrics (bool, default=False) – Whether to return the following evaluation metrics in terms of OPE and OPS: mean-squared-error, rank-correlation, regret@k, and Type I and Type II error rate.
return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.
top_k_in_eval_metrics (int, default=1) – How many candidate policies are included in regret@k.
safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.
relative_safety_criteria (float, default=None (>= 0)) – The relative policy value required to be considered a safe policy. For example, when 0.9 is given, candidate policy must exceed 90% of the behavior policy performance. Only applicable when using a single behavior policy.
- Returns:
ops_dict/(ranking_df_dict, metric_df) – Dictionary/dataframe containing the result of OPS conducted by OPE estimators.
key: [estimator_name][ estimated_ranking, estimated_policy_value, estimated_relative_policy_value, true_ranking, true_policy_value, true_relative_policy_value, mean_squared_error, rank_correlation, regret, type_i_error_rate, type_ii_error_rate, ]
- estimated_ranking: list of str
Name of the candidate policies sorted by the estimated policy value. Recorded in ranking_df_dict if return_by_dataframe is True.
- estimated_policy_value: list of float
Estimated policy value of the candidate policies (sorted by estimated_ranking). Recorded in ranking_df_dict if return_by_dataframe is True.
- estimated_relative_policy_value: list of float
Estimated relative policy value of the candidate policies compared to the behavior policy (sorted by estimated_ranking). Recorded in ranking_df_dict if return_by_dataframe is True.
- true_ranking: list of int
Ranking index of the (true) policy value of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.
- true_policy_value: list of float
True policy value of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.
- true_relative_policy_value: list of float
True relative policy value of the candidate policies compared to the behavior policy (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.
- mean_squared_error: float
Mean-squared-error of the estimators calculated across candidate evaluation policies. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.
- rank_correlation: tuple of float
Rank correlation coefficient between the true ranking and the estimated ranking, and its pvalue. Recorded only when return_metric is True. Recorded in metric_df when return_by_dataframe is True.
- regret: tuple of float and int
Regret@k and k. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.
- type_i_error_rate: float
Type I error rate of the hypothetical test. True Negative when the policy is safe but estimated as unsafe. Recorded only when return_metric is True. Recorded in metric_df when return_by_dataframe is True.
- type_ii_error_rate: float
Type II error rate of the hypothetical test. False Positive when the policy is unsafe but undetected. Recorded only when return_metric is True. Recorded in metric_df when return_by_dataframe is True.
- safety_threshold: float
A policy whose policy value is below the given threshold is to be considered unsafe.
- Return type:
- select_by_policy_value_lower_bound(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, return_true_values=False, return_metrics=False, return_by_dataframe=False, top_k_in_eval_metrics=1, safety_threshold=None, relative_safety_criteria=None, cis=['bootstrap'], alpha=0.05, n_bootstrap_samples=100, random_state=None)[source]#
Rank the candidate policies by their estimated policy value lower bound.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
return_true_values (bool, default=False) – Whether to return the true policy value and corresponding ranking of the candidate policies.
return_metrics (bool, default=False) – Whether to return the following evaluation metrics in terms of OPE and OPS: rank-correlation, regret@k, and Type I and Type II error rate.
return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.
top_k_in_eval_metrics (int, default=1) – How many candidate policies are included in regret@k.
safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.
relative_safety_criteria (float, default=None (>= 0)) – The relative policy value required to be considered a safe policy. For example, when 0.9 is given, candidate policy must exceed 90% of the behavior policy performance. Only applicable when using a single behavior policy.
cis (list of {"bootstrap", "hoeffding", "bernstein", "ttest"}, default=["bootstrap"]) – Estimation methods for confidence intervals.
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
n_bootstrap_samples (int, default=100 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.
- Returns:
ops_dict/(ranking_df_dict, metric_df) – Dictionary/dataframe containing the result of OPS conducted by OPE estimators.
key: [ci][estimator_name][ estimated_ranking, estimated_policy_value_lower_bound, estimated_relative_policy_value_lower_bound, true_ranking, true_policy_value, true_relative_policy_value, mean_squared_error, rank_correlation, regret, type_i_error_rate, type_ii_error_rate, ]
- estimated_ranking: list of str
Name of the candidate policies sorted by the estimated policy value lower bound. Recorded in ranking_df_dict if return_by_dataframe is True.
- estimated_policy_value_lower_bound: list of float
Estimated policy value lower bound of the candidate policies (sorted by estimated_ranking). Recorded in ranking_df_dict if return_by_dataframe is True.
- estimated_relative_policy_value_lower_bound: list of float
Estimated relative policy value lower bound of the candidate policies compared to the behavior policy (sorted by estimated_ranking). Recorded in ranking_df_dict if return_by_dataframe is True.
- true_ranking: list of int
Ranking index of the (true) policy value of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.
- true_policy_value: list of float
True policy value of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.
- true_relative_policy_value: list of float
True relative policy value of the candidate policies compared to the behavior policy (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.
- mean_squared_error: None
This is for API consistency. Recorded in metric_df if return_by_dataframe is True.
- rank_correlation: tuple of float
Rank correlation coefficient between the true ranking and the estimated ranking, and its pvalue. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.
- regret: tuple of float and int
Regret@k and k. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.
- type_i_error_rate: float
Type I error rate of the hypothetical test. True Negative when the policy is safe but estimated as unsafe. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.
- type_ii_error_rate: float
Type II error rate of the hypothetical test. False Positive when the policy is unsafe but undetected. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.
- safety_threshold: float
A policy whose policy value is below the given threshold is to be considered unsafe.
- Return type:
- select_by_lower_quartile(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, return_true_values=False, return_metrics=False, return_by_dataframe=False, safety_threshold=0.0)[source]#
Rank the candidate policies by their estimated lower quartile of the trajectory-wise reward.
- Parameters:
input_dict (OPEInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 0.5].
return_true_values (bool, default=False) – Whether to return the true lower quartile of the trajectory-wise reward and corresponding ranking of the candidate evaluation policies.
return_metrics (bool, default=False) – Whether to return the following evaluation metrics in terms of OPE and OPS: mean-squared-error, rank-correlation, and Type I and Type II error rate.
return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.
safety_threshold (float, default=0.0 (>= 0)) – The lower quartile required to be considered a safe policy.
- Returns:
ops_dict/(ranking_df_dict, metric_df) – Dictionary/dataframe containing the result of OPS conducted by OPE estimators.
key: [estimator_name][ estimated_ranking, estimated_lower_quartile, true_ranking, true_lower_quartile, mean_squared_error, rank_correlation, regret, type_i_error_rate, type_ii_error_rate, ]
- estimated_ranking: list of str
Name of the candidate policies sorted by the estimated lower quartile of the trajectory-wise reward. Recorded in ranking_df_dict if return_by_dataframe is True.
- estimated_lower_quartile: list of float
Estimated lower quartile of the trajectory-wise reward of the candidate policies (sorted by estimated_ranking). Recorded in ranking_df_dict if return_by_dataframe is True.
- true_ranking: list of int
Ranking index of the (true) lower quartile of the trajectory-wise reward of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.
- true_lower_quartile: list of float
True lower quartile of the trajectory-wise reward of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.
- mean_squared_error: float
Mean-squared-error of the estimated lower quartile of the trajectory-wise reward. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.
- rank_correlation: tuple of float
Rank correlation coefficient between the true ranking and the estimated ranking, and its pvalue. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.
- regret: None
This is for API consistency. Recorded in metric_df if return_by_dataframe is True.
- type_i_error_rate: float
Type I error rate of the hypothetical test. True Negative when the policy is safe but estimated as unsafe. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.
- type_ii_error_rate: float
Type II error rate of the hypothetical test. False Positive when the policy is unsafe but undetected. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.
- safety_threshold: float
The lower quartile required to be considered a safe policy.
- Return type:
dict or dataframe
- select_by_conditional_value_at_risk(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, return_true_values=False, return_metrics=False, return_by_dataframe=False, safety_threshold=0.0)[source]#
Rank the candidate policies by their estimated conditional value at risk.
- Parameters:
input_dict (OPEInputDict or MultipleLoggedDataset) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 1].
return_true_values (bool, default=False) – Whether to return the true conditional value at risk and corresponding ranking of the candidate evaluation policies.
return_metrics (bool, default=False) – Whether to return the following evaluation metrics in terms of OPE and OPS: mean-squared-error, rank-correlation, and Type I and Type II error rate.
return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.
safety_threshold (float, default=0.0 (>= 0)) – The conditional value at risk required to be considered a safe policy.
- Returns:
ops_dict/(ranking_df_dict, metric_df) – Dictionary/dataframe containing the result of OPS conducted by OPE estimators.
key: [estimator_name][ estimated_ranking, estimated_conditional_value_at_risk, true_ranking, true_conditional_value_at_risk, mean_squared_error, rank_correlation, regret, type_i_error_rate, type_ii_error_rate, ]
- estimated_ranking: list of str
Name of the candidate policies sorted by the estimated conditional value at risk. Recorded in ranking_df_dict if return_by_dataframe is True.
- estimated_conditional_value_at_risk: list of float
Estimated conditional value at risk of the candidate policies (sorted by estimated_ranking). Recorded in ranking_df_dict if return_by_dataframe is True.
- true_ranking: list of int
Ranking index of the (true) conditional value at risk of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.
- true_conditional_value_at_risk: list of float
True conditional value at risk of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.
- mean_squared_error: float
Mean-squared-error of the estimated conditional value at risk. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.
- rank_correlation: tuple or float
Rank correlation coefficient between the true ranking and the estimated ranking, and its pvalue. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.
- regret: None
This is for API consistency. Recorded in metric_df if return_by_dataframe is True.
- type_i_error_rate: float
Type I error rate of the hypothetical test. True Negative when the policy is safe but estimated as unsafe. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.
- type_ii_error_rate: float
Type II error rate of the hypothetical test. False Positive when the policy is unsafe but undetected. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True`.
- safety_threshold: float
The conditional value at risk required to be considered a safe policy.
- Return type:
- visualize_policy_value_for_selection(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, ci='bootstrap', n_bootstrap_samples=100, random_state=None, is_relative=False, hue='estimator', sharey=False, fig_dir=None, fig_name='estimated_policy_value_standard_ope.png')[source]#
Visualize the policy value estimated by OPE estimators (box plot).
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.
is_relative (bool, default=False) – If True, the method visualizes the estimated policy value of the evaluation policy relative to the on-policy policy value of the behavior policy.
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_policy_value_standard_ope.png") – Name of the bar figure.
- visualize_cumulative_distribution_function_for_selection(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, hue='estimator', legend=True, n_cols=None, fig_dir=None, fig_name='estimated_cumulative_distribution_function.png')[source]#
Visualize the cumulative distribution function (cdf plot).
- Parameters:
input_dict (OPEInputDict or MultipleLoggedDataset) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
legend (bool, default=True) – Whether to include a legend in the figure.
n_cols (int, default=None (> 0)) – Number of columns in the figure.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_cumulative_distribution_function.png") – Name of the bar figure.
- visualize_policy_value_of_cumulative_distribution_ope_for_selection(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, is_relative=False, hue='estimator', sharey=False, fig_dir=None, fig_name='estimated_policy_value_cumulative_distribution_ope.png')[source]#
Visualize the policy value estimated by cumulative distribution OPE estimators (box plot).
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
alpha (float, default=0.05) – Significance level. The value should bw within [0, 1).
is_relative (bool, default=False) – If True, the method visualizes the estimated policy value of the evaluation policy relative to the ground-truth policy value of the behavior policy.
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_policy_value_cumulative_distribution_ope.png") – Name of the bar figure.
- visualize_conditional_value_at_risk_for_selection(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alphas=None, hue='estimator', legend=True, n_cols=None, sharey=False, fig_dir=None, fig_name='estimated_conditional_value_at_risk.png')[source]#
Visualize the conditional value at risk estimated by cumulative distribution OPE estimators (cdf plot).
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
alphas (array-like of shape (n_alpha, ), default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given,
np.linspace(0, 1, 21)will be used.hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
legend (bool, default=True) – Whether to include a legend in the figure.
n_cols (int, default=None (> 0)) – Number of columns in the figure.
sharey (bool, default=False) – If True, the y-axis will be shared among different evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_conditional_value_at_risk.png") – Name of the bar figure.
- visualize_interquartile_range_for_selection(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, hue='estimator', sharey=False, fig_dir=None, fig_name='estimated_interquartile_range.png')[source]#
Visualize the interquartile range estimated by cumulative distribution OPE estimators (box plot).
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_interquartile_range.png") – Name of the bar figure.
- visualize_policy_value_with_multiple_estimates_standard_ope(input_dict, compared_estimators=None, behavior_policy_name=None, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_policy_value_multiple_standard_ope.png')[source]#
Visualize the policy value estimated by OPE estimators across multiple logged dataset.
Note
This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
legend (bool, default=True) – Whether to include a legend in the scatter plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_policy_value_multiple.png") – Name of the bar figure.
- visualize_cumulative_distribution_function_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, scale_min=None, scale_max=None, n_partition=None, plot_type='ci_hue', hue='estimator', legend=True, n_cols=None, sharey=False, fig_dir=None, fig_name='estimated_policy_value_multiple.png')[source]#
Visualize the policy value estimated by OPE estimators across multiple logged dataset.
Note
This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.
This function is not applicable when the data-driven reward scaler is used. Please set
scale_min,scale_max, andn_partitionto use.- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
scale_min (float, default=None) – Minimum value of the reward scale in the CDF.
scale_max (float, default=None) – Maximum value of the reward scale in the CDF.
n_partition (int, default=None) – Number of partitions in the reward scale (x-axis of the CDF).
plot_type ({"ci_hue", "ci_behavior_policy", "enumerate"}, default="ci_hue") – Type of plot. If “ci” is given, the method visualizes the average policy value and its 95% confidence intervals based on the multiple estimate. If “enumerate” is given, we get a scatter plot of estimated values.
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
legend (bool, default=True) – Whether to include a legend in the scatter plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_policy_value_multiple.png") – Name of the bar figure.
- visualize_policy_value_with_multiple_estimates_cumulative_distribution_ope(input_dict, compared_estimators=None, behavior_policy_name=None, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_policy_value_multiple_cumulative_distribution_ope.png')[source]#
Visualize the policy value estimated by OPE estimators across multiple logged dataset.
Note
This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
legend (bool, default=True) – Whether to include a legend in the scatter plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_policy_value_multiple.png") – Name of the bar figure.
- visualize_variance_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_variance_multiple.png')[source]#
Visualize the variance of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.
Note
This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
legend (bool, default=True) – Whether to include a legend in the scatter plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_variance_multiple.png") – Name of the bar figure.
- visualize_conditional_value_at_risk_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, alpha=0.05, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_conditional_value_at_risk_multiple.png')[source]#
Visualize the conditional value at risk of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.
Note
This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
alpha (float = 0.05.) – Proportion of the shaded region in CVaR estimate. The value should be within [0, 1).
plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
legend (bool, default=True) – Whether to include a legend in the scatter plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_conditional_value_at_risk_multiple.png") – Name of the bar figure.
- visualize_lower_quartile_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, alpha=0.05, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_conditional_value_at_risk_multiple.png')[source]#
Visualize the lower quartile of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.
Note
This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
alpha (float = 0.05.) – Proportion of the shaded region in CVaR estimate. The value should be within [0, 1).
plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
legend (bool, default=True) – Whether to include a legend in the scatter plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_conditional_value_at_risk_multiple.png") – Name of the bar figure.
- obtain_topk_policy_value_selected_by_standard_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, max_topk=None, return_safety_violation_rate=False, safety_threshold=None, relative_safety_criteria=None, return_by_dataframe=False)[source]#
Obtain the topk deployment result (policy value) selected by standard OPE.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.
max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.
return_safety_violation_rate (bool, default=False.) – Whether to calculate and return the safety violate.
safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.
relative_safety_criteria (float, default=None) – The relative policy value required to be considered a safe policy. For example, when 0.9 is given, candidate policy must exceed 90% of the behavior policy performance. Only applicable when using a single behavior policy.
return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.
- Returns:
topk_metric_dict/topk_metric_df – Dictionary/dataframe containing the following top-k risk return tradeoff metrics. Note that policy performance refers to the (standard) policy value here. When returning dataframe, the average value will be returned.
key: [estimator][ k-th, best, # return worst, # risk mean, # risk std, # risk safety_violation_rate, # risk sharpe_ratio, # risk-return tradeoff ]
- k-th: ndarray of shape (max_topk, total_n_datasets)
Policy performance of the k-th deployment policy.
- best: ndarray of shape (max_topk, total_n_datasets)
Best policy performance among the top-k deployment policies.
- worst: ndarray of shape (max_topk, total_n_datasets)
Wosrt policy performance among the top-k deployment policies.
- mean: ndarray of shape (max_topk, total_n_datasets)
Mean policy performance of the top-k deployment policies.
- std: ndarray of shape (max_topk, total_n_datasets)
Standard deviation of the policy performance among the top-k deployment policies.
- safety_violation_rate: ndarray of shape (max_topk, total_n_datasets)
Safety violation rate regarding the policy performance of the top-k deployment policies.
- sharpe_ratio: ndarray of shape (max_topk, total_n_datasets)
Risk-return tradeoff metrics defined as follows: \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).
- Return type:
dict or dataframe
- obtain_topk_policy_value_selected_by_cumulative_distribution_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, max_topk=None, return_safety_violation_rate=False, safety_threshold=None, relative_safety_criteria=None, return_by_dataframe=False)[source]#
Obtain the topk deployment result (policy value) selected by cumulative distribution OPE.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.
max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.
return_safety_violation_rate (bool, default=False.) – Whether to calculate and return the safety violate.
safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.
relative_safety_criteria (float, default=None) – The relative policy value required to be considered a safe policy. For example, when 0.9 is given, candidate policy must exceed 90% of the behavior policy performance. Only applicable when using a single behavior policy.
return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.
- Returns:
topk_metric_dict/topk_metric_df – Dictionary/dataframe containing the following top-k risk return tradeoff metrics. Note that policy performance refers to the (standard) policy value here. When returning dataframe, the average value will be returned.
key: [estimator][ k-th, best, # return worst, # risk mean, # risk std, # risk safety_violation_rate, # risk sharpe_ratio, # risk-return tradeoff ]
- k-th: ndarray of shape (max_topk, total_n_datasets)
Policy performance of the k-th deployment policy.
- best: ndarray of shape (max_topk, total_n_datasets)
Best policy performance among the top-k deployment policies.
- worst: ndarray of shape (max_topk, total_n_datasets)
Wosrt policy performance among the top-k deployment policies.
- mean: ndarray of shape (max_topk, total_n_datasets)
Mean policy performance of the top-k deployment policies.
- std: ndarray of shape (max_topk, total_n_datasets)
Standard deviation of the policy performance among the top-k deployment policies.
- safety_violation_rate: ndarray of shape (max_topk, total_n_datasets)
Safety violation rate regarding the policy performance of the top-k deployment policies.
- sharpe_ratio: ndarray of shape (max_topk, total_n_datasets)
Risk-return tradeoff metrics defined as follows: \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).
- Return type:
dict or dataframe
- obtain_topk_policy_value_selected_by_lower_bound(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, max_topk=None, return_safety_violation_rate=False, safety_threshold=None, relative_safety_criteria=None, clip_sharpe_ratio=False, cis=['bootstrap'], ope_alpha=0.05, n_bootstrap_samples=100, random_state=None, return_by_dataframe=False)[source]#
Obtain the topk deployment (policy value) result selected by its estimated lower bound.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.
max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.
return_safety_violation_rate (bool, default=False.) – Whether to calculate and return the safety violate.
safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.
relative_safety_criteria (float, default=None) – The relative policy value required to be considered a safe policy. For example, when 0.9 is given, candidate policy must exceed 90% of the behavior policy performance. Only applicable when using a single behavior policy.
clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.
cis (list of {"bootstrap", "hoeffding", "bernstein", "ttest"}, default=["bootstrap"]) – Estimation methods for confidence intervals.
ope_alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
n_bootstrap_samples (int, default=100 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.
return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.
- Returns:
topk_metric_dict/topk_metric_df – Dictionary/dataframe containing the following top-k risk return tradeoff metrics. Note that policy performance refers to the (standard) policy value here. When returning dataframe, the average value will be returned.
key: [estimator][ k-th, best, # return worst, # risk mean, # risk std, # risk safety_violation_rate, # risk sharpe_ratio, # risk-return tradeoff ]
- k-th: ndarray of shape (max_topk, total_n_datasets)
Policy performance of the k-th deployment policy.
- best: ndarray of shape (max_topk, total_n_datasets)
Best policy performance among the top-k deployment policies.
- worst: ndarray of shape (max_topk, total_n_datasets)
Wosrt policy performance among the top-k deployment policies.
- mean: ndarray of shape (max_topk, total_n_datasets)
Mean policy performance of the top-k deployment policies.
- std: ndarray of shape (max_topk, total_n_datasets)
Standard deviation of the policy performance among the top-k deployment policies.
- safety_violation_rate: ndarray of shape (max_topk, total_n_datasets)
Safety violation rate regarding the policy performance of the top-k deployment policies.
- sharpe_ratio: ndarray of shape (max_topk, total_n_datasets)
Risk-return tradeoff metrics defined as follows: \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).
- Return type:
dict or dataframe
- obtain_topk_conditional_value_at_risk_selected_by_standard_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, ope_alpha=0.05, max_topk=None, return_safety_violation_rate=False, safety_threshold=None, clip_sharpe_ratio=False, return_by_dataframe=False)[source]#
Obtain the topk deployment result (CVaR) selected by standard OPE.
- Parameters:
input_dict (OPEInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.
ope_alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 1].
max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.
return_safety_violation_rate (bool, default=False.) – Whether to calculate and return the safety violate.
safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.
clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.
return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.
- Returns:
topk_metric_dict/topk_metric_df – Dictionary/dataframe containing the following top-k risk return tradeoff metrics. Note that policy performance refers to CVaR here. When returning dataframe, the average value will be returned.
key: [estimator][ k-th, best, # return worst, # risk mean, # risk std, # risk safety_violation_rate, # risk sharpe_ratio, # risk-return tradeoff ]
- k-th: ndarray of shape (max_topk, total_n_datasets)
Policy performance of the k-th deployment policy.
- best: ndarray of shape (max_topk, total_n_datasets)
Best policy performance among the top-k deployment policies.
- worst: ndarray of shape (max_topk, total_n_datasets)
Wosrt policy performance among the top-k deployment policies.
- mean: ndarray of shape (max_topk, total_n_datasets)
Mean policy performance of the top-k deployment policies.
- std: ndarray of shape (max_topk, total_n_datasets)
Standard deviation of the policy performance among the top-k deployment policies.
- safety_violation_rate: ndarray of shape (max_topk, total_n_datasets)
Safety violation rate regarding the policy performance of the top-k deployment policies.
- sharpe_ratio: ndarray of shape (max_topk, total_n_datasets)
Risk-return tradeoff metrics defined as follows: \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).
- Return type:
dict or dataframe
- obtain_topk_conditional_value_at_risk_selected_by_cumulative_distribution_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, ope_alpha=0.05, max_topk=None, return_safety_violation_rate=False, safety_threshold=None, clip_sharpe_ratio=False, return_by_dataframe=False)[source]#
Obtain the topk deployment result (CVaR) selected by cumulative distribution OPE.
- Parameters:
input_dict (OPEInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.
ope_alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 1].
max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.
return_safety_violation_rate (bool, default=False.) – Whether to calculate and return the safety violate.
safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.
clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.
return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.
- Returns:
topk_metric_dict/topk_metric_df – Dictionary/dataframe containing the following top-k risk return tradeoff metrics. Note that policy performance refers to CVaR here. When returning dataframe, the average value will be returned.
key: [estimator][ k-th, best, # return worst, # risk mean, # risk std, # risk safety_violation_rate, # risk sharpe_ratio, # risk-return tradeoff ]
- k-th: ndarray of shape (max_topk, total_n_datasets)
Policy performance of the k-th deployment policy.
- best: ndarray of shape (max_topk, total_n_datasets)
Best policy performance among the top-k deployment policies.
- worst: ndarray of shape (max_topk, total_n_datasets)
Wosrt policy performance among the top-k deployment policies.
- mean: ndarray of shape (max_topk, total_n_datasets)
Mean policy performance of the top-k deployment policies.
- std: ndarray of shape (max_topk, total_n_datasets)
Standard deviation of the policy performance among the top-k deployment policies.
- safety_violation_rate: ndarray of shape (max_topk, total_n_datasets)
Safety violation rate regarding the policy performance of the top-k deployment policies.
- sharpe_ratio: ndarray of shape (max_topk, total_n_datasets)
Risk-return tradeoff metrics defined as follows: \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).
- Return type:
dict or dataframe
- obtain_topk_lower_quartile_selected_by_standard_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, ope_alpha=0.05, max_topk=None, return_safety_violation_rate=False, safety_threshold=None, clip_sharpe_ratio=False, return_by_dataframe=False)[source]#
Obtain the topk deployment result (lower quartile) selected by standard OPE.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.
alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 0.5].
max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.
return_safety_violation_rate (bool, default=False.) – Whether to calculate and return the safety violate.
safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.
clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.
return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.
- Returns:
topk_metric_dict/topk_metric_df – Dictionary/dataframe containing the following top-k risk return tradeoff metrics. Note that policy performance refers to the lower quartile here. When returning dataframe, the average value will be returned.
key: [estimator][ k-th, best, # return worst, # risk mean, # risk std, # risk safety_violation_rate, # risk sharpe_ratio, # risk-return tradeoff ]
- k-th: ndarray of shape (max_topk, total_n_datasets)
Policy performance of the k-th deployment policy.
- best: ndarray of shape (max_topk, total_n_datasets)
Best policy performance among the top-k deployment policies.
- worst: ndarray of shape (max_topk, total_n_datasets)
Wosrt policy performance among the top-k deployment policies.
- mean: ndarray of shape (max_topk, total_n_datasets)
Mean policy performance of the top-k deployment policies.
- std: ndarray of shape (max_topk, total_n_datasets)
Standard deviation of the policy performance among the top-k deployment policies.
- safety_violation_rate: ndarray of shape (max_topk, total_n_datasets)
Safety violation rate regarding the policy performance of the top-k deployment policies.
- sharpe_ratio: ndarray of shape (max_topk, total_n_datasets)
Risk-return tradeoff metrics defined as follows: \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).
- Return type:
dict or dataframe
- obtain_topk_lower_quartile_selected_by_cumulative_distribution_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, ope_alpha=0.05, max_topk=None, return_safety_violation_rate=False, safety_threshold=None, clip_sharpe_ratio=False, return_by_dataframe=False)[source]#
Obtain the topk deployment result (lower quartile) selected by cumulative distribution OPE.
- Parameters:
input_dict (OPEInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.
alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 0.5].
max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.
return_safety_violation_rate (bool, default=False.) – Whether to calculate and return the safety violate.
safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.
clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.
return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.
- Returns:
topk_metric_dict/topk_metric_df – Dictionary/dataframe containing the following top-k risk return tradeoff metrics. Note that policy performance refers to the lower quartile here. When returning dataframe, the average value will be returned.
key: [estimator][ k-th, best, # return worst, # risk mean, # risk std, # risk safety_violation_rate, # risk sharpe_ratio, # risk-return tradeoff ]
- k-th: ndarray of shape (max_topk, total_n_datasets)
Policy performance of the k-th deployment policy.
- best: ndarray of shape (max_topk, total_n_datasets)
Best policy performance among the top-k deployment policies.
- worst: ndarray of shape (max_topk, total_n_datasets)
Wosrt policy performance among the top-k deployment policies.
- mean: ndarray of shape (max_topk, total_n_datasets)
Mean policy performance of the top-k deployment policies.
- std: ndarray of shape (max_topk, total_n_datasets)
Standard deviation of the policy performance among the top-k deployment policies.
- safety_violation_rate: ndarray of shape (max_topk, total_n_datasets)
Safety violation rate regarding the policy performance of the top-k deployment policies.
- sharpe_ratio: ndarray of shape (max_topk, total_n_datasets)
Risk-return tradeoff metrics defined as follows: \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).
- Return type:
dict or dataframe
- visualize_topk_policy_value_selected_by_standard_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, metrics=['k-th', 'best', 'worst', 'mean', 'std', 'safety_violation_rate', 'sharpe_ratio'], max_topk=None, safety_threshold=None, relative_safety_criteria=None, clip_sharpe_ratio=False, ymax_sharpe_ratio=None, visualize_ci=False, plot_ci='bootstrap', plot_alpha=0.05, plot_n_bootstrap_samples=100, random_state=None, legend=True, fig_dir=None, fig_name='topk_policy_value_standard_ope.png')[source]#
Visualize the topk deployment result (policy value) selected by standard OPE.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.
metrics (list of {"k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"}, default=["k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"]) –
Indicate which of the policy performance among {“best”, “worst”, “mean”, “std”}, sharpe ratio, and safety violation rate to report. For “k-th”, it means that the policy performance of the (estimated) k-th policy will be visualized.
We define the sharpe ratio for OPE as \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).
max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.
safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.
relative_safety_criteria (float, default=None) – The relative policy value required to be considered a safe policy. For example, when 0.9 is given, candidate policy must exceed 90% of the behavior policy performance. Only applicable when using a single behavior policy.
clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.
ymax_sharp_ratio (float, default=None) – Maximum value in y-axis of the plot of SharpeRatio.
visualize_ci (bool, default=False) – Whether to visualize ci. (Only applicable when
MultipleInputDictis given.)plot_ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.
plot_alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
plot_n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.
legend (bool, default=True) – Whether to include a legend in the figure.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="topk_policy_value_standard_ope.png") – Name of the bar figure.
- visualize_topk_policy_value_selected_by_cumulative_distribution_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, metrics=['k-th', 'best', 'worst', 'mean', 'std', 'safety_violation_rate', 'sharpe_ratio'], max_topk=None, safety_threshold=None, relative_safety_criteria=None, clip_sharpe_ratio=False, ymax_sharpe_ratio=None, visualize_ci=False, plot_ci='bootstrap', plot_alpha=0.05, plot_n_bootstrap_samples=100, random_state=None, legend=True, fig_dir=None, fig_name='topk_policy_value_cumulative_distribution_ope.png')[source]#
Visualize the topk deployment result (policy value) selected by cumulative distribution OPE.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.
metrics (list of {"k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"}, default=["k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"]) –
Indicate which of the policy performance among {“best”, “worst”, “mean”, “std”}, sharpe ratio, and safety violation rate to report. For “k-th”, it means that the policy performance of the (estimated) k-th policy will be visualized.
We define the sharpe ratio for OPE as \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).
max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.
safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.
relative_safety_criteria (float, default=None) – The relative policy value required to be considered a safe policy. For example, when 0.9 is given, candidate policy must exceed 90% of the behavior policy performance.
clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.
ymax_sharp_ratio (float, default=None) – Maximum value in y-axis of the plot of SharpeRatio.
visualize_ci (bool, default=False) – Whether to visualize ci. (Only applicable when
MultipleInputDictis given.)plot_ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.
plot_alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
plot_n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.
legend (bool, default=True) – Whether to include a legend in the figure.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="topk_policy_value_cumulative_distribution_ope.png") – Name of the bar figure.
- visualize_topk_policy_value_selected_by_lower_bound(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, metrics=['k-th', 'best', 'worst', 'mean', 'std', 'safety_violation_rate', 'sharpe_ratio'], max_topk=None, safety_threshold=None, relative_safety_criteria=None, clip_sharpe_ratio=False, ymax_sharpe_ratio=None, ope_cis=['bootstrap'], ope_alpha=0.05, ope_n_bootstrap_samples=100, visualize_ci=False, plot_ci='bootstrap', plot_alpha=0.05, plot_n_bootstrap_samples=100, random_state=None, legend=True, fig_dir=None, fig_name='topk_policy_value_standard_ope_lower_bound.png')[source]#
Visualize the topk deployment result (policy value) selected by its estimated lower bound.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.
metrics (list of {"k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"}, default=["k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"]) –
Indicate which of the policy performance among {“best”, “worst”, “mean”, “std”}, sharpe ratio, and safety violation rate to report. For “k-th”, it means that the policy performance of the (estimated) k-th policy will be visualized.
We define the sharpe ratio for OPE as \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).
max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.
safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.
relative_safety_criteria (float, default=None) – The relative policy value required to be considered a safe policy. For example, when 0.9 is given, candidate policy must exceed 90% of the behavior policy performance.
clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.
ymax_sharp_ratio (float, default=None) – Maximum value in y-axis of the plot of SharpeRatio.
ope_cis (list of {"bootstrap", "hoeffding", "bernstein", "ttest"}, default=["bootstrap"]) – Estimation methods for confidence intervals.
ope_alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
ope_n_bootstrap_samples (int, default=100 (> 0)) – Number of resampling performed in the bootstrap procedure.
visualize_ci (bool, default=False) – Whether to visualize ci. (Only applicable when
MultipleInputDictis given.)plot_ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.
plot_alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
plot_n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.
legend (bool, default=True) – Whether to include a legend in the figure.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="topk_policy_value_standard_ope_lower_bound.png") – Name of the bar figure.
- visualize_topk_conditional_value_at_risk_selected_by_standard_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, ope_alpha=0.05, metrics=['k-th', 'best', 'worst', 'mean', 'std', 'safety_violation_rate', 'sharpe_ratio'], max_topk=None, safety_threshold=None, clip_sharpe_ratio=False, ymax_sharpe_ratio=None, visualize_ci=False, plot_ci='bootstrap', plot_alpha=0.05, plot_n_bootstrap_samples=100, random_state=None, legend=True, fig_dir=None, fig_name='topk_cvar_standard_ope.png')[source]#
Visualize the topk deployment result (CVaR) selected by standard OPE.
- Parameters:
input_dict (OPEInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.
ope_alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 1].
metrics (list of {"k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"}, default=["k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"]) –
Indicate which of the policy performance among {“best”, “worst”, “mean”, “std”}, sharpe ratio, and safety violation rate to report. For “k-th”, it means that the policy performance of the (estimated) k-th policy will be visualized.
We define the sharpe ratio for OPE as \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).
max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.
safety_threshold (float, default=0.0 (>= 0)) – The conditional value at risk required to be considered a safe policy.
clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.
ymax_sharp_ratio (float, default=None) – Maximum value in y-axis of the plot of SharpeRatio.
visualize_ci (bool, default=False) – Whether to visualize ci. (Only applicable when
MultipleInputDictis given.)plot_ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.
plot_alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
plot_n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.
ymax_sharp_ratio – Maximum value in y-axis of the plot of SharpeRatio.
legend (bool, default=True) – Whether to include a legend in the figure.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="topk_cvar_standard_ope.png") – Name of the bar figure.
- visualize_topk_conditional_value_at_risk_selected_by_cumulative_distribution_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, ope_alpha=0.05, metrics=['k-th', 'best', 'worst', 'mean', 'std', 'safety_violation_rate', 'sharpe_ratio'], max_topk=None, safety_threshold=None, clip_sharpe_ratio=False, ymax_sharpe_ratio=None, visualize_ci=False, plot_ci='bootstrap', plot_alpha=0.05, plot_n_bootstrap_samples=100, random_state=None, legend=True, fig_dir=None, fig_name='topk_cvar_cumulative_distribution_ope.png')[source]#
Visualize the topk deployment result (CVaR) selected by cumulative distribution OPE.
- Parameters:
input_dict (OPEInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.
ope_alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 1].
metrics (list of {"k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"}, default=["k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"]) –
Indicate which of the policy performance among {“best”, “worst”, “mean”, “std”}, sharpe ratio, and safety violation rate to report. For “k-th”, it means that the policy performance of the (estimated) k-th policy will be visualized.
We define the sharpe ratio for OPE as \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).
max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.
safety_threshold (float, default=0.0 (>= 0)) – The conditional value at risk required to be considered a safe policy.
clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.
ymax_sharp_ratio (float, default=None) – Maximum value in y-axis of the plot of SharpeRatio.
visualize_ci (bool, default=False) – Whether to visualize ci. (Only applicable when
MultipleInputDictis given.)plot_ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.
plot_alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
plot_n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.
legend (bool, default=True) – Whether to include a legend in the figure.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="topk_cvar_cumulative_distribution_ope.png") – Name of the bar figure.
- visualize_topk_lower_quartile_selected_by_standard_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, ope_alpha=0.05, metrics=['k-th', 'best', 'worst', 'mean', 'std', 'safety_violation_rate', 'sharpe_ratio'], max_topk=None, safety_threshold=None, clip_sharpe_ratio=False, ymax_sharpe_ratio=None, visualize_ci=False, plot_ci='bootstrap', plot_alpha=0.05, plot_n_bootstrap_samples=100, random_state=None, legend=True, fig_dir=None, fig_name='topk_lower_quartile_standard_ope.png')[source]#
Visualize the topk deployment result (lower quartile) selected by standard OPE.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.
alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 0.5].
metrics (list of {"k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"}, default=["k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"]) –
Indicate which of the policy performance among {“best”, “worst”, “mean”, “std”}, sharpe ratio, and safety violation rate to report. For “k-th”, it means that the policy performance of the (estimated) k-th policy will be visualized.
We define the sharpe ratio for OPE as \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).
max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.
safety_threshold (float, default=0.0 (>= 0)) – The conditional value at risk required to be considered a safe policy.
clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.
ymax_sharp_ratio (float, default=None) – Maximum value in y-axis of the plot of SharpeRatio.
visualize_ci (bool, default=False) – Whether to visualize ci. (Only applicable when
MultipleInputDictis given.)plot_ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.
plot_alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
plot_n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.
legend (bool, default=True) – Whether to include a legend in the figure.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="topk_lower_quartile_standard_ope.png") – Name of the bar figure.
- visualize_topk_lower_quartile_selected_by_cumulative_distribution_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, ope_alpha=0.05, metrics=['k-th', 'best', 'worst', 'mean', 'std', 'safety_violation_rate', 'sharpe_ratio'], max_topk=None, safety_threshold=None, clip_sharpe_ratio=False, ymax_sharpe_ratio=None, visualize_ci=False, plot_ci='bootstrap', plot_alpha=0.05, plot_n_bootstrap_samples=100, random_state=None, legend=True, fig_dir=None, fig_name='topk_lower_quartile_cumulative_distribution_ope.png')[source]#
Visualize the topk deployment result (lower quartile) selected by cumulative distribution OPE.
- Parameters:
input_dict (OPEInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.
alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 0.5].
metrics (list of {"k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"}, default=["k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"]) –
Indicate which of the policy performance among {“best”, “worst”, “mean”, “std”}, sharpe ratio, and safety violation rate to report. For “k-th”, it means that the policy performance of the (estimated) k-th policy will be visualized.
We define the sharpe ratio for OPE as \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).
max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.
safety_threshold (float, default=0.0 (>= 0)) – The conditional value at risk required to be considered a safe policy.
clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.
ymax_sharp_ratio (float, default=None) – Maximum value in y-axis of the plot of SharpeRatio.
visualize_ci (bool, default=False) – Whether to visualize ci. (Only applicable when
MultipleInputDictis given.)plot_ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.
plot_alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
plot_n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.
ymax_sharp_ratio – Maximum value in y-axis of the plot of SharpeRatio.
legend (bool, default=True) – Whether to include a legend in the figure.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="topk_lower_quartile_cumulative_distribution_ope.png") – Name of the bar figure.
- visualize_policy_value_for_validation(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, n_cols=None, share_axes=False, legend=True, fig_dir=None, fig_name='validation_policy_value_standard_ope.png')[source]#
Visualize the true policy value and its estimate (scatter plot).
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
n_cols (int, default=None (> 0)) – Number of columns in the figure.
share_axes (bool, default=False) – Whether to share x- and y-axes or not.
legend (bool, default=True) – Whether to include a legend in the scatter plot.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="validation_policy_value_standard_ope.png") – Name of the bar figure.
- visualize_policy_value_of_cumulative_distribution_ope_for_validation(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, n_cols=None, share_axes=False, legend=True, fig_dir=None, fig_name='validation_policy_value_cumulative_distribution_ope.png')[source]#
Visualize the true policy value and its estimate obtained by cumulative distribution OPE (scatter plot).
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
n_cols (int, default=None (> 0)) – Number of columns in the figure.
share_axes (bool, default=False) – Whether to share x- and y-axes or not.
legend (bool, default=True) – Whether to include a legend in the scatter plot.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="validation_policy_value_cumulative_distribution_ope.png") – Name of the bar figure.
- visualize_policy_value_lower_bound_for_validation(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, cis=['bootstrap'], alpha=0.05, n_bootstrap_samples=100, random_state=None, n_cols=None, share_axes=False, legend=True, fig_dir=None, fig_name='validation_policy_value_lower_bound.png')[source]#
Visualize the true policy value and its estimate lower bound (scatter plot).
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
cis (list of {"bootstrap", "hoeffding", "bernstein", "ttest"}, default=["bootstrap"]) – Estimation methods for confidence intervals.
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
n_bootstrap_samples (int, default=100 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.
n_cols (int, default=None (> 0)) – Number of columns in the figure.
share_axes (bool, default=False) – Whether to share x- and y-axes or not.
legend (bool, default=True) – Whether to include a legend in the scatter plot.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="validation_policy_value_lower_bound.png") – Name of the bar figure.
- visualize_variance_for_validation(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, n_cols=None, share_axes=False, legend=True, fig_dir=None, fig_name='validation_variance.png')[source]#
Visualize the true variance and its estimate (scatter plot).
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
n_cols (int, default=None (> 0)) – Number of columns in the figure.
share_axes (bool, default=False) – Whether to share x- and y-axes or not.
legend (bool, default=True) – Whether to include a legend in the scatter plot.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="validation_variance.png") – Name of the bar figure.
- visualize_lower_quartile_for_validation(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, n_cols=None, share_axes=False, legend=True, fig_dir=None, fig_name='validation_lower_quartile.png')[source]#
Visualize the true lower quartile and its estimate (scatter plot).
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 0.5].
n_cols (int, default=None (> 0)) – Number of columns in the figure.
share_axes (bool, default=False) – Whether to share x- and y-axes or not.
legend (bool, default=True) – Whether to include a legend in the scatter plot.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="validation_lower_quartile.png") – Name of the bar figure.
- visualize_conditional_value_at_risk_for_validation(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, n_cols=None, share_axes=False, legend=True, fig_dir=None, fig_name='validation_conditional_value_at_risk.png')[source]#
Visualize the true conditional value at risk and its estimate (scatter plot).
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 1].
n_cols (int, default=None (> 0)) – Number of columns in the figure.
share_axes (bool, default=False) – Whether to share x- and y-axes or not.
legend (bool, default=True) – Whether to include a legend in the scatter plot.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="validation_conditional_value_at_risk.png") – Name of the bar figure.
Methods