Example Codes for Assessing OPE Estimators#
Here, we show example codes for assessing OPE/OPS results.
See also
For preparation, please also refer to the following pages:
Prerequisite#
Here, we assume that an RL environment, a behavior policy, and evaluation policies are given as follows.
behavior_policy
: an instance ofBaseHead
evaluation_policies
: a list of instance(s) ofBaseHead
env
: a gym environment (unnecessary when using real-world datasets)
Additionally, we assume that the logged datasets, inputs, and either ope or cd_ope instances are ready to use. For initializing the ope and cd_ope instances, please refer to this page and this page as references, respectively.
logged_dataset
: a dictionary containing the logged datasetinput_dict
: a dictionary containing inputs for OPEope
: an instance ofOffPolicyEvaluation
cd_ope
: an instance ofCumulativeDistributionOPE
Note that, to run the following example codes, input_dict
should contain on-policy policies of each candidate policy.
This requirement is automatically satisfied when collecting logged dataset by handing env
over the CreateInput
class.
In the following examples, we also use a single logged dataset for simplicity. For the case of using multiple behavior policies or multiple logged datasets, refer to Example Codes with Multiple Logged Dataset and Behavior Policies.
Assessing OPE/OPS results#
The assessments use the OPS class.
from scope_rl.ope import OffPolicySelection
ops = OffPolicySelection(
ope=ope, # either ope or cd_ope must be given
cumulative_distribution_ope=cd_ope,
)
Assessments with conventional metrics#
The convensional metrics including MSE, RankCorr, Regret, and Type I and II Errors are available in the ops function.
ranking_dict, metric_dict = ops.select_by_policy_value(
input_dict=input_dict,
return_metrics=True, # enable this option
return_by_dataframe=True,
)
To compare Regret@k, specify the following commands.
ranking_dict, metric_dict = ops.select_by_policy_value(
input_dict=input_dict,
top_k_in_eval_metrics=1, # specify the value of k
return_metrics=True,
return_by_dataframe=True,
)
We can also specify the reward threshold for Type I and II errors as follows.
ranking_dict, metric_dict = ops.select_by_policy_value(
input_dict=input_dict,
safety_threshold=10.0, # specify the safety threshold
return_metrics=True,
return_by_dataframe=True,
)
To use the value relative to behavior policy as a threshold, use the following option.
ranking_dict, metric_dict = ops.select_by_policy_value(
input_dict=input_dict,
relative_safety_criteria=0.90, # specify the relative safety threshold
return_metrics=True,
return_by_dataframe=True,
)
Similar evaluations are available in the following functions.
select_by_policy_value
select_by_policy_value_lower_bound
select_by_policy_value_via_cumulative_distribution_ope
select_by_conditional_value_at_risk
select_by_lower_quartile
Assessments with top-\(k\) deployment results#
SCOPE-RL enables to obtain and compare the statistics of policy portfolio formed by each estimator as follows.
topk_metric_df = ops.obtain_topk_policy_value_selected_by_standard_ope(
input_dict=input_dict,
return_by_dataframe=True,
)
In the topk_metric_df
, you will find the k-th
, best
, worst
, and mean
policy values and std
of policy values amond top- \(k\)
policy portfolio. We also report the proposed SharpeRatio@k metric as sharpe_ratio
.
Note that, to additionally report the safety violation rate, specify the following options.
topk_metric_df = ops.obtain_topk_policy_value_selected_by_standard_ope(
input_dict=input_dict,
return_safety_violation_rate=True, # enable this option
safety_threshold=10.0, # specify the safety threshold
return_by_dataframe=True,
)
To use the value relative to the behavior policy as the safety requirement, use the following option.
topk_metric_df = ops.obtain_topk_policy_value_selected_by_standard_ope(
input_dict=input_dict,
return_safety_violation_rate=True, # enable this option
relative_safety_criteria=0.90, # specify the relative safety threshold
return_by_dataframe=True,
)
Similar evaluations are available in the following functions.
obtain_topk_policy_value_selected_by_standard_ope
obtain_topk_policy_value_selected_by_lower_bound
obtain_topk_policy_value_selected_by_cumulative_distribution_ope
obtain_topk_conditional_value_at_risk_selected_by_cumulative_distirbution_ope
obtain_topk_lower_quartile_selected_by_cumulative_distribution_ope
We can also evaluate CVaR of top-\(k\) policies selected based on estimated policy value as follows.
topk_metric_df = ops.obtain_topk_conditional_value_at_risk_selected_by_standard_ope(
input_dict=input_dict,
return_by_dataframe=True,
ope_alpha=0.3,
)
We can also evaluate the lower quartile of top-\(k\) policies selected based on estimated policy value as follows.
topk_metric_df = ops.obtain_topk_lower_quartile_selected_by_standard_ope(
input_dict=input_dict,
return_by_dataframe=True,
ope_alpha=0.3,
)
Visualizing top-\(k\) deployment results#
SCOPE-RL also provides functions to visualize the above top-\(k\) policy performances.
ops.visualize_topk_policy_value_selected_by_standard_ope(
input_dict=input_dict,
metrics=["k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"], # (default)
compared_estimators=["dm", "tis", "pdis", "dr"], # (optional)
relative_safety_criteria=1.0, # (optional)
clip_sharpe_ratio=True, # (optional)
ymax_sharpe_ratio=5.0, # (optional)
legend=True, # (optional)
)
Similar evaluations are available in the following functions.
visualize_topk_policy_value_selected_by_standard_ope
visualize_topk_policy_value_selected_by_lower_bound
visualize_topk_policy_value_selected_by_cumulative_distribution_ope
visualize_topk_conditional_value_at_risk_selected_by_cumulative_distirbution_ope
visualize_topk_lower_quartile_selected_by_cumulative_distribution_ope
Again, the visualization functions are also able to show CVaR and lower quartile of top-\(k\) policies selected based on estimated policy value as follows.
# visualize CVaR
ops.visualize_topk_conditional_value_at_risk_selected_by_standard_ope(
input_dict=input_dict,
metrics=["best", "worst", "mean", "std"], # (optional)
compared_estimators=["dm", "tis", "pdis", "dr"], # (optional)
ope_alpha=0.3,
)
# visualize lower quartile
ops.visualize_topk_lower_quartile_selected_by_standard_ope(
input_dict=input_dict,
metrics=["best", "worst", "mean", "std"], # (optional)
compared_estimators=["dm", "tis", "pdis", "dr"], # (optional)
ope_alpha=0.3,
)
Visualizing the true and estimated policy performances#
Finally, SCOPE-RL also implements functions to compare the true and estimated policy performances via scatter plots.
ops.visualize_policy_value_for_validation(
input_dict=input_dict,
n_cols=4, # (optional)
)
Note that, the same y-axes are used with sharey
option.
ops.visualize_policy_value_for_validation(
input_dict=input_dict,
n_cols=4,
sharey=True, # enable this option
)
Similar evaluations are available in the following functions.
visualize_policy_value_for_validation
visualize_policy_value_of_cumulative_distribution_ope_for_validation
visualize_variance_for_validation
visualize_lower_quartile_for_validation
visualize_conditional_value_at_risk_for_validation