Example Codes for Assessing OPE Estimators#

Here, we show example codes for assessing OPE/OPS results.

Prerequisite#

Here, we assume that an RL environment, a behavior policy, and evaluation policies are given as follows.

  • behavior_policy: an instance of BaseHead

  • evaluation_policies: a list of instance(s) of BaseHead

  • env: a gym environment (unnecessary when using real-world datasets)

Additionally, we assume that the logged datasets, inputs, and either ope or cd_ope instances are ready to use. For initializing the ope and cd_ope instances, please refer to this page and this page as references, respectively.

  • logged_dataset: a dictionary containing the logged dataset

  • input_dict: a dictionary containing inputs for OPE

  • ope: an instance of OffPolicyEvaluation

  • cd_ope: an instance of CumulativeDistributionOPE

Note that, to run the following example codes, input_dict should contain on-policy policies of each candidate policy. This requirement is automatically satisfied when collecting logged dataset by handing env over the CreateInput class.

In the following examples, we also use a single logged dataset for simplicity. For the case of using multiple behavior policies or multiple logged datasets, refer to Example Codes with Multiple Logged Dataset and Behavior Policies.

Assessing OPE/OPS results#

The assessments use the OPS class.

from scope_rl.ope import OffPolicySelection

ops = OffPolicySelection(
    ope=ope,  # either ope or cd_ope must be given
    cumulative_distribution_ope=cd_ope,
)

Assessments with conventional metrics#

The convensional metrics including MSE, RankCorr, Regret, and Type I and II Errors are available in the ops function.

ranking_dict, metric_dict = ops.select_by_policy_value(
    input_dict=input_dict,
    return_metrics=True,  # enable this option
    return_by_dataframe=True,
)

To compare Regret@k, specify the following commands.

ranking_dict, metric_dict = ops.select_by_policy_value(
    input_dict=input_dict,
    top_k_in_eval_metrics=1,  # specify the value of k
    return_metrics=True,
    return_by_dataframe=True,
)

We can also specify the reward threshold for Type I and II errors as follows.

ranking_dict, metric_dict = ops.select_by_policy_value(
    input_dict=input_dict,
    safety_threshold=10.0,  # specify the safety threshold
    return_metrics=True,
    return_by_dataframe=True,
)

To use the value relative to behavior policy as a threshold, use the following option.

ranking_dict, metric_dict = ops.select_by_policy_value(
    input_dict=input_dict,
    relative_safety_criteria=0.90,  # specify the relative safety threshold
    return_metrics=True,
    return_by_dataframe=True,
)

Similar evaluations are available in the following functions.

  • select_by_policy_value

  • select_by_policy_value_lower_bound

  • select_by_policy_value_via_cumulative_distribution_ope

  • select_by_conditional_value_at_risk

  • select_by_lower_quartile

Assessments with top-\(k\) deployment results#

SCOPE-RL enables to obtain and compare the statistics of policy portfolio formed by each estimator as follows.

topk_metric_df = ops.obtain_topk_policy_value_selected_by_standard_ope(
    input_dict=input_dict,
    return_by_dataframe=True,
)

In the topk_metric_df, you will find the k-th, best, worst, and mean policy values and std of policy values amond top- \(k\) policy portfolio. We also report the proposed SharpeRatio@k metric as sharpe_ratio.

Note that, to additionally report the safety violation rate, specify the following options.

topk_metric_df = ops.obtain_topk_policy_value_selected_by_standard_ope(
    input_dict=input_dict,
    return_safety_violation_rate=True,  # enable this option
    safety_threshold=10.0,  # specify the safety threshold
    return_by_dataframe=True,
)

To use the value relative to the behavior policy as the safety requirement, use the following option.

topk_metric_df = ops.obtain_topk_policy_value_selected_by_standard_ope(
    input_dict=input_dict,
    return_safety_violation_rate=True,  # enable this option
    relative_safety_criteria=0.90,  # specify the relative safety threshold
    return_by_dataframe=True,
)

Similar evaluations are available in the following functions.

  • obtain_topk_policy_value_selected_by_standard_ope

  • obtain_topk_policy_value_selected_by_lower_bound

  • obtain_topk_policy_value_selected_by_cumulative_distribution_ope

  • obtain_topk_conditional_value_at_risk_selected_by_cumulative_distirbution_ope

  • obtain_topk_lower_quartile_selected_by_cumulative_distribution_ope

We can also evaluate CVaR of top-\(k\) policies selected based on estimated policy value as follows.

topk_metric_df = ops.obtain_topk_conditional_value_at_risk_selected_by_standard_ope(
    input_dict=input_dict,
    return_by_dataframe=True,
    ope_alpha=0.3,
)

We can also evaluate the lower quartile of top-\(k\) policies selected based on estimated policy value as follows.

topk_metric_df = ops.obtain_topk_lower_quartile_selected_by_standard_ope(
    input_dict=input_dict,
    return_by_dataframe=True,
    ope_alpha=0.3,
)

Visualizing top-\(k\) deployment results#

SCOPE-RL also provides functions to visualize the above top-\(k\) policy performances.

ops.visualize_topk_policy_value_selected_by_standard_ope(
    input_dict=input_dict,
    metrics=["k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"],  # (default)
    compared_estimators=["dm", "tis", "pdis", "dr"],  # (optional)
    relative_safety_criteria=1.0,  # (optional)
    clip_sharpe_ratio=True,  # (optional)
    ymax_sharpe_ratio=5.0,  # (optional)
    legend=True,  # (optional)
)
card-img-top

Similar evaluations are available in the following functions.

  • visualize_topk_policy_value_selected_by_standard_ope

  • visualize_topk_policy_value_selected_by_lower_bound

  • visualize_topk_policy_value_selected_by_cumulative_distribution_ope

  • visualize_topk_conditional_value_at_risk_selected_by_cumulative_distirbution_ope

  • visualize_topk_lower_quartile_selected_by_cumulative_distribution_ope

Again, the visualization functions are also able to show CVaR and lower quartile of top-\(k\) policies selected based on estimated policy value as follows.

# visualize CVaR
ops.visualize_topk_conditional_value_at_risk_selected_by_standard_ope(
    input_dict=input_dict,
    metrics=["best", "worst", "mean", "std"],  # (optional)
    compared_estimators=["dm", "tis", "pdis", "dr"],  # (optional)
    ope_alpha=0.3,
)
# visualize lower quartile
ops.visualize_topk_lower_quartile_selected_by_standard_ope(
    input_dict=input_dict,
    metrics=["best", "worst", "mean", "std"],  # (optional)
    compared_estimators=["dm", "tis", "pdis", "dr"],  # (optional)
    ope_alpha=0.3,
)

Visualizing the true and estimated policy performances#

Finally, SCOPE-RL also implements functions to compare the true and estimated policy performances via scatter plots.

ops.visualize_policy_value_for_validation(
    input_dict=input_dict,
    n_cols=4,  # (optional)
)
card-img-top

Note that, the same y-axes are used with sharey option.

ops.visualize_policy_value_for_validation(
    input_dict=input_dict,
    n_cols=4,
    sharey=True,  # enable this option
)

Similar evaluations are available in the following functions.

  • visualize_policy_value_for_validation

  • visualize_policy_value_of_cumulative_distribution_ope_for_validation

  • visualize_variance_for_validation

  • visualize_lower_quartile_for_validation

  • visualize_conditional_value_at_risk_for_validation

<<< Prev Usage

Next >>> Multiple Datasets