Example Codes for Off-Policy Selection#

Here, we show example codes for conducting policy selection via OPE (i.e., Off-Policy Selection; OPS).

Prerequisite#

Here, we assume that an RL environment, a behavior policy, and evaluation policies are given as follows.

behavior_policy: an instance of BaseHead
evaluation_policies: a list of instance(s) of BaseHead
env: a gym environment (unnecessary when using real-world datasets)

Additionally, we assume that the logged datasets, inputs, and either ope or cd_ope instances are ready to use. For initializing the ope and cd_ope instances, please refer to this page and this page as references, respectively.

logged_dataset: a dictionary containing the logged dataset
input_dict: a dictionary containing inputs for OPE
ope: an instance of OffPolicyEvaluation
cd_ope: an instance of CumulativeDistributionOPE

Note that, in the following example, we use a single logged dataset for simplicity. For the case of using multiple behavior policies or multiple logged datasets, refer to Example Codes with Multiple Logged Dataset and Behavior Policies.

Off-Policy Selection#

OPS class calls the ope and cd_ope instances for OPE.

from scope_rl.ope import OffPolicySelection

ops = OffPolicySelection(
    ope=ope,  # either ope or cd_ope must be given
    cumulative_distribution_ope=cd_ope,
)

OPS via basic OPE#

By default, the following function returns the estimated ranking of evaluation policies with their (estimated) policy value as follows.

ranking_dict = ops.select_by_policy_value(
    input_dict=input_dict,
)

To return the results in a dataframe format, enable the following option.

ranking_df = ops.select_by_policy_value(
    input_dict=input_dict,
    return_by_dataframe=True,  #
)

With the following option, we can also verify the true (on-policy) policy value. Note that, this function is only applicable when the on-policy policy value of evaluation policies are recorded in input_dict.

ranking_df = ops.select_by_policy_value(
    input_dict=input_dict,
    return_true_values=True,
    return_by_dataframe=True,  #
)

SCOPE-RL also handles OPS with high-confidence OPE as follows.

ranking_df = ops.select_by_policy_value_lower_bound(
    input_dict=input_dict,
    cis=["bootstrap", "bernstein", "hoeffding", "ttest"],  # the choices of inequality
    return_by_dataframe=True,
    random_state=12345,
)

OPS via cumulative distribution OPE#

We can also conduct OPS via CD-OPE in a manner similar to basic OPE.

First, the following conduct OPS via policy value estimated by CD-OPE.

ranking_df = ops.select_by_policy_value_via_cumulative_distribution_ope(
    input_dict=input_dict,
    return_by_dataframe=True,
)

OPS is also conducted by CVaR and lower quartile as follows.

# CVaR
ranking_df = ops.select_by_conditional_value_at_risk(
    input_dict=input_dict,
    return_by_dataframe=True,
    alpha=0.3,  # specify the proportion of the sided region
)
# lower quartile
ranking_df = ops.select_by_lower_quartile(
    input_dict=input_dict,
    return_by_dataframe=True,
    alpha=0.3,  # specify the proportion of the sided region
)

Obtaining oracle selection results#

By default, the following function returns the ranking of evaluation policies with their (ground-truth) policy value as follows. Note that, this function is only applicable when the on-policy policy value of evaluation policies are recorded in input_dict.

oracle_selection_dict = ops.obtain_true_selection_result(
    input_dict=input_dict,
)

To return the results in a dataframe format, enable the following option.

oracle_selection_df = ops.obtain_true_selection_result(
    input_dict=input_dict,
    return_by_dataframe=True,  #
)

To return variance, enable the following option.

oracle_selection_df = ops.obtain_true_selection_result(
    input_dict=input_dict,
    return_variance=True,  #
    return_by_dataframe=True,
)

To return CVaR and the ranking of candidate policies based on CVaR, enable the following option.

oracle_selection_df = ops.obtain_true_selection_result(
    input_dict=input_dict,
    return_conditional_value_at_risk=True,  #
    cvar_alpha=0.3,  # specify the proportion of the sided region
    return_by_dataframe=True,
)

To return the lower quartile and the ranking of candidate policies based on the lower quartile, enable the following option.

oracle_selection_df = ops.obtain_true_selection_result(
    input_dict=input_dict,
    return_lower_quartile=True,  #
    quartile_alpha=0.3,  # specify the proportion of the sided region
    return_by_dataframe=True,
)

Calling visualization functions from ope / cd_ope instances#

Finally, we should also note that the functions of ope and cd_ope instances are available via ops instance as follows.

# ope.visualize_off_policy_estimates(...)
ops.visualize_policy_value_for_selection(...)

# cd_ope.visualize_cumulative_distribution_function(...)
ops.visualize_cumulative_distribution_function_for_selection(...)

# cd_ope.visualize_policy_value(...)
ops.visualize_policy_value_of_cumulative_distribution_ope_for_selection(...)

# cd_ope.visualize_conditional_value_at_risk(...)
ops.visualize_conditional_value_at_risk_for_selection(...)

# cd_ope.visualize_interquartile_range(...)
ops.visualize_interquartile_range_for_selection(...)