Example Codes with Multiple Logged Dataset and Behavior Policies#

Here, we show example codes for conducting OPE and OPS with multiple logged datasets.

See also

For preparation, please also refer to the following pages about the case with a single logged dataset:

Logged Dataset#

Here, we assume that an RL environment, behavior policies, and evaluation policies are given as follows.

  • behavior_policy: an instance of BaseHead or a list of instance(s) of BaseHead

  • evaluation_policies: a list of instance(s) of BaseHead

  • env: a gym environment (unnecessary when using real-world datasets)

Then, we can collect multiple logged datasets with a single behavior policy as follows.

from scope_rl.dataset import SyntheticDataset

# initialize dataset
dataset = SyntheticDataset(
    env=env,
    max_episode_steps=env.step_per_episode,
)
# obtain logged dataset
multiple_logged_datasets = dataset.obtain_episodes(
    behavior_policies=behavior_policies[0],  # a single behavior policy
    n_datasets=5,  # specify the number of dataset (i.e., number of different random seeeds)
    n_trajectories=10000,
    random_state=random_state,
)

Similarly, SCOPE-RL also collects multiple logged datasets with multiple behavior policies as follows.

multiple_logged_datasets = dataset.obtain_episodes(
    behavior_policies=behavior_policies,  # multiple behavior policies
    n_datasets=5,  # specify the number of dataset (i.e., number of different random seeeds) for each behavior policy
    n_trajectories=10000,
    random_state=random_state,
)

The multiple logged datasets are returned as an instance of MultipleLoggedDataset. Note that, we can also manually create multiple logged datasets as follows.

from scope_rl.utils import MultipleLoggedDataset

multiple_logged_dataset = MultipleLoggedDataset(
    action_type="discrete",
    path="logged_dataset/",  # specify the path to the dataset
)

for behavior_policy in behavior_policies:
    single_logged_dataset = dataset.obtain_episodes(
        behavior_policies=behavior_policy,  # a single behavior policy
        n_trajectories=10000,
        random_state=random_state,
    )

    # add a single_logged_dataset to multiple_logged_dataset
    multiple_logged_dataset.add(
        single_logged_dataset,
        behavior_policy_name=behavior_policy.name,
        dataset_id=0,
    )

Once you create the multiple logged datasets, each dataset is accessible via the following code.

single_logged_dataset = multiple_logged_dataset.get(
    behavior_policy_name=behavior_policies[0].name, dataset_id=0,
)

MultipleLoggedDataset also has the following properties.

# a list of the name of behavior policies
multiple_logged_dataset.behavior_policy_names

# a dictionary of the number of datasets for each behavior policy
multiple_logged_dataset.n_datasets

Inputs#

The next step is to create the inputs for OPE estimators. Here, we show the case of creating inputs for importance-sampling based estimators for brevity. For the case of creating inputs for model-based and marginal importance-sampling based estimators, please also refer to Example Codes for Basic OPE.

We first show the case of creating whole logged datasets stored in multiple_logged_datasets (, which is essentially the same with the case of using single_logged_dataset).

from scope_rl.ope import CreateOPEInput

# initialize class to create inputs
prep = CreateOPEInput(
    env=env,  # unnecessary when using real-world dataset
)
# create inputs (e.g., calculating )
multiple_input_dict = prep.obtain_whole_inputs(
    logged_dataset=multiple_logged_dataset,
    evaluation_policies=evaluation_policies,
    n_trajectories_on_policy_evaluation=100,  # when evaluating OPE (optional)
    random_state=random_state,
)

The above code returns multiple_input_dict as an instance of MultipleInputDict. Each input dictionary is accessible via the following code.

single_input_dict = multiple_input_dict.get(
    behavior_policy_name=behavior_policies[0].name, dataset_id=0,
)

MultipleInputDict has the following properties.

# a list of the name of behavior policies
multiple_input_dict.behavior_policy_names

# a dictionary of the number of datasets for each behavior policy
multiple_input_dict.n_datasets

# a dictionary of the number of evaluation policies of each input dict
multiple_input_dict.n_eval_policies

# check if the contained logged datasets use the same evaluation policies
multiple_input_dict.use_same_eval_policy_across_dataset

Note that, it is also possible to create a single input dict using the CreateOPEInput class by specifying the behavior policy and the dataset id as follows.

single_input_dict = prep.obtain_whole_inputs(
    logged_dataset=multiple_logged_dataset,
    behavior_policy_name=behavior_policies[0].name,  # specify the behavior policy
    dataset_id=0,                                    # specify the dataset id
    evaluation_policies=evaluation_policies,
    random_state=random_state,
)

Off-Policy Evaluation#

SCOPE-RL enables OPE with multiple logged datasets and multiple input dicts without additional effort. Specifically, we can estimate the policy value via basic OPE as follows.

from scope_rl.ope import OffPolicyEvaluation as OPE

# initialize the OPE class
ope = OPE(
    logged_dataset=multiple_logged_dataset,  #
    ope_estimators=estimators,  # a list of OPE estimators
)
# estimate policy value and its confidence intervals
policy_value_df_dict, policy_value_interval_df_dict = ope.summarize_off_policy_estimates(
    input_dict=multiple_input_dict,  #
    random_state=random_state,
)

The result for each logged dataset is accessible by the following keys.

policy_value_df_dict[behavior_policies[0].name][dataset_id]

We can also specify the behavior policy and dataset id when calling the function as follows.

policy_value_df_dict, policy_value_interval_df_dict = ope.summarize_off_policy_estimates(
    input_dict=input_dict,  # either multiple or single input dict
    behavior_policy_name=behavior_policies[0].name,  # specify the behavior policy name
    dataset_id=0,  # specify the dataset id
    random_state=random_state,
)

Next, to compare the OPE result for some specific logged dataset, use the following function.

ope.visualize_off_policy_estimates(
    input_dict,  # either multiple or single input dict
    behavior_policy_name=behavior_policies[0].name,  # specify the behavior policy name
    dataset_id=0,  # specify the dataset id
    random_state=random_state,
)
card-img-top

We can also compare results with multiple datasets as follows.

ope.visualize_policy_value_with_multiple_estimates(
    input_dict=multiple_input_dict,
    plot_type="ci",  #
    hue="policy",
)
card-img-top
ope.visualize_policy_value_with_multiple_estimates(
    input_dict=multiple_input_dict,
    plot_type="violin",  #
    hue="policy",
)
card-img-top
ope.visualize_policy_value_with_multiple_estimates(
    input_dict=multiple_input_dict,
    plot_type="scatter",  #
    hue="policy",
)
card-img-top

Cumulative Distribution Off-Policy Evaluation#

CD-OPE also employs similar implementations with those of basic OPE.

from scope_rl.ope import CumulativeDistributionOPE

# initialize the OPE class
cd_ope = CumulativeDistributionOPE(
    logged_dataset=multiple_logged_dataset,  #
    ope_estimators=estimators,  # a list of OPE estimators
)
# estimate policy value and its confidence intervals
cdf_dict = cd_ope.estimate_cumulative_distribution_function(
    input_dict=multiple_input_dict,  #
)

The result for each logged dataset is accessible by the following keys.

cdf_dict[behavior_policies[0].name][dataset_id]

We can also specify the behavior policy and dataset id when calling the function as follows.

cdf_dict = cd_ope.estimate_cumulative_distribution_function(
    input_dict=multiple_input_dict,  #
    behavior_policy_name=behavior_policies[0].name,  # specify the behavior policy name
    dataset_id=0,  # specify the dataset id
)

Similar codes also work for the following functions.

  • estimate_cumulative_distribution_function

  • estimate_mean

  • estimate_variance

  • estimate_conditional_value_at_risk

  • estimate_interquartile_range

The following code compares the OPE result for some specific logged dataset.

cd_ope.visualize_cumulative_distribution_function(
    input_dict,  # either multiple or single input dict
    behavior_policy_name=behavior_policies[0].name,  # specify the behavior policy name
    dataset_id=0,  # specify the dataset id
)
card-img-top

Similar codes also work for the following functions.

  • visualize_cumulative_distribution_function

  • visualize_policy_value

  • visualize_conditional_value_at_risk

  • visualize_interquartile_range

Next, SCOPE-RL also visualizes CDF estimated on multiple logged dataset as follows.

The first example shows the case of using a single behavior policy and multiple logged dataset.

cd_ope.visualize_cumulative_distribution_function_with_multiple_estimates(
    multiple_input_dict,
    behavior_policy_name=behavior_policies[0].name,  # specify the behavior policy name
    plot_type="ci_hue",  #
    scale_min=0.0,  # set the reward scale (i.e., x-axis or bins of CDF)
    scale_max=10.0,
    n_partition=20,
    n_cols=4,
)
card-img-top

The next examples compare the results across multiple behavior policies.

cd_ope.visualize_cumulative_distribution_function_with_multiple_estimates(
    multiple_input_dict,
    plot_type="ci_behavior_policy",  #
    hue="policy",  #
    scale_min=0.0,
    scale_max=10.0,
    n_partition=20,
)
card-img-top

The final example shows CDF for each logged dataset of a single behavior policy.

cd_ope.visualize_cumulative_distribution_function_with_multiple_estimates(
    multiple_input_dict,
    behavior_policy_name=behavior_policies[0].name,  # specify the behavior policy name
    plot_type="enumerate",  #
    hue="policy",  #
    scale_min=0.0,
    scale_max=10.0,
    n_partition=20,
)
card-img-top

To compare the point-wise estimation result across multiple logged datasets, the following code works.

ope.visualize_policy_value_with_multiple_estimates(
    multiple_input_dict,
    plot_type="ci",  # "violin", "scatter"
    hue="policy",
)
card-img-top

Similar codes also work for the following functions.

  • visualize_policy_value_with_multiple_estimates

  • visualize_variance_with_multiple_estimates

  • visualize_conditional_value_at_risk_with_multiple_estimates

  • visualize_interquartile_range_with_multiple_estimates

Off-Policy Selection#

SCOPE-RL also enables OPS with multiple logged datasets without any additional efforts.

from scope_rl.ope import OffPolicySelection

# initialize the OPS class
ops = OffPolicySelection(
    ope=ope,  # either ope or cd_ope must be given
    cumulative_distribution_ope=cd_ope,
)
# OPS based on estimated policy value
ranking_df_dict, metric_df_dict = ops.select_by_policy_value(
    multiple_input_dict,
    return_metrics=True,
    return_by_dataframe=True,
)

The result for each logged dataset is accessible by the following keys.

ranking_df_dict[behavior_policies[0].name][dataset_id]

The following code compares the OPE result for some specific logged dataset.

ranking_df, metric_df = ops.select_by_policy_value(
    input_dict=input_dict,
    behavior_policy_name=behavior_policies[0].name,  # specify the behavior policy name
    dataset_id=0,  # specify the dataset id
    return_metrics=True,
    return_by_dataframe=True,
)

Similar codes also work for the following functions.

  • select_by_policy_value

  • select_by_policy_value_lower_bound

  • select_by_policy_value_via_cumulative_distribution_ope

  • select_by_conditional_value_at_risk

  • select_by_lower_quartile

  • obtain_true_selection_result

Assessments of OPE via top-\(k\) Policy Selection#

Next, we show how to assess the top-\(k\) policy selection with multiple logged datasets.

topk_metric_df_dict = ops.obtain_topk_policy_value_selected_by_standard_ope(
    input_dict=multiple_input_dict,
    return_by_dataframe=True,
)

The result for each logged dataset is accessible by the following keys.

topk_metric_df_dict[behavior_policies[0].name][dataset_id]

The following code compares the top-\(k\) policies selected by each OPE estimator for some specific logged dataset.

topk_metric_df = ope.obtain_topk_policy_value_selected_by_standard_ope(
    input_dict,  # either multiple or single input dict
    behavior_policy_name=behavior_policies[0].name,  # specify the behavior policy name
    dataset_id=0,  # specify the dataset id
    random_state=random_state,
)

Similar codes also work for the following functions.

  • obtain_topk_policy_value_selected_by_standard_ope

  • obtain_topk_policy_value_selected_by_lower_bound

  • obtain_topk_policy_value_selected_by_cumulative_distribution_ope

  • obtain_topk_conditional_value_at_risk_selected_by_standard_ope

  • obtain_topk_conditional_value_at_risk_selected_by_cumulative_distirbution_ope

  • obtain_topk_lower_quartile_selected_by_standard_ope

  • obtain_topk_lower_quartile_selected_by_cumulative_distribution_ope

Visualization functions also works in a similar manner.

ops.visualize_topk_policy_value_selected_by_standard_ope(
    multiple_input_dict,
    compared_estimators=["dm", "tis", "pdis", "dr"],
    visualize_ci=True,
    safety_threshold=6.0,  # please specify this option instead of `relative_safety_criteria`
    legend=True,
    random_state=random_state,
)
card-img-top

When using a single behavior policy, relative_safety_criteria option becomes available.

ops.visualize_topk_policy_value_selected_by_standard_ope(
    multiple_input_dict,
    behavior_policy_name=behavior_policies[0].name,
    compared_estimators=["dm", "tis", "pdis", "dr"],
    visualize_ci=True,
    safety_threshold=6.0,  # please specify this option instead of `relative_safety_criteria`
    legend=True,
    random_state=random_state,
)

When using a single logged dataset, specify both behavior policy name and dataset id.

ops.visualize_topk_policy_value_selected_by_standard_ope(
    input_dict,  # either multiple or single input dict
    behavior_policy_name=behavior_policies[0].name,  # specify the behavior policy name
    dataset_id=0,  # specify the dataset id
    compared_estimators=["dm", "tis", "pdis", "dr"],
    visualize_ci=True,
    safety_threshold=6.0,  # please specify this option instead of `relative_safety_criteria`
    legend=True,
    random_state=random_state,
)

Similar codes also work for the following functions.

  • visualize_topk_policy_value_selected_by_standard_ope

  • visualize_topk_policy_value_selected_by_lower_bound

  • visualize_topk_policy_value_selected_by_cumulative_distribution_ope

  • visualize_topk_conditional_value_at_risk_selected_by_standard_ope

  • visualize_topk_conditional_value_at_risk_selected_by_cumulative_distirbution_ope

  • visualize_topk_lower_quartile_selected_by_standard_ope

  • visualize_topk_lower_quartile_selected_by_cumulative_distribution_ope

Validating True and Estimated Policy Performance#

Finally, we also provide functions to compare the true and estimated policy performance.

ops.visualize_policy_value_for_validation(
    multiple_input_dict,
    n_cols=4,
    share_axes=True,
)
card-img-top

When using a single behavior policy, specify the behavior policy name.

ops.visualize_policy_value_for_validation(
    input_dict,  # either multiple or single input dict
    behavior_policy_name=behavior_policies[0].name,  # specify the behavior policy name
    n_cols=4,
    share_axes=True,
)

When using a single logged dataset, specify both the behavior policy name and dataset id.

ops.visualize_policy_value_for_validation(
    input_dict,  # either multiple or single input dict
    behavior_policy_name=behavior_policies[0].name,  # specify the behavior policy name
    dataset_id=0,  # specify the dataset id
    n_cols=4,
    share_axes=True,
)

Similar codes also work for the following functions.

  • visualize_policy_value_for_validation

  • visualize_policy_value_lower_bound_for_validation

  • visualize_variance_for_validation

  • visualize_conditional_value_at_risk_for_validation

  • visualize_lower_bound_for_validation

<<< Prev Usage

Next >>> Real_World Datasets