Example Codes with Multiple Logged Dataset and Behavior Policies#
Here, we show example codes for conducting OPE and OPS with multiple logged datasets.
See also
For preparation, please also refer to the following pages about the case with a single logged dataset:
Logged Dataset#
Here, we assume that an RL environment, behavior policies, and evaluation policies are given as follows.
behavior_policy
: an instance ofBaseHead
or a list of instance(s) ofBaseHead
evaluation_policies
: a list of instance(s) ofBaseHead
env
: a gym environment (unnecessary when using real-world datasets)
Then, we can collect multiple logged datasets with a single behavior policy as follows.
from scope_rl.dataset import SyntheticDataset
# initialize dataset
dataset = SyntheticDataset(
env=env,
max_episode_steps=env.step_per_episode,
)
# obtain logged dataset
multiple_logged_datasets = dataset.obtain_episodes(
behavior_policies=behavior_policies[0], # a single behavior policy
n_datasets=5, # specify the number of dataset (i.e., number of different random seeeds)
n_trajectories=10000,
random_state=random_state,
)
Similarly, SCOPE-RL also collects multiple logged datasets with multiple behavior policies as follows.
multiple_logged_datasets = dataset.obtain_episodes(
behavior_policies=behavior_policies, # multiple behavior policies
n_datasets=5, # specify the number of dataset (i.e., number of different random seeeds) for each behavior policy
n_trajectories=10000,
random_state=random_state,
)
The multiple logged datasets are returned as an instance of MultipleLoggedDataset
.
Note that, we can also manually create multiple logged datasets as follows.
from scope_rl.utils import MultipleLoggedDataset
multiple_logged_dataset = MultipleLoggedDataset(
action_type="discrete",
path="logged_dataset/", # specify the path to the dataset
)
for behavior_policy in behavior_policies:
single_logged_dataset = dataset.obtain_episodes(
behavior_policies=behavior_policy, # a single behavior policy
n_trajectories=10000,
random_state=random_state,
)
# add a single_logged_dataset to multiple_logged_dataset
multiple_logged_dataset.add(
single_logged_dataset,
behavior_policy_name=behavior_policy.name,
dataset_id=0,
)
Once you create the multiple logged datasets, each dataset is accessible via the following code.
single_logged_dataset = multiple_logged_dataset.get(
behavior_policy_name=behavior_policies[0].name, dataset_id=0,
)
MultipleLoggedDataset
also has the following properties.
# a list of the name of behavior policies
multiple_logged_dataset.behavior_policy_names
# a dictionary of the number of datasets for each behavior policy
multiple_logged_dataset.n_datasets
Inputs#
The next step is to create the inputs for OPE estimators. Here, we show the case of creating inputs for importance-sampling based estimators for brevity. For the case of creating inputs for model-based and marginal importance-sampling based estimators, please also refer to Example Codes for Basic OPE.
We first show the case of creating whole logged datasets stored in multiple_logged_datasets
(, which is essentially the same with the case of using single_logged_dataset
).
from scope_rl.ope import CreateOPEInput
# initialize class to create inputs
prep = CreateOPEInput(
env=env, # unnecessary when using real-world dataset
)
# create inputs (e.g., calculating )
multiple_input_dict = prep.obtain_whole_inputs(
logged_dataset=multiple_logged_dataset,
evaluation_policies=evaluation_policies,
n_trajectories_on_policy_evaluation=100, # when evaluating OPE (optional)
random_state=random_state,
)
The above code returns multiple_input_dict
as an instance of MultipleInputDict
.
Each input dictionary is accessible via the following code.
single_input_dict = multiple_input_dict.get(
behavior_policy_name=behavior_policies[0].name, dataset_id=0,
)
MultipleInputDict
has the following properties.
# a list of the name of behavior policies
multiple_input_dict.behavior_policy_names
# a dictionary of the number of datasets for each behavior policy
multiple_input_dict.n_datasets
# a dictionary of the number of evaluation policies of each input dict
multiple_input_dict.n_eval_policies
# check if the contained logged datasets use the same evaluation policies
multiple_input_dict.use_same_eval_policy_across_dataset
Note that, it is also possible to create a single input dict using the CreateOPEInput
class
by specifying the behavior policy and the dataset id as follows.
single_input_dict = prep.obtain_whole_inputs(
logged_dataset=multiple_logged_dataset,
behavior_policy_name=behavior_policies[0].name, # specify the behavior policy
dataset_id=0, # specify the dataset id
evaluation_policies=evaluation_policies,
random_state=random_state,
)
Off-Policy Evaluation#
SCOPE-RL enables OPE with multiple logged datasets and multiple input dicts without additional effort. Specifically, we can estimate the policy value via basic OPE as follows.
from scope_rl.ope import OffPolicyEvaluation as OPE
# initialize the OPE class
ope = OPE(
logged_dataset=multiple_logged_dataset, #
ope_estimators=estimators, # a list of OPE estimators
)
# estimate policy value and its confidence intervals
policy_value_df_dict, policy_value_interval_df_dict = ope.summarize_off_policy_estimates(
input_dict=multiple_input_dict, #
random_state=random_state,
)
The result for each logged dataset is accessible by the following keys.
policy_value_df_dict[behavior_policies[0].name][dataset_id]
We can also specify the behavior policy and dataset id when calling the function as follows.
policy_value_df_dict, policy_value_interval_df_dict = ope.summarize_off_policy_estimates(
input_dict=input_dict, # either multiple or single input dict
behavior_policy_name=behavior_policies[0].name, # specify the behavior policy name
dataset_id=0, # specify the dataset id
random_state=random_state,
)
Next, to compare the OPE result for some specific logged dataset, use the following function.
ope.visualize_off_policy_estimates(
input_dict, # either multiple or single input dict
behavior_policy_name=behavior_policies[0].name, # specify the behavior policy name
dataset_id=0, # specify the dataset id
random_state=random_state,
)
We can also compare results with multiple datasets as follows.
ope.visualize_policy_value_with_multiple_estimates(
input_dict=multiple_input_dict,
plot_type="ci", #
hue="policy",
)
ope.visualize_policy_value_with_multiple_estimates(
input_dict=multiple_input_dict,
plot_type="violin", #
hue="policy",
)
ope.visualize_policy_value_with_multiple_estimates(
input_dict=multiple_input_dict,
plot_type="scatter", #
hue="policy",
)
Cumulative Distribution Off-Policy Evaluation#
CD-OPE also employs similar implementations with those of basic OPE.
from scope_rl.ope import CumulativeDistributionOPE
# initialize the OPE class
cd_ope = CumulativeDistributionOPE(
logged_dataset=multiple_logged_dataset, #
ope_estimators=estimators, # a list of OPE estimators
)
# estimate policy value and its confidence intervals
cdf_dict = cd_ope.estimate_cumulative_distribution_function(
input_dict=multiple_input_dict, #
)
The result for each logged dataset is accessible by the following keys.
cdf_dict[behavior_policies[0].name][dataset_id]
We can also specify the behavior policy and dataset id when calling the function as follows.
cdf_dict = cd_ope.estimate_cumulative_distribution_function(
input_dict=multiple_input_dict, #
behavior_policy_name=behavior_policies[0].name, # specify the behavior policy name
dataset_id=0, # specify the dataset id
)
Similar codes also work for the following functions.
estimate_cumulative_distribution_function
estimate_mean
estimate_variance
estimate_conditional_value_at_risk
estimate_interquartile_range
The following code compares the OPE result for some specific logged dataset.
cd_ope.visualize_cumulative_distribution_function(
input_dict, # either multiple or single input dict
behavior_policy_name=behavior_policies[0].name, # specify the behavior policy name
dataset_id=0, # specify the dataset id
)
Similar codes also work for the following functions.
visualize_cumulative_distribution_function
visualize_policy_value
visualize_conditional_value_at_risk
visualize_interquartile_range
Next, SCOPE-RL also visualizes CDF estimated on multiple logged dataset as follows.
The first example shows the case of using a single behavior policy and multiple logged dataset.
cd_ope.visualize_cumulative_distribution_function_with_multiple_estimates(
multiple_input_dict,
behavior_policy_name=behavior_policies[0].name, # specify the behavior policy name
plot_type="ci_hue", #
scale_min=0.0, # set the reward scale (i.e., x-axis or bins of CDF)
scale_max=10.0,
n_partition=20,
n_cols=4,
)
The next examples compare the results across multiple behavior policies.
cd_ope.visualize_cumulative_distribution_function_with_multiple_estimates(
multiple_input_dict,
plot_type="ci_behavior_policy", #
hue="policy", #
scale_min=0.0,
scale_max=10.0,
n_partition=20,
)
The final example shows CDF for each logged dataset of a single behavior policy.
cd_ope.visualize_cumulative_distribution_function_with_multiple_estimates(
multiple_input_dict,
behavior_policy_name=behavior_policies[0].name, # specify the behavior policy name
plot_type="enumerate", #
hue="policy", #
scale_min=0.0,
scale_max=10.0,
n_partition=20,
)
To compare the point-wise estimation result across multiple logged datasets, the following code works.
ope.visualize_policy_value_with_multiple_estimates(
multiple_input_dict,
plot_type="ci", # "violin", "scatter"
hue="policy",
)
Similar codes also work for the following functions.
visualize_policy_value_with_multiple_estimates
visualize_variance_with_multiple_estimates
visualize_conditional_value_at_risk_with_multiple_estimates
visualize_interquartile_range_with_multiple_estimates
Off-Policy Selection#
SCOPE-RL also enables OPS with multiple logged datasets without any additional efforts.
from scope_rl.ope import OffPolicySelection
# initialize the OPS class
ops = OffPolicySelection(
ope=ope, # either ope or cd_ope must be given
cumulative_distribution_ope=cd_ope,
)
# OPS based on estimated policy value
ranking_df_dict, metric_df_dict = ops.select_by_policy_value(
multiple_input_dict,
return_metrics=True,
return_by_dataframe=True,
)
The result for each logged dataset is accessible by the following keys.
ranking_df_dict[behavior_policies[0].name][dataset_id]
The following code compares the OPE result for some specific logged dataset.
ranking_df, metric_df = ops.select_by_policy_value(
input_dict=input_dict,
behavior_policy_name=behavior_policies[0].name, # specify the behavior policy name
dataset_id=0, # specify the dataset id
return_metrics=True,
return_by_dataframe=True,
)
Similar codes also work for the following functions.
select_by_policy_value
select_by_policy_value_lower_bound
select_by_policy_value_via_cumulative_distribution_ope
select_by_conditional_value_at_risk
select_by_lower_quartile
obtain_true_selection_result
Assessments of OPE via top-\(k\) Policy Selection#
Next, we show how to assess the top-\(k\) policy selection with multiple logged datasets.
topk_metric_df_dict = ops.obtain_topk_policy_value_selected_by_standard_ope(
input_dict=multiple_input_dict,
return_by_dataframe=True,
)
The result for each logged dataset is accessible by the following keys.
topk_metric_df_dict[behavior_policies[0].name][dataset_id]
The following code compares the top-\(k\) policies selected by each OPE estimator for some specific logged dataset.
topk_metric_df = ope.obtain_topk_policy_value_selected_by_standard_ope(
input_dict, # either multiple or single input dict
behavior_policy_name=behavior_policies[0].name, # specify the behavior policy name
dataset_id=0, # specify the dataset id
random_state=random_state,
)
Similar codes also work for the following functions.
obtain_topk_policy_value_selected_by_standard_ope
obtain_topk_policy_value_selected_by_lower_bound
obtain_topk_policy_value_selected_by_cumulative_distribution_ope
obtain_topk_conditional_value_at_risk_selected_by_standard_ope
obtain_topk_conditional_value_at_risk_selected_by_cumulative_distirbution_ope
obtain_topk_lower_quartile_selected_by_standard_ope
obtain_topk_lower_quartile_selected_by_cumulative_distribution_ope
Visualization functions also works in a similar manner.
ops.visualize_topk_policy_value_selected_by_standard_ope(
multiple_input_dict,
compared_estimators=["dm", "tis", "pdis", "dr"],
visualize_ci=True,
safety_threshold=6.0, # please specify this option instead of `relative_safety_criteria`
legend=True,
random_state=random_state,
)
When using a single behavior policy, relative_safety_criteria
option becomes available.
ops.visualize_topk_policy_value_selected_by_standard_ope(
multiple_input_dict,
behavior_policy_name=behavior_policies[0].name,
compared_estimators=["dm", "tis", "pdis", "dr"],
visualize_ci=True,
safety_threshold=6.0, # please specify this option instead of `relative_safety_criteria`
legend=True,
random_state=random_state,
)
When using a single logged dataset, specify both behavior policy name and dataset id.
ops.visualize_topk_policy_value_selected_by_standard_ope(
input_dict, # either multiple or single input dict
behavior_policy_name=behavior_policies[0].name, # specify the behavior policy name
dataset_id=0, # specify the dataset id
compared_estimators=["dm", "tis", "pdis", "dr"],
visualize_ci=True,
safety_threshold=6.0, # please specify this option instead of `relative_safety_criteria`
legend=True,
random_state=random_state,
)
Similar codes also work for the following functions.
visualize_topk_policy_value_selected_by_standard_ope
visualize_topk_policy_value_selected_by_lower_bound
visualize_topk_policy_value_selected_by_cumulative_distribution_ope
visualize_topk_conditional_value_at_risk_selected_by_standard_ope
visualize_topk_conditional_value_at_risk_selected_by_cumulative_distirbution_ope
visualize_topk_lower_quartile_selected_by_standard_ope
visualize_topk_lower_quartile_selected_by_cumulative_distribution_ope
Validating True and Estimated Policy Performance#
Finally, we also provide functions to compare the true and estimated policy performance.
ops.visualize_policy_value_for_validation(
multiple_input_dict,
n_cols=4,
share_axes=True,
)
When using a single behavior policy, specify the behavior policy name.
ops.visualize_policy_value_for_validation(
input_dict, # either multiple or single input dict
behavior_policy_name=behavior_policies[0].name, # specify the behavior policy name
n_cols=4,
share_axes=True,
)
When using a single logged dataset, specify both the behavior policy name and dataset id.
ops.visualize_policy_value_for_validation(
input_dict, # either multiple or single input dict
behavior_policy_name=behavior_policies[0].name, # specify the behavior policy name
dataset_id=0, # specify the dataset id
n_cols=4,
share_axes=True,
)
Similar codes also work for the following functions.
visualize_policy_value_for_validation
visualize_policy_value_lower_bound_for_validation
visualize_variance_for_validation
visualize_conditional_value_at_risk_for_validation
visualize_lower_bound_for_validation