Example Codes for Cumulative Distribution OPE#
Here, we show example codes for conducting cumulative distribution OPE (CD-OPE).
See also
For preparation, please also refer to the following pages:
Logged Dataset#
Here, we assume that an RL environment, a behavior policy, and evaluation policies are given as follows.
behavior_policy
: an instance ofBaseHead
evaluation_policies
: a list of instance(s) ofBaseHead
env
: a gym environment (unnecessary when using real-world datasets)
Then, we use the behavior policy to collect logged dataset as follows.
from scope_rl.dataset import SyntheticDataset
# initialize dataset
dataset = SyntheticDataset(
env=env,
max_episode_steps=env.step_per_episode,
)
# obtain logged dataset
logged_dataset = dataset.obtain_episodes(
behavior_policies=behavior_policy,
n_trajectories=10000,
obtain_info=False, # whether to record `info` returned by environment (optional)
random_state=random_state,
)
Note that, in the following example, we use a single logged dataset for simplicity. For the case of using multiple behavior policies or multiple logged datasets, refer to Example Codes with Multiple Logged Dataset and Behavior Policies.
Inputs#
The next step is to create the inputs for OPE estimators. This procedure is also very similar to that of basic OPE.
OPE with importance sampling-based estimators#
When using the importance sampling-based estimators including TIS and SNTIS, and hybrid estimators including DR and SNDR, make sure that “pscore” is recorded in the logged dataset.
Then, when using only importance sampling-based estimators, the minimal sufficient codes are the following:
from scope_rl.ope import CreateOPEInput
# initialize class to create inputs
prep = CreateOPEInput(
env=env, # unnecessary when using real-world dataset
)
# create inputs (e.g., calculating )
input_dict = prep.obtain_whole_inputs(
logged_dataset=logged_dataset,
evaluation_policies=evaluation_policies,
n_trajectories_on_policy_evaluation=100, # when evaluating OPE (optional)
random_state=random_state,
)
OPE with model-based estimators#
When using the model-based estimator (DM) or hybrid methods, we need to additionally obtain value estimation in the input dict.
# initialize class to create inputs
prep = CreateOPEInput(
env=env,
model_args={ # you can specify the model here (optional)
"fqe": {
"encoder_factory": VectorEncoderFactory(hidden_units=[30, 30]),
"q_func_factory": MeanQFunctionFactory(),
"learning_rate": 1e-4,
"use_gpu": torch.cuda.is_available(),
},
},
)
# create inputs (e.g., calculating )
input_dict = prep.obtain_whole_inputs(
logged_dataset=logged_dataset,
evaluation_policies=evaluation_policies,
require_value_prediction=True, # enable this option
q_function_method="fqe", # you can specify algorithms here (optional)
v_function_method="fqe",
n_trajectories_on_policy_evaluation=100,
random_state=random_state,
)
Note that, we can also apply scaling to either state observation or (continuous) action as follows.
from scope_rl.utils import MinMaxScaler
prep = CreateOPEInput(
env=env,
state_scaler=MinMaxScaler( #
minimum=logged_dataset["state"].min(axis=0),
maximum=logged_dataset["state"].max(axis=0),
),
action_scaler=MinMaxActionScaler( #
minimum=env.action_space.low,
maximum=env.action_space.high,
),
sigma=0.1, # additional bandwidth hyperparameter (for dice method)
)
Off-Policy Evaluation#
After preparing the inputs, SCOPE-RL is capable of handling CD-OPE, again in a manner similar to that of basic OPE.
Here, we use the following OPE estimators.
from scope_rl.ope.discrete import CumulativeDistributionDM as CD_DM
from scope_rl.ope.discrete import CumulativeDistributionTIS as CD_TIS
from scope_rl.ope.discrete import CumulativeDistributionTDR as CD_TDR
from scope_rl.ope.discrete import CumulativeDistributionSNTIS as CD_SNTIS
from scope_rl.ope.discrete import CumulativeDistributionSNTDR as CD_SNTDR
estimators = [CD_DM(), CD_TIS(), CD_TDR(), CD_SNTIS(), CD_SNTDR()]
Estimating Cumulative Distribution Function (CDF)#
The CDF curve is easily estimated as follows.
from scope_rl.ope import CumulativeDistributionOPE
# initialize the CD-OPE class
cd_ope = CumulativeDistributionOPE(
logged_dataset=logged_dataset,
ope_estimators=estimators,
)
# estimate CDF
cdf_dict = cd_ope.estimate_cumulative_distribution_function(
input_dict=input_dict,
)
The following code visualizes the results to compare OPE estimators.
cd_ope.visualize_cumulative_distribution_function(
input_dict=input_dict,
hue="estimator", # (default)
n_cols=4, # specify the number of columns (optional)
)
The following code visualizes the results to compare policies.
cd_ope.visualize_cumulative_distribution_function(
input_dict=input_dict,
hue="policy", # (optional)
legend=False,
n_cols=4,
)
Users can also specify the compared OPE estimators as follows.
cd_ope.visualize_cumulative_distribution_function(
input_dict=input_dict,
compared_estimators=["cd_dm", "cd_tis", "cd_tdr"], # names are assessible by `evaluation_policy.name`
)
Note that, the x-axis (bins) of CDF is by default set to the reward observed by the behavior policy. To use the custom bins, specify the reward scale when initializing the class.
cd_ope = CumulativeDistributionOPE(
logged_dataset=logged_dataset,
ope_estimators=estimators,
use_custom_reward_scale=True, # setting bins for cdf
scale_min=0.0,
scale_max=10.0,
n_partition=20,
)
Estimating Mean (i.e., policy value)#
Similarly, we can estimate the policy value via CD-OPE as follows.
policy_value_dict = cd_ope.estimate_mean(
input_dict=input_dict,
compared_estimators=["cd_dm", "cd_tis", "cd_tdr"], # (optional)
)
The visualization function also has simular arguments.
cd_ope.visualize_policy_value(
input_dict=input_dict,
hue="estimator", # (default)
)
For the policy value estimate, we additionally provide is_relative
option to visualize the policy value that is relative to that of behavior policy.
cd_ope.visualize_policy_value(
input_dict=input_dict,
hue="policy", # (optional)
is_relative=True, # enable this option
)
Note that, the visualization function of policy value accompanies with the visualization of the variance, which we discuss in the following.
Estimating Variance#
CD-OPE is able to esitmate the variance of the trajectory-wise reward as follows.
variance_dict = cd_ope.estimate_variance(
input_dict=input_dict,
)
SCOPE-RL shares the visualization function for variance with that of policy value. Specifically, the confidence intervals of the trajectory-wise reward is estimated via the variance estimate, assuming that the trajectory-wise reward follows normal distribution.
cd_ope.visualize_policy_value(
input_dict=input_dict,
)
Estimating Conditional Value at Risk (CVaR)#
Next, SCOPE-RL also estimates CVaR in a similar manner.
cvar_dict = cd_ope.estimate_conditional_value_at_risk(
input_dict=input_dict,
alpha=0.3, # specify the proportion of the sided region
)
We can also get the value of CVaR for multiple values of alpha as follows.
cvar_dict = cd_ope.estimate_conditional_value_at_risk(
input_dict=input_dict,
alpha=np.array([0.1, 0.3]), # specify the proportions of the sided region
)
The visualization function depicts CVaR across range of alphas as follows.
cd_ope.visualize_conditional_value_at_risk(
input_dict=input_dict,
alphas=np.linspace(0, 1, 21), # (default)
n_cols=4, # (optional)
)
Estimating Interquartile Range#
Finally, SCOPE-RL estimates and visualizes the Interquartile range as follows.
# estimate the interquartile range
interquartile_range_dict = cd_ope.estimate_interquartile_range(
input_dict=input_dict,
alpha=0.3, # specify the proportion of the sided region
)
# visualize the interquartile range
cd_ope.visualize_interquartile_range(
input_dict=input_dict,
alpha=0.3, # specify the proportion of the sided region
)
See also
For the evaluation of CD-OPE estimators, please also refer to Example Codes for Assessing OPE Estimators.