scope_rl.ope.ope.OffPolicyEvaluation#
- class scope_rl.ope.ope.OffPolicyEvaluation(logged_dataset, ope_estimators, n_step_pdis=0, bandwidth=1.0, action_scaler=None, disable_reward_after_done=True)[source]#
Class to perform OPE by multiple estimators simultaneously (applicable to both discrete/continuous action cases).
Imported as:
scope_rl.ope.OffPolicyEvaluationNote
OPE estimates the expected policy performance of a given evaluation policy called the policy value.
\[V(\pi) := \mathbb{E} \left[ \sum_{t=0}^{T-1} \gamma^t r_t \mid \pi \right]\]where \(\pi\) is the evaluation policy, \(r_t\) is the reward observed at each timestep \(t\), \(T\) is the total number of timesteps in an episode, and \(\gamma\) is the discount factor.
- Parameters:
logged_dataset (LoggedDataset or MultipleLoggedDataset) –
Logged dataset used to conduct OPE.
key: [ size, n_trajectories, step_per_trajectory, action_type, n_actions, action_dim, action_keys, action_meaning, state_dim, state_keys, state, action, reward, done, terminal, info, pscore, behavior_policy, dataset_id, ]
See also
scope_rl.dataset.SyntheticDatasetdescribes the components oflogged_dataset.ope_estimators (list of BaseOffPolicyEstimator) – List of OPE estimators used to evaluate the policy value of the evaluation policies. Estimators must follow the interface of
scope_rl.ope.BaseOffPolicyEstimator.n_step_pdis (int, default=0 (>= 0)) – Number of previous steps to use per-decision importance weight in marginal OPE estimators. When set to zero, the estimator is reduced to the vanilla state marginal IS.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel used in continuous action case.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
disable_reward_after_done (bool, default=True) – Whether to apply \(r = 0\) once done is observed in an episode.
Examples
Preparation:
# import necessary module from SCOPE-RL from scope_rl.dataset import SyntheticDataset from scope_rl.policy import EpsilonGreedyHead from scope_rl.ope import CreateOPEInput from scope_rl.ope import OffPolicyEvaluation as OPE from scope_rl.ope.discrete import TrajectoryWiseImportanceSampling as TIS from scope_rl.ope.discrete import PerDecisionImportanceSampling as PDIS # import necessary module from other libraries import gym import rtbgym from d3rlpy.algos import DoubleDQNConfig from d3rlpy.dataset import create_fifo_replay_buffer from d3rlpy.algos import ConstantEpsilonGreedy # initialize environment env = gym.make("RTBEnv-discrete-v0") # define (RL) agent (i.e., policy) and train on the environment ddqn = DoubleDQNConfig().create() buffer = create_fifo_replay_buffer( limit=10000, env=env, ) explorer = ConstantEpsilonGreedy( epsilon=0.3, ) ddqn.fit_online( env=env, buffer=buffer, explorer=explorer, n_steps=10000, n_steps_per_epoch=1000, ) # convert ddqn policy to stochastic data collection policy behavior_policy = EpsilonGreedyHead( ddqn, n_actions=env.action_space.n, epsilon=0.3, name="ddqn_epsilon_0.3", random_state=12345, ) # initialize dataset class dataset = SyntheticDataset( env=env, max_episode_steps=env.step_per_episode, ) # data collection logged_dataset = dataset.obtain_episodes( behavior_policies=behavior_policy, n_trajectories=100, random_state=12345, )
Create Input for OPE:
# evaluation policy ddqn_ = EpsilonGreedyHead( base_policy=ddqn, n_actions=env.action_space.n, name="ddqn", epsilon=0.0, random_state=12345 ) random_ = EpsilonGreedyHead( base_policy=ddqn, n_actions=env.action_space.n, name="random", epsilon=1.0, random_state=12345 ) # create input for off-policy evaluation (OPE) prep = CreateOPEInput( env=env, ) input_dict = prep.obtain_whole_inputs( logged_dataset=logged_dataset, evaluation_policies=[ddqn_, random_], n_trajectories_on_policy_evaluation=100, random_state=12345, )
Off-Policy Evaluation:
# OPE ope = OPE( logged_dataset=logged_dataset, ope_estimators=[TIS(), PDIS()], ) policy_value_dict = ope.estimate_policy_value( input_dict=input_dict, )
Output:
>>> policy_value_dict {'ddqn': {'on_policy': 15.95, 'tis': 18.103809657474702, 'pdis': 16.95314065192053}, 'random': {'on_policy': 12.69, 'tis': 0.4885685147584351, 'pdis': 6.2752568547701335}}
See also
- Attributes:
- action_scaler
- estimators_name
Methods
estimate_intervals(input_dict[, ...])Estimate the confidence intervals of the policy value by nonparametric bootstrap.
estimate_policy_value(input_dict[, ...])Estimate the policy value of the given evaluation policies.
Evaluate the estimation performance/accuracy of OPE estimators.
summarize_off_policy_estimates(input_dict[, ...])Summarize the policy value and their confidence intervals estimated by OPE estimators.
visualize_off_policy_estimates(input_dict[, ...])Visualize the policy value estimated by OPE estimators.
Visualize the policy value estimated by OPE estimators across multiple logged dataset.
- estimate_policy_value(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None)[source]#
Estimate the policy value of the given evaluation policies.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
- Returns:
policy_value_dict – Dictionary containing the policy value of each evaluation policy estimated by OPE estimators. key:
[evaluation_policy][OPE_estimator_name]When behavior_policy_name is None and dataset_id is None, key:
[behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]When behavior_policy is None and dataset_id is specified, key:
[behavior_policy_name][evaluation_policy][OPE_estimator_name]When behavior_policy_name is specified and dataset_id is None, key:
[dataset_id][OPE_estimator_name]- Return type:
- estimate_intervals(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, ci='bootstrap', n_bootstrap_samples=100, random_state=None)[source]#
Estimate the confidence intervals of the policy value by nonparametric bootstrap.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.
n_bootstrap_samples (int, default=100 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.
- Returns:
policy_value_interval_dict – Dictionary containing the confidence intervals estimated by nonparametric bootstrap. key:
[evaluation_policy][OPE_estimator_name]When behavior_policy_name is None and dataset_id is None, key:
[behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]When behavior_policy_name is None and dataset_id is specified, key:
[behavior_policy_name][evaluation_policy][OPE_estimator_name]When behavior_policy_name is specified and dataset_id is None, key:
[dataset_id][OPE_estimator_name]- Return type:
References
Josiah P. Hanna, Peter Stone, and Scott Niekum. “Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation.” 2017.
Philip S. Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. “High Confidence Policy Improvement.” 2015.
Philip S. Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. “High Confidence Off-Policy Evaluation.” 2015.
- summarize_off_policy_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, ci='bootstrap', n_bootstrap_samples=100, random_state=None)[source]#
Summarize the policy value and their confidence intervals estimated by OPE estimators.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.
random_state (int, default=None (>= 0)) – Random state.
- Returns:
policy_value_df_dict (dict (, list of dict)) – Dictionary containing the policy value of each evaluation policy estimated by OPE estimators. key:
[evaluation_policy][OPE_estimator_name]When behavior_policy_name is None and dataset_id is None, key:
[behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]When behavior_policy_name is None and dataset_id is specified, key:
[behavior_policy_name][evaluation_policy][OPE_estimator_name]When behavior_policy_name is specified and dataset_id is None, key:
[dataset_id][OPE_estimator_name]policy_value_interval_df_dict (dict (, list of dict)) – Dictionary containing the confidence intervals estimated by nonparametric bootstrap. key:
[evaluation_policy][OPE_estimator_name]When behavior_policy_name is None and dataset_id is None, key:
[behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]When behavior_policy_name is None and dataset_id is specified, key:
[behavior_policy_name][evaluation_policy][OPE_estimator_name]When behavior_policy_name is specified and dataset_id is None, key:
[dataset_id][OPE_estimator_name]
- Return type:
- evaluate_performance_of_ope_estimators(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, metric='relative-ee', return_by_dataframe=False)[source]#
Evaluate the estimation performance/accuracy of OPE estimators.
Note
Evaluate the estimation performance/accuracy of OPE estimators by relative estimation error (relative-EE) or squared error (SE).
\[\mathrm{Relative-EE}(\hat{V}; \mathcal{D}) := \left| \frac{\hat{V}(\pi; \mathcal{D}) - V_{\mathrm{on}}(\pi)}{V_{\mathrm{on}}(\pi)} \right|,\]\[\mathrm{SE}(\hat{V}; \mathcal{D}) := \left( \hat{V}(\pi; \mathcal{D}) - V_{\mathrm{on}} \right)^2,\]where \(V_{\mathrm{on}}(\pi)\) is the on-policy policy value of the evaluation policy \(\pi\). \(\hat{V}(\pi; \mathcal{D})\) is the policy value estimated by the OPE estimator \(\hat{V}\) and logged dataset \(\mathcal{D}\).
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
metric ({"relative-ee", "se"}, default="relative-ee") – Evaluation metric used to evaluate and compare the estimation performance/accuracy of OPE estimators.
return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.
- Returns:
eval_metric_ope_dict/eval_metric_ope_df – Dictionary/dataframe containing evaluation metric for evaluating the estimation performance/accuracy of OPE estimators. key:
[evaluation_policy][OPE_estimator_name]When behavior_policy_name is None and dataset_id is None, key:
[behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]When behavior_policy_name is None and dataset_id is specified, key:
[behavior_policy_name][evaluation_policy][OPE_estimator_name]When behavior_policy_name is specified and dataset_id is None, key:
[dataset_id][OPE_estimator_name]- Return type:
- visualize_off_policy_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, ci='bootstrap', n_bootstrap_samples=100, random_state=None, is_relative=False, hue='estimator', sharey=False, fig_dir=None, fig_name='estimated_policy_value.png')[source]#
Visualize the policy value estimated by OPE estimators.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
alpha (float, default=0.05) – Significance level. The value should be within (0, 1].
ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.
is_relative (bool, default=False) – If True, we get the estimated policy values of the evaluation policies relative to the on-policy policy value of the behavior policy. (Only applicable when using a single input_dict.)
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_policy_value.png") – Name of the bar figure.
- visualize_policy_value_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_policy_value_multiple.png')[source]#
Visualize the policy value estimated by OPE estimators across multiple logged dataset.
Note
This function is applicable only when MultipleLoggedDataset is used.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
legend (bool, default=True) – Whether to include a legend in the scatter plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_policy_value_multiple.png") – Name of the bar figure.
Methods