scope_rl.ope.ope.OffPolicyEvaluation#

class scope_rl.ope.ope.OffPolicyEvaluation(logged_dataset, ope_estimators, n_step_pdis=0, bandwidth=1.0, action_scaler=None, disable_reward_after_done=True)[source]#

Class to perform OPE by multiple estimators simultaneously (applicable to both discrete/continuous action cases).

Imported as: scope_rl.ope.OffPolicyEvaluation

Note

OPE estimates the expected policy performance of a given evaluation policy called the policy value.

\[V(\pi) := \mathbb{E} \left[ \sum_{t=0}^{T-1} \gamma^t r_t \mid \pi \right]\]

where \(\pi\) is the evaluation policy, \(r_t\) is the reward observed at each timestep \(t\), \(T\) is the total number of timesteps in an episode, and \(\gamma\) is the discount factor.

Parameters:

logged_dataset (LoggedDataset or MultipleLoggedDataset) –

Logged dataset used to conduct OPE.

key: [
    size,
    n_trajectories,
    step_per_trajectory,
    action_type,
    n_actions,
    action_dim,
    action_keys,
    action_meaning,
    state_dim,
    state_keys,
    state,
    action,
    reward,
    done,
    terminal,
    info,
    pscore,
    behavior_policy,
    dataset_id,
]

See also

scope_rl.dataset.SyntheticDataset describes the components of logged_dataset.

ope_estimators (list of BaseOffPolicyEstimator) – List of OPE estimators used to evaluate the policy value of the evaluation policies. Estimators must follow the interface of scope_rl.ope.BaseOffPolicyEstimator.
n_step_pdis (int, default=0 (>= 0)) – Number of previous steps to use per-decision importance weight in marginal OPE estimators. When set to zero, the estimator is reduced to the vanilla state marginal IS.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel used in continuous action case.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
disable_reward_after_done (bool, default=True) – Whether to apply \(r = 0\) once done is observed in an episode.

Examples

Preparation:

# import necessary module from SCOPE-RL
from scope_rl.dataset import SyntheticDataset
from scope_rl.policy import EpsilonGreedyHead
from scope_rl.ope import CreateOPEInput
from scope_rl.ope import OffPolicyEvaluation as OPE
from scope_rl.ope.discrete import TrajectoryWiseImportanceSampling as TIS
from scope_rl.ope.discrete import PerDecisionImportanceSampling as PDIS

# import necessary module from other libraries
import gym
import rtbgym
from d3rlpy.algos import DoubleDQNConfig
from d3rlpy.dataset import create_fifo_replay_buffer
from d3rlpy.algos import ConstantEpsilonGreedy

# initialize environment
env = gym.make("RTBEnv-discrete-v0")

# define (RL) agent (i.e., policy) and train on the environment
ddqn = DoubleDQNConfig().create()
buffer = create_fifo_replay_buffer(
    limit=10000,
    env=env,
)
explorer = ConstantEpsilonGreedy(
    epsilon=0.3,
)
ddqn.fit_online(
    env=env,
    buffer=buffer,
    explorer=explorer,
    n_steps=10000,
    n_steps_per_epoch=1000,
)

# convert ddqn policy to stochastic data collection policy
behavior_policy = EpsilonGreedyHead(
    ddqn,
    n_actions=env.action_space.n,
    epsilon=0.3,
    name="ddqn_epsilon_0.3",
    random_state=12345,
)

# initialize dataset class
dataset = SyntheticDataset(
    env=env,
    max_episode_steps=env.step_per_episode,
)

# data collection
logged_dataset = dataset.obtain_episodes(
    behavior_policies=behavior_policy,
    n_trajectories=100,
    random_state=12345,
)

Create Input for OPE:

# evaluation policy
ddqn_ = EpsilonGreedyHead(
    base_policy=ddqn,
    n_actions=env.action_space.n,
    name="ddqn",
    epsilon=0.0,
    random_state=12345
)
random_ = EpsilonGreedyHead(
    base_policy=ddqn,
    n_actions=env.action_space.n,
    name="random",
    epsilon=1.0,
    random_state=12345
)

# create input for off-policy evaluation (OPE)
prep = CreateOPEInput(
    env=env,
)
input_dict = prep.obtain_whole_inputs(
    logged_dataset=logged_dataset,
    evaluation_policies=[ddqn_, random_],
    n_trajectories_on_policy_evaluation=100,
    random_state=12345,
)

Off-Policy Evaluation:

# OPE
ope = OPE(
    logged_dataset=logged_dataset,
    ope_estimators=[TIS(), PDIS()],
)
policy_value_dict = ope.estimate_policy_value(
    input_dict=input_dict,
)

Output:

>>> policy_value_dict

{'ddqn': {'on_policy': 15.95, 'tis': 18.103809657474702, 'pdis': 16.95314065192053},
'random': {'on_policy': 12.69, 'tis': 0.4885685147584351, 'pdis': 6.2752568547701335}}

See also

Attributes:

action_scaler
estimators_name

Methods

`estimate_intervals`(input_dict[, ...])	Estimate the confidence intervals of the policy value by nonparametric bootstrap.
`estimate_policy_value`(input_dict[, ...])	Estimate the policy value of the given evaluation policies.
`evaluate_performance_of_ope_estimators`(...)	Evaluate the estimation performance/accuracy of OPE estimators.
`summarize_off_policy_estimates`(input_dict[, ...])	Summarize the policy value and their confidence intervals estimated by OPE estimators.
`visualize_off_policy_estimates`(input_dict[, ...])	Visualize the policy value estimated by OPE estimators.
`visualize_policy_value_with_multiple_estimates`(...)	Visualize the policy value estimated by OPE estimators across multiple logged dataset.

estimate_policy_value(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None)[source]#

Estimate the policy value of the given evaluation policies.

Parameters:

input_dict (OPEInputDict or MultipleInputDict) –

Dictionary of the OPE inputs for each evaluation policy.

key: [evaluation_policy][
    evaluation_policy_action,
    evaluation_policy_action_dist,
    state_action_value_prediction,
    initial_state_value_prediction,
    state_action_marginal_importance_weight,
    state_marginal_importance_weight,
    on_policy_policy_value,
    gamma,
    behavior_policy,
    evaluation_policy,
    dataset_id,
]

See also

scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.

Returns:

policy_value_dict – Dictionary containing the policy value of each evaluation policy estimated by OPE estimators. key: [evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is None, key: [behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]

When behavior_policy is None and dataset_id is specified, key: [behavior_policy_name][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is specified and dataset_id is None, key: [dataset_id][OPE_estimator_name]

Return type:

dict (, dict of list of dict)

estimate_intervals(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, ci='bootstrap', n_bootstrap_samples=100, random_state=None)[source]#

Estimate the confidence intervals of the policy value by nonparametric bootstrap.

Parameters:

input_dict (OPEInputDict or MultipleInputDict) –

Dictionary of the OPE inputs for each evaluation policy.

key: [evaluation_policy][
    evaluation_policy_action,
    evaluation_policy_action_dist,
    state_action_value_prediction,
    initial_state_value_prediction,
    state_action_marginal_importance_weight,
    state_marginal_importance_weight,
    on_policy_policy_value,
    gamma,
    behavior_policy,
    evaluation_policy,
    dataset_id,
]

See also

scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.
n_bootstrap_samples (int, default=100 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.

Returns:

policy_value_interval_dict – Dictionary containing the confidence intervals estimated by nonparametric bootstrap. key: [evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is None, key: [behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is specified, key: [behavior_policy_name][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is specified and dataset_id is None, key: [dataset_id][OPE_estimator_name]

Return type:

dict

References

Josiah P. Hanna, Peter Stone, and Scott Niekum. “Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation.” 2017.

Philip S. Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. “High Confidence Policy Improvement.” 2015.

Philip S. Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. “High Confidence Off-Policy Evaluation.” 2015.

summarize_off_policy_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, ci='bootstrap', n_bootstrap_samples=100, random_state=None)[source]#

Summarize the policy value and their confidence intervals estimated by OPE estimators.

Parameters:

input_dict (OPEInputDict or MultipleInputDict) –

Dictionary of the OPE inputs for each evaluation policy.

key: [evaluation_policy][
    evaluation_policy_action,
    evaluation_policy_action_dist,
    state_action_value_prediction,
    initial_state_value_prediction,
    state_action_marginal_importance_weight,
    state_marginal_importance_weight,
    on_policy_policy_value,
    gamma,
    behavior_policy,
    evaluation_policy,
    dataset_id,
]

See also

scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.
random_state (int, default=None (>= 0)) – Random state.

Returns:

policy_value_df_dict (dict (, list of dict)) – Dictionary containing the policy value of each evaluation policy estimated by OPE estimators. key: [evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is None, key: [behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is specified, key: [behavior_policy_name][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is specified and dataset_id is None, key: [dataset_id][OPE_estimator_name]
policy_value_interval_df_dict (dict (, list of dict)) – Dictionary containing the confidence intervals estimated by nonparametric bootstrap. key: [evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is None, key: [behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is specified, key: [behavior_policy_name][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is specified and dataset_id is None, key: [dataset_id][OPE_estimator_name]

Return type:

Tuple[Dict[str, DataFrame], Dict[str, DataFrame]]

evaluate_performance_of_ope_estimators(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, metric='relative-ee', return_by_dataframe=False)[source]#

Evaluate the estimation performance/accuracy of OPE estimators.

Note

Evaluate the estimation performance/accuracy of OPE estimators by relative estimation error (relative-EE) or squared error (SE).

\[\mathrm{Relative-EE}(\hat{V}; \mathcal{D}) := \left| \frac{\hat{V}(\pi; \mathcal{D}) - V_{\mathrm{on}}(\pi)}{V_{\mathrm{on}}(\pi)} \right|,\]

\[\mathrm{SE}(\hat{V}; \mathcal{D}) := \left( \hat{V}(\pi; \mathcal{D}) - V_{\mathrm{on}} \right)^2,\]

where \(V_{\mathrm{on}}(\pi)\) is the on-policy policy value of the evaluation policy \(\pi\). \(\hat{V}(\pi; \mathcal{D})\) is the policy value estimated by the OPE estimator \(\hat{V}\) and logged dataset \(\mathcal{D}\).

Parameters:

input_dict (OPEInputDict or MultipleInputDict) –

Dictionary of the OPE inputs for each evaluation policy.

key: [evaluation_policy][
    evaluation_policy_action,
    evaluation_policy_action_dist,
    state_action_value_prediction,
    initial_state_value_prediction,
    state_action_marginal_importance_weight,
    state_marginal_importance_weight,
    on_policy_policy_value,
    gamma,
    behavior_policy,
    evaluation_policy,
    dataset_id,
]

See also

scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
metric ({"relative-ee", "se"}, default="relative-ee") – Evaluation metric used to evaluate and compare the estimation performance/accuracy of OPE estimators.
return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.

Returns:

eval_metric_ope_dict/eval_metric_ope_df – Dictionary/dataframe containing evaluation metric for evaluating the estimation performance/accuracy of OPE estimators. key: [evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is None, key: [behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is specified, key: [behavior_policy_name][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is specified and dataset_id is None, key: [dataset_id][OPE_estimator_name]

Return type:

dict or dataframe (, list of dict or dataframe)

visualize_off_policy_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, ci='bootstrap', n_bootstrap_samples=100, random_state=None, is_relative=False, hue='estimator', sharey=False, fig_dir=None, fig_name='estimated_policy_value.png')[source]#

Visualize the policy value estimated by OPE estimators.

Parameters:

input_dict (OPEInputDict or MultipleInputDict) –

Dictionary of the OPE inputs for each evaluation policy.

key: [evaluation_policy][
    evaluation_policy_action,
    evaluation_policy_action_dist,
    state_action_value_prediction,
    initial_state_value_prediction,
    state_action_marginal_importance_weight,
    state_marginal_importance_weight,
    on_policy_policy_value,
    gamma,
    behavior_policy,
    evaluation_policy,
    dataset_id,
]

See also

scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
alpha (float, default=0.05) – Significance level. The value should be within (0, 1].
ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.
is_relative (bool, default=False) – If True, we get the estimated policy values of the evaluation policies relative to the on-policy policy value of the behavior policy. (Only applicable when using a single input_dict.)
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_policy_value.png") – Name of the bar figure.

visualize_policy_value_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_policy_value_multiple.png')[source]#

Visualize the policy value estimated by OPE estimators across multiple logged dataset.

Note

This function is applicable only when MultipleLoggedDataset is used.

Parameters:

input_dict (OPEInputDict or MultipleInputDict) –

Dictionary of the OPE inputs for each evaluation policy.

key: [evaluation_policy][
    evaluation_policy_action,
    evaluation_policy_action_dist,
    state_action_value_prediction,
    initial_state_value_prediction,
    state_action_marginal_importance_weight,
    state_marginal_importance_weight,
    on_policy_policy_value,
    gamma,
    behavior_policy,
    evaluation_policy,
    dataset_id,
]

See also

scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
legend (bool, default=True) – Whether to include a legend in the scatter plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_policy_value_multiple.png") – Name of the bar figure.

Methods