scope_rl.ope.ope.OffPolicyEvaluation#

class scope_rl.ope.ope.OffPolicyEvaluation(logged_dataset, ope_estimators, n_step_pdis=0, bandwidth=1.0, action_scaler=None, disable_reward_after_done=True)[source]#

Class to perform OPE by multiple estimators simultaneously (applicable to both discrete/continuous action cases).

Imported as: scope_rl.ope.OffPolicyEvaluation

Note

OPE estimates the expected policy performance of a given evaluation policy called the policy value.

\[V(\pi) := \mathbb{E} \left[ \sum_{t=0}^{T-1} \gamma^t r_t \mid \pi \right]\]

where \(\pi\) is the evaluation policy, \(r_t\) is the reward observed at each timestep \(t\), \(T\) is the total number of timesteps in an episode, and \(\gamma\) is the discount factor.

Parameters:
  • logged_dataset (LoggedDataset or MultipleLoggedDataset) –

    Logged dataset used to conduct OPE.

    key: [
        size,
        n_trajectories,
        step_per_trajectory,
        action_type,
        n_actions,
        action_dim,
        action_keys,
        action_meaning,
        state_dim,
        state_keys,
        state,
        action,
        reward,
        done,
        terminal,
        info,
        pscore,
        behavior_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.dataset.SyntheticDataset describes the components of logged_dataset.

  • ope_estimators (list of BaseOffPolicyEstimator) – List of OPE estimators used to evaluate the policy value of the evaluation policies. Estimators must follow the interface of scope_rl.ope.BaseOffPolicyEstimator.

  • n_step_pdis (int, default=0 (>= 0)) – Number of previous steps to use per-decision importance weight in marginal OPE estimators. When set to zero, the estimator is reduced to the vanilla state marginal IS.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel used in continuous action case.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

  • disable_reward_after_done (bool, default=True) – Whether to apply \(r = 0\) once done is observed in an episode.

Examples

Preparation:

# import necessary module from SCOPE-RL
from scope_rl.dataset import SyntheticDataset
from scope_rl.policy import EpsilonGreedyHead
from scope_rl.ope import CreateOPEInput
from scope_rl.ope import OffPolicyEvaluation as OPE
from scope_rl.ope.discrete import TrajectoryWiseImportanceSampling as TIS
from scope_rl.ope.discrete import PerDecisionImportanceSampling as PDIS

# import necessary module from other libraries
import gym
import rtbgym
from d3rlpy.algos import DoubleDQNConfig
from d3rlpy.dataset import create_fifo_replay_buffer
from d3rlpy.algos import ConstantEpsilonGreedy

# initialize environment
env = gym.make("RTBEnv-discrete-v0")

# define (RL) agent (i.e., policy) and train on the environment
ddqn = DoubleDQNConfig().create()
buffer = create_fifo_replay_buffer(
    limit=10000,
    env=env,
)
explorer = ConstantEpsilonGreedy(
    epsilon=0.3,
)
ddqn.fit_online(
    env=env,
    buffer=buffer,
    explorer=explorer,
    n_steps=10000,
    n_steps_per_epoch=1000,
)

# convert ddqn policy to stochastic data collection policy
behavior_policy = EpsilonGreedyHead(
    ddqn,
    n_actions=env.action_space.n,
    epsilon=0.3,
    name="ddqn_epsilon_0.3",
    random_state=12345,
)

# initialize dataset class
dataset = SyntheticDataset(
    env=env,
    max_episode_steps=env.step_per_episode,
)

# data collection
logged_dataset = dataset.obtain_episodes(
    behavior_policies=behavior_policy,
    n_trajectories=100,
    random_state=12345,
)

Create Input for OPE:

# evaluation policy
ddqn_ = EpsilonGreedyHead(
    base_policy=ddqn,
    n_actions=env.action_space.n,
    name="ddqn",
    epsilon=0.0,
    random_state=12345
)
random_ = EpsilonGreedyHead(
    base_policy=ddqn,
    n_actions=env.action_space.n,
    name="random",
    epsilon=1.0,
    random_state=12345
)

# create input for off-policy evaluation (OPE)
prep = CreateOPEInput(
    env=env,
)
input_dict = prep.obtain_whole_inputs(
    logged_dataset=logged_dataset,
    evaluation_policies=[ddqn_, random_],
    n_trajectories_on_policy_evaluation=100,
    random_state=12345,
)

Off-Policy Evaluation:

# OPE
ope = OPE(
    logged_dataset=logged_dataset,
    ope_estimators=[TIS(), PDIS()],
)
policy_value_dict = ope.estimate_policy_value(
    input_dict=input_dict,
)

Output:

>>> policy_value_dict

{'ddqn': {'on_policy': 15.95, 'tis': 18.103809657474702, 'pdis': 16.95314065192053},
'random': {'on_policy': 12.69, 'tis': 0.4885685147584351, 'pdis': 6.2752568547701335}}
Attributes:
action_scaler
estimators_name

Methods

estimate_intervals(input_dict[, ...])

Estimate the confidence intervals of the policy value by nonparametric bootstrap.

estimate_policy_value(input_dict[, ...])

Estimate the policy value of the given evaluation policies.

evaluate_performance_of_ope_estimators(...)

Evaluate the estimation performance/accuracy of OPE estimators.

summarize_off_policy_estimates(input_dict[, ...])

Summarize the policy value and their confidence intervals estimated by OPE estimators.

visualize_off_policy_estimates(input_dict[, ...])

Visualize the policy value estimated by OPE estimators.

visualize_policy_value_with_multiple_estimates(...)

Visualize the policy value estimated by OPE estimators across multiple logged dataset.

estimate_policy_value(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None)[source]#

Estimate the policy value of the given evaluation policies.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

Returns:

policy_value_dict – Dictionary containing the policy value of each evaluation policy estimated by OPE estimators. key: [evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is None, key: [behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]

When behavior_policy is None and dataset_id is specified, key: [behavior_policy_name][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is specified and dataset_id is None, key: [dataset_id][OPE_estimator_name]

Return type:

dict (, dict of list of dict)

estimate_intervals(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, ci='bootstrap', n_bootstrap_samples=100, random_state=None)[source]#

Estimate the confidence intervals of the policy value by nonparametric bootstrap.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.

  • n_bootstrap_samples (int, default=100 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

policy_value_interval_dict – Dictionary containing the confidence intervals estimated by nonparametric bootstrap. key: [evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is None, key: [behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is specified, key: [behavior_policy_name][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is specified and dataset_id is None, key: [dataset_id][OPE_estimator_name]

Return type:

dict

References

Josiah P. Hanna, Peter Stone, and Scott Niekum. “Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation.” 2017.

Philip S. Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. “High Confidence Policy Improvement.” 2015.

Philip S. Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. “High Confidence Off-Policy Evaluation.” 2015.

summarize_off_policy_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, ci='bootstrap', n_bootstrap_samples=100, random_state=None)[source]#

Summarize the policy value and their confidence intervals estimated by OPE estimators.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

  • policy_value_df_dict (dict (, list of dict)) – Dictionary containing the policy value of each evaluation policy estimated by OPE estimators. key: [evaluation_policy][OPE_estimator_name]

    When behavior_policy_name is None and dataset_id is None, key: [behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]

    When behavior_policy_name is None and dataset_id is specified, key: [behavior_policy_name][evaluation_policy][OPE_estimator_name]

    When behavior_policy_name is specified and dataset_id is None, key: [dataset_id][OPE_estimator_name]

  • policy_value_interval_df_dict (dict (, list of dict)) – Dictionary containing the confidence intervals estimated by nonparametric bootstrap. key: [evaluation_policy][OPE_estimator_name]

    When behavior_policy_name is None and dataset_id is None, key: [behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]

    When behavior_policy_name is None and dataset_id is specified, key: [behavior_policy_name][evaluation_policy][OPE_estimator_name]

    When behavior_policy_name is specified and dataset_id is None, key: [dataset_id][OPE_estimator_name]

Return type:

Tuple[Dict[str, DataFrame], Dict[str, DataFrame]]

evaluate_performance_of_ope_estimators(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, metric='relative-ee', return_by_dataframe=False)[source]#

Evaluate the estimation performance/accuracy of OPE estimators.

Note

Evaluate the estimation performance/accuracy of OPE estimators by relative estimation error (relative-EE) or squared error (SE).

\[\mathrm{Relative-EE}(\hat{V}; \mathcal{D}) := \left| \frac{\hat{V}(\pi; \mathcal{D}) - V_{\mathrm{on}}(\pi)}{V_{\mathrm{on}}(\pi)} \right|,\]
\[\mathrm{SE}(\hat{V}; \mathcal{D}) := \left( \hat{V}(\pi; \mathcal{D}) - V_{\mathrm{on}} \right)^2,\]

where \(V_{\mathrm{on}}(\pi)\) is the on-policy policy value of the evaluation policy \(\pi\). \(\hat{V}(\pi; \mathcal{D})\) is the policy value estimated by the OPE estimator \(\hat{V}\) and logged dataset \(\mathcal{D}\).

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • metric ({"relative-ee", "se"}, default="relative-ee") – Evaluation metric used to evaluate and compare the estimation performance/accuracy of OPE estimators.

  • return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.

Returns:

eval_metric_ope_dict/eval_metric_ope_df – Dictionary/dataframe containing evaluation metric for evaluating the estimation performance/accuracy of OPE estimators. key: [evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is None, key: [behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is specified, key: [behavior_policy_name][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is specified and dataset_id is None, key: [dataset_id][OPE_estimator_name]

Return type:

dict or dataframe (, list of dict or dataframe)

visualize_off_policy_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, ci='bootstrap', n_bootstrap_samples=100, random_state=None, is_relative=False, hue='estimator', sharey=False, fig_dir=None, fig_name='estimated_policy_value.png')[source]#

Visualize the policy value estimated by OPE estimators.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • alpha (float, default=0.05) – Significance level. The value should be within (0, 1].

  • ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

  • is_relative (bool, default=False) – If True, we get the estimated policy values of the evaluation policies relative to the on-policy policy value of the behavior policy. (Only applicable when using a single input_dict.)

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_policy_value.png") – Name of the bar figure.

visualize_policy_value_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_policy_value_multiple.png')[source]#

Visualize the policy value estimated by OPE estimators across multiple logged dataset.

Note

This function is applicable only when MultipleLoggedDataset is used.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • legend (bool, default=True) – Whether to include a legend in the scatter plot.

  • sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_policy_value_multiple.png") – Name of the bar figure.

Methods