scope_rl.ope.ope.CumulativeDistributionOPE#

class scope_rl.ope.ope.CumulativeDistributionOPE(logged_dataset, ope_estimators, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, bandwidth=1.0, action_scaler=None, disable_reward_after_done=True)[source]#

Class to conduct cumulative distribution OPE by multiple estimators simultaneously (applicable to both discrete/continuous action cases).

Imported as: scope_rl.ope.CumutiveDistributionOPE

Note

Cumulative distribution OPE first estimates the following cumulative distribution function, and then estimates some statistics.

\[F(m, \pi) := \mathbb{E} \left[ \mathbb{I} \left \{ \sum_{t=0}^{T-1} \gamma^t r_t \leq m \right \} \mid \pi \right]\]

where \(\pi\) is the evaluation policy, \(r_t\) is the reward observed at each timestep \(t\), \(T\) is the total number of timesteps in an episode, and \(\gamma\) is the discount factor.

CDF is itself informative, but it also enables us to calculate the following risk functions.

  • Mean: \(\mu(F) := \int_{G} G \, \mathrm{d}F(G)\)

  • Variance: \(\sigma^2(F) := \int_{G} (G - \mu(F))^2 \, \mathrm{d}F(G)\)

  • \(\alpha\)-quartile: \(Q^{\alpha}(F) := \min \{ G \mid F(G) \leq \alpha \}\)

  • Conditional Value at Risk (CVaR): \(\int_{G} G \, \mathbb{I}\{ G \leq Q^{\alpha}(F) \} \, \mathrm{d}F(G)\)

where we use \(G := \sum_{t=0}^{T-1} \gamma^t r_t`and :math:`dF(G) := \mathrm{lim}_{\Delta \rightarrow 0} F(G) - F(G- \Delta)\).

Parameters:
  • logged_dataset (LoggedDataset or MultipleLoggedDataset) –

    Logged dataset used to conduct OPE.

    key: [
        size,
        n_trajectories,
        step_per_trajectory,
        action_type,
        n_actions,
        action_dim,
        action_keys,
        action_meaning,
        state_dim,
        state_keys,
        state,
        action,
        reward,
        done,
        terminal,
        info,
        pscore,
        behavior_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.dataset.SyntheticDataset describes the components of logged_dataset.

  • ope_estimators (list of BaseOffPolicyEstimator) – List of OPE estimators used to evaluate the policy value of the evaluation policies. Estimators must follow the interface of scope_rl.ope.BaseCumulativeDistributionOPEEstimator.

  • use_custom_reward_scale (bool, default=False) –

    Whether to use a customized reward scale or the reward observed under the behavior policy.

    If True, the reward scale is uniform, following Huang et al. (2021).

    If False, the reward scale follows the one defined in Chundak et al. (2021).

  • scale_min (float, default=None) – Minimum value of the reward scale in the CDF. If use_custom_reward_scale is True, a value must be given.

  • scale_max (float, default=None) – Maximum value of the reward scale in the CDF. If use_custom_reward_scale is True, a value must be given.

  • n_partition (int, default=None) – Number of partitions in the reward scale (x-axis of the CDF). If use_custom_reward_scale is True, a value must be given.

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel used in continuous action case.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

  • disable_reward_after_done (bool, default=True) – Whether to apply \(r = 0\) once done is observed in an episode.

Examples

Preparation:

# import necessary module from SCOPE-RL
from scope_rl.dataset import SyntheticDataset
from scope_rl.policy import EpsilonGreedyHead
from scope_rl.ope import CreateOPEInput
from scope_rl.ope import CumulativeDistributionOPE
from scope_rl.ope.discrete import CumulativeDistributionTIS as CD_IS
from scope_rl.ope.discrete import CumulativeDistributionSNTIS as CD_SNIS

# import necessary module from other libraries
import gym
import rtbgym
from d3rlpy.algos import DoubleDQNConfig
from d3rlpy.dataset import create_fifo_replay_buffer
from d3rlpy.algos import ConstantEpsilonGreedy

# initialize environment
env = gym.make("RTBEnv-discrete-v0")

# define (RL) agent (i.e., policy) and train on the environment
ddqn = DoubleDQNConfig().create()
buffer = create_fifo_replay_buffer(
    limit=10000,
    env=env,
)
explorer = ConstantEpsilonGreedy(
    epsilon=0.3,
)
ddqn.fit_online(
    env=env,
    buffer=buffer,
    explorer=explorer,
    n_steps=10000,
    n_steps_per_epoch=1000,
)

# convert ddqn policy to stochastic data collection policy
behavior_policy = EpsilonGreedyHead(
    ddqn,
    n_actions=env.action_space.n,
    epsilon=0.3,
    name="ddqn_epsilon_0.3",
    random_state=12345,
)

# initialize dataset class
dataset = SyntheticDataset(
    env=env,
    max_episode_steps=env.step_per_episode,
)

# data collection
logged_dataset = dataset.obtain_episodes(
    behavior_policies=behavior_policy,
    n_trajectories=100,
    random_state=12345,
)

Create Input for OPE:

# evaluation policy
ddqn_ = EpsilonGreedyHead(
    base_policy=ddqn,
    n_actions=env.action_space.n,
    name="ddqn",
    epsilon=0.0,
    random_state=12345
)
random_ = EpsilonGreedyHead(
    base_policy=ddqn,
    n_actions=env.action_space.n,
    name="random",
    epsilon=1.0,
    random_state=12345
)

# create input for off-policy evaluation (OPE)
prep = CreateOPEInput(
    env=env,
)
input_dict = prep.obtain_whole_inputs(
    logged_dataset=logged_dataset,
    evaluation_policies=[ddqn_, random_],
    n_trajectories_on_policy_evaluation=100,
    random_state=12345,
)

Cumulative Distribution OPE:

# OPE
cd_ope = CumulativeDistributionOPE(
    logged_dataset=logged_dataset,
    ope_estimators=[
        CD_IS(estimator_name="cd_is"),
        CD_SNIS(estimator_name="cd_snis"),
    ],
)
variance_dict = cd_ope.estimate_variance(
    input_dict=input_dict,
)

Output:

>>> variance_dict

{'ddqn': {'on_policy': 18.6216, 'cdf_is': 19.201934808340265, 'cdf_snis': 25.315555555555555},
'random': {'on_policy': 21.512806887023064, 'cdf_is': 13.591854902638273, 'cdf_snis': 7.158545530356914}}

References

Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment for Markov Decision Processes.” 2022.

Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” 2021.

Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” 2021.

Attributes:
action_scaler
estimators_name
n_partition
scale_max
scale_min

Methods

estimate_conditional_value_at_risk(input_dict)

Estimate the conditional value at risk of the trajectory-wise reward under the given evaluation policies.

estimate_cumulative_distribution_function(...)

Estimate the cumulative distribution of the trajectory-wise reward under the given evaluation policies.

estimate_interquartile_range(input_dict[, ...])

Estimate the interquartile range of the trajectory-wise reward under the given evaluation policies.

estimate_mean(input_dict[, ...])

Estimate the expected trajectory-wise reward (i.e., policy value) of the evaluation policies.

estimate_variance(input_dict[, ...])

Estimate the variance of the trajectory-wise reward under the given evaluation policies.

obtain_reward_scale()

Obtain the reward scale (x-axis) for the cumulative distribution function.

visualize_conditional_value_at_risk(input_dict)

Visualize the conditional value at risk estimated by OPE estimators.

visualize_conditional_value_at_risk_with_multiple_estimates(...)

Visualize the conditional value at risk of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.

visualize_cumulative_distribution_function(...)

Visualize the cumulative distribution function estimated by OPE estimators.

visualize_cumulative_distribution_function_with_multiple_estimates(...)

Visualize the policy value estimated by OPE estimators across multiple logged dataset.

visualize_interquartile_range(input_dict[, ...])

Visualize the interquartile range estimated by OPE estimators.

visualize_lower_quartile_with_multiple_estimates(...)

Visualize the lower quartile of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.

visualize_policy_value(input_dict[, ...])

Visualize the policy value estimated by OPE estimators.

visualize_policy_value_with_multiple_estimates(...)

Visualize the policy value estimated by OPE estimators across multiple logged dataset.

visualize_variance_with_multiple_estimates(...)

Visualize the variance of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.

obtain_reward_scale()[source]#

Obtain the reward scale (x-axis) for the cumulative distribution function.

Returns:

reward_scale – Reward Scale (x-axis of the cumulative distribution function).

Return type:

ndarray of shape (n_unique_reward, ) or (n_partition, )

estimate_cumulative_distribution_function(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, reward_scale=None)[source]#

Estimate the cumulative distribution of the trajectory-wise reward under the given evaluation policies.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • reward_scale (array-like of shape (n_partition, ), default=None) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.

Returns:

cumulative_distribution_dict – Dictionary containing the cumulative distribution of each evaluation policy estimated by OPE estimators. key: [evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is None, key: [behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is specified, key: [behavior_policy_name][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is specified and dataset_id is None, key: [dataset_id][OPE_estimator_name]

Return type:

dict (, list of dict)

estimate_mean(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None)[source]#

Estimate the expected trajectory-wise reward (i.e., policy value) of the evaluation policies.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

Returns:

mean_dict – Dictionary containing the mean trajectory-wise reward of each evaluation policy estimated by OPE estimators. key: [evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is None, key: [behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is specified, key: [behavior_policy_name][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is specified and dataset_id is None, key: [dataset_id][OPE_estimator_name]

Return type:

dict (, list of dict)

estimate_variance(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None)[source]#

Estimate the variance of the trajectory-wise reward under the given evaluation policies.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

Returns:

variance_dict – Dictionary containing the variance of trajectory-wise reward of each evaluation policy estimated by OPE estimators. key: [evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is None, key: [behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is specified, key: [behavior_policy_name][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is specified and dataset_id is None, key: [dataset_id][OPE_estimator_name]

Return type:

dict (, list of dict)

estimate_conditional_value_at_risk(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alphas=None)[source]#

Estimate the conditional value at risk of the trajectory-wise reward under the given evaluation policies.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • alphas ({float, array-like of shape (n_alpha, )}, default=None) – Set of proportions of the shaded region. The value(s) should be within [0, 1). If None is given, np.linspace(0, 1, 21) will be used.

Returns:

conditional_value_at_risk_dict – Dictionary containing the conditional value at risk of trajectory-wise reward of each evaluation policy estimated by OPE estimators. key: [evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is None, key: [behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is None and dataset_id is specified, key: [behavior_policy_name][evaluation_policy][OPE_estimator_name]

When behavior_policy_name is specified and dataset_id is None, key: [dataset_id][OPE_estimator_name]

Return type:

dict (, list of dict)

estimate_interquartile_range(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05)[source]#

Estimate the interquartile range of the trajectory-wise reward under the given evaluation policies.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • alpha (float, default=0.05) – Proportion of the shaded region. The value should be within (0, 1].

Returns:

interquartile_range_dict – Dictionary containing the interquartile range of trajectory-wise reward of each evaluation policy estimated by OPE estimators. key: [evaluation_policy][OPE_estimator_name][quartile_name]

Return type:

dict (, list of dict)

visualize_cumulative_distribution_function(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, hue='estimator', legend=True, n_cols=None, fig_dir=None, fig_name='estimated_cumulative_distribution_function.png')[source]#

Visualize the cumulative distribution function estimated by OPE estimators.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • legend (bool, default=True) – Whether to include a legend in the figure.

  • n_cols (int, default=None) – Number of columns in the figure.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_cumulative_distribution_function.png") – Name of the bar figure.

visualize_policy_value(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, is_relative=False, hue='estimator', sharey=False, fig_dir=None, fig_name='estimated_policy_value.png')[source]#

Visualize the policy value estimated by OPE estimators.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • is_relative (bool, default=False) – If True, we get the estimated policy values of the evaluation policies relative to the ground-truth policy value of the behavior policy.

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • sharey (bool, default=False) – If True, the y-axis will be shared among different evaluation policies.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_policy_value.png") – Name of the bar figure.

visualize_conditional_value_at_risk(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alphas=None, hue='estimator', legend=True, n_cols=None, sharey=False, fig_dir=None, fig_name='estimated_conditional_value_at_risk.png')[source]#

Visualize the conditional value at risk estimated by OPE estimators.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • alphas (array-like of shape (n_alpha, ), default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given, np.linspace(0, 1, 21) will be used.

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • legend (bool, default=True) – Whether to include a legend in the figure.

  • n_cols (int, default=None) – Number of columns in the figure.

  • sharey (bool, default=False) – This parameter is for API consistency.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_conditional_value_at_risk.png") – Name of the bar figure.

visualize_interquartile_range(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, hue='estimator', sharey=False, fig_dir=None, fig_name='estimated_interquartile_range.png')[source]#

Visualize the interquartile range estimated by OPE estimators.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • sharey (bool, default=False) – If True, the y-axis will be shared among different evaluation policies.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_interquartile_range.png") – Name of the bar figure.

visualize_cumulative_distribution_function_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, scale_min=None, scale_max=None, n_partition=None, plot_type='ci_hue', hue='estimator', legend=True, n_cols=None, sharey=False, fig_dir=None, fig_name='estimated_policy_value_multiple.png')[source]#

Visualize the policy value estimated by OPE estimators across multiple logged dataset.

Note

This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.

This function is not applicable when the data-driven reward scaler is used. Please set scale_min, scale_max, and n_partition to use.

Parameters:
  • input_dict (MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • plot_type ({"ci_hue", "ci_behavior_policy", "enumerate"}, default="ci_hue") – Type of plot. If “ci” is given, the method visualizes the average policy value and its 95% confidence intervals based on the multiple estimate. If “enumerate” is given, we get a scatter plot of estimated values.

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • legend (bool, default=True) – Whether to include a legend in the scatter plot.

  • sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_policy_value_multiple.png") – Name of the bar figure.

visualize_policy_value_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_policy_value_multiple.png')[source]#

Visualize the policy value estimated by OPE estimators across multiple logged dataset.

Note

This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • legend (bool, default=True) – Whether to include a legend in the scatter plot.

  • sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_policy_value_multiple.png") – Name of the bar figure.

visualize_variance_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_variance_multiple.png')[source]#

Visualize the variance of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.

Note

This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • legend (bool, default=True) – Whether to include a legend in the scatter plot.

  • sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_variance_multiple.png") – Name of the bar figure.

visualize_conditional_value_at_risk_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, alpha=0.05, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_conditional_value_at_risk_multiple.png')[source]#

Visualize the conditional value at risk of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.

Note

This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • alpha (float = 0.05.) – Proportion of the shaded region in CVaR estimate. The value should be within [0, 1).

  • plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • legend (bool, default=True) – Whether to include a legend in the scatter plot.

  • sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_conditional_value_at_risk_multiple.png") – Name of the bar figure.

visualize_lower_quartile_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, alpha=0.05, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_conditional_value_at_risk_multiple.png')[source]#

Visualize the lower quartile of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.

Note

This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • alpha (float = 0.05.) – Proportion of the shaded region in CVaR estimate. The value should be within [0, 1).

  • plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • legend (bool, default=True) – Whether to include a legend in the scatter plot.

  • sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_conditional_value_at_risk_multiple.png") – Name of the bar figure.

Methods