scope_rl.ope.ops.OffPolicySelection#

class scope_rl.ope.ops.OffPolicySelection(ope=None, cumulative_distribution_ope=None)[source]#

Class to conduct OPS and evaluation of OPE/OPS with multiple estimators simultaneously.

Imported as: scope_rl.ope.OffPolicySelection

Note

Off-Policy Selection (OPS)

OPS selects the “best” policy among several candidates based on the policy value or other statistics estimated by OPE.

\[\hat{\pi} := {\arg \max}_{\pi \in \Pi} \hat{J}(\pi)\]

where \(\Pi\) is a set of candidate policies and \(\hat{J}(\cdot)\) is some OPE estimates of the policy performance. Below, we describe two types of OPE to estimate such policy performance.

Off-Policy Evaluation (OPE)

(Basic) OPE estimates the expected policy performance called the policy value.

\[V(\pi) := \mathbb{E} \left[ \sum_{t=1}^T \gamma^{t-1} r_t \mid \pi \right]\]

where \(r_t\) is the reward observed at each timestep \(t\), \(T\) is the total number of timesteps in an episode, and \(\gamma\) is the discount factor.

See also

OffPolicyEvaluation

Cumulative Distribution OPE

In contrast, cumulative distribution OPE first estimates the following cumulative distribution function.

\[F(t, \pi) := \mathbb{E} \left[ \mathbb{I} \left \{ \sum_{t=1}^T \gamma^{t-1} r_t \leq t \right \} \mid \pi \right]\]

Then, cumulative distribution OPE also estimates some risk functions including variance, conditional value at risk, and interquartile range based on the CDF estimate.

See also

CumulativeDistributionOPE

Parameters:

Examples

Preparation:

# import necessary module from SCOPE-RL
from scope_rl.dataset import SyntheticDataset
from scope_rl.policy import EpsilonGreedyHead
from scope_rl.ope import CreateOPEInput
from scope_rl.ope import OffPolicySelection
from scope_rl.ope import OffPolicyEvaluation as OPE
from scope_rl.ope.discrete import TrajectoryWiseImportanceSampling as TIS
from scope_rl.ope.discrete import PerDecisionImportanceSampling as PDIS
from scope_rl.ope import CumulativeDistributionOPE
from scope_rl.ope.discrete import CumulativeDistributionTIS as CD_IS
from scope_rl.ope.discrete import CumulativeDistributionSNTIS as CD_SNIS

# import necessary module from other libraries
import gym
import rtbgym
from d3rlpy.algos import DoubleDQNConfig
from d3rlpy.dataset import create_fifo_replay_buffer
from d3rlpy.algos import ConstantEpsilonGreedy

# initialize environment
env = gym.make("RTBEnv-discrete-v0")

# define (RL) agent (i.e., policy) and train on the environment
ddqn = DoubleDQNConfig().create()
buffer = create_fifo_replay_buffer(
    limit=10000,
    env=env,
)
explorer = ConstantEpsilonGreedy(
    epsilon=0.3,
)
ddqn.fit_online(
    env=env,
    buffer=buffer,
    explorer=explorer,
    n_steps=10000,
    n_steps_per_epoch=1000,
)

# convert ddqn policy to stochastic data collection policy
behavior_policy = EpsilonGreedyHead(
    ddqn,
    n_actions=env.action_space.n,
    epsilon=0.3,
    name="ddqn_epsilon_0.3",
    random_state=12345,
)

# initialize dataset class
dataset = SyntheticDataset(
    env=env,
    max_episode_steps=env.step_per_episode,
)

# data collection
logged_dataset = dataset.obtain_episodes(
    behavior_policies=behavior_policy,
    n_trajectories=100,
    random_state=12345,
)

Create Input for OPE:

# evaluation policy
ddqn_ = EpsilonGreedyHead(
    base_policy=ddqn,
    n_actions=env.action_space.n,
    name="ddqn",
    epsilon=0.0,
    random_state=12345
)
random_ = EpsilonGreedyHead(
    base_policy=ddqn,
    n_actions=env.action_space.n,
    name="random",
    epsilon=1.0,
    random_state=12345
)

# create input for off-policy evaluation (OPE)
prep = CreateOPEInput(
    env=env,
)
input_dict = prep.obtain_whole_inputs(
    logged_dataset=logged_dataset,
    evaluation_policies=[ddqn_, random_],
    n_trajectories_on_policy_evaluation=100,
    random_state=12345,
)

Off-Policy Evaluation and Selection:

# OPS
ope = OPE(
    logged_dataset=logged_dataset,
    ope_estimators=[TIS(), PDIS()],
)
cd_ope = CumulativeDistributionOPE(
    logged_dataset=logged_dataset,
    ope_estimators=[
        CD_IS(estimator_name="cd_is"),
        CD_SNIS(estimator_name="cd_snis"),
    ],
)
ops = OffPolicySelection(
    ope=ope,
    cumulative_distribution_ope=cd_ope,
)
ops_dict = ops.select_by_policy_value(
    input_dict=input_dict,
    return_metrics=True,
)

Output:

>>> ops_dict

{'tis': {'estimated_ranking': ['ddqn', 'random'],
        'estimated_policy_value': array([21.3624954,  0.3827044]),
        'estimated_relative_policy_value': array([1.44732354, 0.02592848]),
        'mean_squared_error': 94.79587393975419,
        'rank_correlation': SpearmanrResult(correlation=0.9999999999999999, pvalue=nan),
        'regret': (0.0, 1),
        'type_i_error_rate': 0.0,
        'type_ii_error_rate': 0.0,
        'safety_threshold': 13.284},
'pdis': {'estimated_ranking': ['ddqn', 'random'],
        'estimated_policy_value': array([18.02806424,  7.13847486]),
        'estimated_relative_policy_value': array([1.22141357, 0.48363651]),
        'mean_squared_error': 19.45349619733373,
        'rank_correlation': SpearmanrResult(correlation=0.9999999999999999, pvalue=nan),
        'regret': (0.0, 1),
        'type_i_error_rate': 0.0,
        'type_ii_error_rate': 0.0,
        'safety_threshold': 13.284}}

See also

References

Vladislav Kurenkov and Sergey Kolesnikov. “Showing Your Offline Reinforcement Learning Work: Online Evaluation Budget Matters.” 2022.

Shengpu Tang and Jenna Wiens. “Model Selection for Offline Reinforcement Learning: Practical Considerations for Healthcare Settings.” 2021.

Justin Fu, Mohammad Norouzi, Ofir Nachum, George Tucker, Ziyu Wang, Alexander Novikov, Mengjiao Yang, Michael R. Zhang, Yutian Chen, Aviral Kumar, Cosmin Paduraru, Sergey Levine, and Tom Le Paine. “Benchmarks for Deep Off-Policy Evaluation.” 2021.

Tom Le Paine, Cosmin Paduraru, Andrea Michi, Caglar Gulcehre, Konrad Zolna, Alexander Novikov, Ziyu Wang, and Nando de Freitas. “Hyperparameter Selection for Offline Reinforcement Learning.” 2020.

Attributes:
cumulative_distribution_ope
estimators_name
ope

Methods

obtain_topk_conditional_value_at_risk_selected_by_cumulative_distribution_ope(...)

Obtain the topk deployment result (CVaR) selected by cumulative distribution OPE.

obtain_topk_conditional_value_at_risk_selected_by_standard_ope(...)

Obtain the topk deployment result (CVaR) selected by standard OPE.

obtain_topk_lower_quartile_selected_by_cumulative_distribution_ope(...)

Obtain the topk deployment result (lower quartile) selected by cumulative distribution OPE.

obtain_topk_lower_quartile_selected_by_standard_ope(...)

Obtain the topk deployment result (lower quartile) selected by standard OPE.

obtain_topk_policy_value_selected_by_cumulative_distribution_ope(...)

Obtain the topk deployment result (policy value) selected by cumulative distribution OPE.

obtain_topk_policy_value_selected_by_lower_bound(...)

Obtain the topk deployment (policy value) result selected by its estimated lower bound.

obtain_topk_policy_value_selected_by_standard_ope(...)

Obtain the topk deployment result (policy value) selected by standard OPE.

obtain_true_selection_result(input_dict[, ...])

Obtain the oracle selection result based on the ground-truth policy value.

select_by_conditional_value_at_risk(input_dict)

Rank the candidate policies by their estimated conditional value at risk.

select_by_lower_quartile(input_dict[, ...])

Rank the candidate policies by their estimated lower quartile of the trajectory-wise reward.

select_by_policy_value(input_dict[, ...])

Rank the candidate policies by their estimated policy values.

select_by_policy_value_lower_bound(input_dict)

Rank the candidate policies by their estimated policy value lower bound.

select_by_policy_value_via_cumulative_distribution_ope(...)

Rank the candidate policies by their estimated policy value via cumulative distribution OPE methods.

visualize_conditional_value_at_risk_for_selection(...)

Visualize the conditional value at risk estimated by cumulative distribution OPE estimators (cdf plot).

visualize_conditional_value_at_risk_for_validation(...)

Visualize the true conditional value at risk and its estimate (scatter plot).

visualize_conditional_value_at_risk_with_multiple_estimates(...)

Visualize the conditional value at risk of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.

visualize_cumulative_distribution_function_for_selection(...)

Visualize the cumulative distribution function (cdf plot).

visualize_cumulative_distribution_function_with_multiple_estimates(...)

Visualize the policy value estimated by OPE estimators across multiple logged dataset.

visualize_interquartile_range_for_selection(...)

Visualize the interquartile range estimated by cumulative distribution OPE estimators (box plot).

visualize_lower_quartile_for_validation(...)

Visualize the true lower quartile and its estimate (scatter plot).

visualize_lower_quartile_with_multiple_estimates(...)

Visualize the lower quartile of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.

visualize_policy_value_for_selection(input_dict)

Visualize the policy value estimated by OPE estimators (box plot).

visualize_policy_value_for_validation(input_dict)

Visualize the true policy value and its estimate (scatter plot).

visualize_policy_value_lower_bound_for_validation(...)

Visualize the true policy value and its estimate lower bound (scatter plot).

visualize_policy_value_of_cumulative_distribution_ope_for_selection(...)

Visualize the policy value estimated by cumulative distribution OPE estimators (box plot).

visualize_policy_value_of_cumulative_distribution_ope_for_validation(...)

Visualize the true policy value and its estimate obtained by cumulative distribution OPE (scatter plot).

visualize_policy_value_with_multiple_estimates_cumulative_distribution_ope(...)

Visualize the policy value estimated by OPE estimators across multiple logged dataset.

visualize_policy_value_with_multiple_estimates_standard_ope(...)

Visualize the policy value estimated by OPE estimators across multiple logged dataset.

visualize_topk_conditional_value_at_risk_selected_by_cumulative_distribution_ope(...)

Visualize the topk deployment result (CVaR) selected by cumulative distribution OPE.

visualize_topk_conditional_value_at_risk_selected_by_standard_ope(...)

Visualize the topk deployment result (CVaR) selected by standard OPE.

visualize_topk_lower_quartile_selected_by_cumulative_distribution_ope(...)

Visualize the topk deployment result (lower quartile) selected by cumulative distribution OPE.

visualize_topk_lower_quartile_selected_by_standard_ope(...)

Visualize the topk deployment result (lower quartile) selected by standard OPE.

visualize_topk_policy_value_selected_by_cumulative_distribution_ope(...)

Visualize the topk deployment result (policy value) selected by cumulative distribution OPE.

visualize_topk_policy_value_selected_by_lower_bound(...)

Visualize the topk deployment result (policy value) selected by its estimated lower bound.

visualize_topk_policy_value_selected_by_standard_ope(...)

Visualize the topk deployment result (policy value) selected by standard OPE.

visualize_variance_for_validation(input_dict)

Visualize the true variance and its estimate (scatter plot).

visualize_variance_with_multiple_estimates(...)

Visualize the variance of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.

obtain_true_selection_result(input_dict, behavior_policy_name=None, dataset_id=None, return_variance=False, return_lower_quartile=False, return_conditional_value_at_risk=False, return_by_dataframe=False, quartile_alpha=0.05, cvar_alpha=0.05)[source]#

Obtain the oracle selection result based on the ground-truth policy value.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • return_variance (bool, default=False) – Whether to return the variance or not.

  • return_lower_quartile (bool. default=False) – Whether to return the lower interquartile or not.

  • return_conditional_value_at_risk (bool, default=False) – Whether to return the conditional value at risk or not.

  • return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.

  • quartile_alpha (float, default=0.05) – Proportion of the shaded region of the interquartile range.

  • cvar_alpha (float, default=0.05) – Proportion of the shaded region of the conditional value at risk.

Returns:

ground_truth_dict/ground_truth_df – Dictionary/dataframe containing the following ground-truth (on-policy) metrics.

key: [
    ranking,
    policy_value,
    relative_policy_value,
    variance,
    ranking_by_lower_quartile,
    lower_quartile,
    ranking_by_conditional_value_at_risk,
    conditional_value_at_risk,
    parameters,  # only when return_by_dataframe == False
]
ranking: list of str

Name of the candidate policies sorted by the ground-truth policy value.

policy_value: list of float

Ground-truth policy value of the candidate policies (sorted by ranking).

relative_policy_value: list of float

Ground-truth relative policy value of the candidate policies compared to the behavior policy (sorted by ranking).

variance: list of float

Ground-truth variance of the trajectory-wise reward of the candidate policies (sorted by ranking). If return_variance is False, None is recorded.

ranking_by_lower_quartile: list of str

Name of the candidate policies sorted by the ground-truth lower quartile of the trajectory-wise reward. If return_lower_quartile is False, None is recorded.

lower_quartile: list of float

Ground-truth lower quartile of the candidate policies (sorted by ranking_by_lower_quartile). If return_lower_quartile is False, None is recorded.

ranking_by_conditional_value_at_risk: list of str

Name of the candidate policies sorted by the ground-truth conditional value at risk. If return_conditional_value_at_risk is False, None is recorded.

conditional_value_at_risk: list of float

Ground-truth conditional value at risk of the candidate policies (sorted by ranking_by_conditional_value_at_risk). If return_conditional_value_at_risk is False, None is recorded.

parameters: dict

Dictionary containing quartile_alpha, and cvar_alpha. If return_by_dataframe is True, parameters will not be returned.

Return type:

dict or dataframe (, list of dict or dataframe)

select_by_policy_value(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, return_true_values=False, return_metrics=False, return_by_dataframe=False, top_k_in_eval_metrics=1, safety_threshold=None, relative_safety_criteria=None)[source]#

Rank the candidate policies by their estimated policy values.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • return_true_values (bool, default=False) – Whether to return the true policy value and corresponding ranking of the candidate policies.

  • return_metrics (bool, default=False) – Whether to return the following evaluation metrics in terms of OPE and OPS: mean-squared-error, rank-correlation, regret@k, and Type I and Type II error rate.

  • return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.

  • top_k_in_eval_metrics (int, default=1) – How many candidate policies are included in regret@k.

  • safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.

  • relative_safety_criteria (float, default=None (>= 0)) – The relative policy value required to be considered a safe policy. For example, when 0.9 is given, candidate policy must exceed 90% of the behavior policy performance. Only applicable when using a single behavior policy.

Returns:

ops_dict/(ranking_df_dict, metric_df) – Dictionary/dataframe containing the result of OPS conducted by OPE estimators.

key: [estimator_name][
    estimated_ranking,
    estimated_policy_value,
    estimated_relative_policy_value,
    true_ranking,
    true_policy_value,
    true_relative_policy_value,
    mean_squared_error,
    rank_correlation,
    regret,
    type_i_error_rate,
    type_ii_error_rate,
]
estimated_ranking: list of str

Name of the candidate policies sorted by the estimated policy value. Recorded in ranking_df_dict if return_by_dataframe is True.

estimated_policy_value: list of float

Estimated policy value of the candidate policies (sorted by estimated_ranking). Recorded in ranking_df_dict if return_by_dataframe is True.

estimated_relative_policy_value: list of float

Estimated relative policy value of the candidate policies compared to the behavior policy (sorted by estimated_ranking). Recorded in ranking_df_dict if return_by_dataframe is True.

true_ranking: list of int

Ranking index of the (true) policy value of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.

true_policy_value: list of float

True policy value of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict when return_by_dataframe is True.

true_relative_policy_value: list of float

True relative policy value of the candidate policies compared to the behavior policy (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.

mean_squared_error: float

Mean-squared-error of the estimators calculated across candidate evaluation policies. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.

rank_correlation: tuple of float

Rank correlation coefficient between the true ranking and the estimated ranking, and its pvalue. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.

regret: tuple of float and int

Regret@k and k. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.

type_i_error_rate: float

Type I error rate of the hypothetical test. True Negative when the policy is safe but estimated as unsafe. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.

type_ii_error_rate: float

Type II error rate of the hypothetical test. False Positive when the policy is unsafe but undetected. Recorded only when return_metric is True. Recorded in metric_df when return_by_dataframe is True.

safety_threshold: float

A policy whose policy value is below the given threshold is to be considered unsafe.

Return type:

dict or dataframe (, list of dict or dataframe)

select_by_policy_value_via_cumulative_distribution_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, return_true_values=False, return_metrics=False, return_by_dataframe=False, top_k_in_eval_metrics=1, safety_threshold=None, relative_safety_criteria=None)[source]#

Rank the candidate policies by their estimated policy value via cumulative distribution OPE methods.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • return_true_values (bool, default=False) – Whether to return the true policy value and corresponding ranking of the candidate policies.

  • return_metrics (bool, default=False) – Whether to return the following evaluation metrics in terms of OPE and OPS: mean-squared-error, rank-correlation, regret@k, and Type I and Type II error rate.

  • return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.

  • top_k_in_eval_metrics (int, default=1) – How many candidate policies are included in regret@k.

  • safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.

  • relative_safety_criteria (float, default=None (>= 0)) – The relative policy value required to be considered a safe policy. For example, when 0.9 is given, candidate policy must exceed 90% of the behavior policy performance. Only applicable when using a single behavior policy.

Returns:

ops_dict/(ranking_df_dict, metric_df) – Dictionary/dataframe containing the result of OPS conducted by OPE estimators.

key: [estimator_name][
    estimated_ranking,
    estimated_policy_value,
    estimated_relative_policy_value,
    true_ranking,
    true_policy_value,
    true_relative_policy_value,
    mean_squared_error,
    rank_correlation,
    regret,
    type_i_error_rate,
    type_ii_error_rate,
]
estimated_ranking: list of str

Name of the candidate policies sorted by the estimated policy value. Recorded in ranking_df_dict if return_by_dataframe is True.

estimated_policy_value: list of float

Estimated policy value of the candidate policies (sorted by estimated_ranking). Recorded in ranking_df_dict if return_by_dataframe is True.

estimated_relative_policy_value: list of float

Estimated relative policy value of the candidate policies compared to the behavior policy (sorted by estimated_ranking). Recorded in ranking_df_dict if return_by_dataframe is True.

true_ranking: list of int

Ranking index of the (true) policy value of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.

true_policy_value: list of float

True policy value of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.

true_relative_policy_value: list of float

True relative policy value of the candidate policies compared to the behavior policy (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.

mean_squared_error: float

Mean-squared-error of the estimators calculated across candidate evaluation policies. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.

rank_correlation: tuple of float

Rank correlation coefficient between the true ranking and the estimated ranking, and its pvalue. Recorded only when return_metric is True. Recorded in metric_df when return_by_dataframe is True.

regret: tuple of float and int

Regret@k and k. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.

type_i_error_rate: float

Type I error rate of the hypothetical test. True Negative when the policy is safe but estimated as unsafe. Recorded only when return_metric is True. Recorded in metric_df when return_by_dataframe is True.

type_ii_error_rate: float

Type II error rate of the hypothetical test. False Positive when the policy is unsafe but undetected. Recorded only when return_metric is True. Recorded in metric_df when return_by_dataframe is True.

safety_threshold: float

A policy whose policy value is below the given threshold is to be considered unsafe.

Return type:

dict or dataframe (, list of dict or dataframe)

select_by_policy_value_lower_bound(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, return_true_values=False, return_metrics=False, return_by_dataframe=False, top_k_in_eval_metrics=1, safety_threshold=None, relative_safety_criteria=None, cis=['bootstrap'], alpha=0.05, n_bootstrap_samples=100, random_state=None)[source]#

Rank the candidate policies by their estimated policy value lower bound.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • return_true_values (bool, default=False) – Whether to return the true policy value and corresponding ranking of the candidate policies.

  • return_metrics (bool, default=False) – Whether to return the following evaluation metrics in terms of OPE and OPS: rank-correlation, regret@k, and Type I and Type II error rate.

  • return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.

  • top_k_in_eval_metrics (int, default=1) – How many candidate policies are included in regret@k.

  • safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.

  • relative_safety_criteria (float, default=None (>= 0)) – The relative policy value required to be considered a safe policy. For example, when 0.9 is given, candidate policy must exceed 90% of the behavior policy performance. Only applicable when using a single behavior policy.

  • cis (list of {"bootstrap", "hoeffding", "bernstein", "ttest"}, default=["bootstrap"]) – Estimation methods for confidence intervals.

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • n_bootstrap_samples (int, default=100 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

ops_dict/(ranking_df_dict, metric_df) – Dictionary/dataframe containing the result of OPS conducted by OPE estimators.

key: [ci][estimator_name][
    estimated_ranking,
    estimated_policy_value_lower_bound,
    estimated_relative_policy_value_lower_bound,
    true_ranking,
    true_policy_value,
    true_relative_policy_value,
    mean_squared_error,
    rank_correlation,
    regret,
    type_i_error_rate,
    type_ii_error_rate,
]
estimated_ranking: list of str

Name of the candidate policies sorted by the estimated policy value lower bound. Recorded in ranking_df_dict if return_by_dataframe is True.

estimated_policy_value_lower_bound: list of float

Estimated policy value lower bound of the candidate policies (sorted by estimated_ranking). Recorded in ranking_df_dict if return_by_dataframe is True.

estimated_relative_policy_value_lower_bound: list of float

Estimated relative policy value lower bound of the candidate policies compared to the behavior policy (sorted by estimated_ranking). Recorded in ranking_df_dict if return_by_dataframe is True.

true_ranking: list of int

Ranking index of the (true) policy value of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.

true_policy_value: list of float

True policy value of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.

true_relative_policy_value: list of float

True relative policy value of the candidate policies compared to the behavior policy (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.

mean_squared_error: None

This is for API consistency. Recorded in metric_df if return_by_dataframe is True.

rank_correlation: tuple of float

Rank correlation coefficient between the true ranking and the estimated ranking, and its pvalue. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.

regret: tuple of float and int

Regret@k and k. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.

type_i_error_rate: float

Type I error rate of the hypothetical test. True Negative when the policy is safe but estimated as unsafe. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.

type_ii_error_rate: float

Type II error rate of the hypothetical test. False Positive when the policy is unsafe but undetected. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.

safety_threshold: float

A policy whose policy value is below the given threshold is to be considered unsafe.

Return type:

dict or dataframe (, list of dict or dataframe)

select_by_lower_quartile(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, return_true_values=False, return_metrics=False, return_by_dataframe=False, safety_threshold=0.0)[source]#

Rank the candidate policies by their estimated lower quartile of the trajectory-wise reward.

Parameters:
  • input_dict (OPEInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 0.5].

  • return_true_values (bool, default=False) – Whether to return the true lower quartile of the trajectory-wise reward and corresponding ranking of the candidate evaluation policies.

  • return_metrics (bool, default=False) – Whether to return the following evaluation metrics in terms of OPE and OPS: mean-squared-error, rank-correlation, and Type I and Type II error rate.

  • return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.

  • safety_threshold (float, default=0.0 (>= 0)) – The lower quartile required to be considered a safe policy.

Returns:

ops_dict/(ranking_df_dict, metric_df) – Dictionary/dataframe containing the result of OPS conducted by OPE estimators.

key: [estimator_name][
    estimated_ranking,
    estimated_lower_quartile,
    true_ranking,
    true_lower_quartile,
    mean_squared_error,
    rank_correlation,
    regret,
    type_i_error_rate,
    type_ii_error_rate,
]
estimated_ranking: list of str

Name of the candidate policies sorted by the estimated lower quartile of the trajectory-wise reward. Recorded in ranking_df_dict if return_by_dataframe is True.

estimated_lower_quartile: list of float

Estimated lower quartile of the trajectory-wise reward of the candidate policies (sorted by estimated_ranking). Recorded in ranking_df_dict if return_by_dataframe is True.

true_ranking: list of int

Ranking index of the (true) lower quartile of the trajectory-wise reward of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.

true_lower_quartile: list of float

True lower quartile of the trajectory-wise reward of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.

mean_squared_error: float

Mean-squared-error of the estimated lower quartile of the trajectory-wise reward. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.

rank_correlation: tuple of float

Rank correlation coefficient between the true ranking and the estimated ranking, and its pvalue. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.

regret: None

This is for API consistency. Recorded in metric_df if return_by_dataframe is True.

type_i_error_rate: float

Type I error rate of the hypothetical test. True Negative when the policy is safe but estimated as unsafe. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.

type_ii_error_rate: float

Type II error rate of the hypothetical test. False Positive when the policy is unsafe but undetected. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.

safety_threshold: float

The lower quartile required to be considered a safe policy.

Return type:

dict or dataframe

select_by_conditional_value_at_risk(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, return_true_values=False, return_metrics=False, return_by_dataframe=False, safety_threshold=0.0)[source]#

Rank the candidate policies by their estimated conditional value at risk.

Parameters:
  • input_dict (OPEInputDict or MultipleLoggedDataset) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 1].

  • return_true_values (bool, default=False) – Whether to return the true conditional value at risk and corresponding ranking of the candidate evaluation policies.

  • return_metrics (bool, default=False) – Whether to return the following evaluation metrics in terms of OPE and OPS: mean-squared-error, rank-correlation, and Type I and Type II error rate.

  • return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.

  • safety_threshold (float, default=0.0 (>= 0)) – The conditional value at risk required to be considered a safe policy.

Returns:

ops_dict/(ranking_df_dict, metric_df) – Dictionary/dataframe containing the result of OPS conducted by OPE estimators.

key: [estimator_name][
    estimated_ranking,
    estimated_conditional_value_at_risk,
    true_ranking,
    true_conditional_value_at_risk,
    mean_squared_error,
    rank_correlation,
    regret,
    type_i_error_rate,
    type_ii_error_rate,
]
estimated_ranking: list of str

Name of the candidate policies sorted by the estimated conditional value at risk. Recorded in ranking_df_dict if return_by_dataframe is True.

estimated_conditional_value_at_risk: list of float

Estimated conditional value at risk of the candidate policies (sorted by estimated_ranking). Recorded in ranking_df_dict if return_by_dataframe is True.

true_ranking: list of int

Ranking index of the (true) conditional value at risk of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.

true_conditional_value_at_risk: list of float

True conditional value at risk of the candidate policies (sorted by estimated_ranking). Recorded only when return_true_values is True. Recorded in ranking_df_dict if return_by_dataframe is True.

mean_squared_error: float

Mean-squared-error of the estimated conditional value at risk. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.

rank_correlation: tuple or float

Rank correlation coefficient between the true ranking and the estimated ranking, and its pvalue. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.

regret: None

This is for API consistency. Recorded in metric_df if return_by_dataframe is True.

type_i_error_rate: float

Type I error rate of the hypothetical test. True Negative when the policy is safe but estimated as unsafe. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True.

type_ii_error_rate: float

Type II error rate of the hypothetical test. False Positive when the policy is unsafe but undetected. Recorded only when return_metric is True. Recorded in metric_df if return_by_dataframe is True`.

safety_threshold: float

The conditional value at risk required to be considered a safe policy.

Return type:

dict or dataframe (, list of dict or dataframe)

visualize_policy_value_for_selection(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, ci='bootstrap', n_bootstrap_samples=100, random_state=None, is_relative=False, hue='estimator', sharey=False, fig_dir=None, fig_name='estimated_policy_value_standard_ope.png')[source]#

Visualize the policy value estimated by OPE estimators (box plot).

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

  • is_relative (bool, default=False) – If True, the method visualizes the estimated policy value of the evaluation policy relative to the on-policy policy value of the behavior policy.

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_policy_value_standard_ope.png") – Name of the bar figure.

visualize_cumulative_distribution_function_for_selection(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, hue='estimator', legend=True, n_cols=None, fig_dir=None, fig_name='estimated_cumulative_distribution_function.png')[source]#

Visualize the cumulative distribution function (cdf plot).

Parameters:
  • input_dict (OPEInputDict or MultipleLoggedDataset) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • legend (bool, default=True) – Whether to include a legend in the figure.

  • n_cols (int, default=None (> 0)) – Number of columns in the figure.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_cumulative_distribution_function.png") – Name of the bar figure.

visualize_policy_value_of_cumulative_distribution_ope_for_selection(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, is_relative=False, hue='estimator', sharey=False, fig_dir=None, fig_name='estimated_policy_value_cumulative_distribution_ope.png')[source]#

Visualize the policy value estimated by cumulative distribution OPE estimators (box plot).

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • alpha (float, default=0.05) – Significance level. The value should bw within [0, 1).

  • is_relative (bool, default=False) – If True, the method visualizes the estimated policy value of the evaluation policy relative to the ground-truth policy value of the behavior policy.

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • sharey (bool, default=False) – If True, the y-axis will be shared among different evaluation policies.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_policy_value_cumulative_distribution_ope.png") – Name of the bar figure.

visualize_conditional_value_at_risk_for_selection(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alphas=None, hue='estimator', legend=True, n_cols=None, sharey=False, fig_dir=None, fig_name='estimated_conditional_value_at_risk.png')[source]#

Visualize the conditional value at risk estimated by cumulative distribution OPE estimators (cdf plot).

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • alphas (array-like of shape (n_alpha, ), default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given, np.linspace(0, 1, 21) will be used.

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • legend (bool, default=True) – Whether to include a legend in the figure.

  • n_cols (int, default=None (> 0)) – Number of columns in the figure.

  • sharey (bool, default=False) – If True, the y-axis will be shared among different evaluation policies.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_conditional_value_at_risk.png") – Name of the bar figure.

visualize_interquartile_range_for_selection(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, hue='estimator', sharey=False, fig_dir=None, fig_name='estimated_interquartile_range.png')[source]#

Visualize the interquartile range estimated by cumulative distribution OPE estimators (box plot).

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • sharey (bool, default=False) – If True, the y-axis will be shared among different evaluation policies.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_interquartile_range.png") – Name of the bar figure.

visualize_policy_value_with_multiple_estimates_standard_ope(input_dict, compared_estimators=None, behavior_policy_name=None, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_policy_value_multiple_standard_ope.png')[source]#

Visualize the policy value estimated by OPE estimators across multiple logged dataset.

Note

This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • legend (bool, default=True) – Whether to include a legend in the scatter plot.

  • sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_policy_value_multiple.png") – Name of the bar figure.

visualize_cumulative_distribution_function_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, scale_min=None, scale_max=None, n_partition=None, plot_type='ci_hue', hue='estimator', legend=True, n_cols=None, sharey=False, fig_dir=None, fig_name='estimated_policy_value_multiple.png')[source]#

Visualize the policy value estimated by OPE estimators across multiple logged dataset.

Note

This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.

This function is not applicable when the data-driven reward scaler is used. Please set scale_min, scale_max, and n_partition to use.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • scale_min (float, default=None) – Minimum value of the reward scale in the CDF.

  • scale_max (float, default=None) – Maximum value of the reward scale in the CDF.

  • n_partition (int, default=None) – Number of partitions in the reward scale (x-axis of the CDF).

  • plot_type ({"ci_hue", "ci_behavior_policy", "enumerate"}, default="ci_hue") – Type of plot. If “ci” is given, the method visualizes the average policy value and its 95% confidence intervals based on the multiple estimate. If “enumerate” is given, we get a scatter plot of estimated values.

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • legend (bool, default=True) – Whether to include a legend in the scatter plot.

  • sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_policy_value_multiple.png") – Name of the bar figure.

visualize_policy_value_with_multiple_estimates_cumulative_distribution_ope(input_dict, compared_estimators=None, behavior_policy_name=None, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_policy_value_multiple_cumulative_distribution_ope.png')[source]#

Visualize the policy value estimated by OPE estimators across multiple logged dataset.

Note

This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • legend (bool, default=True) – Whether to include a legend in the scatter plot.

  • sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_policy_value_multiple.png") – Name of the bar figure.

visualize_variance_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_variance_multiple.png')[source]#

Visualize the variance of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.

Note

This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • legend (bool, default=True) – Whether to include a legend in the scatter plot.

  • sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_variance_multiple.png") – Name of the bar figure.

visualize_conditional_value_at_risk_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, alpha=0.05, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_conditional_value_at_risk_multiple.png')[source]#

Visualize the conditional value at risk of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.

Note

This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • alpha (float = 0.05.) – Proportion of the shaded region in CVaR estimate. The value should be within [0, 1).

  • plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • legend (bool, default=True) – Whether to include a legend in the scatter plot.

  • sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_conditional_value_at_risk_multiple.png") – Name of the bar figure.

visualize_lower_quartile_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, alpha=0.05, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_conditional_value_at_risk_multiple.png')[source]#

Visualize the lower quartile of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.

Note

This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • alpha (float = 0.05.) – Proportion of the shaded region in CVaR estimate. The value should be within [0, 1).

  • plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.

  • hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.

  • legend (bool, default=True) – Whether to include a legend in the scatter plot.

  • sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="estimated_conditional_value_at_risk_multiple.png") – Name of the bar figure.

obtain_topk_policy_value_selected_by_standard_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, max_topk=None, return_safety_violation_rate=False, safety_threshold=None, relative_safety_criteria=None, return_by_dataframe=False)[source]#

Obtain the topk deployment result (policy value) selected by standard OPE.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.

  • max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.

  • return_safety_violation_rate (bool, default=False.) – Whether to calculate and return the safety violate.

  • safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.

  • relative_safety_criteria (float, default=None) – The relative policy value required to be considered a safe policy. For example, when 0.9 is given, candidate policy must exceed 90% of the behavior policy performance. Only applicable when using a single behavior policy.

  • return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.

Returns:

topk_metric_dict/topk_metric_df – Dictionary/dataframe containing the following top-k risk return tradeoff metrics. Note that policy performance refers to the (standard) policy value here. When returning dataframe, the average value will be returned.

key: [estimator][
    k-th,
    best,  # return
    worst,  # risk
    mean,   # risk
    std,    # risk
    safety_violation_rate,  # risk
    sharpe_ratio,  # risk-return tradeoff
]
k-th: ndarray of shape (max_topk, total_n_datasets)

Policy performance of the k-th deployment policy.

best: ndarray of shape (max_topk, total_n_datasets)

Best policy performance among the top-k deployment policies.

worst: ndarray of shape (max_topk, total_n_datasets)

Wosrt policy performance among the top-k deployment policies.

mean: ndarray of shape (max_topk, total_n_datasets)

Mean policy performance of the top-k deployment policies.

std: ndarray of shape (max_topk, total_n_datasets)

Standard deviation of the policy performance among the top-k deployment policies.

safety_violation_rate: ndarray of shape (max_topk, total_n_datasets)

Safety violation rate regarding the policy performance of the top-k deployment policies.

sharpe_ratio: ndarray of shape (max_topk, total_n_datasets)

Risk-return tradeoff metrics defined as follows: \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).

Return type:

dict or dataframe

obtain_topk_policy_value_selected_by_cumulative_distribution_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, max_topk=None, return_safety_violation_rate=False, safety_threshold=None, relative_safety_criteria=None, return_by_dataframe=False)[source]#

Obtain the topk deployment result (policy value) selected by cumulative distribution OPE.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.

  • max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.

  • return_safety_violation_rate (bool, default=False.) – Whether to calculate and return the safety violate.

  • safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.

  • relative_safety_criteria (float, default=None) – The relative policy value required to be considered a safe policy. For example, when 0.9 is given, candidate policy must exceed 90% of the behavior policy performance. Only applicable when using a single behavior policy.

  • return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.

Returns:

topk_metric_dict/topk_metric_df – Dictionary/dataframe containing the following top-k risk return tradeoff metrics. Note that policy performance refers to the (standard) policy value here. When returning dataframe, the average value will be returned.

key: [estimator][
    k-th,
    best,  # return
    worst,  # risk
    mean,   # risk
    std,    # risk
    safety_violation_rate,  # risk
    sharpe_ratio,  # risk-return tradeoff
]
k-th: ndarray of shape (max_topk, total_n_datasets)

Policy performance of the k-th deployment policy.

best: ndarray of shape (max_topk, total_n_datasets)

Best policy performance among the top-k deployment policies.

worst: ndarray of shape (max_topk, total_n_datasets)

Wosrt policy performance among the top-k deployment policies.

mean: ndarray of shape (max_topk, total_n_datasets)

Mean policy performance of the top-k deployment policies.

std: ndarray of shape (max_topk, total_n_datasets)

Standard deviation of the policy performance among the top-k deployment policies.

safety_violation_rate: ndarray of shape (max_topk, total_n_datasets)

Safety violation rate regarding the policy performance of the top-k deployment policies.

sharpe_ratio: ndarray of shape (max_topk, total_n_datasets)

Risk-return tradeoff metrics defined as follows: \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).

Return type:

dict or dataframe

obtain_topk_policy_value_selected_by_lower_bound(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, max_topk=None, return_safety_violation_rate=False, safety_threshold=None, relative_safety_criteria=None, clip_sharpe_ratio=False, cis=['bootstrap'], ope_alpha=0.05, n_bootstrap_samples=100, random_state=None, return_by_dataframe=False)[source]#

Obtain the topk deployment (policy value) result selected by its estimated lower bound.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.

  • max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.

  • return_safety_violation_rate (bool, default=False.) – Whether to calculate and return the safety violate.

  • safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.

  • relative_safety_criteria (float, default=None) – The relative policy value required to be considered a safe policy. For example, when 0.9 is given, candidate policy must exceed 90% of the behavior policy performance. Only applicable when using a single behavior policy.

  • clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.

  • cis (list of {"bootstrap", "hoeffding", "bernstein", "ttest"}, default=["bootstrap"]) – Estimation methods for confidence intervals.

  • ope_alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • n_bootstrap_samples (int, default=100 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

  • return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.

Returns:

topk_metric_dict/topk_metric_df – Dictionary/dataframe containing the following top-k risk return tradeoff metrics. Note that policy performance refers to the (standard) policy value here. When returning dataframe, the average value will be returned.

key: [estimator][
    k-th,
    best,  # return
    worst,  # risk
    mean,   # risk
    std,    # risk
    safety_violation_rate,  # risk
    sharpe_ratio,  # risk-return tradeoff
]
k-th: ndarray of shape (max_topk, total_n_datasets)

Policy performance of the k-th deployment policy.

best: ndarray of shape (max_topk, total_n_datasets)

Best policy performance among the top-k deployment policies.

worst: ndarray of shape (max_topk, total_n_datasets)

Wosrt policy performance among the top-k deployment policies.

mean: ndarray of shape (max_topk, total_n_datasets)

Mean policy performance of the top-k deployment policies.

std: ndarray of shape (max_topk, total_n_datasets)

Standard deviation of the policy performance among the top-k deployment policies.

safety_violation_rate: ndarray of shape (max_topk, total_n_datasets)

Safety violation rate regarding the policy performance of the top-k deployment policies.

sharpe_ratio: ndarray of shape (max_topk, total_n_datasets)

Risk-return tradeoff metrics defined as follows: \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).

Return type:

dict or dataframe

obtain_topk_conditional_value_at_risk_selected_by_standard_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, ope_alpha=0.05, max_topk=None, return_safety_violation_rate=False, safety_threshold=None, clip_sharpe_ratio=False, return_by_dataframe=False)[source]#

Obtain the topk deployment result (CVaR) selected by standard OPE.

Parameters:
  • input_dict (OPEInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.

  • ope_alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 1].

  • max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.

  • return_safety_violation_rate (bool, default=False.) – Whether to calculate and return the safety violate.

  • safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.

  • clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.

  • return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.

Returns:

topk_metric_dict/topk_metric_df – Dictionary/dataframe containing the following top-k risk return tradeoff metrics. Note that policy performance refers to CVaR here. When returning dataframe, the average value will be returned.

key: [estimator][
    k-th,
    best,  # return
    worst,  # risk
    mean,   # risk
    std,    # risk
    safety_violation_rate,  # risk
    sharpe_ratio,  # risk-return tradeoff
]
k-th: ndarray of shape (max_topk, total_n_datasets)

Policy performance of the k-th deployment policy.

best: ndarray of shape (max_topk, total_n_datasets)

Best policy performance among the top-k deployment policies.

worst: ndarray of shape (max_topk, total_n_datasets)

Wosrt policy performance among the top-k deployment policies.

mean: ndarray of shape (max_topk, total_n_datasets)

Mean policy performance of the top-k deployment policies.

std: ndarray of shape (max_topk, total_n_datasets)

Standard deviation of the policy performance among the top-k deployment policies.

safety_violation_rate: ndarray of shape (max_topk, total_n_datasets)

Safety violation rate regarding the policy performance of the top-k deployment policies.

sharpe_ratio: ndarray of shape (max_topk, total_n_datasets)

Risk-return tradeoff metrics defined as follows: \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).

Return type:

dict or dataframe

obtain_topk_conditional_value_at_risk_selected_by_cumulative_distribution_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, ope_alpha=0.05, max_topk=None, return_safety_violation_rate=False, safety_threshold=None, clip_sharpe_ratio=False, return_by_dataframe=False)[source]#

Obtain the topk deployment result (CVaR) selected by cumulative distribution OPE.

Parameters:
  • input_dict (OPEInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.

  • ope_alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 1].

  • max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.

  • return_safety_violation_rate (bool, default=False.) – Whether to calculate and return the safety violate.

  • safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.

  • clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.

  • return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.

Returns:

topk_metric_dict/topk_metric_df – Dictionary/dataframe containing the following top-k risk return tradeoff metrics. Note that policy performance refers to CVaR here. When returning dataframe, the average value will be returned.

key: [estimator][
    k-th,
    best,  # return
    worst,  # risk
    mean,   # risk
    std,    # risk
    safety_violation_rate,  # risk
    sharpe_ratio,  # risk-return tradeoff
]
k-th: ndarray of shape (max_topk, total_n_datasets)

Policy performance of the k-th deployment policy.

best: ndarray of shape (max_topk, total_n_datasets)

Best policy performance among the top-k deployment policies.

worst: ndarray of shape (max_topk, total_n_datasets)

Wosrt policy performance among the top-k deployment policies.

mean: ndarray of shape (max_topk, total_n_datasets)

Mean policy performance of the top-k deployment policies.

std: ndarray of shape (max_topk, total_n_datasets)

Standard deviation of the policy performance among the top-k deployment policies.

safety_violation_rate: ndarray of shape (max_topk, total_n_datasets)

Safety violation rate regarding the policy performance of the top-k deployment policies.

sharpe_ratio: ndarray of shape (max_topk, total_n_datasets)

Risk-return tradeoff metrics defined as follows: \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).

Return type:

dict or dataframe

obtain_topk_lower_quartile_selected_by_standard_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, ope_alpha=0.05, max_topk=None, return_safety_violation_rate=False, safety_threshold=None, clip_sharpe_ratio=False, return_by_dataframe=False)[source]#

Obtain the topk deployment result (lower quartile) selected by standard OPE.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.

  • alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 0.5].

  • max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.

  • return_safety_violation_rate (bool, default=False.) – Whether to calculate and return the safety violate.

  • safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.

  • clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.

  • return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.

Returns:

topk_metric_dict/topk_metric_df – Dictionary/dataframe containing the following top-k risk return tradeoff metrics. Note that policy performance refers to the lower quartile here. When returning dataframe, the average value will be returned.

key: [estimator][
    k-th,
    best,  # return
    worst,  # risk
    mean,   # risk
    std,    # risk
    safety_violation_rate,  # risk
    sharpe_ratio,  # risk-return tradeoff
]
k-th: ndarray of shape (max_topk, total_n_datasets)

Policy performance of the k-th deployment policy.

best: ndarray of shape (max_topk, total_n_datasets)

Best policy performance among the top-k deployment policies.

worst: ndarray of shape (max_topk, total_n_datasets)

Wosrt policy performance among the top-k deployment policies.

mean: ndarray of shape (max_topk, total_n_datasets)

Mean policy performance of the top-k deployment policies.

std: ndarray of shape (max_topk, total_n_datasets)

Standard deviation of the policy performance among the top-k deployment policies.

safety_violation_rate: ndarray of shape (max_topk, total_n_datasets)

Safety violation rate regarding the policy performance of the top-k deployment policies.

sharpe_ratio: ndarray of shape (max_topk, total_n_datasets)

Risk-return tradeoff metrics defined as follows: \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).

Return type:

dict or dataframe

obtain_topk_lower_quartile_selected_by_cumulative_distribution_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, ope_alpha=0.05, max_topk=None, return_safety_violation_rate=False, safety_threshold=None, clip_sharpe_ratio=False, return_by_dataframe=False)[source]#

Obtain the topk deployment result (lower quartile) selected by cumulative distribution OPE.

Parameters:
  • input_dict (OPEInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.

  • alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 0.5].

  • max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.

  • return_safety_violation_rate (bool, default=False.) – Whether to calculate and return the safety violate.

  • safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.

  • clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.

  • return_by_dataframe (bool, default=False) – Whether to return the result in a dataframe format.

Returns:

topk_metric_dict/topk_metric_df – Dictionary/dataframe containing the following top-k risk return tradeoff metrics. Note that policy performance refers to the lower quartile here. When returning dataframe, the average value will be returned.

key: [estimator][
    k-th,
    best,  # return
    worst,  # risk
    mean,   # risk
    std,    # risk
    safety_violation_rate,  # risk
    sharpe_ratio,  # risk-return tradeoff
]
k-th: ndarray of shape (max_topk, total_n_datasets)

Policy performance of the k-th deployment policy.

best: ndarray of shape (max_topk, total_n_datasets)

Best policy performance among the top-k deployment policies.

worst: ndarray of shape (max_topk, total_n_datasets)

Wosrt policy performance among the top-k deployment policies.

mean: ndarray of shape (max_topk, total_n_datasets)

Mean policy performance of the top-k deployment policies.

std: ndarray of shape (max_topk, total_n_datasets)

Standard deviation of the policy performance among the top-k deployment policies.

safety_violation_rate: ndarray of shape (max_topk, total_n_datasets)

Safety violation rate regarding the policy performance of the top-k deployment policies.

sharpe_ratio: ndarray of shape (max_topk, total_n_datasets)

Risk-return tradeoff metrics defined as follows: \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).

Return type:

dict or dataframe

visualize_topk_policy_value_selected_by_standard_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, metrics=['k-th', 'best', 'worst', 'mean', 'std', 'safety_violation_rate', 'sharpe_ratio'], max_topk=None, safety_threshold=None, relative_safety_criteria=None, clip_sharpe_ratio=False, ymax_sharpe_ratio=None, visualize_ci=False, plot_ci='bootstrap', plot_alpha=0.05, plot_n_bootstrap_samples=100, random_state=None, legend=True, fig_dir=None, fig_name='topk_policy_value_standard_ope.png')[source]#

Visualize the topk deployment result (policy value) selected by standard OPE.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.

  • metrics (list of {"k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"}, default=["k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"]) –

    Indicate which of the policy performance among {“best”, “worst”, “mean”, “std”}, sharpe ratio, and safety violation rate to report. For “k-th”, it means that the policy performance of the (estimated) k-th policy will be visualized.

    We define the sharpe ratio for OPE as \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).

  • max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.

  • safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.

  • relative_safety_criteria (float, default=None) – The relative policy value required to be considered a safe policy. For example, when 0.9 is given, candidate policy must exceed 90% of the behavior policy performance. Only applicable when using a single behavior policy.

  • clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.

  • ymax_sharp_ratio (float, default=None) – Maximum value in y-axis of the plot of SharpeRatio.

  • visualize_ci (bool, default=False) – Whether to visualize ci. (Only applicable when MultipleInputDict is given.)

  • plot_ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.

  • plot_alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • plot_n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

  • legend (bool, default=True) – Whether to include a legend in the figure.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="topk_policy_value_standard_ope.png") – Name of the bar figure.

visualize_topk_policy_value_selected_by_cumulative_distribution_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, metrics=['k-th', 'best', 'worst', 'mean', 'std', 'safety_violation_rate', 'sharpe_ratio'], max_topk=None, safety_threshold=None, relative_safety_criteria=None, clip_sharpe_ratio=False, ymax_sharpe_ratio=None, visualize_ci=False, plot_ci='bootstrap', plot_alpha=0.05, plot_n_bootstrap_samples=100, random_state=None, legend=True, fig_dir=None, fig_name='topk_policy_value_cumulative_distribution_ope.png')[source]#

Visualize the topk deployment result (policy value) selected by cumulative distribution OPE.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.

  • metrics (list of {"k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"}, default=["k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"]) –

    Indicate which of the policy performance among {“best”, “worst”, “mean”, “std”}, sharpe ratio, and safety violation rate to report. For “k-th”, it means that the policy performance of the (estimated) k-th policy will be visualized.

    We define the sharpe ratio for OPE as \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).

  • max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.

  • safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.

  • relative_safety_criteria (float, default=None) – The relative policy value required to be considered a safe policy. For example, when 0.9 is given, candidate policy must exceed 90% of the behavior policy performance.

  • clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.

  • ymax_sharp_ratio (float, default=None) – Maximum value in y-axis of the plot of SharpeRatio.

  • visualize_ci (bool, default=False) – Whether to visualize ci. (Only applicable when MultipleInputDict is given.)

  • plot_ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.

  • plot_alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • plot_n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

  • legend (bool, default=True) – Whether to include a legend in the figure.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="topk_policy_value_cumulative_distribution_ope.png") – Name of the bar figure.

visualize_topk_policy_value_selected_by_lower_bound(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, metrics=['k-th', 'best', 'worst', 'mean', 'std', 'safety_violation_rate', 'sharpe_ratio'], max_topk=None, safety_threshold=None, relative_safety_criteria=None, clip_sharpe_ratio=False, ymax_sharpe_ratio=None, ope_cis=['bootstrap'], ope_alpha=0.05, ope_n_bootstrap_samples=100, visualize_ci=False, plot_ci='bootstrap', plot_alpha=0.05, plot_n_bootstrap_samples=100, random_state=None, legend=True, fig_dir=None, fig_name='topk_policy_value_standard_ope_lower_bound.png')[source]#

Visualize the topk deployment result (policy value) selected by its estimated lower bound.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.

  • metrics (list of {"k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"}, default=["k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"]) –

    Indicate which of the policy performance among {“best”, “worst”, “mean”, “std”}, sharpe ratio, and safety violation rate to report. For “k-th”, it means that the policy performance of the (estimated) k-th policy will be visualized.

    We define the sharpe ratio for OPE as \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).

  • max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.

  • safety_threshold (float, default=None.) – A policy whose policy value is below the given threshold is to be considered unsafe.

  • relative_safety_criteria (float, default=None) – The relative policy value required to be considered a safe policy. For example, when 0.9 is given, candidate policy must exceed 90% of the behavior policy performance.

  • clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.

  • ymax_sharp_ratio (float, default=None) – Maximum value in y-axis of the plot of SharpeRatio.

  • ope_cis (list of {"bootstrap", "hoeffding", "bernstein", "ttest"}, default=["bootstrap"]) – Estimation methods for confidence intervals.

  • ope_alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • ope_n_bootstrap_samples (int, default=100 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • visualize_ci (bool, default=False) – Whether to visualize ci. (Only applicable when MultipleInputDict is given.)

  • plot_ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.

  • plot_alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • plot_n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

  • legend (bool, default=True) – Whether to include a legend in the figure.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="topk_policy_value_standard_ope_lower_bound.png") – Name of the bar figure.

visualize_topk_conditional_value_at_risk_selected_by_standard_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, ope_alpha=0.05, metrics=['k-th', 'best', 'worst', 'mean', 'std', 'safety_violation_rate', 'sharpe_ratio'], max_topk=None, safety_threshold=None, clip_sharpe_ratio=False, ymax_sharpe_ratio=None, visualize_ci=False, plot_ci='bootstrap', plot_alpha=0.05, plot_n_bootstrap_samples=100, random_state=None, legend=True, fig_dir=None, fig_name='topk_cvar_standard_ope.png')[source]#

Visualize the topk deployment result (CVaR) selected by standard OPE.

Parameters:
  • input_dict (OPEInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.

  • ope_alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 1].

  • metrics (list of {"k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"}, default=["k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"]) –

    Indicate which of the policy performance among {“best”, “worst”, “mean”, “std”}, sharpe ratio, and safety violation rate to report. For “k-th”, it means that the policy performance of the (estimated) k-th policy will be visualized.

    We define the sharpe ratio for OPE as \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).

  • max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.

  • safety_threshold (float, default=0.0 (>= 0)) – The conditional value at risk required to be considered a safe policy.

  • clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.

  • ymax_sharp_ratio (float, default=None) – Maximum value in y-axis of the plot of SharpeRatio.

  • visualize_ci (bool, default=False) – Whether to visualize ci. (Only applicable when MultipleInputDict is given.)

  • plot_ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.

  • plot_alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • plot_n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

  • ymax_sharp_ratio – Maximum value in y-axis of the plot of SharpeRatio.

  • legend (bool, default=True) – Whether to include a legend in the figure.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="topk_cvar_standard_ope.png") – Name of the bar figure.

visualize_topk_conditional_value_at_risk_selected_by_cumulative_distribution_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, ope_alpha=0.05, metrics=['k-th', 'best', 'worst', 'mean', 'std', 'safety_violation_rate', 'sharpe_ratio'], max_topk=None, safety_threshold=None, clip_sharpe_ratio=False, ymax_sharpe_ratio=None, visualize_ci=False, plot_ci='bootstrap', plot_alpha=0.05, plot_n_bootstrap_samples=100, random_state=None, legend=True, fig_dir=None, fig_name='topk_cvar_cumulative_distribution_ope.png')[source]#

Visualize the topk deployment result (CVaR) selected by cumulative distribution OPE.

Parameters:
  • input_dict (OPEInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.

  • ope_alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 1].

  • metrics (list of {"k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"}, default=["k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"]) –

    Indicate which of the policy performance among {“best”, “worst”, “mean”, “std”}, sharpe ratio, and safety violation rate to report. For “k-th”, it means that the policy performance of the (estimated) k-th policy will be visualized.

    We define the sharpe ratio for OPE as \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).

  • max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.

  • safety_threshold (float, default=0.0 (>= 0)) – The conditional value at risk required to be considered a safe policy.

  • clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.

  • ymax_sharp_ratio (float, default=None) – Maximum value in y-axis of the plot of SharpeRatio.

  • visualize_ci (bool, default=False) – Whether to visualize ci. (Only applicable when MultipleInputDict is given.)

  • plot_ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.

  • plot_alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • plot_n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

  • legend (bool, default=True) – Whether to include a legend in the figure.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="topk_cvar_cumulative_distribution_ope.png") – Name of the bar figure.

visualize_topk_lower_quartile_selected_by_standard_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, ope_alpha=0.05, metrics=['k-th', 'best', 'worst', 'mean', 'std', 'safety_violation_rate', 'sharpe_ratio'], max_topk=None, safety_threshold=None, clip_sharpe_ratio=False, ymax_sharpe_ratio=None, visualize_ci=False, plot_ci='bootstrap', plot_alpha=0.05, plot_n_bootstrap_samples=100, random_state=None, legend=True, fig_dir=None, fig_name='topk_lower_quartile_standard_ope.png')[source]#

Visualize the topk deployment result (lower quartile) selected by standard OPE.

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.

  • alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 0.5].

  • metrics (list of {"k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"}, default=["k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"]) –

    Indicate which of the policy performance among {“best”, “worst”, “mean”, “std”}, sharpe ratio, and safety violation rate to report. For “k-th”, it means that the policy performance of the (estimated) k-th policy will be visualized.

    We define the sharpe ratio for OPE as \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).

  • max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.

  • safety_threshold (float, default=0.0 (>= 0)) – The conditional value at risk required to be considered a safe policy.

  • clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.

  • ymax_sharp_ratio (float, default=None) – Maximum value in y-axis of the plot of SharpeRatio.

  • visualize_ci (bool, default=False) – Whether to visualize ci. (Only applicable when MultipleInputDict is given.)

  • plot_ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.

  • plot_alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • plot_n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

  • legend (bool, default=True) – Whether to include a legend in the figure.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="topk_lower_quartile_standard_ope.png") – Name of the bar figure.

visualize_topk_lower_quartile_selected_by_cumulative_distribution_ope(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, ope_alpha=0.05, metrics=['k-th', 'best', 'worst', 'mean', 'std', 'safety_violation_rate', 'sharpe_ratio'], max_topk=None, safety_threshold=None, clip_sharpe_ratio=False, ymax_sharpe_ratio=None, visualize_ci=False, plot_ci='bootstrap', plot_alpha=0.05, plot_n_bootstrap_samples=100, random_state=None, legend=True, fig_dir=None, fig_name='topk_lower_quartile_cumulative_distribution_ope.png')[source]#

Visualize the topk deployment result (lower quartile) selected by cumulative distribution OPE.

Parameters:
  • input_dict (OPEInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset. If None, the average of the result will be shown.

  • alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 0.5].

  • metrics (list of {"k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"}, default=["k-th", "best", "worst", "mean", "std", "safety_violation_rate", "sharpe_ratio"]) –

    Indicate which of the policy performance among {“best”, “worst”, “mean”, “std”}, sharpe ratio, and safety violation rate to report. For “k-th”, it means that the policy performance of the (estimated) k-th policy will be visualized.

    We define the sharpe ratio for OPE as \(S(\hat{V}) := (\mathrm{Best@k} - V(\pi_0)) / \mathrm{Std@k}\).

  • max_topk (int, default=None) – Maximum number of policies to be deployed. If None is given, all the policies will be deployed.

  • safety_threshold (float, default=0.0 (>= 0)) – The conditional value at risk required to be considered a safe policy.

  • clip_sharpe_ratio (bool, default=False) – Whether to clip a large value of SharpeRatio with 1e2.

  • ymax_sharp_ratio (float, default=None) – Maximum value in y-axis of the plot of SharpeRatio.

  • visualize_ci (bool, default=False) – Whether to visualize ci. (Only applicable when MultipleInputDict is given.)

  • plot_ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.

  • plot_alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • plot_n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

  • ymax_sharp_ratio – Maximum value in y-axis of the plot of SharpeRatio.

  • legend (bool, default=True) – Whether to include a legend in the figure.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="topk_lower_quartile_cumulative_distribution_ope.png") – Name of the bar figure.

visualize_policy_value_for_validation(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, n_cols=None, share_axes=False, legend=True, fig_dir=None, fig_name='validation_policy_value_standard_ope.png')[source]#

Visualize the true policy value and its estimate (scatter plot).

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • n_cols (int, default=None (> 0)) – Number of columns in the figure.

  • share_axes (bool, default=False) – Whether to share x- and y-axes or not.

  • legend (bool, default=True) – Whether to include a legend in the scatter plot.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="validation_policy_value_standard_ope.png") – Name of the bar figure.

visualize_policy_value_of_cumulative_distribution_ope_for_validation(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, n_cols=None, share_axes=False, legend=True, fig_dir=None, fig_name='validation_policy_value_cumulative_distribution_ope.png')[source]#

Visualize the true policy value and its estimate obtained by cumulative distribution OPE (scatter plot).

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • n_cols (int, default=None (> 0)) – Number of columns in the figure.

  • share_axes (bool, default=False) – Whether to share x- and y-axes or not.

  • legend (bool, default=True) – Whether to include a legend in the scatter plot.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="validation_policy_value_cumulative_distribution_ope.png") – Name of the bar figure.

visualize_policy_value_lower_bound_for_validation(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, cis=['bootstrap'], alpha=0.05, n_bootstrap_samples=100, random_state=None, n_cols=None, share_axes=False, legend=True, fig_dir=None, fig_name='validation_policy_value_lower_bound.png')[source]#

Visualize the true policy value and its estimate lower bound (scatter plot).

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • cis (list of {"bootstrap", "hoeffding", "bernstein", "ttest"}, default=["bootstrap"]) – Estimation methods for confidence intervals.

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • n_bootstrap_samples (int, default=100 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

  • n_cols (int, default=None (> 0)) – Number of columns in the figure.

  • share_axes (bool, default=False) – Whether to share x- and y-axes or not.

  • legend (bool, default=True) – Whether to include a legend in the scatter plot.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="validation_policy_value_lower_bound.png") – Name of the bar figure.

visualize_variance_for_validation(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, n_cols=None, share_axes=False, legend=True, fig_dir=None, fig_name='validation_variance.png')[source]#

Visualize the true variance and its estimate (scatter plot).

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • n_cols (int, default=None (> 0)) – Number of columns in the figure.

  • share_axes (bool, default=False) – Whether to share x- and y-axes or not.

  • legend (bool, default=True) – Whether to include a legend in the scatter plot.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="validation_variance.png") – Name of the bar figure.

visualize_lower_quartile_for_validation(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, n_cols=None, share_axes=False, legend=True, fig_dir=None, fig_name='validation_lower_quartile.png')[source]#

Visualize the true lower quartile and its estimate (scatter plot).

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 0.5].

  • n_cols (int, default=None (> 0)) – Number of columns in the figure.

  • share_axes (bool, default=False) – Whether to share x- and y-axes or not.

  • legend (bool, default=True) – Whether to include a legend in the scatter plot.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="validation_lower_quartile.png") – Name of the bar figure.

visualize_conditional_value_at_risk_for_validation(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, n_cols=None, share_axes=False, legend=True, fig_dir=None, fig_name='validation_conditional_value_at_risk.png')[source]#

Visualize the true conditional value at risk and its estimate (scatter plot).

Parameters:
  • input_dict (OPEInputDict or MultipleInputDict) –

    Dictionary of the OPE inputs for each evaluation policy.

    key: [evaluation_policy][
        evaluation_policy_action,
        evaluation_policy_action_dist,
        state_action_value_prediction,
        initial_state_value_prediction,
        state_action_marginal_importance_weight,
        state_marginal_importance_weight,
        on_policy_policy_value,
        gamma,
        behavior_policy,
        evaluation_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.ope.input.CreateOPEInput describes the components of input_dict.

  • compared_estimators (list of str, default=None) – Name of compared estimators. When None is given, all the estimators are compared.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • alpha (float, default=0.05) – Proportion of the shaded region. The value should be within [0, 1].

  • n_cols (int, default=None (> 0)) – Number of columns in the figure.

  • share_axes (bool, default=False) – Whether to share x- and y-axes or not.

  • legend (bool, default=True) – Whether to include a legend in the scatter plot.

  • fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.

  • fig_name (str, default="validation_conditional_value_at_risk.png") – Name of the bar figure.

Methods