scope_rl.ope.ope.CumulativeDistributionOPE#
- class scope_rl.ope.ope.CumulativeDistributionOPE(logged_dataset, ope_estimators, use_custom_reward_scale=False, scale_min=None, scale_max=None, n_partition=None, bandwidth=1.0, action_scaler=None, disable_reward_after_done=True)[source]#
Class to conduct cumulative distribution OPE by multiple estimators simultaneously (applicable to both discrete/continuous action cases).
Imported as:
scope_rl.ope.CumutiveDistributionOPENote
Cumulative distribution OPE first estimates the following cumulative distribution function, and then estimates some statistics.
\[F(m, \pi) := \mathbb{E} \left[ \mathbb{I} \left \{ \sum_{t=0}^{T-1} \gamma^t r_t \leq m \right \} \mid \pi \right]\]where \(\pi\) is the evaluation policy, \(r_t\) is the reward observed at each timestep \(t\), \(T\) is the total number of timesteps in an episode, and \(\gamma\) is the discount factor.
CDF is itself informative, but it also enables us to calculate the following risk functions.
Mean: \(\mu(F) := \int_{G} G \, \mathrm{d}F(G)\)
Variance: \(\sigma^2(F) := \int_{G} (G - \mu(F))^2 \, \mathrm{d}F(G)\)
\(\alpha\)-quartile: \(Q^{\alpha}(F) := \min \{ G \mid F(G) \leq \alpha \}\)
Conditional Value at Risk (CVaR): \(\int_{G} G \, \mathbb{I}\{ G \leq Q^{\alpha}(F) \} \, \mathrm{d}F(G)\)
where we use \(G := \sum_{t=0}^{T-1} \gamma^t r_t`and :math:`dF(G) := \mathrm{lim}_{\Delta \rightarrow 0} F(G) - F(G- \Delta)\).
- Parameters:
logged_dataset (LoggedDataset or MultipleLoggedDataset) –
Logged dataset used to conduct OPE.
key: [ size, n_trajectories, step_per_trajectory, action_type, n_actions, action_dim, action_keys, action_meaning, state_dim, state_keys, state, action, reward, done, terminal, info, pscore, behavior_policy, dataset_id, ]
See also
scope_rl.dataset.SyntheticDatasetdescribes the components oflogged_dataset.ope_estimators (list of BaseOffPolicyEstimator) – List of OPE estimators used to evaluate the policy value of the evaluation policies. Estimators must follow the interface of scope_rl.ope.BaseCumulativeDistributionOPEEstimator.
use_custom_reward_scale (bool, default=False) –
Whether to use a customized reward scale or the reward observed under the behavior policy.
If True, the reward scale is uniform, following Huang et al. (2021).
If False, the reward scale follows the one defined in Chundak et al. (2021).
scale_min (float, default=None) – Minimum value of the reward scale in the CDF. If use_custom_reward_scale is True, a value must be given.
scale_max (float, default=None) – Maximum value of the reward scale in the CDF. If use_custom_reward_scale is True, a value must be given.
n_partition (int, default=None) – Number of partitions in the reward scale (x-axis of the CDF). If use_custom_reward_scale is True, a value must be given.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel used in continuous action case.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
disable_reward_after_done (bool, default=True) – Whether to apply \(r = 0\) once done is observed in an episode.
Examples
Preparation:
# import necessary module from SCOPE-RL from scope_rl.dataset import SyntheticDataset from scope_rl.policy import EpsilonGreedyHead from scope_rl.ope import CreateOPEInput from scope_rl.ope import CumulativeDistributionOPE from scope_rl.ope.discrete import CumulativeDistributionTIS as CD_IS from scope_rl.ope.discrete import CumulativeDistributionSNTIS as CD_SNIS # import necessary module from other libraries import gym import rtbgym from d3rlpy.algos import DoubleDQNConfig from d3rlpy.dataset import create_fifo_replay_buffer from d3rlpy.algos import ConstantEpsilonGreedy # initialize environment env = gym.make("RTBEnv-discrete-v0") # define (RL) agent (i.e., policy) and train on the environment ddqn = DoubleDQNConfig().create() buffer = create_fifo_replay_buffer( limit=10000, env=env, ) explorer = ConstantEpsilonGreedy( epsilon=0.3, ) ddqn.fit_online( env=env, buffer=buffer, explorer=explorer, n_steps=10000, n_steps_per_epoch=1000, ) # convert ddqn policy to stochastic data collection policy behavior_policy = EpsilonGreedyHead( ddqn, n_actions=env.action_space.n, epsilon=0.3, name="ddqn_epsilon_0.3", random_state=12345, ) # initialize dataset class dataset = SyntheticDataset( env=env, max_episode_steps=env.step_per_episode, ) # data collection logged_dataset = dataset.obtain_episodes( behavior_policies=behavior_policy, n_trajectories=100, random_state=12345, )
Create Input for OPE:
# evaluation policy ddqn_ = EpsilonGreedyHead( base_policy=ddqn, n_actions=env.action_space.n, name="ddqn", epsilon=0.0, random_state=12345 ) random_ = EpsilonGreedyHead( base_policy=ddqn, n_actions=env.action_space.n, name="random", epsilon=1.0, random_state=12345 ) # create input for off-policy evaluation (OPE) prep = CreateOPEInput( env=env, ) input_dict = prep.obtain_whole_inputs( logged_dataset=logged_dataset, evaluation_policies=[ddqn_, random_], n_trajectories_on_policy_evaluation=100, random_state=12345, )
Cumulative Distribution OPE:
# OPE cd_ope = CumulativeDistributionOPE( logged_dataset=logged_dataset, ope_estimators=[ CD_IS(estimator_name="cd_is"), CD_SNIS(estimator_name="cd_snis"), ], ) variance_dict = cd_ope.estimate_variance( input_dict=input_dict, )
Output:
>>> variance_dict {'ddqn': {'on_policy': 18.6216, 'cdf_is': 19.201934808340265, 'cdf_snis': 25.315555555555555}, 'random': {'on_policy': 21.512806887023064, 'cdf_is': 13.591854902638273, 'cdf_snis': 7.158545530356914}}
See also
References
Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment for Markov Decision Processes.” 2022.
Audrey Huang, Liu Leqi, Zachary C. Lipton, and Kamyar Azizzadenesheli. “Off-Policy Risk Assessment in Contextual Bandits.” 2021.
Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S. Thomas. “Universal Off-Policy Evaluation.” 2021.
- Attributes:
- action_scaler
- estimators_name
- n_partition
- scale_max
- scale_min
Methods
estimate_conditional_value_at_risk(input_dict)Estimate the conditional value at risk of the trajectory-wise reward under the given evaluation policies.
Estimate the cumulative distribution of the trajectory-wise reward under the given evaluation policies.
estimate_interquartile_range(input_dict[, ...])Estimate the interquartile range of the trajectory-wise reward under the given evaluation policies.
estimate_mean(input_dict[, ...])Estimate the expected trajectory-wise reward (i.e., policy value) of the evaluation policies.
estimate_variance(input_dict[, ...])Estimate the variance of the trajectory-wise reward under the given evaluation policies.
Obtain the reward scale (x-axis) for the cumulative distribution function.
visualize_conditional_value_at_risk(input_dict)Visualize the conditional value at risk estimated by OPE estimators.
visualize_conditional_value_at_risk_with_multiple_estimates(...)Visualize the conditional value at risk of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.
Visualize the cumulative distribution function estimated by OPE estimators.
visualize_cumulative_distribution_function_with_multiple_estimates(...)Visualize the policy value estimated by OPE estimators across multiple logged dataset.
visualize_interquartile_range(input_dict[, ...])Visualize the interquartile range estimated by OPE estimators.
Visualize the lower quartile of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.
visualize_policy_value(input_dict[, ...])Visualize the policy value estimated by OPE estimators.
Visualize the policy value estimated by OPE estimators across multiple logged dataset.
Visualize the variance of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.
- obtain_reward_scale()[source]#
Obtain the reward scale (x-axis) for the cumulative distribution function.
- Returns:
reward_scale – Reward Scale (x-axis of the cumulative distribution function).
- Return type:
ndarray of shape (n_unique_reward, ) or (n_partition, )
- estimate_cumulative_distribution_function(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, reward_scale=None)[source]#
Estimate the cumulative distribution of the trajectory-wise reward under the given evaluation policies.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
reward_scale (array-like of shape (n_partition, ), default=None) – Scale of the trajectory-wise reward used for x-axis of the CDF plot.
- Returns:
cumulative_distribution_dict – Dictionary containing the cumulative distribution of each evaluation policy estimated by OPE estimators. key:
[evaluation_policy][OPE_estimator_name]When behavior_policy_name is None and dataset_id is None, key:
[behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]When behavior_policy_name is None and dataset_id is specified, key:
[behavior_policy_name][evaluation_policy][OPE_estimator_name]When behavior_policy_name is specified and dataset_id is None, key:
[dataset_id][OPE_estimator_name]- Return type:
- estimate_mean(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None)[source]#
Estimate the expected trajectory-wise reward (i.e., policy value) of the evaluation policies.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
- Returns:
mean_dict – Dictionary containing the mean trajectory-wise reward of each evaluation policy estimated by OPE estimators. key:
[evaluation_policy][OPE_estimator_name]When behavior_policy_name is None and dataset_id is None, key:
[behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]When behavior_policy_name is None and dataset_id is specified, key:
[behavior_policy_name][evaluation_policy][OPE_estimator_name]When behavior_policy_name is specified and dataset_id is None, key:
[dataset_id][OPE_estimator_name]- Return type:
- estimate_variance(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None)[source]#
Estimate the variance of the trajectory-wise reward under the given evaluation policies.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
- Returns:
variance_dict – Dictionary containing the variance of trajectory-wise reward of each evaluation policy estimated by OPE estimators. key:
[evaluation_policy][OPE_estimator_name]When behavior_policy_name is None and dataset_id is None, key:
[behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]When behavior_policy_name is None and dataset_id is specified, key:
[behavior_policy_name][evaluation_policy][OPE_estimator_name]When behavior_policy_name is specified and dataset_id is None, key:
[dataset_id][OPE_estimator_name]- Return type:
- estimate_conditional_value_at_risk(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alphas=None)[source]#
Estimate the conditional value at risk of the trajectory-wise reward under the given evaluation policies.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
alphas ({float, array-like of shape (n_alpha, )}, default=None) – Set of proportions of the shaded region. The value(s) should be within [0, 1). If None is given,
np.linspace(0, 1, 21)will be used.
- Returns:
conditional_value_at_risk_dict – Dictionary containing the conditional value at risk of trajectory-wise reward of each evaluation policy estimated by OPE estimators. key:
[evaluation_policy][OPE_estimator_name]When behavior_policy_name is None and dataset_id is None, key:
[behavior_policy_name][dataset_id][evaluation_policy][OPE_estimator_name]When behavior_policy_name is None and dataset_id is specified, key:
[behavior_policy_name][evaluation_policy][OPE_estimator_name]When behavior_policy_name is specified and dataset_id is None, key:
[dataset_id][OPE_estimator_name]- Return type:
- estimate_interquartile_range(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05)[source]#
Estimate the interquartile range of the trajectory-wise reward under the given evaluation policies.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
alpha (float, default=0.05) – Proportion of the shaded region. The value should be within (0, 1].
- Returns:
interquartile_range_dict – Dictionary containing the interquartile range of trajectory-wise reward of each evaluation policy estimated by OPE estimators. key:
[evaluation_policy][OPE_estimator_name][quartile_name]- Return type:
- visualize_cumulative_distribution_function(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, hue='estimator', legend=True, n_cols=None, fig_dir=None, fig_name='estimated_cumulative_distribution_function.png')[source]#
Visualize the cumulative distribution function estimated by OPE estimators.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
legend (bool, default=True) – Whether to include a legend in the figure.
n_cols (int, default=None) – Number of columns in the figure.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_cumulative_distribution_function.png") – Name of the bar figure.
- visualize_policy_value(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, is_relative=False, hue='estimator', sharey=False, fig_dir=None, fig_name='estimated_policy_value.png')[source]#
Visualize the policy value estimated by OPE estimators.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
is_relative (bool, default=False) – If True, we get the estimated policy values of the evaluation policies relative to the ground-truth policy value of the behavior policy.
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_policy_value.png") – Name of the bar figure.
- visualize_conditional_value_at_risk(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alphas=None, hue='estimator', legend=True, n_cols=None, sharey=False, fig_dir=None, fig_name='estimated_conditional_value_at_risk.png')[source]#
Visualize the conditional value at risk estimated by OPE estimators.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
alphas (array-like of shape (n_alpha, ), default=None) – Set of proportions of the shaded region. The values should be within [0, 1). If None is given,
np.linspace(0, 1, 21)will be used.hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
legend (bool, default=True) – Whether to include a legend in the figure.
n_cols (int, default=None) – Number of columns in the figure.
sharey (bool, default=False) – This parameter is for API consistency.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_conditional_value_at_risk.png") – Name of the bar figure.
- visualize_interquartile_range(input_dict, compared_estimators=None, behavior_policy_name=None, dataset_id=None, alpha=0.05, hue='estimator', sharey=False, fig_dir=None, fig_name='estimated_interquartile_range.png')[source]#
Visualize the interquartile range estimated by OPE estimators.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_interquartile_range.png") – Name of the bar figure.
- visualize_cumulative_distribution_function_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, scale_min=None, scale_max=None, n_partition=None, plot_type='ci_hue', hue='estimator', legend=True, n_cols=None, sharey=False, fig_dir=None, fig_name='estimated_policy_value_multiple.png')[source]#
Visualize the policy value estimated by OPE estimators across multiple logged dataset.
Note
This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.
This function is not applicable when the data-driven reward scaler is used. Please set
scale_min,scale_max, andn_partitionto use.- Parameters:
input_dict (MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
plot_type ({"ci_hue", "ci_behavior_policy", "enumerate"}, default="ci_hue") – Type of plot. If “ci” is given, the method visualizes the average policy value and its 95% confidence intervals based on the multiple estimate. If “enumerate” is given, we get a scatter plot of estimated values.
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
legend (bool, default=True) – Whether to include a legend in the scatter plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_policy_value_multiple.png") – Name of the bar figure.
- visualize_policy_value_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_policy_value_multiple.png')[source]#
Visualize the policy value estimated by OPE estimators across multiple logged dataset.
Note
This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
legend (bool, default=True) – Whether to include a legend in the scatter plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_policy_value_multiple.png") – Name of the bar figure.
- visualize_variance_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_variance_multiple.png')[source]#
Visualize the variance of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.
Note
This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
legend (bool, default=True) – Whether to include a legend in the scatter plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_variance_multiple.png") – Name of the bar figure.
- visualize_conditional_value_at_risk_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, alpha=0.05, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_conditional_value_at_risk_multiple.png')[source]#
Visualize the conditional value at risk of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.
Note
This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
alpha (float = 0.05.) – Proportion of the shaded region in CVaR estimate. The value should be within [0, 1).
plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
legend (bool, default=True) – Whether to include a legend in the scatter plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_conditional_value_at_risk_multiple.png") – Name of the bar figure.
- visualize_lower_quartile_with_multiple_estimates(input_dict, compared_estimators=None, behavior_policy_name=None, alpha=0.05, plot_type='ci', hue='estimator', legend=True, sharey=False, fig_dir=None, fig_name='estimated_conditional_value_at_risk_multiple.png')[source]#
Visualize the lower quartile of the trajectory-wise reward under the evaluation policy estimated by OPE estimators across multiple logged dataset.
Note
This function is applicable only when MultipleLoggedDataset is used and MultipleInputDict is collected by the same evaluation policy across logged datasets.
- Parameters:
input_dict (OPEInputDict or MultipleInputDict) –
Dictionary of the OPE inputs for each evaluation policy.
key: [evaluation_policy][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
See also
scope_rl.ope.input.CreateOPEInputdescribes the components ofinput_dict.compared_estimators (list of str, default=None) – Name of compared estimators. If None is given, all the estimators are compared.
behavior_policy_name (str, default=None) – Name of the behavior policy.
alpha (float = 0.05.) – Proportion of the shaded region in CVaR estimate. The value should be within [0, 1).
plot_type ({"ci", "scatter", "violin"}, default="ci") – Type of plot. If “ci” is given, we get the empirical average of the estimated values with their estimated confidence intervals. If “scatter” is given, we get a scatter plot of estimated values.
hue ({"estimator", "policy"}, default="estimator") – Hue of the plot.
legend (bool, default=True) – Whether to include a legend in the scatter plot.
sharey (bool, default=False) – If True, the y-axis will be shared among different estimators or evaluation policies.
fig_dir (Path, default=None) – Path to store the bar figure. If None is given, the figure will not be saved.
fig_name (str, default="estimated_conditional_value_at_risk_multiple.png") – Name of the bar figure.
Methods