scope_rl.ope.input#
Meta class to create input for Off-Policy Evaluation (OPE).
Classes
Class to prepare OPE inputs. |
- class scope_rl.ope.input.CreateOPEInput(env=None, model_args=None, gamma=1.0, bandwidth=1.0, state_scaler=None, action_scaler=None, device=None)[source]#
Class to prepare OPE inputs.
Imported as:
scope_rl.ope.CreateOPEInput- Parameters:
env (gym.Env, default=None) – Reinforcement learning (RL) environment.
model_args (dict of dict, default=None) –
Arguments of the models.
key: [ "fqe", "state_action_dual", "state_action_value", "state_action_weight", "state_dual", "state_value", "state_weight", "hidden_dim", # hidden dim of value/weight function, except FQE ]
Note
Please specify
scalerandaction_scalerwhen callingobtain_whole_inputs()(, as we will overwrite those specified bymodel_args[model]["scaler/action_scaler"]).See also
The followings describe the parameters of each model.
(external) d3rlpy’s documentation about FQE
(API reference)
scope_rl.ope.weight_value_learning
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel.
state_scaler (d3rlpy.preprocessing.Scaler, default=None) – Scaling factor of state.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action. Only applicable in the continuous action case.
device (Optional[str] = None) – Specifies device used for torch.
Examples
Preparation:
# import necessary module from SCOPE-RL from scope_rl.dataset import SyntheticDataset from scope_rl.policy import EpsilonGreedyHead from scope_rl.ope import CreateOPEInput from scope_rl.ope import OffPolicyEvaluation as OPE from scope_rl.ope.discrete import TrajectoryWiseImportanceSampling as TIS from scope_rl.ope.discrete import PerDecisionImportanceSampling as PDIS # import necessary module from other libraries import gym import rtbgym from d3rlpy.algos import DoubleDQNConfig from d3rlpy.dataset import create_fifo_replay_buffer from d3rlpy.algos import ConstantEpsilonGreedy # initialize environment env = gym.make("RTBEnv-discrete-v0") # define (RL) agent (i.e., policy) and train on the environment ddqn = DoubleDQNConfig().create() buffer = create_fifo_replay_buffer( limit=10000, env=env, ) explorer = ConstantEpsilonGreedy( epsilon=0.3, ) ddqn.fit_online( env=env, buffer=buffer, explorer=explorer, n_steps=10000, n_steps_per_epoch=1000, ) # convert ddqn policy to stochastic data collection policy behavior_policy = EpsilonGreedyHead( ddqn, n_actions=env.action_space.n, epsilon=0.3, name="ddqn_epsilon_0.3", random_state=12345, ) # initialize dataset class dataset = SyntheticDataset( env=env, max_episode_steps=env.step_per_episode, ) # data collection logged_dataset = dataset.obtain_episodes( behavior_policies=behavior_policy, n_trajectories=100, random_state=12345, )
Create Input:
# evaluation policy ddqn_ = EpsilonGreedyHead( base_policy=ddqn, n_actions=env.action_space.n, name="ddqn", epsilon=0.0, random_state=12345 ) random_ = EpsilonGreedyHead( base_policy=ddqn, n_actions=env.action_space.n, name="random", epsilon=1.0, random_state=12345 ) # create input for off-policy evaluation (OPE) prep = CreateOPEInput( env=env, ) input_dict = prep.obtain_whole_inputs( logged_dataset=logged_dataset, evaluation_policies=[ddqn_, random_], require_value_prediction=True, n_trajectories_on_policy_evaluation=100, random_state=12345, )
Output:
>>> input_dict {'ddqn': {'evaluation_policy_action_dist': array([[0., 0., 0., ..., 0., 1., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 1., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 1., 0.]]), 'evaluation_policy_action': None, 'state_action_value_prediction': array([[11.64699173, 10.1278677 , 10.09877205, ..., 10.16476822, 15.13939476, 8.95065594], [10.42242146, 7.73790789, 7.27790451, ..., 3.51157165, 12.0761919 , 3.75301909], [ 7.22864819, 6.88499546, 5.68951464, ..., 6.10659647, 7.05469513, 4.81715965], ..., [ 7.28475332, 3.91264176, 4.6845212 , ..., -0.02834684, 7.94454432, 2.59267783], [13.44723797, 3.08360171, 5.99188185, ..., -2.16886044, 7.13434792, 5.72265959], [ 2.27913332, 3.07881427, 1.8636421 , ..., 3.37803316, 3.20135021, 2.68845224]]), 'initial_state_value_prediction': array([15.13939476, 14.83423805, 13.82990742, ..., 15.49367523, 15.49053097, 14.88922691]), 'state_action_marginal_importance_weight': None, 'state_marginal_importance_weight': None, 'on_policy_policy_value': array([ 8., 10., 9., ..., 13., 18., 4.]), 'gamma': 1.0, 'behavior_policy': 'ddqn_epsilon_0.3', 'evaluation_policy': 'ddqn', 'dataset_id': 0},}, 'random': {'evaluation_policy_action_dist': array([[0.1, 0.1, 0.1, ..., 0.1, 0.1, 0.1], [0.1, 0.1, 0.1, ..., 0.1, 0.1, 0.1], [0.1, 0.1, 0.1, ..., 0.1, 0.1, 0.1], ..., [0.1, 0.1, 0.1, ..., 0.1, 0.1, 0.1], [0.1, 0.1, 0.1, ..., 0.1, 0.1, 0.1], [0.1, 0.1, 0.1, ..., 0.1, 0.1, 0.1]]), 'evaluation_policy_action': None, 'state_action_value_prediction': array([[10.63342857, 10.61063576, 11.16767025, ..., 15.32427979, 15.08568764, 10.50707436], [ 4.02995491, 4.80947208, 7.07302999, ..., 9.928442 , 10.78198528, 9.04977417], [ 6.21145582, 6.08772421, 6.5972681 , ..., 9.68579388, 7.2353406 , 6.17404699], ..., [ 1.2350018 , 1.37531543, 3.48139453, ..., 3.44862366, 5.41990328, -0.20314722], [ 0.81208032, -0.28935188, 2.62608957, ..., 6.6619091 , -2.18710518, -2.34665537], [ 2.36533523, 2.24474525, 2.31729817, ..., 4.7845993 , 2.83752441, 3.00596046]]), 'initial_state_value_prediction': array([12.5472518 , 12.56364899, 12.30248432, ..., 12.62372198, 12.6544138 , 12.54314356]), 'state_action_marginal_importance_weight': None, 'state_marginal_importance_weight': None, 'on_policy_policy_value': array([ 9., 7., 4., ..., 15., 8., 5.]), 'gamma': 1.0, 'behavior_policy': 'ddqn_epsilon_0.3', 'evaluation_policy': 'random', 'dataset_id': 0}}
See also
- Attributes:
- action_scaler
- device
- env
- model_args
- state_scaler
Methods
build_and_fit_FQE(evaluation_policy[, ...])Perform Fitted Q Evaluation (FQE).
build_and_fit_state_action_dual_model(...[, ...])Perform Augmented Lagrangian Method (ALM) to estimate the state-action value weight function.
Perform Minimax Q Learning (MQL) to estimate the state-action value function.
Perform Minimax Weight Learning (MWL) to estimate the state-action weight function.
build_and_fit_state_dual_model(evaluation_policy)Perform Augmented Lagrangian Method (ALM) to estimate the state value weight function.
build_and_fit_state_value_model(...[, ...])Perform Minimax V Learning (MVL) to estimate the state value function.
build_and_fit_state_weight_model(...[, ...])Perform Minimax Weight Learning (MWL) to estimate state weight function.
obtain_evaluation_policy_action(...[, state])Obtain evaluation policy action.
obtain_evaluation_policy_action_dist(...[, ...])Obtain action choice probability of the evaluation policy and its an estimated Q-function of the observed state.
obtain_evaluation_policy_action_prob_for_observed_state_action(...)Obtain the pscore of an observed state action pair.
obtain_initial_state(evaluation_policy[, ...])Obtain initial state distribution (stationary distribution) of the evaluation policy.
obtain_initial_state_value_prediction(...[, ...])Obtain the initial state value of the evaluation policy in the case of discrete action spaces.
Predict state-action marginal importance weight.
obtain_state_action_value_prediction(method, ...)Obtain an estimated Q-function of the observed state and all actions (discrete) or that of the actions chosen by behavior and (deterministic) evaluation policies (continuous).
Predict state marginal importance weight.
obtain_whole_inputs(logged_dataset, ...[, ...])Obtain input as a dictionary.
- build_and_fit_FQE(evaluation_policy, k_fold=1, n_steps=10000)[source]#
Perform Fitted Q Evaluation (FQE).
- build_and_fit_state_action_dual_model(evaluation_policy, k_fold=1, n_steps=10000, random_state=None)[source]#
Perform Augmented Lagrangian Method (ALM) to estimate the state-action value weight function.
- build_and_fit_state_action_value_model(evaluation_policy, k_fold=1, n_steps=10000, random_state=None)[source]#
Perform Minimax Q Learning (MQL) to estimate the state-action value function.
- build_and_fit_state_action_weight_model(evaluation_policy, k_fold=1, n_steps=10000, random_state=None)[source]#
Perform Minimax Weight Learning (MWL) to estimate the state-action weight function.
- build_and_fit_state_dual_model(evaluation_policy, k_fold=1, n_steps=10000, random_state=None)[source]#
Perform Augmented Lagrangian Method (ALM) to estimate the state value weight function.
- build_and_fit_state_value_model(evaluation_policy, k_fold=1, n_steps=10000, random_state=None)[source]#
Perform Minimax V Learning (MVL) to estimate the state value function.
- build_and_fit_state_weight_model(evaluation_policy, k_fold=1, n_steps=10000, random_state=None)[source]#
Perform Minimax Weight Learning (MWL) to estimate state weight function.
- obtain_initial_state(evaluation_policy, resample_initial_state=False, minimum_rollout_length=0, maximum_rollout_length=100, random_state=None)[source]#
Obtain initial state distribution (stationary distribution) of the evaluation policy.
- Parameters:
evaluation_policy (BaseHead) – Evaluation policy.
resample_initial_state (bool, default=False) – Whether to resample from initial state distribution using the given evaluation policy. If False, the initial state distribution of the behavior policy is used instead.
minimum_rollout_length (int, default=0 (>= 0)) – Minimum length of rollout by the behavior policy before generating the logged dataset when working on the infinite horizon setting. This argument is irrelevant when working on the finite horizon setting.
maximum_rollout_length (int, default=100 (>= minimum_rollout_length)) – Maximum length of rollout by the behavior policy before generating the logged dataset when working on the infinite horizon setting. This argument is irrelevant when working on the finite horizon setting.
random_state (int, default=None (>= 0)) – Random state.
- Returns:
evaluation_policy_initial_state – Initial state distribution of the evaluation policy. This is intended to be used in marginal OPE estimators.
- Return type:
ndarray of shape (n_trajectories, )
- obtain_evaluation_policy_action(evaluation_policy, state=None)[source]#
Obtain evaluation policy action.
- Parameters:
evaluation_policy (BaseHead) – Evaluation policy.
state (np.ndarray, default=None) – Sample an action from the evaluation_policy at this state. If None is given, state observed in the logged data will be used.
- Returns:
evaluation_policy_action – Evaluation policy action \(a_t \sim \pi(a_t \mid s_t)\).
- Return type:
ndarray of shape (n_trajectories * step_per_trajectory, )
- obtain_evaluation_policy_action_dist(evaluation_policy, state=None)[source]#
Obtain action choice probability of the evaluation policy and its an estimated Q-function of the observed state.
- Parameters:
evaluation_policy (BaseHead) – Evaluation policy.
state (np.ndarray, default=None) – Sample an action from the evaluation_policy at this state.. If None is given, state observed in the logged data will be used.
- Returns:
evaluation_policy_action_dist – Evaluation policy pscore \(\pi(a_t \mid s_t)\).
- Return type:
ndarray of shape (n_trajectories * step_per_trajectory, n_actions)
- obtain_evaluation_policy_action_prob_for_observed_state_action(evaluation_policy)[source]#
Obtain the pscore of an observed state action pair.
- Parameters:
evaluation_policy (BaseHead) – Evaluation policy.
- Returns:
evaluation_policy_pscore – Evaluation policy pscore \(\pi(a_t \mid s_t)\).
- Return type:
ndarray of shape (n_trajectories * step_per_trajectory, )
- obtain_state_action_value_prediction(method, evaluation_policy, k_fold=1)[source]#
Obtain an estimated Q-function of the observed state and all actions (discrete) or that of the actions chosen by behavior and (deterministic) evaluation policies (continuous).
- Parameters:
- Returns:
state_action_value_prediction – If action_type is “discrete”, output is state action value for observed state and all actions, i.e., \(\hat{Q}(s, a) \forall a \in \mathcal{A}\).
If action_type is “continuous”, output is state action value for the observed action and that chosen by the evaluation policy, i.e., (row 0) \(\hat{Q}(s_t, a_t)\) and (row 2) \(\hat{Q}(s_t, \pi(s_t))\).
- Return type:
ndarray of shape (n_trajectories * step_per_trajectory, n_actions) or (n_trajectories * step_per_trajectory, 2)
- obtain_initial_state_value_prediction(method, evaluation_policy, k_fold=1, resample_initial_state=False, minimum_rollout_length=0, maximum_rollout_length=100, random_state=None)[source]#
Obtain the initial state value of the evaluation policy in the case of discrete action spaces.
- Parameters:
method ({"fqe", "dice_q", "dice_v", "mql", "mvl"}) – Estimation method.
evaluation_policy (BaseHead) – Evaluation policy.
k_fold (int, default=1 (> 0)) – Number of folds to perform cross-fitting.
resample_initial_state (bool, default=False) – Whether to resample initial state distribution using the given evaluation policy. If False, the initial state distribution of the behavior policy is used instead.
minimum_rollout_length (int, default=0 (>= 0)) – Minimum length of rollout by the behavior policy before generating the logged dataset when working on the infinite horizon setting. This argument is irrelevant when working on the finite horizon setting.
maximum_rollout_length (int, default=100 (>= minimum_rollout_length)) – Maximum length of rollout by the behavior policy before generating the logged dataset when working on the infinite horizon setting. This argument is irrelevant when working on the finite horizon setting.
random_state (int, default=None (>= 0)) – Random state.
- Returns:
initial_state_value_prediction – State action value of the observed initial state.
- Return type:
ndarray of shape (n_trajectories, )
- obtain_state_action_marginal_importance_weight(method, evaluation_policy, k_fold=1)[source]#
Predict state-action marginal importance weight.
- Parameters:
- Returns:
state_action_weight_prediction – State-action marginal importance weight for the observed state and action.
- Return type:
ndarray of shape (n_trajectories * step_per_trajectory, )
- obtain_state_marginal_importance_weight(method, evaluation_policy, k_fold=1)[source]#
Predict state marginal importance weight.
- Parameters:
- Returns:
state_weight_prediction – State marginal importance weight for observed state.
- Return type:
ndarray of shape (n_trajectories * step_per_trajectory, )
- obtain_whole_inputs(logged_dataset, evaluation_policies, behavior_policy_name=None, dataset_id=None, require_value_prediction=False, require_weight_prediction=False, resample_initial_state=False, q_function_method='fqe', v_function_method='fqe', w_function_method='dice', k_fold=1, n_steps=10000, n_trajectories_on_policy_evaluation=100, use_stationary_distribution_on_policy_evaluation=False, minimum_rollout_length=0, maximum_rollout_length=100, random_state=None, path='input_dict/', save_relative_path=False)[source]#
Obtain input as a dictionary.
- Parameters:
logged_dataset (LoggedDataset or MultipleLoggedDataset) –
Logged dataset containing the following.
key: [ size, n_trajectories, step_per_trajectory, action_type, n_actions, action_dim, action_keys, action_meaning, state_dim, state_keys, state, action, reward, done, terminal, info, pscore, behavior_policy, dataset_id, ] .. seealso:: :class:`scope_rl.dataset.SyntheticDataset` describes the components of :class:`logged_dataset`.evaluation_policies (list of BaseHead or BaseHead) –
Evaluation policies.
Tip
When using LoggedDataset, evaluation_policies should be
List[BaseHead]([BaseHead, BaseHead, ..]).
2. When using MultipleLoggedDataset and apply the same evaluation policies across behavior_policies and dataset_ids, evaluation_policies should be
List[BaseHead].3. When using MultipleLoggedDataset and apply the same evaluation policies across dataset_ids but different evaluation_policies across behavior policies, evaluation_policies should be
Dict[str, List[BaseHead]]. (key:[behavior_policy_name]).4. When using MultipleLoggedDataset and apply different evaluation policies across dataset_ids and behavior policies, evaluation_policies should be
Dict[str, List[BaseHead]]. (key:[behavior_policy_name][dataset_id])behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
require_value_prediction (bool, default=False) – Whether to obtain an estimated value function.
require_weight_prediction (bool, default=False) – Whether to obtain an estimated weight function.
resample_initial_state (bool, default=False) –
Whether to resample initial state distribution using the given evaluation policy. If False, the initial state distribution of the behavior policy is used instead.
Note that this parameter is applicable only when self.env is given.
q_function_method ({"fqe", "dice_q", "mql"}, default="fqe") – Method to estimate \(Q(s, a)\).
v_function_method ({"fqe", "dice_q", "dice_v", "mql", "mvl"}, default="fqe") – Method to estimate \(V(s)\).
w_function_method ({"dice", "mwl"}, default="dice") – Method to estimate \(w(s, a)\) and \(w(s)\).
k_fold (int, default=1 (> 0)) –
Number of folds to perform cross-fitting.
If \(K>1\), we split the logged dataset into \(K\) folds. \(\mathcal{D}_j\) is the \(j\)-th split of logged data consisting of \(n_k\) samples. Then, the value and weight functions (\(w^j\) and \(Q^j\)) are trained on the subset of data used for OPE, i.e., \(\mathcal{D} \setminus \mathcal{D}_j\).
If \(K=1\), the value and weight functions are trained on the entire data.
n_steps (int, default=10000 (> 0)) – Number of gradient steps to fit weight and value learning methods.
n_trajectories_on_policy_evaluation (int, default=None (> 0)) – Number of episodes to perform on-policy evaluation.
use_stationary_distribution_on_policy_evaluation (bool, default=False) – Whether to evaluate a policy based on stationary distribution. If True, evaluation policy is evaluated by rollout without resetting environment at each episode.
minimum_rollout_length (int, default=0 (>= 0)) – Minimum length of rollout to collect initial state.
maximum_rollout_length (int, default=100 (>= minimum_rollout_length)) – Maximum length of rollout to collect initial state.
random_state (int, default=None (>= 0)) – Random state.
path (str) – Path to the directory. Either absolute or relative path is acceptable.
save_relative_path (bool, default=False.) –
Whether to save a relative path. If True, a path relative to the scope-rl directory will be saved. If False, the absolute path will be saved.
Note that this option was added in order to run examples in the documentation properly. Otherwise, the default setting (False) is recommended.
- Returns:
input_dicts – MultipleInputDict is an instance containing (multiple) input dictionary for OPE.
Each input dict is accessible by the following command.
input_dict_ = input_dict.get(behavior_policy_name=behavior_policy.name, dataset_id=0)
Each input dict consists of the following.
key: [evaluation_policy_name][ evaluation_policy_action, evaluation_policy_action_dist, state_action_value_prediction, initial_state_value_prediction, state_action_marginal_importance_weight, state_marginal_importance_weight, on_policy_policy_value, gamma, behavior_policy, evaluation_policy, dataset_id, ]
- evaluation_policy_action: ndarray of shape (n_trajectories * step_per_trajectories, action_dim)
Action chosen by the deterministic evaluation policy. If action_type is “discrete”, None is recorded.
- evaluation_policy_action_dist: ndarray of shape (n_trajectories * step_per_trajectory, n_actions)
Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\). If action_type is “continuous”, None is recorded.
- state_action_value_prediction: ndarray
If action_type is “discrete”, \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\). shape (n_trajectories * step_per_trajectory, n_actions)
If action_type is “continuous”, \(\hat{Q}\) for the observed action and that chosen by the evaluation policy, i.e., (row 0) \(\hat{Q}(s_t, a_t)\) and (row 2) \(\hat{Q}(s_t, \pi(a \mid s_t))\). shape (n_trajectories * step_per_trajectory, 2)
If require_value_prediction is False, None is recorded.
- initial_state_value_prediction: ndarray of shape (n_trajectories, )
Estimated initial state value.
If use_base_model is False, None is recorded.
- state_action_marginal_importance_weight: ndarray of shape (n_trajectories * step_per_trajectory, )
Estimated state-action marginal importance weight, i.e., \(\hat{w}(s_t, a_t) \approx d^{\pi}(s_t, a_t) / d^{\pi_b}(s_t, a_t)\).
If require_weight_prediction is False, None is recorded.
- state_marginal_importance_weight: ndarray of shape (n_trajectories * step_per_trajectory, )
Estimated state marginal importance weight, i.e., \(\hat{w}(s_t) \approx d^{\pi}(s_t) / d^{\pi_b}(s_t)\).
If require_weight_prediction is False, None is recorded.
- on_policy_policy_value: ndarray of shape (n_trajectories_on_policy_evaluation, )
On-policy policy value. If self.env is None, None is recorded.
- gamma: float
Discount factor.
- behavior_policy: str
Name of the behavior policy.
- evaluation_policy: str
Name of the evaluation policy.
- dataset_id: int
Id of the logged dataset.
- Return type:
OPEInputDict or MultipleInputDict