Guidelines for Preparing Real-World Datasets#
Here, we provide the guideline for preparing logged datasets and inputs that are compatible with SCOPE-RL.
Logged Dataset#
In real-world experiments, logged_dataset should contain the following keys.
For the keys that are (optional), please use None values when the data is unavailable.
key: [
size,
n_trajectories,
step_per_trajectory,
action_type,
n_actions,
action_dim,
action_keys,
action_meaning,
state_dim,
state_keys,
state,
action,
reward,
done,
terminal,
info,
pscore,
behavior_policy,
dataset_id,
]
step_per_trajectory: number of steps in a trajectory, intn_trajectories: number of trajectories, int (optional)size: number of data tuples, which is given by the multiplication ofn_trajectoriesandstep_per_trajectory, int (optional)state: state observation, np.ndarray of shape (size, )action: action chosen by the behavior policy, np.ndarrayreward: reward observation, np.ndarray of shape (size, )done: whether an episode ends or not (due to the consequence of agent action), np.ndarray of (size, )terminal: whether an episode terminates or not (due to fixed episode lengths), np.ndarray of (size, )pscore: probability of the behavior policy choosing the observed action, np.ndarray of (size, ), (optional)action_type: type of action, str (either “discrete” or “continuous”)n_actions: number of actions (discrete action case), int (optional)action_dim: dimension of actions (continuous action case), int (optional)action_keys: disctionary containing the name of actions, dict (optional)action_meaning: np.ndarray to map action index to actual actions, dict (optional)state_dim: dimension of states, int, (optional)state_keys: disctionary containing the name of states, int (optional)info: info obtained during the interaction of the agent, dict (optional)behavior_policy: name of the behavior policy, strdataset_id: dataset id , int
Note that, when pscore is available, the importance sampling-based estimators are applicable to OPE.
The shape of action is (size, n_actions) in discrete action cases, while it is (size, action_dim) in continuous action cases.
Input Dict#
Then, input_dict should contain the following keys for each evaluation policy (in input_dict[evaluation_policy_name]).
For the keys that are (optional), please use None values when the data is unavailable.
key: [evaluation_policy_name][
evaluation_policy_action,
evaluation_policy_action_dist,
state_action_value_prediction,
initial_state_value_prediction,
state_action_marginal_importance_weight,
state_marginal_importance_weight,
on_policy_policy_value,
gamma,
behavior_policy,
evaluation_policy,
dataset_id,
]
evaluation_policy_action: action chosen by the evaluation policy (continuous action case), np.ndarray of shape (size, )evaluation_policy_action_dist: action distribution of the evaluation policy (discrete action case), np.ndarray of shape (size, n_actions)state_action_value_prediction: predicted value for observed state-action pairs, np.ndarrayinitial_state_value_prediction: predicted value for observed initial actions, np.ndarray pf shape (n_trajectories, ) (optional)state_action_marginal_importance_weight: estimated state-action marginal importance weight, np.ndarray of (size, ) (optional)state_marginal_importance_weight: estimated state-marginal importance weight, np.ndarray of (size, ) (optional)on_policy_policy_value: on-policy policy value of the evaluation policy, float (optional)gamma: discount factor, floatbehavior_policy: name of the behavior policy, strevaluation_policy: name of the evaluation policy, strdataset_id: dataset id , int
Note that, when state_action_value_prediction and initial_state_value_predictions are available,
the model-based and hybrid estimators (e.g., DM and DR) are applicable to OPE.
On the other side, when state_action_marginal_importance_weight and state_marginal_importance_weight are available,
the marginal importance-sampling based estimators are applicable to OPE.
Finally, the assessments of OPE methods become feasible when on_policy_policy_value is available.
The shape of state_action_value_prediction is (size, n_actions) in discrete action cases, while it is (size, 2) in continuous action cases.
In the continuous action case, index 0 of axis=1 should contain the predicted values for the actions chosen by the behavior policy, whereas index 1 of axis=1 should contain those of the evaluation policy.