scope_rl.dataset.synthetic#
Class to handle synthetic dataset generation.
Classes
Class for synthetic data generation. |
- class scope_rl.dataset.synthetic.SyntheticDataset(env, max_episode_steps=None, action_meaning=None, action_keys=None, state_keys=None, info_keys=None)[source]#
Class for synthetic data generation.
Bases:
scope_rl.dataset.BaseDatasetImported as:
scope_rl.dataset.SyntheticDatasetNote
Logged dataset is directly used for Off-Policy Evaluation (OPE). Moreover, it is also compatible with d3rlpy (offline RL library) with the following command.
d3rlpy_dataset = MDPDataset( observations=logged_datasets["state"], actions=logged_datasets["action"], rewards=logged_datasets["reward"], terminals=logged_datasets["done"], )
See also
(external) d3rlpy’s documentation about MDPDataset
- Parameters:
env (gym.Env) – Reinforcement learning (RL) environment.
max_episode_steps (int, default=None (> 0)) – Maximum number of timesteps in an episode.
action_meaning (dict) – Dictionary to map discrete action index to a specific action. If action_type is “continuous”, None is recorded.
action_keys (list of str) – Name of each dimension in the action space. If action_type is “discrete”, None is recorded.
state_keys (list of str) – Name of each dimension of the state space.
info_keys (Dict[str, type]) – Dictionary containing the key and type of info components.
Examples
Preparation:
# import necessary module from SCOPE-RL from scope_rl.dataset import SyntheticDataset from scope_rl.policy import EpsilonGreedyHead # import necessary module from other libraries import gym import rtbgym from d3rlpy.algos import DoubleDQNConfig from d3rlpy.dataset import create_fifo_replay_buffer from d3rlpy.algos import ConstantEpsilonGreedy # initialize environment env = gym.make("RTBEnv-discrete-v0") # define (RL) agent (i.e., policy) and train on the environment ddqn = DoubleDQNConfig().create() buffer = create_fifo_replay_buffer( limit=10000, env=env, ) explorer = ConstantEpsilonGreedy( epsilon=0.3, ) ddqn.fit_online( env=env, buffer=buffer, explorer=explorer, n_steps=10000, n_steps_per_epoch=1000, ) # convert ddqn policy to stochastic data collection policy behavior_policy = EpsilonGreedyHead( ddqn, n_actions=env.action_space.n, epsilon=0.3, name="ddqn_epsilon_0.3", random_state=12345, )
Synthetic Dataset Generation:
# initialize dataset class dataset = SyntheticDataset( env=env, max_episode_steps=env.step_per_episode, action_meaning=env.action_meaning, state_keys=env.obs_keys, info_keys={ "search_volume": int, "impression": int, "click": int, "conversion": int, "average_bid_price": float, }, ) # data collection logged_datasets = dataset.obtain_episodes( behavior_policies=behavior_policy, n_trajectories=100, obtain_info=True, random_state=12345, )
Output:
>>> logged_datasets {'size': 700, 'n_trajectories': 100, 'step_per_trajectory': 7, 'action_type': 'discrete', 'action_dim': 10, 'action_keys': None, 'action_meaning': array([ 0.1 , 0.16681005, 0.27825594, 0.46415888, 0.77426368, 1.29154967, 2.15443469, 3.59381366, 5.9948425 , 10. ]), 'state_dim': 7, 'state_keys': ['timestep', 'remaining_budget', 'budget_consumption_rate', 'cost_per_mille_of_impression', 'winning_rate', 'reward', 'adjust_rate'], 'state': array([[0.00000000e+00, 3.00000000e+03, 9.29616093e-01, ..., 1.83918812e-01, 2.00000000e+00, 4.71334329e-01], [1.00000000e+00, 1.91000000e+03, 3.63333333e-01, ..., 1.00000000e+00, 6.00000000e+00, 1.00000000e+01], [2.00000000e+00, 1.91000000e+03, 0.00000000e+00, ..., 0.00000000e+00, 0.00000000e+00, 1.66810054e-01], ..., [4.00000000e+00, 9.54000000e+02, 5.40904716e-01, ..., 1.00000000e+00, 2.00000000e+00, 3.59381366e+00], [5.00000000e+00, 6.10000000e+01, 9.36058700e-01, ..., 9.90049751e-01, 7.00000000e+00, 3.59381366e+00], [6.00000000e+00, 6.10000000e+01, 0.00000000e+00, ..., 0.00000000e+00, 0.00000000e+00, 1.00000000e-01]]), 'action': array([9., 1., 9., ..., 7., 0., 9.]), 'reward': array([ 6., 0., 1., ..., 7., 0., 0.]), 'done': array([0., 0., 0., ..., 0., 0., 1.]), 'terminal': array([0., 0., 0., ..., 0., 0., 1.]), 'info': {'search_volume': array([201., 205., 217., ..., 201., 191., 186.]), 'impression': array([201., 0., 217., ..., 199., 0., 8.]), 'click': array([21., 0., 24., ..., 18., 0., 0.]), 'conversion': array([ 6., 0., 1., ..., 7., 0., 0.]), 'average_bid_price': array([544.55223881, 8.24390244, 523.24423963, ..., 172.58706468, 4.2565445 , 458.76344086])}, 'pscore': array([0.73, 0.73, 0.73, ..., 0.73, 0.03, 0.73]), 'behavior_policy': 'ddqn_epsilon_0.3', 'dataset_id': 0}
See also
- Attributes:
- action_keys
- action_meaning
- info_keys
- max_episode_steps
- state_keys
Methods
obtain_episodes(behavior_policies[, ...])Rollout the behavior policy and obtain episodes.
obtain_steps(behavior_policies[, ...])Rollout the behavior policy and obtain steps.
- obtain_episodes(behavior_policies, n_datasets=1, n_trajectories=10000, step_per_trajectory=None, obtain_info=False, record_unclipped_action=False, path='logged_dataset/', save_relative_path=False, random_state=None)[source]#
Rollout the behavior policy and obtain episodes.
Note
This function calls
obtain_episodesand save multiple logged dataset inMultipleLoggedDataset.Note
This function is intended to be used for the environment which has a fixed length of episodes (episodic setting).
For non-episodic, stationary setting (such as cartpole or taxi as used in (Liu et al., 2018) and (Uehara et al., 2020)), please also consider using
obtain_steps()to generate a logged dataset.References
Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.
Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. “Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation.” 2018
- Parameters:
behavior_policies (list of BaseHead or BaseHead) – List of RL policies that generate logged data.
n_datasets (int, default=1 (> 0)) – Number of generated (independent) datasets. If the value is more than 1, the method returns
MultipleLoggedDatasetinstead ofLoggedDataset.n_trajectories (int, default=10000 (> 0)) – Number of trajectories to generate by rolling out the behavior policy.
step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.
obtain_info (bool, default=False) – Whether to gain info from the environment or not.
record_unclipped_action (bool, default=False) – Whether to record unclipped action values in the logged dataset. Only applicable when action_type is continuous.
path (str) – Path to the directory. Either absolute or relative path is acceptable.
save_relative_path (bool, default=False.) –
Whether to save a relative path. If True, a path relative to the scope-rl directory will be saved. If False, the absolute path will be saved.
Note that this option was added in order to run examples in the documentation properly. Otherwise, the default setting (False) is recommended.
random_state (int, default=None (>= 0)) – Random state.
- Returns:
logged_dataset(s) – MultipleLoggedDataset is an instance containing (multiple) logged datasets.
Each logged dataset is accessible by the following command.
logged_dataset_0 = logged_datasets.get(behavior_policy.name, 0)
Each logged dataset consists of the following.
key: [ size, n_trajectories, step_per_trajectory, action_type, n_actions, action_dim, action_keys, action_meaning, state_dim, state_keys, state, action, reward, done, terminal, info, pscore, behavior_policy, dataset_id, ]
- size: int (> 0)
Number of steps the dataset records.
- n_trajectories: int (> 0)
Number of trajectories the dataset records.
- step_per_trajectory: int (> 0)
Number of timesteps in an trajectory.
- action_type: str
Type of the action space. Either “discrete” or “continuous”.
- n_actions: int (> 0)
Number of actions. If action_type is “continuous”, None is recorded.
- action_dim: int (> 0)
Dimensions of the action space. If action_type is “discrete”, None is recorded.
- action_keys: list of str
Name of each dimension in the action space. If action_type is “discrete”, None is recorded.
- action_meaning: dict
Dictionary to map discrete action index to a specific action. If action_type is “continuous”, None is recorded.
- state_dim: int (> 0)
Dimensions of the state space.
- state_keys: list of str
Name of each dimension of the state space.
- state: ndarray of shape (size, state_dim)
State observed by the behavior policy.
- action: ndarray of shape (size, ) or (size, action_dim)
Action chosen by the behavior policy.
- reward: ndarray of shape (size, )
Reward observed for each (state, action) pair.
- done: ndarray of shape (size, )
Whether an episode ends or not.
- terminal: ndarray of shape (size, )
Whether an episode reaches the pre-defined maximum steps.
- info: dict
Additional feedbacks from the environment.
- pscore: ndarray of shape (size, )
Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
- behavior_policy: str
Name of the behavior policy.
- dataset_id: int
Id of the logged dataset.
- Return type:
LoggedDataset or MultipleLoggedDataset
- obtain_steps(behavior_policies, n_datasets=1, n_trajectories=10000, step_per_trajectory=10, minimum_rollout_length=0, maximum_rollout_length=100, obtain_info=False, obtain_trajectories_from_single_interaction=False, record_unclipped_action=False, path='logged_dataset/', save_relative_path=False, random_state=None)[source]#
Rollout the behavior policy and obtain steps.
Note
This function is intended to be used for the environment which has a stationary state distribution (such as cartpole or taxi as used in (Liu et al., 2018) and (Uehara et al., 2020)).
For the (standard) episodic RL setting, please also consider using
obtain_episodes().References
Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.
Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. “Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation.” 2018
- Parameters:
behavior_policies (list of BaseHead or BaseHead) – List of RL policies that generate logged data.
n_datasets (int, default=1 (> 0)) – Number of generated (independent) datasets. If the value is more than 1, the method returns
MultiplLoggedeDatasetinstead ofLoggedDataset.n_trajectories (int, default=10000 (> 0)) – Number of trajectories to generate by rolling out the behavior policy.
step_per_trajectory (int, default=100 (> 0)) – Number of timesteps in an trajectory.
minimum_rollout_length (int, default=0 (>= 0)) – Minimum length of rollout by the behavior policy before generating the logged dataset when working on the infinite horizon setting. This argument is irrelevant when working on the finite horizon setting.
maximum_rollout_length (int, default=100 (>= minimum_rollout_length)) – Maximum length of rollout by the behavior policy before generating the logged dataset when working on the infinite horizon setting. This argument is irrelevant when working on the finite horizon setting.
obtain_info (bool, default=False) – Whether to gain info from the environment or not.
obtain_trajectories_from_single_interaction (bool, default=False) – Whether to collect whole data from a single trajectory. If True, the initial state of trajectory i is the next state of the trajectory (i-1)’s last state. If False, the initial state will be sampled by rolling out the behavior policy after resetting the environment.
record_unclipped_action (bool, default=False) – Whether to record unclipped action values in the logged dataset. Only applicable when action_type is continuous.
seed_env (bool, default=False) – Whether to set seed on environment or not.
path (str) – Path to the directory. Either absolute or relative path is acceptable.
save_relative_path (bool, default=False.) –
Whether to save a relative path. If True, a path relative to the scope-rl directory will be saved. If False, the absolute path will be saved.
Note that this option was added in order to run examples in the documentation properly. Otherwise, the default setting (False) is recommended.
random_state (int, default=None (>= 0)) – Random state.
- Returns:
logged_dataset(s) – MultipleLoggedDataset is an instance containing (multiple) logged datasets.
By calling the following command, we can access each logged dataset as follows.
logged_dataset_0 = logged_datasets.get(behavior_policy.name, 0)
Each logged dataset consists the following.
key: [ size, n_trajectories, step_per_trajectory, action_type, n_actions, action_dim, action_keys, action_meaning, state_dim, state_keys, state, action, reward, done, terminal, info, pscore, behavior_policy, dataset_id, ]
- size: int (> 0)
Number of steps the dataset records.
- n_trajectories: int (> 0)
Number of trajectories the dataset records.
- step_per_trajectory: int (> 0)
Number of timesteps in an trajectory.
- action_type: str
Type of the action space. Either “discrete” or “continuous”.
- n_actions: int (> 0)
Number of actions. If action_type is “continuous”, None is recorded.
- action_dim: int (> 0)
Dimensions of the action space. If action_type is “discrete”, None is recorded.
- action_keys: list of str
Name of each dimension in the action space. If action_type is “discrete”, None is recorded.
- action_meaning: dict
Dictionary to map discrete action index to a specific action. If action_type is “continuous”, None is recorded.
- state_dim: int (> 0)
Dimensions of the state space.
- state_keys: list of str
Name of each dimension of the state space.
- state: ndarray of shape (size, state_dim)
State observed by the behavior policy.
- action: ndarray of shape (size, ) or (size, action_dim)
Action chosen by the behavior policy.
- reward: ndarray of shape (size, )
Reward observed for each (state, action) pair.
- done: ndarray of shape (size, )
Whether an episode ends or not.
- terminal: ndarray of shape (size, )
Whether an episode reaches the pre-defined maximum steps.
- info: dict
Additional feedbacks from the environment.
- pscore: ndarray of shape (size, )
Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
- behavior_policy: str
Name of the behavior policy.
- dataset_id: int
Id of the logged dataset.
- Return type:
LoggedDataset or MultipleLoggedDataset