scope_rl.dataset.synthetic.SyntheticDataset#

class scope_rl.dataset.synthetic.SyntheticDataset(env, max_episode_steps=None, action_meaning=None, action_keys=None, state_keys=None, info_keys=None)[source]#

Class for synthetic data generation.

Bases: scope_rl.dataset.BaseDataset

Imported as: scope_rl.dataset.SyntheticDataset

Note

Logged dataset is directly used for Off-Policy Evaluation (OPE). Moreover, it is also compatible with d3rlpy (offline RL library) with the following command.

d3rlpy_dataset = MDPDataset(
    observations=logged_datasets["state"],
    actions=logged_datasets["action"],
    rewards=logged_datasets["reward"],
    terminals=logged_datasets["done"],
)

Parameters:

env (gym.Env) – Reinforcement learning (RL) environment.
max_episode_steps (int, default=None (> 0)) – Maximum number of timesteps in an episode.
action_meaning (dict) – Dictionary to map discrete action index to a specific action. If action_type is “continuous”, None is recorded.
action_keys (list of str) – Name of each dimension in the action space. If action_type is “discrete”, None is recorded.
state_keys (list of str) – Name of each dimension of the state space.
info_keys (Dict[str, type]) – Dictionary containing the key and type of info components.

Examples

Preparation:

# import necessary module from SCOPE-RL
from scope_rl.dataset import SyntheticDataset
from scope_rl.policy import EpsilonGreedyHead

# import necessary module from other libraries
import gym
import rtbgym
from d3rlpy.algos import DoubleDQNConfig
from d3rlpy.dataset import create_fifo_replay_buffer
from d3rlpy.algos import ConstantEpsilonGreedy

# initialize environment
env = gym.make("RTBEnv-discrete-v0")

# define (RL) agent (i.e., policy) and train on the environment
ddqn = DoubleDQNConfig().create()
buffer = create_fifo_replay_buffer(
    limit=10000,
    env=env,
)
explorer = ConstantEpsilonGreedy(
    epsilon=0.3,
)
ddqn.fit_online(
    env=env,
    buffer=buffer,
    explorer=explorer,
    n_steps=10000,
    n_steps_per_epoch=1000,
)

# convert ddqn policy to stochastic data collection policy
behavior_policy = EpsilonGreedyHead(
    ddqn,
    n_actions=env.action_space.n,
    epsilon=0.3,
    name="ddqn_epsilon_0.3",
    random_state=12345,
)

Synthetic Dataset Generation:

# initialize dataset class
dataset = SyntheticDataset(
    env=env,
    max_episode_steps=env.step_per_episode,
    action_meaning=env.action_meaning,
    state_keys=env.obs_keys,
    info_keys={
        "search_volume": int,
        "impression": int,
        "click": int,
        "conversion": int,
        "average_bid_price": float,
    },
)

# data collection
logged_datasets = dataset.obtain_episodes(
    behavior_policies=behavior_policy,
    n_trajectories=100,
    obtain_info=True,
    random_state=12345,
)

Output:

>>> logged_datasets

{'size': 700,
'n_trajectories': 100,
'step_per_trajectory': 7,
'action_type': 'discrete',
'action_dim': 10,
'action_keys': None,
'action_meaning': array([ 0.1       ,  0.16681005,  0.27825594,  0.46415888,  0.77426368,
        1.29154967,  2.15443469,  3.59381366,  5.9948425 , 10.        ]),
'state_dim': 7,
'state_keys': ['timestep',
'remaining_budget',
'budget_consumption_rate',
'cost_per_mille_of_impression',
'winning_rate',
'reward',
'adjust_rate'],
'state': array([[0.00000000e+00, 3.00000000e+03, 9.29616093e-01, ...,
     1.83918812e-01, 2.00000000e+00, 4.71334329e-01],
    [1.00000000e+00, 1.91000000e+03, 3.63333333e-01, ...,
    1.00000000e+00, 6.00000000e+00, 1.00000000e+01],
    [2.00000000e+00, 1.91000000e+03, 0.00000000e+00, ...,
    0.00000000e+00, 0.00000000e+00, 1.66810054e-01],
    ...,
    [4.00000000e+00, 9.54000000e+02, 5.40904716e-01, ...,
    1.00000000e+00, 2.00000000e+00, 3.59381366e+00],
    [5.00000000e+00, 6.10000000e+01, 9.36058700e-01, ...,
    9.90049751e-01, 7.00000000e+00, 3.59381366e+00],
    [6.00000000e+00, 6.10000000e+01, 0.00000000e+00, ...,
    0.00000000e+00, 0.00000000e+00, 1.00000000e-01]]),
'action': array([9., 1., 9., ..., 7., 0., 9.]),
'reward': array([ 6.,  0.,  1., ..., 7.,  0.,  0.]),
'done': array([0., 0., 0., ..., 0., 0., 1.]),
'terminal': array([0., 0., 0., ..., 0., 0., 1.]),
'info': {'search_volume': array([201.,   205.,  217., ..., 201.,   191., 186.]),
'impression': array([201.,   0.,  217., ..., 199.,   0.,   8.]),
'click': array([21.,  0.,  24., ...,  18.,  0.,  0.]),
'conversion': array([ 6.,  0.,  1., ..., 7.,  0.,  0.]),
'average_bid_price': array([544.55223881,   8.24390244, 523.24423963, ..., 172.58706468,
           4.2565445 , 458.76344086])},
'pscore': array([0.73, 0.73, 0.73, ..., 0.73, 0.03, 0.73]),
'behavior_policy': 'ddqn_epsilon_0.3',
'dataset_id': 0}

See also

Quickstart

Attributes:

action_keys
action_meaning
info_keys
max_episode_steps
state_keys

Methods

`obtain_episodes`(behavior_policies[, ...])	Rollout the behavior policy and obtain episodes.
`obtain_steps`(behavior_policies[, ...])	Rollout the behavior policy and obtain steps.

obtain_episodes(behavior_policies, n_datasets=1, n_trajectories=10000, step_per_trajectory=None, obtain_info=False, record_unclipped_action=False, path='logged_dataset/', save_relative_path=False, random_state=None)[source]#

Rollout the behavior policy and obtain episodes.

Note

This function calls obtain_episodes and save multiple logged dataset in MultipleLoggedDataset.

Note

This function is intended to be used for the environment which has a fixed length of episodes (episodic setting).

For non-episodic, stationary setting (such as cartpole or taxi as used in (Liu et al., 2018) and (Uehara et al., 2020)), please also consider using obtain_steps() to generate a logged dataset.

References

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. “Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation.” 2018

Parameters:

behavior_policies (list of BaseHead or BaseHead) – List of RL policies that generate logged data.
n_datasets (int, default=1 (> 0)) – Number of generated (independent) datasets. If the value is more than 1, the method returns MultipleLoggedDataset instead of LoggedDataset.
n_trajectories (int, default=10000 (> 0)) – Number of trajectories to generate by rolling out the behavior policy.
step_per_trajectory (int, default=None (> 0)) – Number of timesteps in an trajectory.
obtain_info (bool, default=False) – Whether to gain info from the environment or not.
record_unclipped_action (bool, default=False) – Whether to record unclipped action values in the logged dataset. Only applicable when action_type is continuous.
path (str) – Path to the directory. Either absolute or relative path is acceptable.
save_relative_path (bool, default=False.) –
Whether to save a relative path. If True, a path relative to the scope-rl directory will be saved. If False, the absolute path will be saved.

Note that this option was added in order to run examples in the documentation properly. Otherwise, the default setting (False) is recommended.
random_state (int, default=None (>= 0)) – Random state.

Returns:

logged_dataset(s) – MultipleLoggedDataset is an instance containing (multiple) logged datasets.

Each logged dataset is accessible by the following command.

logged_dataset_0 = logged_datasets.get(behavior_policy.name, 0)

Each logged dataset consists of the following.

key: [
    size,
    n_trajectories,
    step_per_trajectory,
    action_type,
    n_actions,
    action_dim,
    action_keys,
    action_meaning,
    state_dim,
    state_keys,
    state,
    action,
    reward,
    done,
    terminal,
    info,
    pscore,
    behavior_policy,
    dataset_id,
]

size: int (> 0): Number of steps the dataset records.
n_trajectories: int (> 0): Number of trajectories the dataset records.
step_per_trajectory: int (> 0): Number of timesteps in an trajectory.
action_type: str: Type of the action space. Either “discrete” or “continuous”.
n_actions: int (> 0): Number of actions. If action_type is “continuous”, None is recorded.
action_dim: int (> 0): Dimensions of the action space. If action_type is “discrete”, None is recorded.
action_keys: list of str: Name of each dimension in the action space. If action_type is “discrete”, None is recorded.
action_meaning: dict: Dictionary to map discrete action index to a specific action. If action_type is “continuous”, None is recorded.
state_dim: int (> 0): Dimensions of the state space.
state_keys: list of str: Name of each dimension of the state space.
state: ndarray of shape (size, state_dim): State observed by the behavior policy.
action: ndarray of shape (size, ) or (size, action_dim): Action chosen by the behavior policy.
reward: ndarray of shape (size, ): Reward observed for each (state, action) pair.
done: ndarray of shape (size, ): Whether an episode ends or not.
terminal: ndarray of shape (size, ): Whether an episode reaches the pre-defined maximum steps.
info: dict: Additional feedbacks from the environment.
pscore: ndarray of shape (size, ): Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
behavior_policy: str: Name of the behavior policy.
dataset_id: int: Id of the logged dataset.

Return type:

LoggedDataset or MultipleLoggedDataset

obtain_steps(behavior_policies, n_datasets=1, n_trajectories=10000, step_per_trajectory=10, minimum_rollout_length=0, maximum_rollout_length=100, obtain_info=False, obtain_trajectories_from_single_interaction=False, record_unclipped_action=False, path='logged_dataset/', save_relative_path=False, random_state=None)[source]#

Rollout the behavior policy and obtain steps.

Note

This function is intended to be used for the environment which has a stationary state distribution (such as cartpole or taxi as used in (Liu et al., 2018) and (Uehara et al., 2020)).

For the (standard) episodic RL setting, please also consider using obtain_episodes().

References

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. “Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation.” 2018

Parameters:

behavior_policies (list of BaseHead or BaseHead) – List of RL policies that generate logged data.
n_datasets (int, default=1 (> 0)) – Number of generated (independent) datasets. If the value is more than 1, the method returns MultiplLoggedeDataset instead of LoggedDataset.
n_trajectories (int, default=10000 (> 0)) – Number of trajectories to generate by rolling out the behavior policy.
step_per_trajectory (int, default=100 (> 0)) – Number of timesteps in an trajectory.
minimum_rollout_length (int, default=0 (>= 0)) – Minimum length of rollout by the behavior policy before generating the logged dataset when working on the infinite horizon setting. This argument is irrelevant when working on the finite horizon setting.
maximum_rollout_length (int, default=100 (>= minimum_rollout_length)) – Maximum length of rollout by the behavior policy before generating the logged dataset when working on the infinite horizon setting. This argument is irrelevant when working on the finite horizon setting.
obtain_info (bool, default=False) – Whether to gain info from the environment or not.
obtain_trajectories_from_single_interaction (bool, default=False) – Whether to collect whole data from a single trajectory. If True, the initial state of trajectory i is the next state of the trajectory (i-1)’s last state. If False, the initial state will be sampled by rolling out the behavior policy after resetting the environment.
record_unclipped_action (bool, default=False) – Whether to record unclipped action values in the logged dataset. Only applicable when action_type is continuous.
seed_env (bool, default=False) – Whether to set seed on environment or not.
path (str) – Path to the directory. Either absolute or relative path is acceptable.
save_relative_path (bool, default=False.) –
Whether to save a relative path. If True, a path relative to the scope-rl directory will be saved. If False, the absolute path will be saved.

Note that this option was added in order to run examples in the documentation properly. Otherwise, the default setting (False) is recommended.
random_state (int, default=None (>= 0)) – Random state.

Returns:

logged_dataset(s) – MultipleLoggedDataset is an instance containing (multiple) logged datasets.

By calling the following command, we can access each logged dataset as follows.

logged_dataset_0 = logged_datasets.get(behavior_policy.name, 0)

Each logged dataset consists the following.

key: [
    size,
    n_trajectories,
    step_per_trajectory,
    action_type,
    n_actions,
    action_dim,
    action_keys,
    action_meaning,
    state_dim,
    state_keys,
    state,
    action,
    reward,
    done,
    terminal,
    info,
    pscore,
    behavior_policy,
    dataset_id,
]

size: int (> 0): Number of steps the dataset records.
n_trajectories: int (> 0): Number of trajectories the dataset records.
step_per_trajectory: int (> 0): Number of timesteps in an trajectory.
action_type: str: Type of the action space. Either “discrete” or “continuous”.
n_actions: int (> 0): Number of actions. If action_type is “continuous”, None is recorded.
action_dim: int (> 0): Dimensions of the action space. If action_type is “discrete”, None is recorded.
action_keys: list of str: Name of each dimension in the action space. If action_type is “discrete”, None is recorded.
action_meaning: dict: Dictionary to map discrete action index to a specific action. If action_type is “continuous”, None is recorded.
state_dim: int (> 0): Dimensions of the state space.
state_keys: list of str: Name of each dimension of the state space.
state: ndarray of shape (size, state_dim): State observed by the behavior policy.
action: ndarray of shape (size, ) or (size, action_dim): Action chosen by the behavior policy.
reward: ndarray of shape (size, ): Reward observed for each (state, action) pair.
done: ndarray of shape (size, ): Whether an episode ends or not.
terminal: ndarray of shape (size, ): Whether an episode reaches the pre-defined maximum steps.
info: dict: Additional feedbacks from the environment.
pscore: ndarray of shape (size, ): Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
behavior_policy: str: Name of the behavior policy.
dataset_id: int: Id of the logged dataset.

Return type:

LoggedDataset or MultipleLoggedDataset

Methods