scope_rl.policy.orl#
Meta class to handle Offline Learning (ORL).
Classes
Class to handle ORL by multiple algorithms simultaneously. |
- class scope_rl.policy.orl.TrainCandidatePolicies(fitting_args=None)[source]#
Class to handle ORL by multiple algorithms simultaneously. (applicable to both discrete/continuous action cases)
Imported as:
scope_rl.policy.TrainCandidatePolicies- Parameters:
fitting_args (dict, default=None) – Arguments of fitting function to learn model.
Examples
Preparation:
# import necessary module from SCOPE-RL from scope_rl.dataset import SyntheticDataset from scope_rl.policy import TrainCandidatePolicies from scope_rl.policy import EpsilonGreedyHead, SoftmaxHead # import necessary module from other libraries import gym import rtbgym from d3rlpy.algos import DoubleDQNConfig from d3rlpy.dataset import create_fifo_replay_buffer from d3rlpy.algos import ConstantEpsilonGreedy from d3rlpy.algos import DiscreteBCQConfig, DiscreteCQLConfig # initialize environment env = gym.make("RTBEnv-discrete-v0") # define (RL) agent (i.e., policy) and train on the environment ddqn = DoubleDQNConfig().create() buffer = create_fifo_replay_buffer( limit=10000, env=env, ) explorer = ConstantEpsilonGreedy( epsilon=0.3, ) ddqn.fit_online( env=env, buffer=buffer, explorer=explorer, n_steps=10000, n_steps_per_epoch=1000, ) # convert ddqn policy to a stochastic data collection policy behavior_policy = EpsilonGreedyHead( ddqn, n_actions=env.action_space.n, epsilon=0.3, name="ddqn_epsilon_0.3", random_state=12345, ) # initialize dataset class dataset = SyntheticDataset( env=env, max_episode_steps=env.step_per_episode, ) # data collection logged_dataset = dataset.obtain_episodes( behavior_policies=behavior_policy, n_datasets=2, n_trajectories=100, random_state=12345, )
Learning Evaluation Policies:
# base algorithms bcq = DiscreteBCQConfig().create() cql = DiscreteCQLConfig().create() algorithms = [bcq, cql] algorithms_name = ["bcq", "cql"] # policy wrappers policy_wrappers = { "eps_01": ( EpsilonGreedyHead, { "epsilon": 0.1, "n_actions": env.action_space.n, } ), "eps_03": ( EpsilonGreedyHead, { "epsilon": 0.3, "n_actions": env.action_space.n, } ), "softmax": ( SoftmaxHead, { "tau": 1.0, "n_actions": env.action_space.n, } ), } # off-policy learning orl = TrainCandidatePolicies() eval_policies = orl.obtain_evaluation_policy( logged_dataset=logged_dataset, algorithms=algorithms, algorithms_name=algorithms_name, policy_wrappers=policy_wrappers, random_state=12345, )
Output:
>>> [eval_policy.name for eval_policy in eval_policies[behavior_policy.name][0]] ['bcq_eps_01', 'bcq_eps_03', 'bcq_softmax', 'cql_eps_01', 'cql_eps_03', 'cql_softmax']
- Attributes:
- fitting_args
Methods
apply_head(base_policies, ...[, random_state])Apply policy wrappers to the (deterministic) base policies.
learn_base_policy(logged_dataset, algorithms)Learn base policy.
obtain_evaluation_policy(logged_dataset, ...)Obtain evaluation policies given base algorithms and policy wrappers.
- learn_base_policy(logged_dataset, algorithms, behavior_policy_name=None, dataset_id=None, random_state=None)[source]#
Learn base policy.
- Parameters:
logged_dataset (LoggedDataset or MultipleLoggedDataset) –
Logged dataset used to conduct OPE.
key: [ size, n_trajectories, step_per_trajectory, action_type, n_actions, action_dim, action_keys, action_meaning, state_dim, state_keys, state, action, reward, done, terminal, info, pscore, behavior_policy, dataset_id, ]
See also
scope_rl.dataset.SyntheticDatasetdescribes the components oflogged_dataset.algorithms (list of QLearningAlgoBase) – List of algorithms to fit.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
random_state (int, default=None (>= 0)) – Random state.
- Returns:
base_policies – List of learned policies.
- Return type:
QLearningAlgoBase
- apply_head(base_policies, base_policies_name, policy_wrappers, random_state=None)[source]#
Apply policy wrappers to the (deterministic) base policies.
- Parameters:
base_policies (list of QLearningAlgoBase) – List of base (learned) policies.
base_policies_name (list of str) – List of the name of each base policy.
policy_wrappers (HeadDict.) –
Dictionary containing information about policy wrappers. The HeadDict should follow the following format.
key: wrapper_name value: (BaseHead, params_dict)
(Example of
HeadDict){ "eps_01": # wrapper_name ( EpsilonGreedyHead, # BaseHead { "epsilon": 0.1, # params_dict "n_actions": 5, }, ) }
Note
random_state,name, andbase_policyshould be omitted from theparams_dict.See also
scope_rl.policy.head described various policy wrappers and their parameters.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
random_state (int, default=None (>= 0)) – Random state.
- Returns:
evaluation_policies – List of (stochastic) evaluation policies.
- Return type:
- obtain_evaluation_policy(logged_dataset, algorithms, algorithms_name, policy_wrappers, behavior_policy_name=None, dataset_id=None, random_state=None)[source]#
Obtain evaluation policies given base algorithms and policy wrappers.
- Parameters:
logged_dataset (LoggedDataset or MultipleLoggedDataset) –
Logged dataset used to conduct OPE.
key: [ size, n_trajectories, step_per_trajectory, action_type, n_actions, action_dim, action_keys, action_meaning, state_dim, state_keys, state, action, reward, done, terminal, info, pscore, ]
See also
scope_rl.dataset.SyntheticDatasetdescribes the components oflogged_dataset.algorithms (list of QLearningAlgoBase) – List of algorithms to fit.
algorithms_name (list of str) – List of the name of each base policy.
policy_wrappers (HeadDict) –
Dictionary containing information about policy wrappers. The HeadDict should follow the following format.
key: wrapper_name value: (BaseHead, params_dict)
(Example of
HeadDict){ "eps_01": # wrapper_name ( EpsilonGreedyHead, # BaseHead { "epsilon": 0.1, # params_dict "n_actions": 5, }, ) }
Note
random_state,name, andbase_policyshould be omitted from theparams_dict.See also
scope_rl.policy.head described various policy wrappers and their parameters.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
random_state (int, default=None (>= 0)) – Random state.
- Returns:
evaluation_policies – List of (stochastic) evaluation policies.
- Return type: