scope_rl.policy.orl#

Meta class to handle Offline Learning (ORL).

Classes

TrainCandidatePolicies

Class to handle ORL by multiple algorithms simultaneously.

class scope_rl.policy.orl.TrainCandidatePolicies(fitting_args=None)[source]#

Class to handle ORL by multiple algorithms simultaneously. (applicable to both discrete/continuous action cases)

Imported as: scope_rl.policy.TrainCandidatePolicies

Parameters:: fitting_args (dict, default=None) – Arguments of fitting function to learn model.

Examples

Preparation:

# import necessary module from SCOPE-RL
from scope_rl.dataset import SyntheticDataset
from scope_rl.policy import TrainCandidatePolicies
from scope_rl.policy import EpsilonGreedyHead, SoftmaxHead

# import necessary module from other libraries
import gym
import rtbgym
from d3rlpy.algos import DoubleDQNConfig
from d3rlpy.dataset import create_fifo_replay_buffer

from d3rlpy.algos import ConstantEpsilonGreedy
from d3rlpy.algos import DiscreteBCQConfig, DiscreteCQLConfig

# initialize environment
env = gym.make("RTBEnv-discrete-v0")

# define (RL) agent (i.e., policy) and train on the environment
ddqn = DoubleDQNConfig().create()
buffer = create_fifo_replay_buffer(
    limit=10000,
    env=env,
)
explorer = ConstantEpsilonGreedy(
    epsilon=0.3,
)
ddqn.fit_online(
    env=env,
    buffer=buffer,
    explorer=explorer,
    n_steps=10000,
    n_steps_per_epoch=1000,
)

# convert ddqn policy to a stochastic data collection policy
behavior_policy = EpsilonGreedyHead(
    ddqn,
    n_actions=env.action_space.n,
    epsilon=0.3,
    name="ddqn_epsilon_0.3",
    random_state=12345,
)

# initialize dataset class
dataset = SyntheticDataset(
    env=env,
    max_episode_steps=env.step_per_episode,
)

# data collection
logged_dataset = dataset.obtain_episodes(
    behavior_policies=behavior_policy,
    n_datasets=2,
    n_trajectories=100,
    random_state=12345,
)

Learning Evaluation Policies:

# base algorithms
bcq = DiscreteBCQConfig().create()
cql = DiscreteCQLConfig().create()
algorithms = [bcq, cql]
algorithms_name = ["bcq", "cql"]

# policy wrappers
policy_wrappers = {
    "eps_01": (
        EpsilonGreedyHead,
        {
            "epsilon": 0.1,
            "n_actions": env.action_space.n,
        }
    ),
    "eps_03": (
        EpsilonGreedyHead,
        {
            "epsilon": 0.3,
            "n_actions": env.action_space.n,
        }
    ),
    "softmax": (
        SoftmaxHead,
        {
            "tau": 1.0,
            "n_actions": env.action_space.n,
        }
    ),
}

# off-policy learning
orl = TrainCandidatePolicies()
eval_policies = orl.obtain_evaluation_policy(
    logged_dataset=logged_dataset,
    algorithms=algorithms,
    algorithms_name=algorithms_name,
    policy_wrappers=policy_wrappers,
    random_state=12345,
)

Output:

>>> [eval_policy.name for eval_policy in eval_policies[behavior_policy.name][0]]

['bcq_eps_01', 'bcq_eps_03', 'bcq_softmax', 'cql_eps_01', 'cql_eps_03', 'cql_softmax']

Attributes:

fitting_args

Methods

`apply_head`(base_policies, ...[, random_state])	Apply policy wrappers to the (deterministic) base policies.
`learn_base_policy`(logged_dataset, algorithms)	Learn base policy.
`obtain_evaluation_policy`(logged_dataset, ...)	Obtain evaluation policies given base algorithms and policy wrappers.

learn_base_policy(logged_dataset, algorithms, behavior_policy_name=None, dataset_id=None, random_state=None)[source]#

Learn base policy.

Parameters:

logged_dataset (LoggedDataset or MultipleLoggedDataset) –

Logged dataset used to conduct OPE.

key: [
    size,
    n_trajectories,
    step_per_trajectory,
    action_type,
    n_actions,
    action_dim,
    action_keys,
    action_meaning,
    state_dim,
    state_keys,
    state,
    action,
    reward,
    done,
    terminal,
    info,
    pscore,
    behavior_policy,
    dataset_id,
]

See also

scope_rl.dataset.SyntheticDataset describes the components of logged_dataset.

algorithms (list of QLearningAlgoBase) – List of algorithms to fit.
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
random_state (int, default=None (>= 0)) – Random state.

Returns:

base_policies – List of learned policies.

Return type:

QLearningAlgoBase

apply_head(base_policies, base_policies_name, policy_wrappers, random_state=None)[source]#

Apply policy wrappers to the (deterministic) base policies.

Parameters:

base_policies (list of QLearningAlgoBase) – List of base (learned) policies.
base_policies_name (list of str) – List of the name of each base policy.

policy_wrappers (HeadDict.) –

Dictionary containing information about policy wrappers. The HeadDict should follow the following format.

key: wrapper_name

value: (BaseHead, params_dict)

(Example of HeadDict)

{
    "eps_01":  # wrapper_name
        (
            EpsilonGreedyHead,  # BaseHead
            {
                "epsilon": 0.1,         # params_dict
                "n_actions": 5,
            },
        )
}

Note

random_state, name, and base_policy should be omitted from the params_dict.

See also

scope_rl.policy.head described various policy wrappers and their parameters.

behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
random_state (int, default=None (>= 0)) – Random state.

Returns:

evaluation_policies – List of (stochastic) evaluation policies.

Return type:

list of BaseHead

obtain_evaluation_policy(logged_dataset, algorithms, algorithms_name, policy_wrappers, behavior_policy_name=None, dataset_id=None, random_state=None)[source]#

Obtain evaluation policies given base algorithms and policy wrappers.

Parameters:

logged_dataset (LoggedDataset or MultipleLoggedDataset) –

Logged dataset used to conduct OPE.

key: [
    size,
    n_trajectories,
    step_per_trajectory,
    action_type,
    n_actions,
    action_dim,
    action_keys,
    action_meaning,
    state_dim,
    state_keys,
    state,
    action,
    reward,
    done,
    terminal,
    info,
    pscore,
]

See also

scope_rl.dataset.SyntheticDataset describes the components of logged_dataset.

algorithms (list of QLearningAlgoBase) – List of algorithms to fit.
algorithms_name (list of str) – List of the name of each base policy.

policy_wrappers (HeadDict) –

Dictionary containing information about policy wrappers. The HeadDict should follow the following format.

key: wrapper_name

value: (BaseHead, params_dict)

(Example of HeadDict)

{
    "eps_01":  # wrapper_name
        (
            EpsilonGreedyHead,  # BaseHead
            {
                "epsilon": 0.1,         # params_dict
                "n_actions": 5,
            },
        )
}

Note

random_state, name, and base_policy should be omitted from the params_dict.

See also

scope_rl.policy.head described various policy wrappers and their parameters.

behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
random_state (int, default=None (>= 0)) – Random state.

Returns:

evaluation_policies – List of (stochastic) evaluation policies.

Return type:

list of BaseHead