scope_rl.policy.orl#

Meta class to handle Offline Learning (ORL).

Classes

TrainCandidatePolicies

Class to handle ORL by multiple algorithms simultaneously.

class scope_rl.policy.orl.TrainCandidatePolicies(fitting_args=None)[source]#

Class to handle ORL by multiple algorithms simultaneously. (applicable to both discrete/continuous action cases)

Imported as: scope_rl.policy.TrainCandidatePolicies

Parameters:

fitting_args (dict, default=None) – Arguments of fitting function to learn model.

Examples

Preparation:

# import necessary module from SCOPE-RL
from scope_rl.dataset import SyntheticDataset
from scope_rl.policy import TrainCandidatePolicies
from scope_rl.policy import EpsilonGreedyHead, SoftmaxHead

# import necessary module from other libraries
import gym
import rtbgym
from d3rlpy.algos import DoubleDQNConfig
from d3rlpy.dataset import create_fifo_replay_buffer

from d3rlpy.algos import ConstantEpsilonGreedy
from d3rlpy.algos import DiscreteBCQConfig, DiscreteCQLConfig

# initialize environment
env = gym.make("RTBEnv-discrete-v0")

# define (RL) agent (i.e., policy) and train on the environment
ddqn = DoubleDQNConfig().create()
buffer = create_fifo_replay_buffer(
    limit=10000,
    env=env,
)
explorer = ConstantEpsilonGreedy(
    epsilon=0.3,
)
ddqn.fit_online(
    env=env,
    buffer=buffer,
    explorer=explorer,
    n_steps=10000,
    n_steps_per_epoch=1000,
)

# convert ddqn policy to a stochastic data collection policy
behavior_policy = EpsilonGreedyHead(
    ddqn,
    n_actions=env.action_space.n,
    epsilon=0.3,
    name="ddqn_epsilon_0.3",
    random_state=12345,
)

# initialize dataset class
dataset = SyntheticDataset(
    env=env,
    max_episode_steps=env.step_per_episode,
)

# data collection
logged_dataset = dataset.obtain_episodes(
    behavior_policies=behavior_policy,
    n_datasets=2,
    n_trajectories=100,
    random_state=12345,
)

Learning Evaluation Policies:

# base algorithms
bcq = DiscreteBCQConfig().create()
cql = DiscreteCQLConfig().create()
algorithms = [bcq, cql]
algorithms_name = ["bcq", "cql"]

# policy wrappers
policy_wrappers = {
    "eps_01": (
        EpsilonGreedyHead,
        {
            "epsilon": 0.1,
            "n_actions": env.action_space.n,
        }
    ),
    "eps_03": (
        EpsilonGreedyHead,
        {
            "epsilon": 0.3,
            "n_actions": env.action_space.n,
        }
    ),
    "softmax": (
        SoftmaxHead,
        {
            "tau": 1.0,
            "n_actions": env.action_space.n,
        }
    ),
}

# off-policy learning
orl = TrainCandidatePolicies()
eval_policies = orl.obtain_evaluation_policy(
    logged_dataset=logged_dataset,
    algorithms=algorithms,
    algorithms_name=algorithms_name,
    policy_wrappers=policy_wrappers,
    random_state=12345,
)

Output:

>>> [eval_policy.name for eval_policy in eval_policies[behavior_policy.name][0]]

['bcq_eps_01', 'bcq_eps_03', 'bcq_softmax', 'cql_eps_01', 'cql_eps_03', 'cql_softmax']
Attributes:
fitting_args

Methods

apply_head(base_policies, ...[, random_state])

Apply policy wrappers to the (deterministic) base policies.

learn_base_policy(logged_dataset, algorithms)

Learn base policy.

obtain_evaluation_policy(logged_dataset, ...)

Obtain evaluation policies given base algorithms and policy wrappers.

learn_base_policy(logged_dataset, algorithms, behavior_policy_name=None, dataset_id=None, random_state=None)[source]#

Learn base policy.

Parameters:
  • logged_dataset (LoggedDataset or MultipleLoggedDataset) –

    Logged dataset used to conduct OPE.

    key: [
        size,
        n_trajectories,
        step_per_trajectory,
        action_type,
        n_actions,
        action_dim,
        action_keys,
        action_meaning,
        state_dim,
        state_keys,
        state,
        action,
        reward,
        done,
        terminal,
        info,
        pscore,
        behavior_policy,
        dataset_id,
    ]
    

    See also

    scope_rl.dataset.SyntheticDataset describes the components of logged_dataset.

  • algorithms (list of QLearningAlgoBase) – List of algorithms to fit.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

base_policies – List of learned policies.

Return type:

QLearningAlgoBase

apply_head(base_policies, base_policies_name, policy_wrappers, random_state=None)[source]#

Apply policy wrappers to the (deterministic) base policies.

Parameters:
  • base_policies (list of QLearningAlgoBase) – List of base (learned) policies.

  • base_policies_name (list of str) – List of the name of each base policy.

  • policy_wrappers (HeadDict.) –

    Dictionary containing information about policy wrappers. The HeadDict should follow the following format.

    key: wrapper_name
    
    value: (BaseHead, params_dict)
    

    (Example of HeadDict)

    {
        "eps_01":  # wrapper_name
            (
                EpsilonGreedyHead,  # BaseHead
                {
                    "epsilon": 0.1,         # params_dict
                    "n_actions": 5,
                },
            )
    }
    

    Note

    random_state, name, and base_policy should be omitted from the params_dict.

    See also

    scope_rl.policy.head described various policy wrappers and their parameters.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

evaluation_policies – List of (stochastic) evaluation policies.

Return type:

list of BaseHead

obtain_evaluation_policy(logged_dataset, algorithms, algorithms_name, policy_wrappers, behavior_policy_name=None, dataset_id=None, random_state=None)[source]#

Obtain evaluation policies given base algorithms and policy wrappers.

Parameters:
  • logged_dataset (LoggedDataset or MultipleLoggedDataset) –

    Logged dataset used to conduct OPE.

    key: [
        size,
        n_trajectories,
        step_per_trajectory,
        action_type,
        n_actions,
        action_dim,
        action_keys,
        action_meaning,
        state_dim,
        state_keys,
        state,
        action,
        reward,
        done,
        terminal,
        info,
        pscore,
    ]
    

    See also

    scope_rl.dataset.SyntheticDataset describes the components of logged_dataset.

  • algorithms (list of QLearningAlgoBase) – List of algorithms to fit.

  • algorithms_name (list of str) – List of the name of each base policy.

  • policy_wrappers (HeadDict) –

    Dictionary containing information about policy wrappers. The HeadDict should follow the following format.

    key: wrapper_name
    
    value: (BaseHead, params_dict)
    

    (Example of HeadDict)

    {
        "eps_01":  # wrapper_name
            (
                EpsilonGreedyHead,  # BaseHead
                {
                    "epsilon": 0.1,         # params_dict
                    "n_actions": 5,
                },
            )
    }
    

    Note

    random_state, name, and base_policy should be omitted from the params_dict.

    See also

    scope_rl.policy.head described various policy wrappers and their parameters.

  • behavior_policy_name (str, default=None) – Name of the behavior policy.

  • dataset_id (int, default=None) – Id of the logged dataset.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

evaluation_policies – List of (stochastic) evaluation policies.

Return type:

list of BaseHead