Supported Implementation#

Our implementation aims to streamline the data collection, (offline) policy learning, and off-policy evaluation and selection (OPE/OPS) procedure. We rely on d3rlpy’s implementation of the learning algorithms and provide some useful tools to streamline the above offline RL procedure.

Synthetic Dataset Generation#

SyntheticDataset is an easy-to-use data collection module, which is compatible with any OpenAI Gym and Gymnasium-like RL environment.

It takes an RL environment as input to instantiate the class.

# initialize the dataset class
from scope_rl.dataset import SyntheticDataset
dataset = SyntheticDataset(
    env=env,
    max_episode_steps=env.step_per_episode,
)

Then, it collects logged data by a behavior policy (i.e., data collection policy) as follows.

# collect logged data by a behavior policy
logged_dataset = dataset.obtain_episodes(
    behavior_policies=behavior_policy,  # BaseHead
    n_trajectories=10000,
    random_state=random_state,
)

Tip

How to obtain a behavior policy?

Our SyntheticDataset class accepts an instance of BaseHead as a behavior policy.

A policy head converts a d3rlpy’s deterministic behavior policy to either a deterministic or stochastic policy with functions to calculate propensity scores (i.e., action choice probabilities).

For example, EpsilonGreedyHead converts a discrete-action policy to an epsilon-greedy policy as follows.

from scope_rl.policy import EpsilonGreedyHead
behavior_policy = EpsilonGreedyHead(
    base_policy,  # QLearningAlgoBase of d3rlpy
    n_actions=env.action_space.n,
    epsilon=0.3,
    name="eps_03",
    random_state=random_state,
)

GaussianHead converts a continuous-action policy to a stochastic policy as follows.

from scope_rl.policy import GaussianHead
behavior_policy = GaussianHead(
    base_policy,  # QLearningAlgoBase of d3rlpy
    sigma=1.0,
    name="sigma_10",
    random_state=random_state,
)

See also

For detailed descriptions and additional supported implementations, please refer to the Policy Wrappers section later in this page.

How to customize the dataset class?

To customize the dataset class, use BaseDataset. The obtained logged_dataset should contain the following keys for API consistency.

key: [
    size,
    n_trajectories,
    step_per_trajectory,
    action_type,
    n_actions,
    action_dim,
    action_keys,
    action_meaning,
    state_dim,
    state_keys,
    state,
    action,
    reward,
    done,
    terminal,
    info,
    pscore,
    behavior_policy,
    dataset_id,
]

Note

logged_dataset can be used for OPE even if action_keys, action_meaning, state_keys, and info are not provided. For API consistency, just leave None when these keys are unnecessary.

Moreover, offline RL algorithms, FQE (model-based OPE), and marginal OPE estimators can also work without pscore.

See also

API reference of BaseDataset and Guidelines for Preparing Real-World Datasets explain the meaning of each keys in detail.

How to handle multiple logged datasets at once?

MultipleLoggedDataset enables us to smoothly handle multiple logged datasets.

Specifically, MultipleLoggedDataset saves the paths to each logged dataset and makes each dataset accessible through the following command.

logged_dataset_ = multiple_logged_dataset.get(behavior_policy_name=behavior_policy.name, dataset_id=0)

There are two ways to obtain MultipleLoggedDataset.

The first way is to directly get MultipleLoggedDataset as the output of SyntheticDataset as follows.

synthetic_dataset = SyntheticDataset(
    env=env,
    max_episode_steps=env.step_per_episode,
    ...,
)
multiple_logged_dataset_1 = synthetic_dataset.obtain_episodes(
    behavior_policies=[behavior_policy_1, behavior_policy_2],  # when using multiple logged datasets, MultipleLoggedDataset is returned
    n_datasets=1,
    n_trajectories=10000,
    ...,
)
multiple_logged_dataset_2 = synthetic_dataset.obtain_episodes(
    behavior_policies=behavior_policy,
    n_datasets=5,                       # when n_datasets > 1, MultipleLoggedDataset is returned
    n_trajectories=10000,
    ...,
)

The second way to define MultipleLoggedDataset manually as follows.

from scope_rl.utils import MultipleLoggedDataset

multiple_logged_dataset = MultipleLoggedDataset(
    action_type="discrete",
    path="logged_dataset/",  # either absolute or relative path
)

for behavior_policy in behavior_policies:
    single_logged_dataset = dataset.obtain_episodes(
        behavior_policies=behavior_policy,
        n_trajectories=10000,
        ...,
    )

    # add a single_logged_dataset to multiple_logged_dataset
    multiple_logged_dataset.add(
        single_logged_dataset,
        behavior_policy_name=behavior_policy.name,
        dataset_id=0,
    )
How to collect data in a non-episodic setting?

When the goal is to evaluate the policy under a stationary distribution (\(d^{\pi}(s)\)) rather than in an episodic setting (i.e., cartpole or taxi used in [12, 13]), we need to collect data from the stationary distribution.

For this, please consider using obtain_step instead of obtain_episodes as follows.

logged_dataset = dataset.obtain_steps(
    behavior_policies=behavior_policy,
    n_trajectories=10000,
    ...,
)

See also

Offline Learning#

Once we obtain the logged dataset, it’s time to learn a new policy in an offline manner. For this, d3rlpy provides various offline RL algorithms that work as follows.

# import modules
from d3rlpy.dataset import MDPDataset
from d3rlpy.algos import DiscreteCQLConfig as CQLConfig
from d3rlpy.models.encoders import VectorEncoderFactory
from d3rlpy.models.q_functions import MeanQFunctionFactory

# convert a (single) logged dataset to the d3rlpy dataset
offlinerl_dataset = MDPDataset(
    observations=logged_dataset["state"],
    actions=logged_dataset["action"],
    rewards=logged_dataset["reward"],
    terminals=logged_dataset["done"],
)

# define an offline RL algorithm
cql = CQLConfig(
    encoder_factory=VectorEncoderFactory(hidden_units=[30, 30]),
    q_func_factory=MeanQFunctionFactory(),
).create()

# fit algorithm in an offline manner
cql.fit(
    offlinerl_dataset,
    n_steps=10000,
)

While the above procedure is already simple and easy to use, we also provide TrainCandidatePolicies as a meta class to further smoothen the ORL procedure with various algorithms.

# prepare offline RL algorithms
cql_b1 = CQLConfig(
    encoder_factory=VectorEncoderFactory(hidden_units=[30, 30]),
    q_func_factory=MeanQFunctionFactory(),
).create()

cql_b2 = CQLConfig(
    encoder_factory=VectorEncoderFactory(hidden_units=[100]),
    q_func_factory=MeanQFunctionFactory(),
).create()

cql_b3 = CQLConfig(
    encoder_factory=VectorEncoderFactory(hidden_units=[50, 10]),
    q_func_factory=MeanQFunctionFactory(),
).create()

# off-policy learning
from scope_rl.policy import TrainCandidatePolicies
opl = TrainCandidatePolicies(
    fitting_args={"n_steps": 10000},
)
base_policies = opl.learn_base_policy(
    logged_dataset=logged_dataset,
    algorithms=[cql_b1, cql_b2, cql_b3],
    random_state=random_state,
)

Using TrainCandidatePolicies, we can also convert the deterministic base policies to stochastic (evaluation) policies as follows.

# policy wrapper
from scope_rl.policy import EpsilonGreedyHead

policy_wrappers = {
    "eps_00": (
        EpsilonGreedyHead, {
            "epsilon": 0.0,
            "n_actions": env.action_space.n,
        }
    ),
    "eps_03": (
        EpsilonGreedyHead, {
            "epsilon": 0.3,
            "n_actions": env.action_space.n,
        }
    ),
    "eps_07": (
        EpsilonGreedyHead, {
            "epsilon": 0.7,
            "n_actions": env.action_space.n,
        }
    ),
    "softmax": (
        SoftmaxHead, {
            "tau": 1.0,
            "n_actions": env.action_space.n,
        }
    )
}

# apply policy wrappers and convert deterministic base policies into stochastic evaluation policies
eval_policies = opl.apply_head(
    base_policies=base_policies,
    base_policies_name=["cql_b1", "cql_b2", "cql_b3"],
    policy_wrappers=policy_wrappers,
    random_state=random_state,
)

where we describe the policy wrappers in detail in the next section.

Also, it is possible to learn the base policy and apply policy wrappers at the same time as follows.

eval_policies = opl.obtain_evaluation_policy(
    logged_dataset=logged_dataset,
    algorithms=[cql_b1, cql_b2, cql_b3],
    algorithms_name=["cql_b1", "cql_b2", "cql_b3"],
    policy_wrappers=policy_wrappers,
    random_state=random_state,
)

The obtained evaluation policies are the following (both algorithms and policy wrappers are enumerated).

>>> [eval_policy.name for eval_policy in eval_policies[0]]

['cql_b1_eps_00', 'cql_b1_eps_03', 'cql_b1_eps_07', 'cql_b1_softmax',
 'cql_b2_eps_00', 'cql_b2_eps_03', 'cql_b2_eps_07', 'cql_b2_softmax',
 'cql_b3_eps_00', 'cql_b3_eps_03', 'cql_b3_eps_07', 'cql_b3_softmax']

Tip

How to handle OPL with multiple logged datasets?

TrainCandidatePolicies is particularly useful when fitting offline RL algorithms on multiple logged datasets.

We can apply the same algorithms and policies wrappers across multiple datasets by the following command.

eval_policies = opl.obtain_evaluation_policy(
    logged_dataset=logged_dataset,                   # MultipleLoggedDataset
    algorithms=[cql_b1, cql_b2, cql_b3],             # single list
    algorithms_name=["cql_b1", "cql_b2", "cql_b3"],  # single list
    policy_wrappers=policy_wrappers,                 # single dict
    random_state=random_state,
)

The evaluation policies are returned in a nested list.

The other functions (i.e., learn_base_policy and apply_head) also work in a manner similar to the above examples.

See also

Policy Wrapper#

Here, we describe some useful wrapper tools to convert a d3rlpy’s policy to (stochastic) behavior and evaluation policies.

Discrete

EpsilonGreedyHead, SoftmaxHead

Continuous

GaussianHead, TruncatedGaussianHead, EvalHead

Both (Online)

OnlineHead

Tip

How to customize the policy head?

To customize the policy head, use BaseHead. Basically, the policy head has two roles.

  1. Enabling online interactions.

  2. Converting a deterministic policy to a stochastic policy.

For the first purpose, we already provide the following four functions in the base class:

  • predict_online

  • predict_value_online

  • sample_action_online

  • sample_action_and_output_pscore_online

Please just override these functions for online interactions. OnlineHead is also useful for this purpose.

Next, for the second purpose, you can customize how to convert a deterministic policy to a stochastic policy using the following functions.

  • sample_action_and_output_pscore_online

  • calc_action_choice_probability

  • calc_pscore_given_action

DiscreteHead#

This module transforms a deterministic policy into a stochastic one in discrete action cases. Specifically, we have the following two options.

  • EpsilonGreedyHead: \(\pi(a | s) := (1 - \epsilon) * \pi_{\mathrm{det}}(a | s) + \epsilon / |\mathcal{A}|\).

  • SoftmaxHead: \(\pi(a | s) := \displaystyle \frac{\exp(Q^{(\pi_{\mathrm{det}})}(s, a) / \tau)}{\sum_{a' \in \mathcal{A}} \exp(Q^{(\pi_{\mathrm{det}})}(s, a') / \tau)}\).

Note that \(\epsilon \in [0, 1]\) is the degree of exploration \(\tau\) is the temperature hyperparameter. EpsilonGreedyHead is also used to construct a deterministic evaluation policy in OPE/OPS by setting \(\epsilon=0.0\).

ContinuousHead#

This module transforms a deterministic policy to a stochastic one in continuous action cases.

  • GaussianHead: \(\pi(a | s) := \mathrm{Normal}(\pi_{\mathrm{det}}(s), \sigma)\).

  • TruncatedGaussianHead: \(\pi(a | s) := \mathrm{TruncatedNormal}(\pi_{\mathrm{det}}(s), \sigma)\).

We also provide the wrapper class of deterministic policy to be used in OPE.

  • ContinuousEvalHead: \(\pi(s) = \pi_{\mathrm{det}}(s)\).

OnlineHead#

This module enables online interaction with the policy (note: d3rlpy’s policy is particularly designed for batch interactions).

  • OnlineHead

Online Evaluation#

Finally, we provide the series of functions to be used for online performance evaluation in scope_rl/ope/online.py.

(Rollout)

  • rollout_policy_online

(Statistics)

  • calc_on_policy_policy_value

  • calc_on_policy_policy_value_interval

  • calc_on_policy_variance

  • calc_on_policy_conditional_value_at_risk

  • calc_on_policy_policy_interquartile_range

  • calc_on_policy_cumulative_distribution_function

(Visualization)

  • visualize_on_policy_policy_value

  • visualize_on_policy_cumulative_distribution_function

  • visualize_on_policy_conditional_value_at_risk

  • visualize_on_policy_interquartile_range

<<< Prev Problem Formulation

Next >>> Off_policy Evaluation

Next >>> Package Reference