Supported Implementation#

Our implementation aims to streamline the data collection, (offline) policy learning, and off-policy evaluation and selection (OPE/OPS) procedure. We rely on d3rlpy’s implementation of the learning algorithms and provide some useful tools to streamline the above offline RL procedure.

Synthetic Dataset Generation#

SyntheticDataset is an easy-to-use data collection module, which is compatible with any OpenAI Gym and Gymnasium-like RL environment.

It takes an RL environment as input to instantiate the class.

# initialize the dataset class
from scope_rl.dataset import SyntheticDataset
dataset = SyntheticDataset(
    env=env,
    max_episode_steps=env.step_per_episode,
)

Then, it collects logged data by a behavior policy (i.e., data collection policy) as follows.

# collect logged data by a behavior policy
logged_dataset = dataset.obtain_episodes(
    behavior_policies=behavior_policy,  # BaseHead
    n_trajectories=10000,
    random_state=random_state,
)

Tip

Offline Learning#

Once we obtain the logged dataset, it’s time to learn a new policy in an offline manner. For this, d3rlpy provides various offline RL algorithms that work as follows.

# import modules
from d3rlpy.dataset import MDPDataset
from d3rlpy.algos import DiscreteCQLConfig as CQLConfig
from d3rlpy.models.encoders import VectorEncoderFactory
from d3rlpy.models.q_functions import MeanQFunctionFactory

# convert a (single) logged dataset to the d3rlpy dataset
offlinerl_dataset = MDPDataset(
    observations=logged_dataset["state"],
    actions=logged_dataset["action"],
    rewards=logged_dataset["reward"],
    terminals=logged_dataset["done"],
)

# define an offline RL algorithm
cql = CQLConfig(
    encoder_factory=VectorEncoderFactory(hidden_units=[30, 30]),
    q_func_factory=MeanQFunctionFactory(),
).create()

# fit algorithm in an offline manner
cql.fit(
    offlinerl_dataset,
    n_steps=10000,
)

While the above procedure is already simple and easy to use, we also provide TrainCandidatePolicies as a meta class to further smoothen the ORL procedure with various algorithms.

# prepare offline RL algorithms
cql_b1 = CQLConfig(
    encoder_factory=VectorEncoderFactory(hidden_units=[30, 30]),
    q_func_factory=MeanQFunctionFactory(),
).create()

cql_b2 = CQLConfig(
    encoder_factory=VectorEncoderFactory(hidden_units=[100]),
    q_func_factory=MeanQFunctionFactory(),
).create()

cql_b3 = CQLConfig(
    encoder_factory=VectorEncoderFactory(hidden_units=[50, 10]),
    q_func_factory=MeanQFunctionFactory(),
).create()

# off-policy learning
from scope_rl.policy import TrainCandidatePolicies
opl = TrainCandidatePolicies(
    fitting_args={"n_steps": 10000},
)
base_policies = opl.learn_base_policy(
    logged_dataset=logged_dataset,
    algorithms=[cql_b1, cql_b2, cql_b3],
    random_state=random_state,
)

Using TrainCandidatePolicies, we can also convert the deterministic base policies to stochastic (evaluation) policies as follows.

# policy wrapper
from scope_rl.policy import EpsilonGreedyHead

policy_wrappers = {
    "eps_00": (
        EpsilonGreedyHead, {
            "epsilon": 0.0,
            "n_actions": env.action_space.n,
        }
    ),
    "eps_03": (
        EpsilonGreedyHead, {
            "epsilon": 0.3,
            "n_actions": env.action_space.n,
        }
    ),
    "eps_07": (
        EpsilonGreedyHead, {
            "epsilon": 0.7,
            "n_actions": env.action_space.n,
        }
    ),
    "softmax": (
        SoftmaxHead, {
            "tau": 1.0,
            "n_actions": env.action_space.n,
        }
    )
}

# apply policy wrappers and convert deterministic base policies into stochastic evaluation policies
eval_policies = opl.apply_head(
    base_policies=base_policies,
    base_policies_name=["cql_b1", "cql_b2", "cql_b3"],
    policy_wrappers=policy_wrappers,
    random_state=random_state,
)

where we describe the policy wrappers in detail in the next section.

Also, it is possible to learn the base policy and apply policy wrappers at the same time as follows.

eval_policies = opl.obtain_evaluation_policy(
    logged_dataset=logged_dataset,
    algorithms=[cql_b1, cql_b2, cql_b3],
    algorithms_name=["cql_b1", "cql_b2", "cql_b3"],
    policy_wrappers=policy_wrappers,
    random_state=random_state,
)

The obtained evaluation policies are the following (both algorithms and policy wrappers are enumerated).

>>> [eval_policy.name for eval_policy in eval_policies[0]]

['cql_b1_eps_00', 'cql_b1_eps_03', 'cql_b1_eps_07', 'cql_b1_softmax',
 'cql_b2_eps_00', 'cql_b2_eps_03', 'cql_b2_eps_07', 'cql_b2_softmax',
 'cql_b3_eps_00', 'cql_b3_eps_03', 'cql_b3_eps_07', 'cql_b3_softmax']

Tip

Policy Wrapper#

Here, we describe some useful wrapper tools to convert a d3rlpy’s policy to (stochastic) behavior and evaluation policies.

Discrete	EpsilonGreedyHead, SoftmaxHead
Continuous	GaussianHead, TruncatedGaussianHead, EvalHead
Both (Online)	OnlineHead

Tip

DiscreteHead#

This module transforms a deterministic policy into a stochastic one in discrete action cases. Specifically, we have the following two options.

EpsilonGreedyHead: \(\pi(a | s) := (1 - \epsilon) * \pi_{\mathrm{det}}(a | s) + \epsilon / |\mathcal{A}|\).

SoftmaxHead: \(\pi(a | s) := \displaystyle \frac{\exp(Q^{(\pi_{\mathrm{det}})}(s, a) / \tau)}{\sum_{a' \in \mathcal{A}} \exp(Q^{(\pi_{\mathrm{det}})}(s, a') / \tau)}\).

Note that \(\epsilon \in [0, 1]\) is the degree of exploration \(\tau\) is the temperature hyperparameter. EpsilonGreedyHead is also used to construct a deterministic evaluation policy in OPE/OPS by setting \(\epsilon=0.0\).

ContinuousHead#

This module transforms a deterministic policy to a stochastic one in continuous action cases.

GaussianHead: \(\pi(a | s) := \mathrm{Normal}(\pi_{\mathrm{det}}(s), \sigma)\).

TruncatedGaussianHead: \(\pi(a | s) := \mathrm{TruncatedNormal}(\pi_{\mathrm{det}}(s), \sigma)\).

We also provide the wrapper class of deterministic policy to be used in OPE.

ContinuousEvalHead: \(\pi(s) = \pi_{\mathrm{det}}(s)\).

OnlineHead#

This module enables online interaction with the policy (note: d3rlpy’s policy is particularly designed for batch interactions).

OnlineHead

Online Evaluation#

Finally, we provide the series of functions to be used for online performance evaluation in scope_rl/ope/online.py.

(Rollout)

rollout_policy_online

(Statistics)

calc_on_policy_policy_value
calc_on_policy_policy_value_interval
calc_on_policy_variance
calc_on_policy_conditional_value_at_risk
calc_on_policy_policy_interquartile_range
calc_on_policy_cumulative_distribution_function

(Visualization)

visualize_on_policy_policy_value
visualize_on_policy_cumulative_distribution_function
visualize_on_policy_conditional_value_at_risk
visualize_on_policy_interquartile_range

<<< Prev Problem Formulation

Next >>> Off_policy Evaluation

Next >>> Package Reference