Supported Implementation#
Our implementation aims to streamline the data collection, (offline) policy learning, and off-policy evaluation and selection (OPE/OPS) procedure. We rely on d3rlpy’s implementation of the learning algorithms and provide some useful tools to streamline the above offline RL procedure.
Synthetic Dataset Generation#
SyntheticDataset
is an easy-to-use data collection module, which is compatible with any OpenAI Gym and Gymnasium-like RL environment.
It takes an RL environment as input to instantiate the class.
# initialize the dataset class
from scope_rl.dataset import SyntheticDataset
dataset = SyntheticDataset(
env=env,
max_episode_steps=env.step_per_episode,
)
Then, it collects logged data by a behavior policy (i.e., data collection policy) as follows.
# collect logged data by a behavior policy
logged_dataset = dataset.obtain_episodes(
behavior_policies=behavior_policy, # BaseHead
n_trajectories=10000,
random_state=random_state,
)
Tip
How to obtain a behavior policy?
Our SyntheticDataset
class accepts an instance of BaseHead
as a behavior policy.
A policy head converts a d3rlpy’s deterministic behavior policy to either a deterministic or stochastic policy with functions to calculate propensity scores (i.e., action choice probabilities).
For example, EpsilonGreedyHead
converts a discrete-action policy to an epsilon-greedy policy as follows.
from scope_rl.policy import EpsilonGreedyHead
behavior_policy = EpsilonGreedyHead(
base_policy, # QLearningAlgoBase of d3rlpy
n_actions=env.action_space.n,
epsilon=0.3,
name="eps_03",
random_state=random_state,
)
GaussianHead
converts a continuous-action policy to a stochastic policy as follows.
from scope_rl.policy import GaussianHead
behavior_policy = GaussianHead(
base_policy, # QLearningAlgoBase of d3rlpy
sigma=1.0,
name="sigma_10",
random_state=random_state,
)
See also
For detailed descriptions and additional supported implementations, please refer to the Policy Wrappers section later in this page.
How to customize the dataset class?
To customize the dataset class, use BaseDataset
. The obtained logged_dataset
should contain the following keys for API consistency.
key: [
size,
n_trajectories,
step_per_trajectory,
action_type,
n_actions,
action_dim,
action_keys,
action_meaning,
state_dim,
state_keys,
state,
action,
reward,
done,
terminal,
info,
pscore,
behavior_policy,
dataset_id,
]
Note
logged_dataset
can be used for OPE even if action_keys
, action_meaning
, state_keys
, and info
are not provided.
For API consistency, just leave None
when these keys are unnecessary.
Moreover, offline RL algorithms, FQE (model-based OPE), and marginal OPE estimators
can also work without pscore
.
See also
API reference of BaseDataset and Guidelines for Preparing Real-World Datasets explain the meaning of each keys in detail.
How to handle multiple logged datasets at once?
MultipleLoggedDataset
enables us to smoothly handle multiple logged datasets.
Specifically, MultipleLoggedDataset
saves the paths to each logged dataset and makes each dataset accessible through the following command.
logged_dataset_ = multiple_logged_dataset.get(behavior_policy_name=behavior_policy.name, dataset_id=0)
There are two ways to obtain MultipleLoggedDataset
.
The first way is to directly get MultipleLoggedDataset
as the output of SyntheticDataset
as follows.
synthetic_dataset = SyntheticDataset(
env=env,
max_episode_steps=env.step_per_episode,
...,
)
multiple_logged_dataset_1 = synthetic_dataset.obtain_episodes(
behavior_policies=[behavior_policy_1, behavior_policy_2], # when using multiple logged datasets, MultipleLoggedDataset is returned
n_datasets=1,
n_trajectories=10000,
...,
)
multiple_logged_dataset_2 = synthetic_dataset.obtain_episodes(
behavior_policies=behavior_policy,
n_datasets=5, # when n_datasets > 1, MultipleLoggedDataset is returned
n_trajectories=10000,
...,
)
The second way to define MultipleLoggedDataset
manually as follows.
from scope_rl.utils import MultipleLoggedDataset
multiple_logged_dataset = MultipleLoggedDataset(
action_type="discrete",
path="logged_dataset/", # either absolute or relative path
)
for behavior_policy in behavior_policies:
single_logged_dataset = dataset.obtain_episodes(
behavior_policies=behavior_policy,
n_trajectories=10000,
...,
)
# add a single_logged_dataset to multiple_logged_dataset
multiple_logged_dataset.add(
single_logged_dataset,
behavior_policy_name=behavior_policy.name,
dataset_id=0,
)
How to collect data in a non-episodic setting?
When the goal is to evaluate the policy under a stationary distribution (\(d^{\pi}(s)\)) rather than in an episodic setting (i.e., cartpole or taxi used in [12, 13]), we need to collect data from the stationary distribution.
For this, please consider using obtain_step
instead of obtain_episodes
as follows.
logged_dataset = dataset.obtain_steps(
behavior_policies=behavior_policy,
n_trajectories=10000,
...,
)
See also
Offline Learning#
Once we obtain the logged dataset, it’s time to learn a new policy in an offline manner. For this, d3rlpy provides various offline RL algorithms that work as follows.
# import modules
from d3rlpy.dataset import MDPDataset
from d3rlpy.algos import DiscreteCQLConfig as CQLConfig
from d3rlpy.models.encoders import VectorEncoderFactory
from d3rlpy.models.q_functions import MeanQFunctionFactory
# convert a (single) logged dataset to the d3rlpy dataset
offlinerl_dataset = MDPDataset(
observations=logged_dataset["state"],
actions=logged_dataset["action"],
rewards=logged_dataset["reward"],
terminals=logged_dataset["done"],
)
# define an offline RL algorithm
cql = CQLConfig(
encoder_factory=VectorEncoderFactory(hidden_units=[30, 30]),
q_func_factory=MeanQFunctionFactory(),
).create()
# fit algorithm in an offline manner
cql.fit(
offlinerl_dataset,
n_steps=10000,
)
While the above procedure is already simple and easy to use,
we also provide TrainCandidatePolicies
as a meta class to further smoothen the ORL procedure with various algorithms.
# prepare offline RL algorithms
cql_b1 = CQLConfig(
encoder_factory=VectorEncoderFactory(hidden_units=[30, 30]),
q_func_factory=MeanQFunctionFactory(),
).create()
cql_b2 = CQLConfig(
encoder_factory=VectorEncoderFactory(hidden_units=[100]),
q_func_factory=MeanQFunctionFactory(),
).create()
cql_b3 = CQLConfig(
encoder_factory=VectorEncoderFactory(hidden_units=[50, 10]),
q_func_factory=MeanQFunctionFactory(),
).create()
# off-policy learning
from scope_rl.policy import TrainCandidatePolicies
opl = TrainCandidatePolicies(
fitting_args={"n_steps": 10000},
)
base_policies = opl.learn_base_policy(
logged_dataset=logged_dataset,
algorithms=[cql_b1, cql_b2, cql_b3],
random_state=random_state,
)
Using TrainCandidatePolicies
, we can also convert the deterministic base policies to stochastic (evaluation) policies as follows.
# policy wrapper
from scope_rl.policy import EpsilonGreedyHead
policy_wrappers = {
"eps_00": (
EpsilonGreedyHead, {
"epsilon": 0.0,
"n_actions": env.action_space.n,
}
),
"eps_03": (
EpsilonGreedyHead, {
"epsilon": 0.3,
"n_actions": env.action_space.n,
}
),
"eps_07": (
EpsilonGreedyHead, {
"epsilon": 0.7,
"n_actions": env.action_space.n,
}
),
"softmax": (
SoftmaxHead, {
"tau": 1.0,
"n_actions": env.action_space.n,
}
)
}
# apply policy wrappers and convert deterministic base policies into stochastic evaluation policies
eval_policies = opl.apply_head(
base_policies=base_policies,
base_policies_name=["cql_b1", "cql_b2", "cql_b3"],
policy_wrappers=policy_wrappers,
random_state=random_state,
)
where we describe the policy wrappers in detail in the next section.
Also, it is possible to learn the base policy and apply policy wrappers at the same time as follows.
eval_policies = opl.obtain_evaluation_policy(
logged_dataset=logged_dataset,
algorithms=[cql_b1, cql_b2, cql_b3],
algorithms_name=["cql_b1", "cql_b2", "cql_b3"],
policy_wrappers=policy_wrappers,
random_state=random_state,
)
The obtained evaluation policies are the following (both algorithms and policy wrappers are enumerated).
>>> [eval_policy.name for eval_policy in eval_policies[0]]
['cql_b1_eps_00', 'cql_b1_eps_03', 'cql_b1_eps_07', 'cql_b1_softmax',
'cql_b2_eps_00', 'cql_b2_eps_03', 'cql_b2_eps_07', 'cql_b2_softmax',
'cql_b3_eps_00', 'cql_b3_eps_03', 'cql_b3_eps_07', 'cql_b3_softmax']
Tip
How to handle OPL with multiple logged datasets?
TrainCandidatePolicies
is particularly useful when fitting offline RL algorithms on multiple logged datasets.
We can apply the same algorithms and policies wrappers across multiple datasets by the following command.
eval_policies = opl.obtain_evaluation_policy(
logged_dataset=logged_dataset, # MultipleLoggedDataset
algorithms=[cql_b1, cql_b2, cql_b3], # single list
algorithms_name=["cql_b1", "cql_b2", "cql_b3"], # single list
policy_wrappers=policy_wrappers, # single dict
random_state=random_state,
)
The evaluation policies are returned in a nested list.
The other functions (i.e., learn_base_policy
and apply_head
) also work in a manner similar to the above examples.
See also
Policy Wrapper#
Here, we describe some useful wrapper tools to convert a d3rlpy’s policy to (stochastic) behavior and evaluation policies.
EpsilonGreedyHead, SoftmaxHead |
|
GaussianHead, TruncatedGaussianHead, EvalHead |
|
OnlineHead |
Tip
How to customize the policy head?
To customize the policy head, use BaseHead
. Basically, the policy head has two roles.
Enabling online interactions.
Converting a deterministic policy to a stochastic policy.
For the first purpose, we already provide the following four functions in the base class:
predict_online
predict_value_online
sample_action_online
sample_action_and_output_pscore_online
Please just override these functions for online interactions. OnlineHead
is also useful for this purpose.
Next, for the second purpose, you can customize how to convert a deterministic policy to a stochastic policy using the following functions.
sample_action_and_output_pscore_online
calc_action_choice_probability
calc_pscore_given_action
DiscreteHead#
This module transforms a deterministic policy into a stochastic one in discrete action cases. Specifically, we have the following two options.
EpsilonGreedyHead
: \(\pi(a | s) := (1 - \epsilon) * \pi_{\mathrm{det}}(a | s) + \epsilon / |\mathcal{A}|\).
SoftmaxHead
: \(\pi(a | s) := \displaystyle \frac{\exp(Q^{(\pi_{\mathrm{det}})}(s, a) / \tau)}{\sum_{a' \in \mathcal{A}} \exp(Q^{(\pi_{\mathrm{det}})}(s, a') / \tau)}\).
Note that \(\epsilon \in [0, 1]\) is the degree of exploration \(\tau\) is the temperature hyperparameter. EpsilonGreedyHead is also used to construct a deterministic evaluation policy in OPE/OPS by setting \(\epsilon=0.0\).
ContinuousHead#
This module transforms a deterministic policy to a stochastic one in continuous action cases.
GaussianHead
: \(\pi(a | s) := \mathrm{Normal}(\pi_{\mathrm{det}}(s), \sigma)\).
TruncatedGaussianHead
: \(\pi(a | s) := \mathrm{TruncatedNormal}(\pi_{\mathrm{det}}(s), \sigma)\).
We also provide the wrapper class of deterministic policy to be used in OPE.
ContinuousEvalHead
: \(\pi(s) = \pi_{\mathrm{det}}(s)\).
OnlineHead#
This module enables online interaction with the policy (note: d3rlpy’s policy is particularly designed for batch interactions).
OnlineHead
Online Evaluation#
Finally, we provide the series of functions to be used for online performance evaluation in scope_rl/ope/online.py.
(Rollout)
rollout_policy_online
(Statistics)
calc_on_policy_policy_value
calc_on_policy_policy_value_interval
calc_on_policy_variance
calc_on_policy_conditional_value_at_risk
calc_on_policy_policy_interquartile_range
calc_on_policy_cumulative_distribution_function
(Visualization)
visualize_on_policy_policy_value
visualize_on_policy_cumulative_distribution_function
visualize_on_policy_conditional_value_at_risk
visualize_on_policy_interquartile_range