scope_rl.ope.input#

Meta class to create input for Off-Policy Evaluation (OPE).

Classes

CreateOPEInput

Class to prepare OPE inputs.

class scope_rl.ope.input.CreateOPEInput(env=None, model_args=None, gamma=1.0, bandwidth=1.0, state_scaler=None, action_scaler=None, device=None)[source]#

Class to prepare OPE inputs.

Imported as: scope_rl.ope.CreateOPEInput

Parameters:

env (gym.Env, default=None) – Reinforcement learning (RL) environment.
model_args (dict of dict, default=None) –
Arguments of the models.
```
key: [
    "fqe",
    "state_action_dual",
    "state_action_value",
    "state_action_weight",
    "state_dual",
    "state_value",
    "state_weight",
    "hidden_dim",  # hidden dim of value/weight function, except FQE
]
```
Note

Please specify scaler and action_scaler when calling obtain_whole_inputs() (, as we will overwrite those specified by model_args[model]["scaler/action_scaler"]).
See also

The followings describe the parameters of each model.
- (external) d3rlpy’s documentation about FQE
- (API reference) scope_rl.ope.weight_value_learning
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel.
state_scaler (d3rlpy.preprocessing.Scaler, default=None) – Scaling factor of state.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action. Only applicable in the continuous action case.
device (Optional[str] = None) – Specifies device used for torch.

Examples

Preparation:

# import necessary module from SCOPE-RL
from scope_rl.dataset import SyntheticDataset
from scope_rl.policy import EpsilonGreedyHead
from scope_rl.ope import CreateOPEInput
from scope_rl.ope import OffPolicyEvaluation as OPE
from scope_rl.ope.discrete import TrajectoryWiseImportanceSampling as TIS
from scope_rl.ope.discrete import PerDecisionImportanceSampling as PDIS

# import necessary module from other libraries
import gym
import rtbgym
from d3rlpy.algos import DoubleDQNConfig
from d3rlpy.dataset import create_fifo_replay_buffer
from d3rlpy.algos import ConstantEpsilonGreedy

# initialize environment
env = gym.make("RTBEnv-discrete-v0")

# define (RL) agent (i.e., policy) and train on the environment
ddqn = DoubleDQNConfig().create()
buffer = create_fifo_replay_buffer(
    limit=10000,
    env=env,
)
explorer = ConstantEpsilonGreedy(
    epsilon=0.3,
)
ddqn.fit_online(
    env=env,
    buffer=buffer,
    explorer=explorer,
    n_steps=10000,
    n_steps_per_epoch=1000,
)

# convert ddqn policy to stochastic data collection policy
behavior_policy = EpsilonGreedyHead(
    ddqn,
    n_actions=env.action_space.n,
    epsilon=0.3,
    name="ddqn_epsilon_0.3",
    random_state=12345,
)

# initialize dataset class
dataset = SyntheticDataset(
    env=env,
    max_episode_steps=env.step_per_episode,
)

# data collection
logged_dataset = dataset.obtain_episodes(
    behavior_policies=behavior_policy,
    n_trajectories=100,
    random_state=12345,
)

Create Input:

# evaluation policy
ddqn_ = EpsilonGreedyHead(
    base_policy=ddqn,
    n_actions=env.action_space.n,
    name="ddqn",
    epsilon=0.0,
    random_state=12345
)
random_ = EpsilonGreedyHead(
    base_policy=ddqn,
    n_actions=env.action_space.n,
    name="random",
    epsilon=1.0,
    random_state=12345
)

# create input for off-policy evaluation (OPE)
prep = CreateOPEInput(
    env=env,
)
input_dict = prep.obtain_whole_inputs(
    logged_dataset=logged_dataset,
    evaluation_policies=[ddqn_, random_],
    require_value_prediction=True,
    n_trajectories_on_policy_evaluation=100,
    random_state=12345,
)

Output:

>>> input_dict

{'ddqn':
    {'evaluation_policy_action_dist':
        array([[0., 0., 0., ..., 0., 1., 0.],
              [0., 0., 0., ..., 0., 0., 0.],
              [0., 0., 0., ..., 0., 1., 0.],
              ...,
              [0., 0., 0., ..., 0., 0., 0.],
              [0., 0., 0., ..., 0., 0., 0.],
              [0., 0., 0., ..., 0., 1., 0.]]),
    'evaluation_policy_action': None,
    'state_action_value_prediction':
        array([[11.64699173, 10.1278677 , 10.09877205, ..., 10.16476822, 15.13939476,  8.95065594],
              [10.42242146,  7.73790789,  7.27790451, ...,  3.51157165, 12.0761919 ,  3.75301909],
              [ 7.22864819,  6.88499546,  5.68951464, ...,  6.10659647, 7.05469513,  4.81715965],
              ...,
              [ 7.28475332,  3.91264176,  4.6845212 , ..., -0.02834684, 7.94454432,  2.59267783],
              [13.44723797,  3.08360171,  5.99188185, ..., -2.16886044, 7.13434792,  5.72265959],
              [ 2.27913332,  3.07881427,  1.8636421 , ...,  3.37803316, 3.20135021,  2.68845224]]),
    'initial_state_value_prediction': array([15.13939476, 14.83423805, 13.82990742, ..., 15.49367523, 15.49053097, 14.88922691]),
    'state_action_marginal_importance_weight': None,
    'state_marginal_importance_weight': None,
    'on_policy_policy_value': array([ 8., 10.,  9., ...,  13., 18.,  4.]),
    'gamma': 1.0,
    'behavior_policy': 'ddqn_epsilon_0.3',
    'evaluation_policy': 'ddqn',
    'dataset_id': 0},},
'random':
    {'evaluation_policy_action_dist':
        array([[0.1, 0.1, 0.1, ..., 0.1, 0.1, 0.1],
              [0.1, 0.1, 0.1, ..., 0.1, 0.1, 0.1],
              [0.1, 0.1, 0.1, ..., 0.1, 0.1, 0.1],
              ...,
              [0.1, 0.1, 0.1, ..., 0.1, 0.1, 0.1],
              [0.1, 0.1, 0.1, ..., 0.1, 0.1, 0.1],
              [0.1, 0.1, 0.1, ..., 0.1, 0.1, 0.1]]),
    'evaluation_policy_action': None,
    'state_action_value_prediction':
        array([[10.63342857, 10.61063576, 11.16767025, ..., 15.32427979, 15.08568764, 10.50707436],
              [ 4.02995491,  4.80947208,  7.07302999, ...,  9.928442  , 10.78198528,  9.04977417],
              [ 6.21145582,  6.08772421,  6.5972681 , ...,  9.68579388, 7.2353406 ,  6.17404699],
              ...,
              [ 1.2350018 ,  1.37531543,  3.48139453, ...,  3.44862366, 5.41990328, -0.20314722],
              [ 0.81208032, -0.28935188,  2.62608957, ...,  6.6619091 , -2.18710518, -2.34665537],
              [ 2.36533523,  2.24474525,  2.31729817, ...,  4.7845993 , 2.83752441,  3.00596046]]),
    'initial_state_value_prediction': array([12.5472518 , 12.56364899, 12.30248432, ..., 12.62372198, 12.6544138 , 12.54314356]),
    'state_action_marginal_importance_weight': None,
    'state_marginal_importance_weight': None,
    'on_policy_policy_value': array([ 9.,  7.,  4., ..., 15.,  8.,  5.]),
    'gamma': 1.0,
    'behavior_policy': 'ddqn_epsilon_0.3',
    'evaluation_policy': 'random',
    'dataset_id': 0}}

See also

Quickstart

Attributes:

action_scaler
device
env
model_args
state_scaler

Methods

`build_and_fit_FQE`(evaluation_policy[, ...])	Perform Fitted Q Evaluation (FQE).
`build_and_fit_state_action_dual_model`(...[, ...])	Perform Augmented Lagrangian Method (ALM) to estimate the state-action value weight function.
`build_and_fit_state_action_value_model`(...)	Perform Minimax Q Learning (MQL) to estimate the state-action value function.
`build_and_fit_state_action_weight_model`(...)	Perform Minimax Weight Learning (MWL) to estimate the state-action weight function.
`build_and_fit_state_dual_model`(evaluation_policy)	Perform Augmented Lagrangian Method (ALM) to estimate the state value weight function.
`build_and_fit_state_value_model`(...[, ...])	Perform Minimax V Learning (MVL) to estimate the state value function.
`build_and_fit_state_weight_model`(...[, ...])	Perform Minimax Weight Learning (MWL) to estimate state weight function.
`obtain_evaluation_policy_action`(...[, state])	Obtain evaluation policy action.
`obtain_evaluation_policy_action_dist`(...[, ...])	Obtain action choice probability of the evaluation policy and its an estimated Q-function of the observed state.
`obtain_evaluation_policy_action_prob_for_observed_state_action`(...)	Obtain the pscore of an observed state action pair.
`obtain_initial_state`(evaluation_policy[, ...])	Obtain initial state distribution (stationary distribution) of the evaluation policy.
`obtain_initial_state_value_prediction`(...[, ...])	Obtain the initial state value of the evaluation policy in the case of discrete action spaces.
`obtain_state_action_marginal_importance_weight`(...)	Predict state-action marginal importance weight.
`obtain_state_action_value_prediction`(method, ...)	Obtain an estimated Q-function of the observed state and all actions (discrete) or that of the actions chosen by behavior and (deterministic) evaluation policies (continuous).
`obtain_state_marginal_importance_weight`(...)	Predict state marginal importance weight.
`obtain_whole_inputs`(logged_dataset, ...[, ...])	Obtain input as a dictionary.

build_and_fit_FQE(evaluation_policy, k_fold=1, n_steps=10000)[source]#

Perform Fitted Q Evaluation (FQE).

Parameters:

evaluation_policy (BaseHead) – Evaluation policy.
k_fold (int, default=1 (> 0)) – Number of folds to perform cross-fitting.
n_steps (int, default=10000 (> 0)) – Number of gradient steps.

build_and_fit_state_action_dual_model(evaluation_policy, k_fold=1, n_steps=10000, random_state=None)[source]#

Perform Augmented Lagrangian Method (ALM) to estimate the state-action value weight function.

Parameters:

evaluation_policy (BaseHead) – Evaluation policy.
k_fold (int, default=1 (> 0)) – Number of folds to perform cross-fitting.
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
random_state (int, default=None (>= 0)) – Random state.

build_and_fit_state_action_value_model(evaluation_policy, k_fold=1, n_steps=10000, random_state=None)[source]#

Perform Minimax Q Learning (MQL) to estimate the state-action value function.

Parameters:

evaluation_policy (BaseHead) – Evaluation policy.
k_fold (int, default=1 (> 0)) – Number of folds to perform cross-fitting.
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
random_state (int, default=None (>= 0)) – Random state.

build_and_fit_state_action_weight_model(evaluation_policy, k_fold=1, n_steps=10000, random_state=None)[source]#

Perform Minimax Weight Learning (MWL) to estimate the state-action weight function.

Parameters:

evaluation_policy (BaseHead) – Evaluation policy.
k_fold (int, default=1 (> 0)) – Number of folds to perform cross-fitting.
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
random_state (int, default=None (>= 0)) – Random state.

build_and_fit_state_dual_model(evaluation_policy, k_fold=1, n_steps=10000, random_state=None)[source]#

Perform Augmented Lagrangian Method (ALM) to estimate the state value weight function.

Parameters:

evaluation_policy (BaseHead) – Evaluation policy.
k_fold (int, default=1 (> 0)) – Number of folds to perform cross-fitting.
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
random_state (int, default=None (>= 0)) – Random state.

build_and_fit_state_value_model(evaluation_policy, k_fold=1, n_steps=10000, random_state=None)[source]#

Perform Minimax V Learning (MVL) to estimate the state value function.

Parameters:

evaluation_policy (BaseHead) – Evaluation policy.
k_fold (int, default=1 (> 0)) – Number of folds to perform cross-fitting.
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
random_state (int, default=None (>= 0)) – Random state.

build_and_fit_state_weight_model(evaluation_policy, k_fold=1, n_steps=10000, random_state=None)[source]#

Perform Minimax Weight Learning (MWL) to estimate state weight function.

Parameters:

evaluation_policy (BaseHead) – Evaluation policy.
k_fold (int, default=1 (> 0)) – Number of folds to perform cross-fitting.
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
random_state (int, default=None (>= 0)) – Random state.

obtain_initial_state(evaluation_policy, resample_initial_state=False, minimum_rollout_length=0, maximum_rollout_length=100, random_state=None)[source]#

Obtain initial state distribution (stationary distribution) of the evaluation policy.

Parameters:

evaluation_policy (BaseHead) – Evaluation policy.
resample_initial_state (bool, default=False) – Whether to resample from initial state distribution using the given evaluation policy. If False, the initial state distribution of the behavior policy is used instead.
minimum_rollout_length (int, default=0 (>= 0)) – Minimum length of rollout by the behavior policy before generating the logged dataset when working on the infinite horizon setting. This argument is irrelevant when working on the finite horizon setting.
maximum_rollout_length (int, default=100 (>= minimum_rollout_length)) – Maximum length of rollout by the behavior policy before generating the logged dataset when working on the infinite horizon setting. This argument is irrelevant when working on the finite horizon setting.
random_state (int, default=None (>= 0)) – Random state.

Returns:

evaluation_policy_initial_state – Initial state distribution of the evaluation policy. This is intended to be used in marginal OPE estimators.

Return type:

ndarray of shape (n_trajectories, )

obtain_evaluation_policy_action(evaluation_policy, state=None)[source]#

Obtain evaluation policy action.

Parameters:

evaluation_policy (BaseHead) – Evaluation policy.
state (np.ndarray, default=None) – Sample an action from the evaluation_policy at this state. If None is given, state observed in the logged data will be used.

Returns:

evaluation_policy_action – Evaluation policy action \(a_t \sim \pi(a_t \mid s_t)\).

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, )

obtain_evaluation_policy_action_dist(evaluation_policy, state=None)[source]#

Obtain action choice probability of the evaluation policy and its an estimated Q-function of the observed state.

Parameters:

evaluation_policy (BaseHead) – Evaluation policy.
state (np.ndarray, default=None) – Sample an action from the evaluation_policy at this state.. If None is given, state observed in the logged data will be used.

Returns:

evaluation_policy_action_dist – Evaluation policy pscore \(\pi(a_t \mid s_t)\).

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, n_actions)

obtain_evaluation_policy_action_prob_for_observed_state_action(evaluation_policy)[source]#

Obtain the pscore of an observed state action pair.

Parameters:: evaluation_policy (BaseHead) – Evaluation policy.
Returns:: evaluation_policy_pscore – Evaluation policy pscore \(\pi(a_t \mid s_t)\).
Return type:: ndarray of shape (n_trajectories * step_per_trajectory, )

obtain_state_action_value_prediction(method, evaluation_policy, k_fold=1)[source]#

Obtain an estimated Q-function of the observed state and all actions (discrete) or that of the actions chosen by behavior and (deterministic) evaluation policies (continuous).

Parameters:

method ({"fqe", "dice_q", "mql"}) – Estimation method.
evaluation_policy (BaseHead) – Evaluation policy.
k_fold (int, default=1 (> 0)) – Number of folds to perform cross-fitting.

Returns:

state_action_value_prediction – If action_type is “discrete”, output is state action value for observed state and all actions, i.e., \(\hat{Q}(s, a) \forall a \in \mathcal{A}\).

If action_type is “continuous”, output is state action value for the observed action and that chosen by the evaluation policy, i.e., (row 0) \(\hat{Q}(s_t, a_t)\) and (row 2) \(\hat{Q}(s_t, \pi(s_t))\).

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, n_actions) or (n_trajectories * step_per_trajectory, 2)

obtain_initial_state_value_prediction(method, evaluation_policy, k_fold=1, resample_initial_state=False, minimum_rollout_length=0, maximum_rollout_length=100, random_state=None)[source]#

Obtain the initial state value of the evaluation policy in the case of discrete action spaces.

Parameters:

method ({"fqe", "dice_q", "dice_v", "mql", "mvl"}) – Estimation method.
evaluation_policy (BaseHead) – Evaluation policy.
k_fold (int, default=1 (> 0)) – Number of folds to perform cross-fitting.
resample_initial_state (bool, default=False) – Whether to resample initial state distribution using the given evaluation policy. If False, the initial state distribution of the behavior policy is used instead.
minimum_rollout_length (int, default=0 (>= 0)) – Minimum length of rollout by the behavior policy before generating the logged dataset when working on the infinite horizon setting. This argument is irrelevant when working on the finite horizon setting.
maximum_rollout_length (int, default=100 (>= minimum_rollout_length)) – Maximum length of rollout by the behavior policy before generating the logged dataset when working on the infinite horizon setting. This argument is irrelevant when working on the finite horizon setting.
random_state (int, default=None (>= 0)) – Random state.

Returns:

initial_state_value_prediction – State action value of the observed initial state.

Return type:

ndarray of shape (n_trajectories, )

obtain_state_action_marginal_importance_weight(method, evaluation_policy, k_fold=1)[source]#

Predict state-action marginal importance weight.

Parameters:

method ({"dice", "mwl"}) – Estimation method.
evaluation_policy (BaseHead) – Evaluation policy.
k_fold (int, default=1 (> 0)) – Number of folds to perform cross-fitting.

Returns:

state_action_weight_prediction – State-action marginal importance weight for the observed state and action.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, )

obtain_state_marginal_importance_weight(method, evaluation_policy, k_fold=1)[source]#

Predict state marginal importance weight.

Parameters:

method ({"dice", "mwl"}) – Estimation method.
evaluation_policy (BaseHead) – Evaluation policy.
k_fold (int, default=1 (> 0)) – Number of folds to perform cross-fitting.

Returns:

state_weight_prediction – State marginal importance weight for observed state.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, )

obtain_whole_inputs(logged_dataset, evaluation_policies, behavior_policy_name=None, dataset_id=None, require_value_prediction=False, require_weight_prediction=False, resample_initial_state=False, q_function_method='fqe', v_function_method='fqe', w_function_method='dice', k_fold=1, n_steps=10000, n_trajectories_on_policy_evaluation=100, use_stationary_distribution_on_policy_evaluation=False, minimum_rollout_length=0, maximum_rollout_length=100, random_state=None, path='input_dict/', save_relative_path=False)[source]#

Obtain input as a dictionary.

Parameters:

logged_dataset (LoggedDataset or MultipleLoggedDataset) –

Logged dataset containing the following.

key: [
    size,
    n_trajectories,
    step_per_trajectory,
    action_type,
    n_actions,
    action_dim,
    action_keys,
    action_meaning,
    state_dim,
    state_keys,
    state,
    action,
    reward,
    done,
    terminal,
    info,
    pscore,
    behavior_policy,
    dataset_id,
]

.. seealso::

    :class:`scope_rl.dataset.SyntheticDataset` describes the components of :class:`logged_dataset`.

evaluation_policies (list of BaseHead or BaseHead) –
Evaluation policies.
Tip
1. When using LoggedDataset, evaluation_policies should be List[BaseHead] ([BaseHead, BaseHead, ..]).
2. When using MultipleLoggedDataset and apply the same evaluation policies across behavior_policies and dataset_ids, evaluation_policies should be List[BaseHead].

3. When using MultipleLoggedDataset and apply the same evaluation policies across dataset_ids but different evaluation_policies across behavior policies, evaluation_policies should be Dict[str, List[BaseHead]]. (key: [behavior_policy_name]).

4. When using MultipleLoggedDataset and apply different evaluation policies across dataset_ids and behavior policies, evaluation_policies should be Dict[str, List[BaseHead]]. (key: [behavior_policy_name][dataset_id])
behavior_policy_name (str, default=None) – Name of the behavior policy.
dataset_id (int, default=None) – Id of the logged dataset.
require_value_prediction (bool, default=False) – Whether to obtain an estimated value function.
require_weight_prediction (bool, default=False) – Whether to obtain an estimated weight function.
resample_initial_state (bool, default=False) –
Whether to resample initial state distribution using the given evaluation policy. If False, the initial state distribution of the behavior policy is used instead.

Note that this parameter is applicable only when self.env is given.
q_function_method ({"fqe", "dice_q", "mql"}, default="fqe") – Method to estimate \(Q(s, a)\).
v_function_method ({"fqe", "dice_q", "dice_v", "mql", "mvl"}, default="fqe") – Method to estimate \(V(s)\).
w_function_method ({"dice", "mwl"}, default="dice") – Method to estimate \(w(s, a)\) and \(w(s)\).
k_fold (int, default=1 (> 0)) –
Number of folds to perform cross-fitting.

If \(K>1\), we split the logged dataset into \(K\) folds. \(\mathcal{D}_j\) is the \(j\)-th split of logged data consisting of \(n_k\) samples. Then, the value and weight functions (\(w^j\) and \(Q^j\)) are trained on the subset of data used for OPE, i.e., \(\mathcal{D} \setminus \mathcal{D}_j\).

If \(K=1\), the value and weight functions are trained on the entire data.
n_steps (int, default=10000 (> 0)) – Number of gradient steps to fit weight and value learning methods.
n_trajectories_on_policy_evaluation (int, default=None (> 0)) – Number of episodes to perform on-policy evaluation.
use_stationary_distribution_on_policy_evaluation (bool, default=False) – Whether to evaluate a policy based on stationary distribution. If True, evaluation policy is evaluated by rollout without resetting environment at each episode.
minimum_rollout_length (int, default=0 (>= 0)) – Minimum length of rollout to collect initial state.
maximum_rollout_length (int, default=100 (>= minimum_rollout_length)) – Maximum length of rollout to collect initial state.
random_state (int, default=None (>= 0)) – Random state.
path (str) – Path to the directory. Either absolute or relative path is acceptable.
save_relative_path (bool, default=False.) –
Whether to save a relative path. If True, a path relative to the scope-rl directory will be saved. If False, the absolute path will be saved.

Note that this option was added in order to run examples in the documentation properly. Otherwise, the default setting (False) is recommended.

Returns:

input_dicts – MultipleInputDict is an instance containing (multiple) input dictionary for OPE.

Each input dict is accessible by the following command.

input_dict_ = input_dict.get(behavior_policy_name=behavior_policy.name, dataset_id=0)

Each input dict consists of the following.

key: [evaluation_policy_name][
    evaluation_policy_action,
    evaluation_policy_action_dist,
    state_action_value_prediction,
    initial_state_value_prediction,
    state_action_marginal_importance_weight,
    state_marginal_importance_weight,
    on_policy_policy_value,
    gamma,
    behavior_policy,
    evaluation_policy,
    dataset_id,
]

evaluation_policy_action: ndarray of shape (n_trajectories * step_per_trajectories, action_dim)

Action chosen by the deterministic evaluation policy. If action_type is “discrete”, None is recorded.

evaluation_policy_action_dist: ndarray of shape (n_trajectories * step_per_trajectory, n_actions)

Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\). If action_type is “continuous”, None is recorded.

state_action_value_prediction: ndarray

If action_type is “discrete”, \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\). shape (n_trajectories * step_per_trajectory, n_actions)

If action_type is “continuous”, \(\hat{Q}\) for the observed action and that chosen by the evaluation policy, i.e., (row 0) \(\hat{Q}(s_t, a_t)\) and (row 2) \(\hat{Q}(s_t, \pi(a \mid s_t))\). shape (n_trajectories * step_per_trajectory, 2)

If require_value_prediction is False, None is recorded.

initial_state_value_prediction: ndarray of shape (n_trajectories, )

Estimated initial state value.

If use_base_model is False, None is recorded.

state_action_marginal_importance_weight: ndarray of shape (n_trajectories * step_per_trajectory, )

Estimated state-action marginal importance weight, i.e., \(\hat{w}(s_t, a_t) \approx d^{\pi}(s_t, a_t) / d^{\pi_b}(s_t, a_t)\).

If require_weight_prediction is False, None is recorded.

state_marginal_importance_weight: ndarray of shape (n_trajectories * step_per_trajectory, )

Estimated state marginal importance weight, i.e., \(\hat{w}(s_t) \approx d^{\pi}(s_t) / d^{\pi_b}(s_t)\).

If require_weight_prediction is False, None is recorded.

on_policy_policy_value: ndarray of shape (n_trajectories_on_policy_evaluation, )

On-policy policy value. If self.env is None, None is recorded.

gamma: float

Discount factor.

behavior_policy: str

Name of the behavior policy.

evaluation_policy: str

Name of the evaluation policy.

dataset_id: int

Id of the logged dataset.

Return type:

OPEInputDict or MultipleInputDict