scope_rl.ope.weight_value_learning.augmented_lagrangian_learning_discrete.DiscreteDiceStateWightValueLearning#

class scope_rl.ope.weight_value_learning.augmented_lagrangian_learning_discrete.DiscreteDiceStateWightValueLearning(v_function, w_function, gamma=1.0, bandwidth=1.0, state_scaler=None, method='best_dice', batch_size=128, v_lr=0.0001, w_lr=0.0001, lambda_lr=0.0001, alpha_v=None, alpha_w=None, alpha_r=None, enable_lambda=None, device='cuda:0')[source]#

Augmented Lagrangian method for weight/value function of marginal OPE estimators (for discrete action space).

Bases: scope_rl.ope.weight_value_learning.BaseWeightValueLearner

Imported as: scope_rl.ope.weight_value_learning.DiscreteAugmentedLagrangianStateWightValueLearning

Note

Augmented Lagrangian method simultaneously learns the weight and value functions using the Lagrangian relaxation of the primal dual problem of weight/value learning (See (Yang et al., 2020) for the theories behind).

This class aims to learn V-function and state-marginal importance weight rather than estimating Q-function and state-action marginal importance weight.

\[\max_{w \leq 0} \min_{V, \lambda} L(w, V, \lambda)\]

where

\[\begin{split}L(w, V, \lambda) &:= (1 - \gamma) \mathbb{E}_{s_0 \sim d(s_0)} [V(s_0)] + \lambda \\ & \quad \quad + \mathbb{E}_{(s_t, a_t, r_t, s_{t+1}) \sim d^{\pi_b}} [w_s(s_t) w_a(s_t, a_t) (\alpha_r r_t + \gamma V(s_{t+1}) - V(s_t) - \lambda)] \\ & \quad \quad + \alpha_V \mathbb{E}_{s_t \sim d^{\pi_b}} [V^2(s_t)] - \alpha_w \mathbb{E}_{s_t \sim d^{\pi_b}} [w_s^2(s_t)]\end{split}\]

where \(V(s_t)\) is the V-function, \(w_s(s_t) \approx d^{\pi}(s_t) / d^{\pi_b}(s_t)\) is the state marginal importance weight. \(w_a(s_t, a_t) = \pi(a_t | s_t) / \pi_0(a_t | s_t)\) is the immediate importance weight.

This estimator is analogous to the following estimators in its special cases (although the following uses Q-function and state-action marginal importance weight).

DualDICE (Nachum et al., 2019): \(\alpha_Q = 1, \alpha_w = 0, \alpha_r = 0, \lambda = 0\).
GenDICE (Zhang et al., 2020), GradientDICE (Zhang et al., 2020): \(\alpha_Q = 1, \alpha_w = 0, \alpha_r = 0\)
AlgaeDICE (Nachum et al., 2019): \(\alpha_Q = 0, \alpha_w = 1, \alpha_r = 1, \lambda = 0\)
BestDICE (Yang et al., 2020): \(\alpha_Q = 0, \alpha_w = 1, \alpha_r = 1\)
Minimax Value Learning (MVL) (Uehara and Jiang, 2019): \(\alpha_Q = 0, \alpha_w = 0, \alpha_r = 1, \lambda = 0\)
Minimax Weight Learning (MWL) (Uehara and Jiang, 2019): \(\alpha_Q = 0, \alpha_w = 0, \alpha_r = 0, \lambda = 0\)

ALM is beneficial in that it can simultaneously learn both V-function and W-function in an adversarial manner. However, since the objective function of ALM is not convex, it may suffer from learning instability.

Note

The positivity constraint \(w \geq 0\) should be imposed in the function approximation model.

Parameters:

v_function (VFunction) – V function model.
w_function (StateWeightFunction) – Weight function model.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the Gaussian kernel. (This is for API consistency)
state_scaler (d3rlpy.preprocessing.Scaler, default=None) – Scaling factor of state.
method ({"dual_dice", "gen_dice", "algae_dice", "best_dice", "mvl", "mwl", "custom"}, default="best_dice") – Indicates which parameter set should be used. When, “custom” users can specify their own parameter.
batch_size (int, default=128 (> 0)) – Batch size.
v_lr (float, default=1e-4 (> 0)) – Learning rate of v_function.
w_lr (float, default=1e-4 (> 0)) – Learning rate of w_function.
lambda_lr (float, default=1e-4 (> 0)) – Learning rate of lambda_.
alpha_v (float, default=None (>= 0)) – Regularization coefficient of the V-function. A value should be given if method is “custom”.
alpha_w (float, default=None (>= 0)) – Regularization coefficient of the weight function. A value should be given if method is “custom”.
alpha_r (bool, default=None) – Whether to consider the reward observation. A value should be given if method is “custom”.
enable_lambda (bool, default=None) – Whether to optimize \(\lambda\). If False, \(\lambda\) is automatically set to zero. A boolean value should be given if method is “custom”.
device (str, default="cuda:0") – Specifies device used for torch.

References

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. “Off-Policy Evaluation via the Regularized Lagrangian.” 2020.

Shangtong Zhang, Bo Liu, and Shimon Whiteson. “GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values.” 2020.

Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. “GenDICE: Generalized Offline Estimation of Stationary Values.” 2020.

Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. “AlgaeDICE: Policy Gradient from Arbitrary Experience.” 2019.

Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. “DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections.” 2019.

Attributes:

alpha_r
alpha_v
alpha_w
enable_lambda
state_scaler

Methods

`fit`(step_per_trajectory, state, action, ...)	Fit value and weight functions.
`fit_predict`(step_per_trajectory, state, ...)	Fit and predict value/weight functions.
`load`(path_v, path_w)	Load models.
`predict`(state)	Predict V function and state-action marginal importance weight.
`predict_value`(state)	Predict V function.
`predict_weight`(state)	Predict state marginal importance weight.
`save`(path_v, path_w)	Save models.

load(path_v, path_w)[source]#

Load models.

save(path_v, path_w)[source]#

Save models.

fit(step_per_trajectory, state, action, reward, pscore, evaluation_policy_action_dist, n_steps=10000, n_steps_per_epoch=10000, random_state=None, **kwargs)[source]#

Fit value and weight functions.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory)) – Reward observed for each (state, action) pair.
pscore (array-like of shape (n_trajectories * step_per_trajectory)) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_actions)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.
random_state (int, default=None (>= 0)) – Random state.

predict_value(state)[source]#

Predict V function.

Parameters:: state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
Returns:: v_function – State value.
Return type:: ndarray of shape (n_trajectories * step_per_trajectory)

predict_weight(state)[source]#

Predict state marginal importance weight.

Parameters:: state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
Returns:: w_hat – Estimated state marginal importance weight.
Return type:: ndarray of shape (n_trajectories * step_per_trajectory)

predict(state)[source]#

Predict V function and state-action marginal importance weight.

Parameters:

state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

Returns:

v_function (ndarray of shape (n_trajectories * step_per_trajectory)) – V function of each (state, action) pair.
w_hat (ndarray of shape (n_trajectories * step_per_trajectory)) – Estimated state-action marginal importance weight.

fit_predict(step_per_trajectory, state, action, reward, pscore, evaluation_policy_action_dist, n_steps=10000, n_steps_per_epoch=10000, random_state=None, **kwargs)[source]#

Fit and predict value/weight functions.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory)) – Reward observed for each (state, action) pair.
pscore (array-like of shape (n_trajectories * step_per_trajectory)) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_actions)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.
random_state (int, default=None (>= 0)) – Random state.

Returns:

q_value (ndarray of shape (n_trajectories * step_per_trajectory)) – Q value of each (state, action) pair.
w_hat (ndarray of shape (n_trajectories * step_per_trajectory)) – Estimated state-action marginal importance weight.

Methods