scope_rl.ope.weight_value_learning.augmented_lagrangian_learning_discrete.DiscreteDiceStateActionWightValueLearning#

class scope_rl.ope.weight_value_learning.augmented_lagrangian_learning_discrete.DiscreteDiceStateActionWightValueLearning(q_function, w_function, gamma=1.0, bandwidth=1.0, state_scaler=None, method='best_dice', batch_size=128, q_lr=0.0001, w_lr=0.0001, lambda_lr=0.0001, alpha_q=None, alpha_w=None, alpha_r=None, enable_lambda=None, device='cuda:0')[source]#

Augmented Lagrangian method for weight/value function of marginal OPE estimators (for discrete action space).

Bases: scope_rl.ope.weight_value_learning.BaseWeightValueLearner

Imported as: scope_rl.ope.weight_value_learning.DiscreteAugmentedLagrangianStateActionWightValueLearning

Note

Augmented Lagrangian method simultaneously learns the weight and value functions using the Lagrangian relaxation of the primal dual problem of weight/value learning as follows (See (Yang et al., 2020) for the theories behind):

\[\max_{w \leq 0} \min_{Q, \lambda} L(w, Q, \lambda)\]

where

\[\begin{split}L(w, Q, \lambda) &:= (1 - \gamma) \mathbb{E}_{s_0 \sim d(s_0), a_0 \sim \pi(s_0)} [Q(s_0, a_0)] + \lambda \\ & \quad \quad + \mathbb{E}_{(s_t, a_t, r_t, s_{t+1}) \sim d^{\pi_b}, a_{t+1} \sim \pi(a_{t+1} | s_{t+1})} [w(s_t, a_t) (\alpha_r r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) - \lambda)] \\ & \quad \quad + \alpha_Q \mathbb{E}_{(s_t, a_t) \sim d^{\pi_b}} [Q^2(s_t, a_t)] - \alpha_w \mathbb{E}_{(s_t, a_t) \sim d^{\pi_b}} [w^2(s_t, a_t)]\end{split}\]

where \(Q(s_t, a_t)\) is the Q-function, \(w(s_t, a_t) \approx d^{\pi}(s_t, a_t) / d^{\pi_b}(s_t, a_t)\) is the state-action marginal importance weight.

This estimator corresponds to the following estimators in its special cases.

  • DualDICE (Nachum et al., 2019): \(\alpha_Q = 1, \alpha_w = 0, \alpha_r = 0, \lambda = 0\).

  • GenDICE (Zhang et al., 2020), GradientDICE (Zhang et al., 2020): \(\alpha_Q = 1, \alpha_w = 0, \alpha_r = 0\)

  • AlgaeDICE (Nachum et al., 2019): \(\alpha_Q = 0, \alpha_w = 1, \alpha_r = 1, \lambda = 0\)

  • BestDICE (Yang et al., 2020): \(\alpha_Q = 0, \alpha_w = 1, \alpha_r = 1\)

  • Minimax Q Learning (MQL) (Uehara and Jiang, 2019): \(\alpha_Q = 0, \alpha_w = 0, \alpha_r = 1, \lambda = 0\)

  • Minimax Weight Learning (MWL) (Uehara and Jiang, 2019): \(\alpha_Q = 0, \alpha_w = 0, \alpha_r = 0, \lambda = 0\)

ALM is beneficial in that it can simultaneously learn both Q-function and W-function in an adversarial manner. However, since the objective function of ALM is not convex, it may suffer from learning instability.

Note

The positivity constraint \(w \geq 0\) should be imposed in the function approximation model.

Parameters:
  • q_function (DiscreteQFunction) – Q-function model.

  • w_function (DiscreteStateActionWeightFunction) – Weight function model.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the Gaussian kernel. (This is for API consistency)

  • state_scaler (d3rlpy.preprocessing.Scaler, default=None) – Scaling factor of state.

  • method ({"dual_dice", "gen_dice", "algae_dice", "best_dice", "mql", "mwl", "custom"}, default="best_dice") – Indicates which parameter set should be used. When, “custom” users can specify their own parameter.

  • batch_size (int, default=128 (> 0)) – Batch size.

  • q_lr (float, default=1e-4 (> 0)) – Learning rate of q_function.

  • w_lr (float, default=1e-4 (> 0)) – Learning rate of w_function.

  • lambda_lr (float, default=1e-4 (> 0)) – Learning rate of lambda_.

  • alpha_q (float, default=None (>= 0)) – Regularization coefficient of the Q-function. A value should be given if method is “custom”.

  • alpha_w (float, default=None (>= 0)) – Regularization coefficient of the weight function. A value should be given if method is “custom”.

  • alpha_r (bool, default=None) – Whether to consider the reward observation. A value should be given if method is “custom”.

  • enable_lambda (bool, default=None) – Whether to optimize \(\lambda\). If False, \(\lambda\) is automatically set to zero. A boolean value should be given if method is “custom”.

  • device (str, default="cuda:0") – Specifies device used for torch.

References

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. “Off-Policy Evaluation via the Regularized Lagrangian.” 2020.

Shangtong Zhang, Bo Liu, and Shimon Whiteson. “GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values.” 2020.

Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. “GenDICE: Generalized Offline Estimation of Stationary Values.” 2020.

Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. “AlgaeDICE: Policy Gradient from Arbitrary Experience.” 2019.

Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. “DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections.” 2019.

Attributes:
alpha_q
alpha_r
alpha_w
enable_lambda
state_scaler

Methods

fit(step_per_trajectory, state, action, ...)

Fit value and weight functions.

fit_predict(step_per_trajectory, state, ...)

Fit and predict value/weight functions.

load(path_q, path_w)

Load models.

predict(state, action)

Predict Q value and state-action marginal importance weight.

predict_q_function(state, action)

Predict Q-function.

predict_q_function_for_all_actions(state)

Predict Q-function for all actions.

predict_v_function(state, ...)

Predict V function.

predict_value(state, action)

Predict Q-function.

predict_weight(state, action)

Predict state-action marginal importance weight.

save(path_q, path_w)

Save models.

load(path_q, path_w)[source]#

Load models.

save(path_q, path_w)[source]#

Save models.

fit(step_per_trajectory, state, action, reward, evaluation_policy_action_dist, n_steps=10000, n_steps_per_epoch=10000, random_state=None, **kwargs)[source]#

Fit value and weight functions.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories * step_per_trajectory)) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories * step_per_trajectory)) – Reward observed for each (state, action) pair.

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_actions)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • n_steps (int, default=10000 (> 0)) – Number of gradient steps.

  • n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.

  • random_state (int, default=None (>= 0)) – Random state.

predict_q_function_for_all_actions(state)[source]#

Predict Q-function for all actions.

Parameters:

state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

Returns:

q_value – Q value of each (state, action) pair.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, n_actions)

predict_q_function(state, action)[source]#

Predict Q-function.

Parameters:
  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories * step_per_trajectory)) – Action chosen by the behavior policy.

Returns:

q_value – Q value of each (state, action) pair.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory)

predict_v_function(state, evaluation_policy_action_dist)[source]#

Predict V function.

Parameters:
  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_actions)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

Returns:

q_value – Q value of each (state, action) pair.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory)

predict_value(state, action)[source]#

Predict Q-function.

Parameters:
  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories * step_per_trajectory)) – Action chosen by the behavior policy.

Returns:

q_value – Q value of each (state, action) pair.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory)

predict_weight(state, action)[source]#

Predict state-action marginal importance weight.

Parameters:
  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories * step_per_trajectory)) – Action chosen by the behavior policy.

Returns:

w_hat – Estimated state-action marginal importance weight.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory)

predict(state, action)[source]#

Predict Q value and state-action marginal importance weight.

Parameters:
  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories * step_per_trajectory)) – Action chosen by the behavior policy.

Returns:

  • q_value (ndarray of shape (n_trajectories * step_per_trajectory)) – Q value of each (state, action) pair.

  • w_hat (ndarray of shape (n_trajectories * step_per_trajectory)) – Estimated state-action marginal importance weight.

fit_predict(step_per_trajectory, state, action, reward, evaluation_policy_action_dist, n_steps=10000, n_steps_per_epoch=10000, random_state=None, **kwargs)[source]#

Fit and predict value/weight functions.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • state (array-like of shape (n_trajectories, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories, )) – Action chosen by the behavior policy.

  • reward (array-like of shape (n_trajectories, )) – Reward observed for each (state, action) pair.

  • evaluation_policy_action_dist (array-like of shape (n_trajectories, step_per_trajectory, n_actions)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • method ({"dual_dice", "gen_dice", "algae_dice", "best_dice", "mql", "mwl", "custom"}, default="best_dice") – Indicates which parameter set should be used. When, “custom” users can specify their own parameter.

  • n_steps (int, default=10000 (> 0)) – Number of epochs to train.

  • n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

  • q_value (ndarray of shape (n_trajectories, step_per_trajectory)) – Q value of each (state, action) pair.

  • w_hat (ndarray of shape (n_trajectories, step_per_trajectory)) – Estimated state-action marginal importance weight.

Methods