scope_rl.ope.weight_value_learning.augmented_lagrangian_learning_continuous#
Augmented Lagrangian method for weight/value function learning (continuous action cases).
Classes
Augmented Lagrangian method for weight/value function of marginal OPE estimators (for continuous action space). |
|
Augmented Lagrangian method for weight/value function of marginal OPE estimators (for continuous action space). |
- class scope_rl.ope.weight_value_learning.augmented_lagrangian_learning_continuous.ContinuousDiceStateActionWightValueLearning(q_function, w_function, gamma=1.0, bandwidth=1.0, state_scaler=None, action_scaler=None, method='best_dice', batch_size=128, q_lr=0.0001, w_lr=0.0001, lambda_lr=0.0001, alpha_q=None, alpha_w=None, alpha_r=None, enable_lambda=None, device='cuda:0')[source]#
Augmented Lagrangian method for weight/value function of marginal OPE estimators (for continuous action space).
Bases:
scope_rl.ope.weight_value_learning.BaseWeightValueLearnerImported as:
scope_rl.ope.weight_value_learning.ContinuousDiceStateActionWightValueLearningNote
Augmented Lagrangian method simultaneously learns the weight and value functions using the Lagrangian relaxation of the primal dual problem of weight/value learning as follows (See (Yang et al., 2020) for the theories behind):
\[\max_{w \leq 0} \min_{Q, \lambda} L(w, Q, \lambda)\]where
\[\begin{split}L(w, Q, \lambda) &:= (1 - \gamma) \mathbb{E}_{s_0 \sim d(s_0), a_0 \sim \pi(s_0)} [Q(s_0, a_0)] + \lambda \\ & \quad \quad + \mathbb{E}_{(s_t, a_t, r_t, s_{t+1}) \sim d^{\pi_b}, a_{t+1} \sim \pi(s_{t+1})} [w(s_t, a_t) (\alpha_r r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) - \lambda)] \\ & \quad \quad + \alpha_Q \mathbb{E}_{(s_t, a_t) \sim d^{\pi_b}} [Q^2(s_t, a_t)] - \alpha_w \mathbb{E}_{(s_t, a_t) \sim d^{\pi_b}} [w^2(s_t, a_t)]\end{split}\]where \(Q(s_t, a_t)\) is the Q-function, \(w(s_t, a_t) \approx d^{\pi}(s_t, a_t) / d^{\pi_b}(s_t, a_t)\) is the state-action marginal importance weight.
This estimator corresponds to the following estimators in its special cases.
DualDICE (Nachum et al., 2019): \(\alpha_Q = 1, \alpha_w = 0, \alpha_r = 0, \lambda = 0\).
GenDICE (Zhang et al., 2020), GradientDICE (Zhang et al., 2020): \(\alpha_Q = 1, \alpha_w = 0, \alpha_r = 0\)
AlgaeDICE (Nachum et al., 2019): \(\alpha_Q = 0, \alpha_w = 1, \alpha_r = 1, \lambda = 0\)
BestDICE (Yang et al., 2020): \(\alpha_Q = 0, \alpha_w = 1, \alpha_r = 1\)
Minimax Q Learning (MQL) (Uehara and Jiang, 2019): \(\alpha_Q = 0, \alpha_w = 0, \alpha_r = 1, \lambda = 0\)
Minimax Weight Learning (MWL) (Uehara and Jiang, 2019): \(\alpha_Q = 0, \alpha_w = 0, \alpha_r = 0, \lambda = 0\)
ALM is beneficial in that it can simultaneously learn both Q-function and W-function in an adversarial manner. However, since the objective function of ALM is not convex, it may suffer from learning instability.
Note
The positivity constraint \(w \geq 0\) should be imposed in the function approximation model.
- Parameters:
q_function (ContinuousQFunction) – Q-function model.
w_function (ContinuousStateActionWeightFunction) – Weight function model.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the Gaussian kernel. (This is for API consistency)
state_scaler (d3rlpy.preprocessing.Scaler, default=None) – Scaling factor of state.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
method ({"dual_dice", "gen_dice", "algae_dice", "best_dice", "mql", "mwl", "custom"}, default="best_dice") – Indicates which parameter set should be used. When, “custom” users can specify their own parameter.
batch_size (int, default=128 (> 0)) – Batch size.
q_lr (float, default=1e-4 (> 0)) – Learning rate of q_function.
w_lr (float, default=1e-4 (> 0)) – Learning rate of w_function.
lambda_lr (float, default=1e-4 (> 0)) – Learning rate of lambda_.
alpha_q (float, default=None (>= 0)) – Regularization coefficient of the Q-function. A value should be given if method is “custom”.
alpha_w (float, default=None (>= 0)) – Regularization coefficient of the weight function. A value should be given if method is “custom”.
alpha_r (bool, default=None) – Whether to consider the reward observation. A value should be given if method is “custom”.
enable_lambda (bool, default=None) – Whether to optimize \(\lambda\). If False, \(\lambda\) is automatically set to zero. A boolean value should be given if method is “custom”.
device (str, default="cuda:0") – Specifies device used for torch.
References
Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.
Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. “Off-Policy Evaluation via the Regularized Lagrangian.” 2020.
Shangtong Zhang, Bo Liu, and Shimon Whiteson. “GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values.” 2020.
Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. “GenDICE: Generalized Offline Estimation of Stationary Values.” 2020.
Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. “AlgaeDICE: Policy Gradient from Arbitrary Experience.” 2019.
Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. “DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections.” 2019.
- Attributes:
- action_scaler
- alpha_q
- alpha_r
- alpha_w
- enable_lambda
- state_scaler
Methods
fit(step_per_trajectory, state, action, ...)Fit value and weight functions.
fit_predict(step_per_trajectory, state, ...)Fit and predict value/weight functions.
load(path_q, path_w)Load models.
predict(state, action)Predict Q value and state-action marginal importance weight.
predict_q_function(state, action)Predict Q-function.
predict_v_function(state, ...)Predict V function.
predict_value(state, action)Predict Q-function.
predict_weight(state, action)Predict state-action marginal importance weight.
save(path_q, path_w)Save models.
- fit(step_per_trajectory, state, action, reward, evaluation_policy_action, n_steps=10000, n_steps_per_epoch=10000, random_state=None, **kwargs)[source]#
Fit value and weight functions.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory)) – Reward observed for each (state, action) pair.
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Next action chose by the evaluation policy.
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.
random_state (int, default=None (>= 0)) – Random state.
- predict_q_function(state, action)[source]#
Predict Q-function.
- Parameters:
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior/evaluation policy.
- Returns:
q_value – Q value of each (state, action) pair.
- Return type:
ndarray of shape (n_trajectories * step_per_trajectory)
- predict_v_function(state, evaluation_policy_action)[source]#
Predict V function.
- Parameters:
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
- Returns:
v_function – State value.
- Return type:
ndarray of shape (n_trajectories * step_per_trajectory)
- predict_value(state, action)[source]#
Predict Q-function.
- Parameters:
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior/evaluation policy.
- Returns:
q_value – Q value of each (state, action) pair.
- Return type:
ndarray of shape (n_trajectories * step_per_trajectory)
- predict_weight(state, action)[source]#
Predict state-action marginal importance weight.
- Parameters:
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
- Returns:
w_hat – Estimated state-action marginal importance weight.
- Return type:
ndarray of shape (n_trajectories * step_per_trajectory)
- predict(state, action)[source]#
Predict Q value and state-action marginal importance weight.
- Parameters:
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
- Returns:
q_value (ndarray of shape (n_trajectories * step_per_trajectory)) – Q value of each (state, action) pair.
w_hat (ndarray of shape (n_trajectories * step_per_trajectory)) – Estimated state-action marginal importance weight.
- fit_predict(step_per_trajectory, state, action, reward, evaluation_policy_action, n_steps=10000, n_steps_per_epoch=10000, random_state=None, **kwargs)[source]#
Fit and predict value/weight functions.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory)) – Reward observed for each (state, action) pair.
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Next action chose by the evaluation policy.
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.
random_state (int, default=None (>= 0)) – Random state.
- Returns:
q_value (ndarray of shape (n_trajectories * step_per_trajectory)) – Q value of each (state, action) pair.
w_hat (ndarray of shape (n_trajectories * step_per_trajectory)) – Estimated state-action marginal importance weight.
- class scope_rl.ope.weight_value_learning.augmented_lagrangian_learning_continuous.ContinuousDiceStateWightValueLearning(v_function, w_function, gamma=1.0, bandwidth=1.0, state_scaler=None, action_scaler=None, method='best_dice', batch_size=128, v_lr=0.0001, w_lr=0.0001, lambda_lr=0.0001, alpha_v=None, alpha_w=None, alpha_r=None, enable_lambda=None, device='cuda:0')[source]#
Augmented Lagrangian method for weight/value function of marginal OPE estimators (for continuous action space).
Bases:
scope_rl.ope.weight_value_learning.BaseWeightValueLearnerImported as:
scope_rl.ope.weight_value_learning.ContinuousDiceStateWightValueLearningNote
Augmented Lagrangian method simultaneously learns the weight and value functions using the Lagrangian relaxation of the primal dual problem of weight/value learning (See (Yang et al., 2020) for the theories behind).
This class aims to learn V-function and state-marginal importance weight rather than estimating Q-function and state-action marginal importance weight.
\[\max_{w \leq 0} \min_{V, \lambda} L(w, V, \lambda)\]where
\[\begin{split}L(w, V, \lambda) &:= (1 - \gamma) \mathbb{E}_{s_0 \sim d(s_0)} [V(s_0)] + \lambda \\ & \quad \quad + \mathbb{E}_{(s_t, a_t, r_t, s_{t+1}) \sim d^{\pi_b}} [w_s(s_t) w_a(s_t, a_t) (\alpha_r r_t + \gamma V(s_{t+1}) - V(s_t) - \lambda)] \\ & \quad \quad + \alpha_V \mathbb{E}_{s_t \sim d^{\pi_b}} [V^2(s_t)] - \alpha_w \mathbb{E}_{s_t \sim d^{\pi_b}} [w_s^2(s_t)]\end{split}\]where \(V(s_t)\) is the V-function, \(w_s(s_t) \approx d^{\pi}(s_t) / d^{\pi_b}(s_t)\) is the state marginal importance weight. \(w_a(s_t, a_t) = \pi(a_t | s_t) / \pi_0(a_t | s_t)\) is the immediate importance weight.
This estimator is analogous to the following estimators in its special cases (although the following uses Q-function and state-action marginal importance weight).
DualDICE (Nachum et al., 2019): \(\alpha_Q = 1, \alpha_w = 0, \alpha_r = 0, \lambda = 0\).
GenDICE (Zhang et al., 2020), GradientDICE (Zhang et al., 2020): \(\alpha_Q = 1, \alpha_w = 0, \alpha_r = 0\)
AlgaeDICE (Nachum et al., 2019): \(\alpha_Q = 0, \alpha_w = 1, \alpha_r = 1, \lambda = 0\)
BestDICE (Yang et al., 2020): \(\alpha_Q = 0, \alpha_w = 1, \alpha_r = 1\)
Minimax Value Learning (MVL) (Uehara and Jiang, 2019): \(\alpha_Q = 0, \alpha_w = 0, \alpha_r = 1, \lambda = 0\)
Minimax Weight Learning (MWL) (Uehara and Jiang, 2019): \(\alpha_Q = 0, \alpha_w = 0, \alpha_r = 0, \lambda = 0\)
ALM is beneficial in that it can simultaneously learn both V-function and W-function in an adversarial manner. However, since the objective function of ALM is not convex, it may suffer from learning instability.
Note
The positivity constraint \(w \geq 0\) should be imposed in the function approximation model.
- Parameters:
v_function (VFunction) – V function model.
w_function (StateWeightFunction) – Weight function model.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the Gaussian kernel. (This is for API consistency)
state_scaler (d3rlpy.preprocessing.Scaler, default=None) – Scaling factor of state.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
method ({"dual_dice", "gen_dice", "algae_dice", "best_dice", "mvl", "mwl", "custom"}, default="best_dice") – Indicates which parameter set should be used. When, “custom” users can specify their own parameter.
batch_size (int, default=128 (> 0)) – Batch size.
v_lr (float, default=1e-4 (> 0)) – Learning rate of v_function.
w_lr (float, default=1e-4 (> 0)) – Learning rate of w_function.
lambda_lr (float, default=1e-4 (> 0)) – Learning rate of lambda_.
alpha_v (float, default=None (>= 0)) – Regularization coefficient of the V-function. A value should be given if method is “custom”.
alpha_w (float, default=None (>= 0)) – Regularization coefficient of the weight function. A value should be given if method is “custom”.
alpha_r (bool, default=None) – Whether to consider the reward observation. A value should be given if method is “custom”.
enable_lambda (bool, default=None) – Whether to optimize \(\lambda\). If False, \(\lambda\) is automatically set to zero. A boolean value should be given if method is “custom”.
device (str, default="cuda:0") – Specifies device used for torch.
References
Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.
Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. “Off-Policy Evaluation via the Regularized Lagrangian.” 2020.
Shangtong Zhang, Bo Liu, and Shimon Whiteson. “GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values.” 2020.
Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. “GenDICE: Generalized Offline Estimation of Stationary Values.” 2020.
Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. “AlgaeDICE: Policy Gradient from Arbitrary Experience.” 2019.
Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. “DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections.” 2019.
- Attributes:
- action_scaler
- alpha_r
- alpha_v
- alpha_w
- enable_lambda
- state_scaler
Methods
fit(step_per_trajectory, state, action, ...)Fit value and weight functions.
fit_predict(step_per_trajectory, state, ...)Fit and predict value/weight functions.
load(path_v, path_w)Load models.
predict(state)Predict V function and state-action marginal importance weight.
predict_value(state)Predict V function.
predict_weight(state)Predict state marginal importance weight.
save(path_v, path_w)Save models.
- fit(step_per_trajectory, state, action, reward, pscore, evaluation_policy_action, n_steps=10000, n_steps_per_epoch=10000, random_state=None, **kwargs)[source]#
Fit value and weight functions.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory)) – Reward observed for each (state, action) pair.
pscore (array-like of shape (n_trajectories * step_per_trajectory)) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.
random_state (int, default=None (>= 0)) – Random state.
- predict_value(state)[source]#
Predict V function.
- Parameters:
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
- Returns:
v_function – State value.
- Return type:
ndarray of shape (n_trajectories * step_per_trajectory)
- predict_weight(state)[source]#
Predict state marginal importance weight.
- Parameters:
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
- Returns:
w_hat – Estimated state marginal importance weight.
- Return type:
ndarray of shape (n_trajectories * step_per_trajectory)
- predict(state)[source]#
Predict V function and state-action marginal importance weight.
- Parameters:
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
- Returns:
v_function (ndarray of shape (n_trajectories * step_per_trajectory)) – V function of each (state, action) pair.
w_hat (ndarray of shape (n_trajectories * step_per_trajectory)) – Estimated state-action marginal importance weight.
- fit_predict(step_per_trajectory, state, action, reward, pscore, evaluation_policy_action, n_steps=10000, n_steps_per_epoch=10000, random_state=None, **kawrgs)[source]#
Fit and predict value/weight functions.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories, )) – Reward observed for each (state, action) pair.
pscore (array-like of shape (n_trajectories * step_per_trajectory)) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.
random_state (int, default=None (>= 0)) – Random state.
- Returns:
q_value (ndarray of shape (n_trajectories * step_per_trajectory)) – Q value of each (state, action) pair.
w_hat (ndarray of shape (n_trajectories * step_per_trajectory)) – Estimated state-action marginal importance weight.