scope_rl.ope.weight_value_learning.minimax_value_learning_continuous#
Minimax value function learning (continuous action cases).
Classes
Minimax Q Learning for marginal OPE estimators (for continuous action space). |
|
Minimax V Learning for marginal OPE estimators (for continuous action space). |
- class scope_rl.ope.weight_value_learning.minimax_value_learning_continuous.ContinuousMinimaxStateActionValueLearning(q_function, gamma=1.0, bandwidth=1.0, state_scaler=None, action_scaler=None, batch_size=128, lr=0.0001, device='cuda:0')[source]#
Minimax Q Learning for marginal OPE estimators (for continuous action space).
Bases:
scope_rl.ope.weight_value_learning.BaseWeightValueLearnerImported as:
scope_rl.ope.weight_value_learning.ContinuousMinimaxStateActionValueLearningNote
Minimax Q Learning uses that the following holds true about Q-function.
\[\mathbb{E}_{(s_t, a_t, r_t, s_{t+1}) \sim d^{\pi_b}} [w(s_t, a_t) (r_t + \gamma Q(s_{t+1}, \pi(s_{t+1})))] = \mathbb{E}_{(s_t, a_t) \sim d^{\pi_b}} [Q(s_t, a_t)]\]where \(Q(s_t, a_t)\) is the Q-function, \(w(s_t, a_t) \approx d^{\pi}(s_t, a_t) / d^{\pi_b}(s_t, a_t)\) is the state-action marginal importance weight.
Then, it adversarially minimize the difference between RHS and LHS (which we denote \(L_Q(w, Q)\)) to the worst case in terms of \(w(\cdot)\) using a discriminator defined in reproducing kernel Hilbert space (RKHS) as follows.
\[\max_Q L_Q^2(w, Q) = \mathbb{E}_{(s_t, a_t, r_t, s_{t+1}), (\tilde{s}_t, \tilde{a}_t, \tilde{r}_t, \tilde{s}_{t+1}) \sim d^{\pi_b}}[ (r_t + \gamma Q(s_{t+1}, \pi(s_{t+1})) - Q(s_t, a_t)) K((s_t, a_t), (\tilde{s}_t, \tilde{a}_t)) (\tilde{r}_t + \gamma Q(\tilde{s}_{t+1}, \pi(\tilde{s}_{t+1})) - Q(\tilde{s}_t, \tilde{a}_t)) ]\]where \(K(\cdot, \cdot)\) is a kernel function.
- Parameters:
q_function (ContinuousQFunction) – Q-function model.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the Gaussian kernel.
state_scaler (d3rlpy.preprocessing.Scaler, default=None) – Scaling factor of state.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
batch_size (int, default=128 (> 0)) – Batch size.
lr (float, default=1e-4 (> 0)) – Learning rate.
device (str, default="cuda:0") – Specifies device used for torch.
References
Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.
- Attributes:
- action_scaler
- state_scaler
Methods
fit(step_per_trajectory, state, action, ...)Fit Q-function.
fit_predict(step_per_trajectory, state, ...)Fit and predict Q-function.
load(path)Load models.
predict(state, action)Predict function.
predict_q_function(state, action)Predict Q-function.
predict_v_function(state, ...)Predict V function.
predict_value(state, action)Predict function.
save(path)Save models.
- fit(step_per_trajectory, state, action, reward, pscore, evaluation_policy_action, n_steps=10000, n_steps_per_epoch=10000, regularization_weight=1.0, random_state=None, **kwargs)[source]#
Fit Q-function.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory)) – Reward observed for each (state, action) pair.
pscore (array-like of shape (n_trajectories * step_per_trajectory)) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.
regularization_weight (float, default=1.0 (> 0)) – Scaling factor of the regularization weight.
random_state (int, default=None (>= 0)) – Random state.
- predict_q_function(state, action)[source]#
Predict Q-function.
- Parameters:
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior/evaluation policy.
- Returns:
q_value – Q value of each (state, action) pair.
- Return type:
ndarray of shape (n_trajectories * step_per_trajectory)
- predict_v_function(state, evaluation_policy_action)[source]#
Predict V function.
- Parameters:
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
- Returns:
v_function – State value.
- Return type:
ndarray of shape (n_trajectories * step_per_trajectory)
- predict_value(state, action)[source]#
Predict function.
- Parameters:
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
- Returns:
q_value – Q value of each (state, action) pair.
- Return type:
ndarray of shape (n_trajectories * step_per_trajectory)
- predict(state, action)[source]#
Predict function.
- Parameters:
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
- Returns:
q_value – Q value of each (state, action) pair.
- Return type:
ndarray of shape (n_trajectories * step_per_trajectory)
- fit_predict(step_per_trajectory, state, action, reward, pscore, evaluation_policy_action, n_steps=10000, n_steps_per_epoch=10000, regularization_weight=1.0, random_state=None, **kwargs)[source]#
Fit and predict Q-function.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory)) – Reward observed for each (state, action) pair.
pscore (array-like of shape (n_trajectories * step_per_trajectory)) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.
regularization_weight (float, default=1.0 (> 0)) – Scaling factor of the regularization weight.
random_state (int, default=None (>= 0)) – Random state.
- class scope_rl.ope.weight_value_learning.minimax_value_learning_continuous.ContinuousMinimaxStateValueLearning(v_function, gamma=1.0, bandwidth=1.0, state_scaler=None, action_scaler=None, batch_size=128, lr=0.0001, device='cuda:0')[source]#
Minimax V Learning for marginal OPE estimators (for continuous action space).
Bases:
scope_rl.ope.weight_value_learning.BaseWeightValueLearnerImported as:
scope_rl.ope.weight_value_learning.ContinuousMinimaxStateValueLearningNote
Minimax V Learning uses that the following holds true about V-function.
\[\mathbb{E}_{(s_t, a_t, r_t, s_{t+1}) \sim d^{\pi_b}} [w(s_t, a_t) (r_t + \gamma V(s_{t+1}))] = \mathbb{E}_{s_t \sim d^{\pi_b}} [V(s_t)]\]where \(V(s_t)\) is the Q-function, \(w(s_t, a_t) \approx d^{\pi}(s_t, a_t) / d^{\pi_b}(s_t, a_t)\) is the state-action marginal importance weight.
Then, it adversarially minimize the difference between RHS and LHS (which we denote \(L_V(w, V)\)) to the worst case in terms of \(w(\cdot)\) using a discriminator defined in reproducing kernel Hilbert space (RKHS) as follows.
\[\max_V L_V^2(w, V) = \mathbb{E}_{(s_t, a_t, r_t, s_{t+1}), (\tilde{s}_t, \tilde{a}_t, \tilde{r}_t, \tilde{s}_{t+1}) \sim d^{\pi_b}}[ (r_t + \gamma V(s_{t+1}) - V(s_t)) K(s_t, \tilde{s_t}) (\tilde{r}_t + \gamma V(\tilde{s}_{t+1}) - V(\tilde{s}_t) ]\]where \(K(\cdot, \cdot)\) is a kernel function.
- Parameters:
v_function (DiscreteQFunction) – V function model.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the Gaussian kernel.
state_scaler (d3rlpy.preprocessing.Scaler, default=None) – Scaling factor of state.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
batch_size (int, default=128 (> 0)) – Batch size.
lr (float, default=1e-4 (> 0)) – Learning rate.
device (str, default="cuda:0") – Specifies device used for torch.
References
Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.
- Attributes:
- action_scaler
- state_scaler
Methods
fit(step_per_trajectory, state, action, ...)Fit Q-function.
fit_predict(step_per_trajectory, state, ...)Fit and predict Q-function.
load(path)Load models.
predict(state)Predict V function.
predict_v_function(state)Predict V function.
predict_value(state)Predict V function.
save(path)Save models.
- fit(step_per_trajectory, state, action, reward, pscore, evaluation_policy_action, n_steps=10000, n_steps_per_epoch=10000, regularization_weight=1.0, random_state=None, **kwargs)[source]#
Fit Q-function.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory)) – Reward observed for each (state, action) pair.
pscore (array-like of shape (n_trajectories * step_per_trajectory)) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.
regularization_weight (float, default=1.0 (> 0)) – Scaling factor of the regularization weight.
random_state (int, default=None (>= 0)) – Random state.
- predict_v_function(state)[source]#
Predict V function.
- Parameters:
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
- Returns:
v_function – State value.
- Return type:
ndarray of shape (n_trajectories * step_per_trajectory)
- predict_value(state)[source]#
Predict V function.
- Parameters:
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
- Returns:
v_function – State value.
- Return type:
ndarray of shape (n_trajectories * step_per_trajectory)
- predict(state)[source]#
Predict V function.
- Parameters:
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
- Returns:
v_function – State value.
- Return type:
ndarray of shape (n_trajectories * step_per_trajectory)
- fit_predict(step_per_trajectory, state, action, reward, pscore, evaluation_policy_action, n_steps=10000, n_steps_per_epoch=10000, regularization_weight=1.0, random_state=None, **kwargs)[source]#
Fit and predict Q-function.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Reward observed for each (state, action) pair.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.
regularization_weight (float, default=1.0 (> 0)) – Scaling factor of the regularization weight.
random_state (int, default=None (>= 0)) – Random state.