scope_rl.ope.weight_value_learning.minimax_weight_learning_continuous#

Minimax weight function learning (continuous action cases).

Classes

ContinuousMinimaxStateActionWeightLearning

Minimax Weight Learning for marginal OPE estimators (for continuous action space).

ContinuousMinimaxStateWeightLearning

Minimax Weight Learning for marginal OPE estimators (for continuous action space).

class scope_rl.ope.weight_value_learning.minimax_weight_learning_continuous.ContinuousMinimaxStateActionWeightLearning(w_function, gamma=1.0, bandwidth=1.0, state_scaler=None, action_scaler=None, batch_size=128, lr=0.0001, device='cuda:0')[source]#

Minimax Weight Learning for marginal OPE estimators (for continuous action space).

Bases: scope_rl.ope.weight_value_learning.BaseWeightValueLearner

Imported as: scope_rl.ope.weight_value_learning.ContinuousMinimaxStateActionWightLearning

Note

Minimax Weight Learning uses that the following holds true about Q-function.

\[\mathbb{E}_{(s_t, a_t, r_t, s_{t+1}) \sim d^{\pi_b}, a_{t+1} \sim \pi(a_{t+1} | s_{t+1})} [w(s_t, a_t) (Q(s_t, a_t) - \gamma Q(s_{t+1}, a_{t+1}))] = \mathbb{E}_{s_0 \sim d^{\pi_b}, a_0 \sim \pi(a_0 | s_0)} [Q(s_0, a_0)]\]

where \(Q(s_t, a_t)\) is the Q-function, \(w(s_t, a_t) \approx d^{\pi}(s_t, a_t) / d^{\pi_b}(s_t, a_t)\) is the state-action marginal importance weight.

Then, it adversarially minimize the difference between RHS and LHS (which we denote \(L_w(w, Q)\)) to the worst case in terms of \(Q(\cdot)\) using a discriminator defined in reproducing kernel Hilbert space (RKHS) as follows.

\[\begin{split}\max_w L_w^2(w, Q) &= \mathbb{E}_{(s_t, a_t, s_{t+1}), (\tilde{s}_t, \tilde{a}_t, \tilde{s}_{t+1}) \sim d^{\pi_b}, a_{t+1} \sim \pi(a_{t+1} | s_{t+1}), \tilde{a}_{t+1} \sim \pi(\tilde{a}_{t+1} | \tilde{s}_{t+1})}[ w(s_t, a_t) w(\tilde{s}_t, \tilde{a}_t) ( K((s_t, a_t), (\tilde{s}_t, \tilde{a}_t)) + K((s_{t+1}, a_{t+1}), (\tilde{s}_{t+1}, \tilde{a}_{t+1})) - \gamma ( K((s_t, a_t), (\tilde{s}_{t+1}, \tilde{a}_{t+1})) + K((s_{t+1}, a_{t+1}), (\tilde{s}_t, \tilde{a}_t)) )) ] \\ & \quad \quad + \gamma (1 - \gamma) \mathbb{E}_{(s_t, a_t, s_{t+1}), (\tilde{s}_t, \tilde{a}_t, \tilde{s}_{t+1}) \sim d^{\pi_b}, a_{t+1} \sim \pi(a_{t+1} | s_{t+1}), \tilde{a}_{t+1} \sim \pi(\tilde{a}_{t+1} | \tilde{s}_{t+1}), s_0 \sim d(s_0), \tilde{s}_0 \sim d(\tilde{s}_0), a_0 \sim \pi(a_0 | s_0), \tilde{a}_0 \sim \pi(\tilde{a}_0 | \tilde{s}_0)}[ w(s_t, a_t) K((s_{t+1}, a_{t+1}), (\tilde{s}_0, \tilde{a}_0)) + w(\tilde{s}_t, \tilde{a}_t) K((\tilde{s}_{t+1}, \tilde{a}_{t+1}), (s_0, a_0)) ] \\ & \quad \quad - (1 - \gamma) \mathbb{E}_{(s_t, a_t), (\tilde{s}_t, \tilde{a}_t) \sim d^{\pi_b}, s_0 \sim d(s_0), \tilde{s}_0 \sim d(\tilde{s}_0), a_0 \sim \pi(a_0 | s_0), \tilde{a}_0 \sim \pi(\tilde{a}_0 | \tilde{s}_0)}[ w(s_t, a_t) K((s_t, a_t), (\tilde{s}_0, \tilde{a}_0)) + w(\tilde{s}_t, \tilde{a}_t) K((\tilde{s}_t, \tilde{a}_t), (s_0, a_0)) ]\end{split}\]

where \(K(\cdot, \cdot)\) is a kernel function.

Parameters:
  • w_function (ContinuousStateActionWeightFunction) – Weight function model.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the Gaussian kernel.

  • state_scaler (d3rlpy.preprocessing.Scaler, default=None) – Scaling factor of state.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

  • batch_size (int, default=128 (> 0)) – Batch size.

  • lr (float, default=1e-4 (> 0)) – Learning rate.

  • device (str, default="cuda:0") – Specifies device used for torch.

References

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Attributes:
action_scaler
state_scaler

Methods

fit(step_per_trajectory, state, action, ...)

Fit weight function.

fit_predict(step_per_trajectory, state, ...)

Fit and predict weight function.

load(path)

Load models.

predict(state, action)

Predict function.

predict_weight(state, action)

Predict function.

save(path)

Save models.

load(path)[source]#

Load models.

save(path)[source]#

Save models.

fit(step_per_trajectory, state, action, evaluation_policy_action, n_steps=10000, n_steps_per_epoch=10000, regularization_weight=1.0, random_state=None, **kwargs)[source]#

Fit weight function.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • n_steps (int, default=10000 (> 0)) – Number of gradient steps.

  • n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.

  • regularization_weight (float, default=1.0 (> 0)) – Scaling factor of the regularization weight.

  • random_state (int, default=None (>= 0)) – Random state.

predict_weight(state, action)[source]#

Predict function.

Parameters:
  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

Returns:

importance_weight – Estimated state-action marginal importance weight.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, )

predict(state, action)[source]#

Predict function.

Parameters:
  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

Returns:

importance_weight – Estimated state-action marginal importance weight.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, )

fit_predict(step_per_trajectory, state, action, evaluation_policy_action, n_steps=10000, n_steps_per_epoch=10000, regularization_weight=1.0, random_state=None, **kwargs)[source]#

Fit and predict weight function.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Next action chosen by the evaluation policy.

  • n_steps (int, default=10000 (> 0)) – Number of gradient steps.

  • n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.

  • regularization_weight (float, default=1.0 (> 0)) – Scaling factor of the regularization weight.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

importance_weight – Estimated state-action marginal importance weight.

Return type:

ndarray of shape (n_trajectories, )

class scope_rl.ope.weight_value_learning.minimax_weight_learning_continuous.ContinuousMinimaxStateWeightLearning(w_function, gamma=1.0, bandwidth=1.0, state_scaler=None, action_scaler=None, batch_size=128, lr=0.0001, device='cuda:0')[source]#

Minimax Weight Learning for marginal OPE estimators (for continuous action space).

Bases: scope_rl.ope.weight_value_learning.BaseWeightValueLearner

Imported as: scope_rl.ope.weight_value_learning.ContinuousMinimaxStateWightLearning

Note

Minimax Weight Learning uses that the following holds true about Q-function.

\[\mathbb{E}_{(s_t, a_t, r_t, s_{t+1}) \sim d^{\pi_b}, a_{t+1} \sim \pi(a_{t+1} | s_{t+1})} [w(s_t, a_t) (Q(s_t, a_t) - \gamma Q(s_{t+1}, a_{t+1}))] = \mathbb{E}_{s_0 \sim d^{\pi_b}, a_0 \sim \pi(a_0 | s_0)} [Q(s_0, a_0)]\]

where \(Q(s_t, a_t)\) is the Q-function, \(w(s_t, a_t) \approx d^{\pi}(s_t, a_t) / d^{\pi_b}(s_t, a_t) = d^{\pi}(s_t) \pi(a_t | s_t) / d^{\pi_b}(s_t) \pi_0(a_t | s_t)\) is the state-action marginal importance weight.

Then, it adversarially minimize the difference between RHS and LHS (which we denote \(L_w(w, Q)\)) to the worst case in terms of \(Q(\cdot)\) using a discriminator defined in reproducing kernel Hilbert space (RKHS) as follows.

\[\begin{split}\max_w L_w^2(w, Q) &= \mathbb{E}_{(s_t, a_t, s_{t+1}), (\tilde{s}_t, \tilde{a}_t, \tilde{s}_{t+1}) \sim d^{\pi_b}, a_{t+1} \sim \pi(a_{t+1} | s_{t+1}), \tilde{a}_{t+1} \sim \pi(\tilde{a}_{t+1} | \tilde{s}_{t+1})}[ w_s(s_t) w_a(s_t, a_t) w_s(\tilde{s}_t) w_a(\tilde{s}_t, \tilde{a}_t) ( K((s_t, a_t), (\tilde{s}_t, \tilde{a}_t)) + K((s_{t+1}, a_{t+1}), (\tilde{s}_{t+1}, \tilde{a}_{t+1})) - \gamma ( K((s_t, a_t), (\tilde{s}_{t+1}, \tilde{a}_{t+1})) + K((s_{t+1}, a_{t+1}), (\tilde{s}_t, \tilde{a}_t)) )) ] \\ & \quad \quad + \gamma (1 - \gamma) \mathbb{E}_{(s_t, a_t, s_{t+1}), (\tilde{s}_t, \tilde{a}_t, \tilde{s}_{t+1}) \sim d^{\pi_b}, a_{t+1} \sim \pi(a_{t+1} | s_{t+1}), \tilde{a}_{t+1} \sim \pi(\tilde{a}_{t+1} | \tilde{s}_{t+1}), s_0 \sim d(s_0), \tilde{s}_0 \sim d(\tilde{s}_0), a_0 \sim \pi(a_0 | s_0), \tilde{a}_0 \sim \pi(\tilde{a}_0 | \tilde{s}_0)}[ w_s(s_t) w_a(s_t, a_t) K((s_{t+1}, a_{t+1}), (\tilde{s}_0, \tilde{a}_0)) + w_s(\tilde{s}_t) w_a(\tilde{s}_t, \tilde{a}_t) K((\tilde{s}_{t+1}, \tilde{a}_{t+1}), (s_0, a_0)) ] \\ & \quad \quad - (1 - \gamma) \mathbb{E}_{(s_t, a_t), (\tilde{s}_t, \tilde{a}_t) \sim d^{\pi_b}, s_0 \sim d(s_0), \tilde{s}_0 \sim d(\tilde{s}_0), a_0 \sim \pi(a_0 | s_0), \tilde{a}_0 \sim \pi(\tilde{a}_0 | \tilde{s}_0)}[ w_s(s_t) w_a(s_t, a_t) K((s_t, a_t), (\tilde{s}_0, \tilde{a}_0)) + w_s(\tilde{s}_t) w_a(\tilde{s}_t, \tilde{a}_t) K((\tilde{s}_t, \tilde{a}_t), (s_0, a_0)) ]\end{split}\]

where \(K(\cdot, \cdot)\) is a kernel function, \(w_s(s_t) \approx d^{\pi}(s_t) / d^{\pi_b}(s_t)\) is the state-marginal importance weight, and \(w_a(s_t, a_t) := \pi(a_t | s_t) / \pi_0(a_t | s_t)\) is the immediate importance weight.

Parameters:
  • w_function (StateWeightFunction) – Weight function model.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the Gaussian kernel.

  • state_scaler (d3rlpy.preprocessing.Scaler, default=None) – Scaling factor of state.

  • action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

  • batch_size (int, default=128 (> 0)) – Batch size.

  • lr (float, default=1e-4 (> 0)) – Learning rate.

  • device (str, default="cuda:0") – Specifies device used for torch.

References

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Attributes:
action_scaler
state_scaler

Methods

fit(step_per_trajectory, state, action, ...)

Fit weight function.

fit_predict(step_per_trajectory, state, ...)

Fit and predict weight function.

load(path)

Load models.

predict(state)

Predict state marginal importance weight.

predict_state_action_marginal_importance_weight(...)

Predict state-action marginal importance weight.

predict_state_marginal_importance_weight(state)

Predict state marginal importance weight.

predict_weight(state)

Predict state marginal importance weight.

save(path)

Save models.

load(path)[source]#

Load models.

save(path)[source]#

Save models.

fit(step_per_trajectory, state, action, pscore, evaluation_policy_action, n_steps=10000, n_steps_per_epoch=10000, regularization_weight=1.0, random_state=None, **kwargs)[source]#

Fit weight function.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory)) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • n_steps (int, default=10000 (> 0)) – Number of gradient steps.

  • n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.

  • regularization_weight (float, default=1.0 (> 0)) – Scaling factor of the regularization weight.

  • random_state (int, default=None (>= 0)) – Random state.

predict_state_marginal_importance_weight(state)[source]#

Predict state marginal importance weight.

Parameters:

state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

Returns:

importance_weight – Estimated state marginal importance weight.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, )

predict_state_action_marginal_importance_weight(state, action, pscore, evaluation_policy_action)[source]#

Predict state-action marginal importance weight.

Parameters:
  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory)) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

Returns:

importance_weight – Estimated state-action marginal importance weight.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, )

predict_weight(state)[source]#

Predict state marginal importance weight.

Parameters:

state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

Returns:

importance_weight – Estimated state marginal importance weight.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, )

predict(state)[source]#

Predict state marginal importance weight.

Parameters:

state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

Returns:

importance_weight – Estimated state marginal importance weight.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, )

fit_predict(step_per_trajectory, state, action, pscore, evaluation_policy_action, n_steps=10000, n_steps_per_epoch=10000, regularization_weight=1.0, random_state=None, **kwargs)[source]#

Fit and predict weight function.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory)) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

  • evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

  • n_steps (int, default=10000 (> 0)) – Number of gradient steps.

  • n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.

  • regularization_weight (float, default=1.0 (> 0)) – Scaling factor of the regularization weight.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

importance_weight – Estimated state-action marginal importance weight.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, )