scope_rl.ope.weight_value_learning.minimax_weight_learning_discrete#

Minimax weight function learning (discrete action cases).

Classes

DiscreteMinimaxStateActionWeightLearning

Minimax Weight Learning for marginal OPE estimators (for discrete action space).

DiscreteMinimaxStateWeightLearning

Minimax Weight Learning for marginal OPE estimators (for discrete action space).

class scope_rl.ope.weight_value_learning.minimax_weight_learning_discrete.DiscreteMinimaxStateActionWeightLearning(w_function, gamma=1.0, bandwidth=1.0, state_scaler=None, batch_size=128, lr=0.0001, device='cuda:0')[source]#

Minimax Weight Learning for marginal OPE estimators (for discrete action space).

Bases: scope_rl.ope.weight_value_learning.BaseWeightValueLearner

Imported as: scope_rl.ope.weight_value_learning.DiscreteMinimaxStateActionWightLearning

Note

Minimax Weight Learning uses that the following holds true about Q-function.

\[\mathbb{E}_{(s_t, a_t, r_t, s_{t+1}) \sim d^{\pi_b}, a_{t+1} \sim \pi(a_{t+1} | s_{t+1})} [w(s_t, a_t) (Q(s_t, a_t) - \gamma Q(s_{t+1}, a_{t+1}))] = \mathbb{E}_{s_0 \sim d^{\pi_b}, a_0 \sim \pi(a_0 | s_0)} [Q(s_0, a_0)]\]

where \(Q(s_t, a_t)\) is the Q-function, \(w(s_t, a_t) \approx d^{\pi}(s_t, a_t) / d^{\pi_b}(s_t, a_t)\) is the state-action marginal importance weight.

Then, it adversarially minimize the difference between RHS and LHS (which we denote \(L_w(w, Q)\)) to the worst case in terms of \(Q(\cdot)\) using a discriminator defined in reproducing kernel Hilbert space (RKHS) as follows.

\[\begin{split}\max_w L_w^2(w, Q) &= \mathbb{E}_{(s_t, a_t, s_{t+1}), (\tilde{s}_t, \tilde{a}_t, \tilde{s}_{t+1}) \sim d^{\pi_b}, a_{t+1} \sim \pi(a_{t+1} | s_{t+1}), \tilde{a}_{t+1} \sim \pi(\tilde{a}_{t+1} | \tilde{s}_{t+1})}[ w(s_t, a_t) w(\tilde{s}_t, \tilde{a}_t) ( K((s_t, a_t), (\tilde{s}_t, \tilde{a}_t)) + K((s_{t+1}, a_{t+1}), (\tilde{s}_{t+1}, \tilde{a}_{t+1})) - \gamma ( K((s_t, a_t), (\tilde{s}_{t+1}, \tilde{a}_{t+1})) + K((s_{t+1}, a_{t+1}), (\tilde{s}_t, \tilde{a}_t)) )) ] \\ & \quad \quad + \gamma (1 - \gamma) \mathbb{E}_{(s_t, a_t, s_{t+1}), (\tilde{s}_t, \tilde{a}_t, \tilde{s}_{t+1}) \sim d^{\pi_b}, a_{t+1} \sim \pi(a_{t+1} | s_{t+1}), \tilde{a}_{t+1} \sim \pi(\tilde{a}_{t+1} | \tilde{s}_{t+1}), s_0 \sim d(s_0), \tilde{s}_0 \sim d(\tilde{s}_0), a_0 \sim \pi(a_0 | s_0), \tilde{a}_0 \sim \pi(\tilde{a}_0 | \tilde{s}_0)}[ w(s_t, a_t) K((s_{t+1}, a_{t+1}), (\tilde{s}_0, \tilde{a}_0)) + w(\tilde{s}_t, \tilde{a}_t) K((\tilde{s}_{t+1}, \tilde{a}_{t+1}), (s_0, a_0)) ] \\ & \quad \quad - (1 - \gamma) \mathbb{E}_{(s_t, a_t), (\tilde{s}_t, \tilde{a}_t) \sim d^{\pi_b}, s_0 \sim d(s_0), \tilde{s}_0 \sim d(\tilde{s}_0), a_0 \sim \pi(a_0 | s_0), \tilde{a}_0 \sim \pi(\tilde{a}_0 | \tilde{s}_0)}[ w(s_t, a_t) K((s_t, a_t), (\tilde{s}_0, \tilde{a}_0)) + w(\tilde{s}_t, \tilde{a}_t) K((\tilde{s}_t, \tilde{a}_t), (s_0, a_0)) ]\end{split}\]

where \(K(\cdot, \cdot)\) is a kernel function.

Parameters:
  • w_function (DiscreteStateActionWeightFunction) – Weight function model.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the Gaussian kernel.

  • state_scaler (d3rlpy.preprocessing.Scaler, default=None) – Scaling factor of state.

  • batch_size (int, default=128 (> 0)) – Batch size.

  • lr (float, default=1e-4 (> 0)) – Learning rate.

  • device (str, default="cuda:0") – Specifies device used for torch.

References

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Attributes:
state_scaler

Methods

fit(step_per_trajectory, state, action, ...)

Fit weight function.

fit_predict(step_per_trajectory, state, ...)

Fit and predict weight function.

load(path)

Load models.

predict(state, action)

Predict function.

predict_weight(state, action)

Predict function.

save(path)

Save models.

load(path)[source]#

Load models.

save(path)[source]#

Save models.

fit(step_per_trajectory, state, action, evaluation_policy_action_dist, n_steps=10000, n_steps_per_epoch=10000, regularization_weight=1.0, random_state=None, **kwargs)[source]#

Fit weight function.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories * step_per_trajectory)) – Action chosen by the behavior policy.

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_actions)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_{t+1}) \forall a \in \mathcal{A}\)

  • n_steps (int, default=10000 (> 0)) – Number of gradient steps.

  • n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.

  • regularization_weight (float, default=1.0 (> 0)) – Scaling factor of the regularization weight.

  • random_state (int, default=None (>= 0)) – Random state.

predict_weight(state, action)[source]#

Predict function.

Parameters:
  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories * step_per_trajectory)) – Action chosen by the behavior policy.

Returns:

importance_weight – Estimated state-action marginal importance weight.

Return type:

ndarray of shape (n_trajectories, step_per_trajectory)

predict(state, action)[source]#

Predict function.

Parameters:
  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories * step_per_trajectory)) – Action chosen by the behavior policy.

Returns:

importance_weight – Estimated state-action marginal importance weight.

Return type:

ndarray of shape (n_trajectories, step_per_trajectory)

fit_predict(step_per_trajectory, state, action, evaluation_policy_action_dist, n_steps=10000, n_steps_per_epoch=10000, regularization_weight=1.0, random_state=None, **kwargs)[source]#

Fit and predict weight function.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories * step_per_trajectory)) – Action chosen by the behavior policy.

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_actions)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_{t+1}) \forall a \in \mathcal{A}\)

  • n_steps (int, default=10000 (> 0)) – Number of gradient steps.

  • n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.

  • regularization_weight (float, default=1.0 (> 0)) – Scaling factor of the regularization weight.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

importance_weight – Estimated state-action marginal importance weight.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory)

class scope_rl.ope.weight_value_learning.minimax_weight_learning_discrete.DiscreteMinimaxStateWeightLearning(w_function, gamma=1.0, bandwidth=1.0, state_scaler=None, batch_size=128, lr=0.0001, device='cuda:0')[source]#

Minimax Weight Learning for marginal OPE estimators (for discrete action space).

Bases: scope_rl.ope.weight_value_learning.BaseWeightValueLearner

Imported as: scope_rl.ope.weight_value_learning.DiscreteMinimaxStateWightLearning

Note

Minimax Weight Learning uses that the following holds true about Q-function.

\[\mathbb{E}_{(s_t, a_t, r_t, s_{t+1}) \sim d^{\pi_b}, a_{t+1} \sim \pi(a_{t+1} | s_{t+1})} [w(s_t, a_t) (Q(s_t, a_t) - \gamma Q(s_{t+1}, a_{t+1}))] = \mathbb{E}_{s_0 \sim d^{\pi_b}, a_0 \sim \pi(a_0 | s_0)} [Q(s_0, a_0)]\]

where \(Q(s_t, a_t)\) is the Q-function, \(w(s_t, a_t) \approx d^{\pi}(s_t, a_t) / d^{\pi_b}(s_t, a_t) = d^{\pi}(s_t) \pi(a_t | s_t) / d^{\pi_b}(s_t) \pi_0(a_t | s_t)\) is the state-action marginal importance weight.

Then, it adversarially minimize the difference between RHS and LHS (which we denote \(L_w(w, Q)\)) to the worst case in terms of \(Q(\cdot)\) using a discriminator defined in reproducing kernel Hilbert space (RKHS) as follows.

\[\begin{split}\max_w L_w^2(w, Q) &= \mathbb{E}_{(s_t, a_t, s_{t+1}), (\tilde{s}_t, \tilde{a}_t, \tilde{s}_{t+1}) \sim d^{\pi_b}, a_{t+1} \sim \pi(a_{t+1} | s_{t+1}), \tilde{a}_{t+1} \sim \pi(\tilde{a}_{t+1} | \tilde{s}_{t+1})}[ w_s(s_t) w_a(s_t, a_t) w_s(\tilde{s}_t) w_a(\tilde{s}_t, \tilde{a}_t) ( K((s_t, a_t), (\tilde{s}_t, \tilde{a}_t)) + K((s_{t+1}, a_{t+1}), (\tilde{s}_{t+1}, \tilde{a}_{t+1})) - \gamma ( K((s_t, a_t), (\tilde{s}_{t+1}, \tilde{a}_{t+1})) + K((s_{t+1}, a_{t+1}), (\tilde{s}_t, \tilde{a}_t)) )) ] \\ & \quad \quad + \gamma (1 - \gamma) \mathbb{E}_{(s_t, a_t, s_{t+1}), (\tilde{s}_t, \tilde{a}_t, \tilde{s}_{t+1}) \sim d^{\pi_b}, a_{t+1} \sim \pi(a_{t+1} | s_{t+1}), \tilde{a}_{t+1} \sim \pi(\tilde{a}_{t+1} | \tilde{s}_{t+1}), s_0 \sim d(s_0), \tilde{s}_0 \sim d(\tilde{s}_0), a_0 \sim \pi(a_0 | s_0), \tilde{a}_0 \sim \pi(\tilde{a}_0 | \tilde{s}_0)}[ w_s(s_t) w_a(s_t, a_t) K((s_{t+1}, a_{t+1}), (\tilde{s}_0, \tilde{a}_0)) + w_s(\tilde{s}_t) w_a(\tilde{s}_t, \tilde{a}_t) K((\tilde{s}_{t+1}, \tilde{a}_{t+1}), (s_0, a_0)) ] \\ & \quad \quad - (1 - \gamma) \mathbb{E}_{(s_t, a_t), (\tilde{s}_t, \tilde{a}_t) \sim d^{\pi_b}, s_0 \sim d(s_0), \tilde{s}_0 \sim d(\tilde{s}_0), a_0 \sim \pi(a_0 | s_0), \tilde{a}_0 \sim \pi(\tilde{a}_0 | \tilde{s}_0)}[ w_s(s_t) w_a(s_t, a_t) K((s_t, a_t), (\tilde{s}_0, \tilde{a}_0)) + w_s(\tilde{s}_t) w_a(\tilde{s}_t, \tilde{a}_t) K((\tilde{s}_t, \tilde{a}_t), (s_0, a_0)) ]\end{split}\]

where \(K(\cdot, \cdot)\) is a kernel function, \(w_s(s_t) \approx d^{\pi}(s_t) / d^{\pi_b}(s_t)\) is the state-marginal importance weight, and \(w_a(s_t, a_t) := \pi(a_t | s_t) / \pi_0(a_t | s_t)\) is the immediate importance weight.

Parameters:
  • w_function (StateWeightFunction) – Weight function model.

  • gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].

  • bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the Gaussian kernel.

  • state_scaler (d3rlpy.preprocessing.Scaler, default=None) – Scaling factor of state.

  • batch_size (int, default=128 (> 0)) – Batch size.

  • lr (float, default=1e-4 (> 0)) – Learning rate.

  • device (str, default="cuda:0") – Specifies device used for torch.

References

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Attributes:
state_scaler

Methods

fit(step_per_trajectory, state, action, ...)

Fit weight function.

fit_predict(step_per_trajectory, state, ...)

Fit and predict weight function.

load(path)

Load models.

predict(state)

Predict state marginal importance weight.

predict_state_action_marginal_importance_weight(...)

Predict state-action marginal importance weight.

predict_state_marginal_importance_weight(state)

Predict state marginal importance weight.

predict_weight(state)

Predict state marginal importance weight.

save(path)

Save models.

load(path)[source]#

Load models.

save(path)[source]#

Save models.

fit(step_per_trajectory, state, action, pscore, evaluation_policy_action_dist, n_steps=10000, n_steps_per_epoch=10000, regularization_weight=1.0, random_state=None, **kwargs)[source]#

Fit weight function.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories * step_per_trajectory)) – Action chosen by the behavior policy.

  • pscore (array-like of shape (n_trajectories, step_per_trajectory)) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_actions)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • n_steps (int, default=10000 (> 0)) – Number of gradient steps.

  • n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.

  • regularization_weight (float, default=1.0 (> 0)) – Scaling factor of the regularization weight.

  • random_state (int, default=None (>= 0)) – Random state.

predict_state_marginal_importance_weight(state)[source]#

Predict state marginal importance weight.

Parameters:
  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • state_scaler (d3rlpy.preprocessing.Scaler, default=None) – Scaling factor of state.

Returns:

importance_weight – Estimated state marginal importance weight.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, )

predict_state_action_marginal_importance_weight(state, action, pscore, evaluation_policy_action_dist)[source]#

Predict state-action marginal importance weight.

Parameters:
  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories * step_per_trajectory)) – Action chosen by the behavior policy.

  • pscore (array-like of shape (n_trajectories *step_per_trajectory)) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_actions)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • state_scaler (d3rlpy.preprocessing.Scaler, default=None) – Scaling factor of state.

Returns:

importance_weight – Estimated state-action marginal importance weight.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, )

predict_weight(state)[source]#

Predict state marginal importance weight.

Parameters:
  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • state_scaler (d3rlpy.preprocessing.Scaler, default=None) – Scaling factor of state.

Returns:

importance_weight – Estimated state marginal importance weight.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, )

predict(state)[source]#

Predict state marginal importance weight.

Parameters:

state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

Returns:

importance_weight – Estimated state marginal importance weight.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, )

fit_predict(step_per_trajectory, state, action, pscore, evaluation_policy_action_dist, n_steps=10000, n_steps_per_epoch=10000, regularization_weight=1.0, random_state=None, **kwargs)[source]#

Fit and predict weight function.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.

  • action (array-like of shape (n_trajectories * step_per_trajectory)) – Action chosen by the behavior policy.

  • pscore (array-like of shape (n_trajectories * step_per_trajectory)) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).

  • evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_actions)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a \mid s_t) \forall a \in \mathcal{A}\)

  • n_steps (int, default=10000 (> 0)) – Number of gradient steps.

  • n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.

  • regularization_weight (float, default=1.0 (> 0)) – Scaling factor of the regularization weight.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

importance_weight – Estimated state-action marginal importance weight.

Return type:

ndarray of shape (n_trajectories, )