scope_rl.ope.weight_value_learning.minimax_weight_learning_continuous#

Minimax weight function learning (continuous action cases).

Classes

`ContinuousMinimaxStateActionWeightLearning`	Minimax Weight Learning for marginal OPE estimators (for continuous action space).
`ContinuousMinimaxStateWeightLearning`	Minimax Weight Learning for marginal OPE estimators (for continuous action space).

class scope_rl.ope.weight_value_learning.minimax_weight_learning_continuous.ContinuousMinimaxStateActionWeightLearning(w_function, gamma=1.0, bandwidth=1.0, state_scaler=None, action_scaler=None, batch_size=128, lr=0.0001, device='cuda:0')[source]#

Minimax Weight Learning for marginal OPE estimators (for continuous action space).

Bases: scope_rl.ope.weight_value_learning.BaseWeightValueLearner

Imported as: scope_rl.ope.weight_value_learning.ContinuousMinimaxStateActionWightLearning

Note

Minimax Weight Learning uses that the following holds true about Q-function.

\[\mathbb{E}_{(s_t, a_t, r_t, s_{t+1}) \sim d^{\pi_b}, a_{t+1} \sim \pi(a_{t+1} | s_{t+1})} [w(s_t, a_t) (Q(s_t, a_t) - \gamma Q(s_{t+1}, a_{t+1}))] = \mathbb{E}_{s_0 \sim d^{\pi_b}, a_0 \sim \pi(a_0 | s_0)} [Q(s_0, a_0)]\]

where \(Q(s_t, a_t)\) is the Q-function, \(w(s_t, a_t) \approx d^{\pi}(s_t, a_t) / d^{\pi_b}(s_t, a_t)\) is the state-action marginal importance weight.

Then, it adversarially minimize the difference between RHS and LHS (which we denote \(L_w(w, Q)\)) to the worst case in terms of \(Q(\cdot)\) using a discriminator defined in reproducing kernel Hilbert space (RKHS) as follows.

\[\begin{split}\max_w L_w^2(w, Q) &= \mathbb{E}_{(s_t, a_t, s_{t+1}), (\tilde{s}_t, \tilde{a}_t, \tilde{s}_{t+1}) \sim d^{\pi_b}, a_{t+1} \sim \pi(a_{t+1} | s_{t+1}), \tilde{a}_{t+1} \sim \pi(\tilde{a}_{t+1} | \tilde{s}_{t+1})}[ w(s_t, a_t) w(\tilde{s}_t, \tilde{a}_t) ( K((s_t, a_t), (\tilde{s}_t, \tilde{a}_t)) + K((s_{t+1}, a_{t+1}), (\tilde{s}_{t+1}, \tilde{a}_{t+1})) - \gamma ( K((s_t, a_t), (\tilde{s}_{t+1}, \tilde{a}_{t+1})) + K((s_{t+1}, a_{t+1}), (\tilde{s}_t, \tilde{a}_t)) )) ] \\ & \quad \quad + \gamma (1 - \gamma) \mathbb{E}_{(s_t, a_t, s_{t+1}), (\tilde{s}_t, \tilde{a}_t, \tilde{s}_{t+1}) \sim d^{\pi_b}, a_{t+1} \sim \pi(a_{t+1} | s_{t+1}), \tilde{a}_{t+1} \sim \pi(\tilde{a}_{t+1} | \tilde{s}_{t+1}), s_0 \sim d(s_0), \tilde{s}_0 \sim d(\tilde{s}_0), a_0 \sim \pi(a_0 | s_0), \tilde{a}_0 \sim \pi(\tilde{a}_0 | \tilde{s}_0)}[ w(s_t, a_t) K((s_{t+1}, a_{t+1}), (\tilde{s}_0, \tilde{a}_0)) + w(\tilde{s}_t, \tilde{a}_t) K((\tilde{s}_{t+1}, \tilde{a}_{t+1}), (s_0, a_0)) ] \\ & \quad \quad - (1 - \gamma) \mathbb{E}_{(s_t, a_t), (\tilde{s}_t, \tilde{a}_t) \sim d^{\pi_b}, s_0 \sim d(s_0), \tilde{s}_0 \sim d(\tilde{s}_0), a_0 \sim \pi(a_0 | s_0), \tilde{a}_0 \sim \pi(\tilde{a}_0 | \tilde{s}_0)}[ w(s_t, a_t) K((s_t, a_t), (\tilde{s}_0, \tilde{a}_0)) + w(\tilde{s}_t, \tilde{a}_t) K((\tilde{s}_t, \tilde{a}_t), (s_0, a_0)) ]\end{split}\]

where \(K(\cdot, \cdot)\) is a kernel function.

Parameters:

w_function (ContinuousStateActionWeightFunction) – Weight function model.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the Gaussian kernel.
state_scaler (d3rlpy.preprocessing.Scaler, default=None) – Scaling factor of state.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
batch_size (int, default=128 (> 0)) – Batch size.
lr (float, default=1e-4 (> 0)) – Learning rate.
device (str, default="cuda:0") – Specifies device used for torch.

References

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Attributes:

action_scaler
state_scaler

Methods

`fit`(step_per_trajectory, state, action, ...)	Fit weight function.
`fit_predict`(step_per_trajectory, state, ...)	Fit and predict weight function.
`load`(path)	Load models.
`predict`(state, action)	Predict function.
`predict_weight`(state, action)	Predict function.
`save`(path)	Save models.

load(path)[source]#

Load models.

save(path)[source]#

Save models.

fit(step_per_trajectory, state, action, evaluation_policy_action, n_steps=10000, n_steps_per_epoch=10000, regularization_weight=1.0, random_state=None, **kwargs)[source]#

Fit weight function.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.
regularization_weight (float, default=1.0 (> 0)) – Scaling factor of the regularization weight.
random_state (int, default=None (>= 0)) – Random state.

predict_weight(state, action)[source]#

Predict function.

Parameters:

state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

Returns:

importance_weight – Estimated state-action marginal importance weight.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, )

predict(state, action)[source]#

Predict function.

Parameters:

state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.

Returns:

importance_weight – Estimated state-action marginal importance weight.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, )

fit_predict(step_per_trajectory, state, action, evaluation_policy_action, n_steps=10000, n_steps_per_epoch=10000, regularization_weight=1.0, random_state=None, **kwargs)[source]#

Fit and predict weight function.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Next action chosen by the evaluation policy.
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.
regularization_weight (float, default=1.0 (> 0)) – Scaling factor of the regularization weight.
random_state (int, default=None (>= 0)) – Random state.

Returns:

importance_weight – Estimated state-action marginal importance weight.

Return type:

ndarray of shape (n_trajectories, )

class scope_rl.ope.weight_value_learning.minimax_weight_learning_continuous.ContinuousMinimaxStateWeightLearning(w_function, gamma=1.0, bandwidth=1.0, state_scaler=None, action_scaler=None, batch_size=128, lr=0.0001, device='cuda:0')[source]#

Minimax Weight Learning for marginal OPE estimators (for continuous action space).

Bases: scope_rl.ope.weight_value_learning.BaseWeightValueLearner

Imported as: scope_rl.ope.weight_value_learning.ContinuousMinimaxStateWightLearning

Note

Minimax Weight Learning uses that the following holds true about Q-function.

where \(Q(s_t, a_t)\) is the Q-function, \(w(s_t, a_t) \approx d^{\pi}(s_t, a_t) / d^{\pi_b}(s_t, a_t) = d^{\pi}(s_t) \pi(a_t | s_t) / d^{\pi_b}(s_t) \pi_0(a_t | s_t)\) is the state-action marginal importance weight.

\[\begin{split}\max_w L_w^2(w, Q) &= \mathbb{E}_{(s_t, a_t, s_{t+1}), (\tilde{s}_t, \tilde{a}_t, \tilde{s}_{t+1}) \sim d^{\pi_b}, a_{t+1} \sim \pi(a_{t+1} | s_{t+1}), \tilde{a}_{t+1} \sim \pi(\tilde{a}_{t+1} | \tilde{s}_{t+1})}[ w_s(s_t) w_a(s_t, a_t) w_s(\tilde{s}_t) w_a(\tilde{s}_t, \tilde{a}_t) ( K((s_t, a_t), (\tilde{s}_t, \tilde{a}_t)) + K((s_{t+1}, a_{t+1}), (\tilde{s}_{t+1}, \tilde{a}_{t+1})) - \gamma ( K((s_t, a_t), (\tilde{s}_{t+1}, \tilde{a}_{t+1})) + K((s_{t+1}, a_{t+1}), (\tilde{s}_t, \tilde{a}_t)) )) ] \\ & \quad \quad + \gamma (1 - \gamma) \mathbb{E}_{(s_t, a_t, s_{t+1}), (\tilde{s}_t, \tilde{a}_t, \tilde{s}_{t+1}) \sim d^{\pi_b}, a_{t+1} \sim \pi(a_{t+1} | s_{t+1}), \tilde{a}_{t+1} \sim \pi(\tilde{a}_{t+1} | \tilde{s}_{t+1}), s_0 \sim d(s_0), \tilde{s}_0 \sim d(\tilde{s}_0), a_0 \sim \pi(a_0 | s_0), \tilde{a}_0 \sim \pi(\tilde{a}_0 | \tilde{s}_0)}[ w_s(s_t) w_a(s_t, a_t) K((s_{t+1}, a_{t+1}), (\tilde{s}_0, \tilde{a}_0)) + w_s(\tilde{s}_t) w_a(\tilde{s}_t, \tilde{a}_t) K((\tilde{s}_{t+1}, \tilde{a}_{t+1}), (s_0, a_0)) ] \\ & \quad \quad - (1 - \gamma) \mathbb{E}_{(s_t, a_t), (\tilde{s}_t, \tilde{a}_t) \sim d^{\pi_b}, s_0 \sim d(s_0), \tilde{s}_0 \sim d(\tilde{s}_0), a_0 \sim \pi(a_0 | s_0), \tilde{a}_0 \sim \pi(\tilde{a}_0 | \tilde{s}_0)}[ w_s(s_t) w_a(s_t, a_t) K((s_t, a_t), (\tilde{s}_0, \tilde{a}_0)) + w_s(\tilde{s}_t) w_a(\tilde{s}_t, \tilde{a}_t) K((\tilde{s}_t, \tilde{a}_t), (s_0, a_0)) ]\end{split}\]

where \(K(\cdot, \cdot)\) is a kernel function, \(w_s(s_t) \approx d^{\pi}(s_t) / d^{\pi_b}(s_t)\) is the state-marginal importance weight, and \(w_a(s_t, a_t) := \pi(a_t | s_t) / \pi_0(a_t | s_t)\) is the immediate importance weight.

Parameters:

w_function (StateWeightFunction) – Weight function model.
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the Gaussian kernel.
state_scaler (d3rlpy.preprocessing.Scaler, default=None) – Scaling factor of state.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
batch_size (int, default=128 (> 0)) – Batch size.
lr (float, default=1e-4 (> 0)) – Learning rate.
device (str, default="cuda:0") – Specifies device used for torch.

References

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Attributes:

action_scaler
state_scaler

Methods

`fit`(step_per_trajectory, state, action, ...)	Fit weight function.
`fit_predict`(step_per_trajectory, state, ...)	Fit and predict weight function.
`load`(path)	Load models.
`predict`(state)	Predict state marginal importance weight.
`predict_state_action_marginal_importance_weight`(...)	Predict state-action marginal importance weight.
`predict_state_marginal_importance_weight`(state)	Predict state marginal importance weight.
`predict_weight`(state)	Predict state marginal importance weight.
`save`(path)	Save models.

load(path)[source]#

Load models.

save(path)[source]#

Save models.

fit(step_per_trajectory, state, action, pscore, evaluation_policy_action, n_steps=10000, n_steps_per_epoch=10000, regularization_weight=1.0, random_state=None, **kwargs)[source]#

Fit weight function.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
pscore (array-like of shape (n_trajectories * step_per_trajectory)) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.
regularization_weight (float, default=1.0 (> 0)) – Scaling factor of the regularization weight.
random_state (int, default=None (>= 0)) – Random state.

predict_state_marginal_importance_weight(state)[source]#

Predict state marginal importance weight.

Parameters:: state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
Returns:: importance_weight – Estimated state marginal importance weight.
Return type:: ndarray of shape (n_trajectories * step_per_trajectory, )

predict_state_action_marginal_importance_weight(state, action, pscore, evaluation_policy_action)[source]#

Predict state-action marginal importance weight.

Parameters:

state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
pscore (array-like of shape (n_trajectories * step_per_trajectory)) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.

Returns:

importance_weight – Estimated state-action marginal importance weight.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, )

predict_weight(state)[source]#

Predict state marginal importance weight.

Parameters:: state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
Returns:: importance_weight – Estimated state marginal importance weight.
Return type:: ndarray of shape (n_trajectories * step_per_trajectory, )

predict(state)[source]#

Predict state marginal importance weight.

Parameters:: state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
Returns:: importance_weight – Estimated state marginal importance weight.
Return type:: ndarray of shape (n_trajectories * step_per_trajectory, )

fit_predict(step_per_trajectory, state, action, pscore, evaluation_policy_action, n_steps=10000, n_steps_per_epoch=10000, regularization_weight=1.0, random_state=None, **kwargs)[source]#

Fit and predict weight function.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
state (array-like of shape (n_trajectories * step_per_trajectory, state_dim)) – State observed by the behavior policy.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
pscore (array-like of shape (n_trajectories * step_per_trajectory)) – Propensity of the observed action being chosen under the behavior policy (pscore stands for propensity score).
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
n_steps (int, default=10000 (> 0)) – Number of gradient steps.
n_steps_per_epoch (int, default=10000 (> 0)) – Number of gradient steps in a epoch.
regularization_weight (float, default=1.0 (> 0)) – Scaling factor of the regularization weight.
random_state (int, default=None (>= 0)) – Random state.

Returns:

importance_weight – Estimated state-action marginal importance weight.

Return type:

ndarray of shape (n_trajectories * step_per_trajectory, )