scope_rl.ope.continuous.basic_estimators.DoublyRobust#

class scope_rl.ope.continuous.basic_estimators.DoublyRobust[source]#

Doubly Robust (DR) (designed for deterministic evaluation policies) for continuous action spaces.

Bases: scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.continuous.DoublyRobust

Note

DR estimates the policy value via step-wise importance weighting and estimated Q-function \(\hat{Q}\) as follows.

\[\hat{J}_{\mathrm{DR}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \left(w_{0:t}^{(i)} \delta(\pi, a_{0:t}^{(i)}) (r_t^{(i)} - \hat{Q}(s_t^{(i)}, a_t^{(i)})) + w_{0:t-1}^{(i)} \delta(\pi, a_{0:t-1}^{(i)}) \hat{Q}(s_t^{(i)}, \pi(s_t^{(i)})) \right),\]

where \(w_{0:t} := \prod_{t'=0}^t (1 / \pi_0(a_{t'} | s_{t'}))\) is the per-decision importance weight. \(\delta(\pi, a_{0:t}) = \prod_{t'=1}^t K(\pi(s_{t'}), a_{t'})\) quantifies the similarity between the action logged in the dataset and that taken by the evaluation policy where \(K(\cdot)\) is a kernel function. Note that the bandwidth of the kernel is an important hyperparameter; the variance of the above estimator often becomes small when the bandwidth of the kernel is large, while the bias often becomes large in those cases.

DR corrects distribution shift between the behavior and evaluation policies and has lower variance than PDIS when \(\hat{Q}(\cdot)\) is reasonably accurate and satisfies \(0 < \hat{Q}(\cdot) < 2 Q(\cdot)\). However, when the importance weight is quite large, it may still suffer from a high variance.

Parameters:: estimator_name (str, default="dr") – Name of the estimator.

References

Nathan Kallus and Angela Zhou. “Policy Evaluation and Optimization with Continuous Treatments.” 2019.

Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” 2016.

Philip S. Thomas and Emma Brunskill. “Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” 2016.

Methods

`estimate_interval`(step_per_trajectory, ...)	Estimate the confidence interval of the policy value by nonparametric bootstrap.
`estimate_policy_value`(step_per_trajectory, ...)	Estimate the policy value of the evaluation policy.

estimate_policy_value(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, 2)) – \(\hat{Q}\) for the observed action and that chosen by the evaluation policy, i.e., (row 0) \(\hat{Q}(s_t, a_t)\) and (row 2) \(\hat{Q}(s_t, \pi(a | s_t))\).
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.

Returns:

V_hat – Estimated policy value.

Return type:

float

estimate_interval(step_per_trajectory, action, reward, pscore, evaluation_policy_action, state_action_value_prediction, gamma=1.0, kernel='gaussian', bandwidth=1.0, action_scaler=None, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=12345, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)
evaluation_policy_action (array-like of shape (n_trajectories * step_per_trajectory, action_dim)) – Action chosen by the evaluation policy.
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, 2)) – \(\hat{Q}\) for the observed action and that chosen by the evaluation policy, i.e., (row 0) \(\hat{Q}(s_t, a_t)\) and (row 2) \(\hat{Q}(s_t, \pi(a | s_t))\).
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
kernel ({"gaussian", "epanechnikov", "triangular", "cosine", "uniform"}) – Name of the kernel function to smooth importance weights.
bandwidth (float, default=1.0 (> 0)) – Bandwidth hyperparameter of the kernel function.
action_scaler (d3rlpy.preprocessing.ActionScaler, default=None) – Scaling factor of action.
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Name of the method to estimate the confidence interval.
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

Methods