scope_rl.ope.discrete.basic_estimators.SelfNormalizedDR#
- class scope_rl.ope.discrete.basic_estimators.SelfNormalizedDR[source]#
Self-Normalized Doubly Robust (SNDR) for discrete action spaces.
Bases:
scope_rl.ope.discrete.DoublyRobust->scope_rl.ope.BaseOffPolicyEstimatorImported as:
scope_rl.ope.discrete.SelfNormalizedDRNote
SNDR estimates the policy value via self-normalized step-wise importance weighting and estimated Q-function \(\hat{Q}\) as follows.
\[\hat{J}_{\mathrm{SNDR}} (\pi; \mathcal{D}) := \sum_{i=1}^n \sum_{t=0}^{T-1} \gamma^t \left( \frac{w_{0:t}^{(i)}}{\sum_{i'=1}^n w_{0:t}^{(i')}} (r_t^{(i)} - \hat{Q}(s_t^{(i)}, a_t^{(i)})) + \frac{w_{0:t-1}^{(i)}}{\sum_{i'=1}^n w_{0:t-1}^{(i')}} \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) \hat{Q}(s_t^{(i)}, a) \right),\]where \(w_{0:t} := \prod_{t'=0}^t (\pi(a_{t'} | s_{t'}) / \pi_0(a_{t'} | s_{t'}))\) is the per-decision importance weight.
The self-normalized estimator is no longer unbiased, but has variance bounded by \(r_{max}^2\) while also remaining consistent.
- Parameters:
estimator_name (str, default="sndr") – Name of the estimator.
References
Nathan Kallus and Masatoshi Uehara. “Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning.” 2019.
Nan Jiang and Lihong Li. “Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.” 2016.
Philip S. Thomas and Emma Brunskill. “Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.” 2016.
Methods
estimate_interval(step_per_trajectory, ...)Estimate the confidence interval of the policy value by nonparametric bootstrap.
estimate_policy_value(step_per_trajectory, ...)Estimate the policy value of the evaluation policy.
- estimate_interval(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, state_action_value_prediction, gamma=1.0, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=None, **kwargs)#
Estimate the confidence interval of the policy value by nonparametric bootstrap.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s_t) \forall a \in \mathcal{A}\)
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Name of the method to estimate the confidence interval.
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.
- Returns:
estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.
key: [ mean, {100 * (1. - alpha)}% CI (lower), {100 * (1. - alpha)}% CI (upper), ]
- Return type:
- estimate_policy_value(step_per_trajectory, action, reward, pscore, evaluation_policy_action_dist, state_action_value_prediction, gamma=1.0, **kwargs)#
Estimate the policy value of the evaluation policy.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
pscore (array-like of shape (n_trajectories * step_per_trajectory, )) – Conditional action choice probability of the behavior policy, i.e., \(\pi_b(a | s)\)
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s_t) \forall a \in \mathcal{A}\)
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
gamma (float, default=1.0) – Discount factor. The value should be within (0, 1].
- Returns:
V_hat – Estimated policy value.
- Return type:
Methods