scope_rl.ope.discrete.basic_estimators.DirectMethod#
- class scope_rl.ope.discrete.basic_estimators.DirectMethod(estimator_name='dm')[source]#
Direct Method (DM) for discrete action spaces.
Bases:
scope_rl.ope.BaseOffPolicyEstimatorImported as:
scope_rl.ope.discrete.DirectMethodNote
DM estimates the policy value using an estimated initial state value as follows.
\[\hat{J}_{\mathrm{DM}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi(a | s_0^{(i)}) \hat{Q}(s_0^{(i)}, a) = \frac{1}{n} \sum_{i=1}^n \hat{V}(s_0^{(i)}),\]where \(\mathcal{D}=\{\{(s_t, a_t, r_t)\}_{t=0}^{T-1}\}_{i=1}^n\) is the logged dataset with \(n\) trajectories. \(T\) indicates step per episode. \(\hat{Q}(s_t, a_t)\) is the estimated Q value given a state-action pair. \(\hat{V}(s_t)\) is the estimated value function given a state.
DM has low variance compared to other estimators, but can produce larger bias due to approximation errors.
There are several methods to estimate \(\hat{Q}(s, a)\) such as Fitted Q Evaluation (FQE) (Le et al., 2019) and Minimax Q-Function Learning (MQL) (Uehara et al., 2020).
See also
The implementation of FQE is provided by d3rlpy. The implementations of Minimax Learning is available at
scope_rl.ope.weight_value_learning.- Parameters:
estimator_name (str, default="dm") – Name of the estimator.
References
Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.
Hoang Le, Cameron Voloshin, and Yisong Yue. “Batch Policy Learning under Constraints.” 2019.
Methods
estimate_interval(step_per_trajectory, ...)Estimate the confidence interval of the policy value by nonparametric bootstrap.
estimate_policy_value(step_per_trajectory, ...)Estimate the policy value of the evaluation policy.
- estimate_policy_value(step_per_trajectory, evaluation_policy_action_dist, state_action_value_prediction, **kwargs)[source]#
Estimate the policy value of the evaluation policy.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s_t) \forall a \in \mathcal{A}\)
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
- Returns:
V_hat – Estimated policy value.
- Return type:
- estimate_interval(step_per_trajectory, evaluation_policy_action_dist, state_action_value_prediction, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=None, **kwargs)[source]#
Estimate the confidence interval of the policy value by nonparametric bootstrap.
- Parameters:
step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s_t) \forall a \in \mathcal{A}\)
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Name of the method to estimate the confidence interval.
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.
- Returns:
estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.
key: [ mean, {100 * (1. - alpha)}% CI (lower), {100 * (1. - alpha)}% CI (upper), ]
- Return type:
Methods