scope_rl.ope.continuous.basic_estimators.DirectMethod#

class scope_rl.ope.continuous.basic_estimators.DirectMethod[source]#

Direct Method (DM) (designed for deterministic evaluation policies) for continuous action spaces.

Bases: scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.continuous.DirectMethod

Note

DM estimates the policy value using an estimated initial state value as follows.

\[\hat{J}_{\mathrm{DM}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \hat{Q}(s_0^{(i)}, \pi(s_0^{(i)})) = \frac{1}{n} \sum_{i=1}^n \hat{V}(s_0^{(i)}),\]

where \(\mathcal{D}=\{\{(s_t, a_t, r_t)\}_{t=0}^{T-1}\}_{i=1}^n\) is the logged dataset with \(n\) trajectories. \(T\) indicates step per episode. \(\hat{Q}(s_t, a_t)\) is the estimated Q value given a state-action pair. \(\hat{V}(s_t)\) is the estimated value function given a state.

DM has low variance compared to other estimators, but can produce larger bias due to approximation errors.

There are several methods to estimate \(\hat{Q}(s, a)\) such as Fitted Q Evaluation (FQE) (Le et al., 2019) and Minimax Q-Function Learning (MQL) (Uehara et al., 2020).

See also

The implementation of FQE is provided by d3rlpy. The implementations of Minimax Learning is available at scope_rl.ope.weight_value_learning.

Parameters:

estimator_name (str, default="dm") – Name of the estimator.

References

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Hoang Le, Cameron Voloshin, and Yisong Yue. “Batch Policy Learning under Constraints.” 2019.

Methods

estimate_interval(step_per_trajectory, ...)

Estimate the confidence interval of the policy value by nonparametric bootstrap.

estimate_policy_value(step_per_trajectory, ...)

Estimate the policy value of the evaluation policy.

estimate_policy_value(step_per_trajectory, state_action_value_prediction, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, 2)) – \(\hat{Q}\) for the observed action and that chosen by the evaluation policy, i.e., (row 0) \(\hat{Q}(s_t, a_t)\) and (row 2) \(\hat{Q}(s_t, \pi(a | s_t))\).

Returns:

V_hat – Estimated policy value.

Return type:

float

estimate_interval(step_per_trajectory, state_action_value_prediction, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=None, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:
  • step_per_trajectory (int (> 0)) – Number of timesteps in an episode.

  • state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, 2)) – \(\hat{Q}\) for the observed action and that chosen by the evaluation policy, i.e., (row 0) \(\hat{Q}(s_t, a_t)\) and (row 2) \(\hat{Q}(s_t, \pi(a | s_t))\).

  • alpha (float, default=0.05) – Significance level. The value should be within [0, 1).

  • ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Name of the method to estimate the confidence interval.

  • n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.

  • random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

Methods