scope_rl.ope.discrete.marginal_estimators.StateMarginalDM#

class scope_rl.ope.discrete.marginal_estimators.StateMarginalDM(estimator_name='sm_dm')[source]#

Direct Method (DM) for discrete-action and stationary OPE.

Bases: scope_rl.ope.BaseStateMarginalOPEEstimator -> scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.StateMarginalDM

Note

DM estimates the policy value using an estimated initial state value as follows.

\[\hat{J}_{\mathrm{DM}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{i=1}^n \sum_{a \in \mathcal{A}} \pi(a | s_0^{(i)}) \hat{Q}(s_0^{(i)}, a) = \frac{1}{n} \sum_{i=1}^n \hat{V}(s_0^{(i)}),\]

where \(\mathcal{D}=\{\{(s_t, a_t, r_t)\}_{t=0}^{T-1}\}_{i=1}^n\) is the logged dataset with \(n\) trajectories. \(T\) indicates step per episode. \(\hat{Q}(s_t, a_t)\) is the estimated Q value given a state-action pair. \(\hat{V}(s_t)\) is the estimated value function given a state.

DM has low variance compared to other estimators, but can produce larger bias due to approximation errors.

There are several methods to estimate \(\hat{Q}(s, a)\) such as Fitted Q Evaluation (FQE) (Le et al., 2019), Minimax Q-Function Learning (MQL) (Uehara et al., 2020), and Augmented Lagrangian Method (ALM) (Yang et al., 2020).

See also

The implementation of FQE is provided by d3rlpy. The implementations of Minimax Weight and Value Learning (including ALM) is available at scope_rl.ope.weight_value_learning.

Note

This function is different from DirectMethod in that the initial state is sampled from the stationary distribution \(d^{\pi}(s_0)\).

Parameters:: estimator_name (str, default="sm_dm") – Name of the estimator.

References

Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. “Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.” 2021.

Takuma Seno and Michita Imai. “d3rlpy: An Offline Deep Reinforcement Library.” 2021.

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Hoang Le, Cameron Voloshin, and Yisong Yue. “Batch Policy Learning under Constraints.” 2019.

Alina Beygelzimer and John Langford. “The Offset Tree for Learning with Partial Labels.” 2009.

Methods

`estimate_interval`(initial_state_value_prediction)	Estimate the confidence interval of the policy value by nonparametric bootstrap.
`estimate_policy_value`(...)	Estimate the policy value of the evaluation policy.

estimate_policy_value(initial_state_value_prediction, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:: initial_state_value_prediction (array-like of shape (n_trajectories, )) – Estimated initial state value.
Returns:: V_hat – Estimated policy value.
Return type:: float

estimate_interval(initial_state_value_prediction, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=None, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:

initial_state_value_prediction (array-like of shape (n_trajectories, )) – Estimated initial state value.
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

Methods