scope_rl.ope.discrete.marginal_estimators.DoubleReinforcementLearning#

class scope_rl.ope.discrete.marginal_estimators.DoubleReinforcementLearning(estimator_name='drl')[source]#

Double Reinforcement Learning (DRL) estimator for discrete action spaces.

Bases: scope_rl.ope.BaseOffPolicyEstimator

Imported as: scope_rl.ope.discrete.DoubleReinforcementLearning

Note

DRL estimates the policy value using state-action marginal importance weight and Q-function estimated by cross-fitting.

\[\hat{J}_{\mathrm{DRL}} (\pi; \mathcal{D}) := \frac{1}{n} \sum_{k=1}^K \sum_{i=1}^{n_k} \sum_{t=0}^{T-1} (\rho^j(s_t^{(i)}, a_t^{(i)}) (r_t^{(i)} - Q^j(s_t^{(i)}, a_t^{(i)})) + \rho^j(s_{t-1}^{(i)}, a_{t-1}^{(i)}) \sum_{a \in \mathcal{A}} \pi(a | s_t^{(i)}) Q^j(s_t^{(i)}, a))\]

where \(\rho(s, a) \approx d^{\pi}(s, a) / d^{\pi_b}(s, a)\) is the state-action marginal importance weight, where \(d^{\pi}(s, a)\) is the marginal visitation probability of the policy \(\pi\) on \((s, a)\). \(Q(s, a)\) is the Q-function. \(K\) is the number of folds and \(\mathcal{D}_j\) is the \(j\)-th split of logged data consisting of \(n_k\) samples. \(\rho^j\) and \(Q^j\) are estimated on the subset of data used for OPE, i.e., \(\mathcal{D} \setminus \mathcal{D}_j\).

DRL achieves the semiparametric efficiency bound with a consistent value predictor.

There are several ways to estimate the state(-action) marginal importance weight such as Augmented Lagrangian Method (ALM) (Yang et al., 2020) and Minimax Weight Learning (MWL) (Uehara et al., 2020).

See also

The implementations of such weight learning methods are available at scope_rl.ope.weight_value_learning.

Parameters:: estimator_name (str, default="drl") – Name of the estimator.

References

Nathan Kallus and Masatoshi Uehara. “Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes.” 2020.

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. “Minimax Weight and Q-Function Learning for Off-Policy Evaluation.” 2020.

Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. “Off-Policy Evaluation via the Regularized Lagrangian.” 2020.

Methods

`estimate_interval`(step_per_trajectory, ...)	Estimate the confidence interval of the policy value by nonparametric bootstrap.
`estimate_policy_value`(step_per_trajectory, ...)	Estimate the policy value of the evaluation policy.

estimate_policy_value(step_per_trajectory, action, reward, state_action_marginal_importance_weight, evaluation_policy_action_dist, state_action_value_prediction, **kwargs)[source]#

Estimate the policy value of the evaluation policy.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
state_action_marginal_importance_weight (array-like of shape (n_trajectories * step_per_trajectory, )) – Importance weight wrt the state-action marginal distribution, i.e., \(d^{\pi}(s, a) / d^{\pi_b}(s, a)\)
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s) \forall a \in \mathcal{A}\)
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).

Returns:

V_hat – Estimated policy value.

Return type:

float

estimate_interval(step_per_trajectory, action, reward, state_action_marginal_importance_weight, evaluation_policy_action_dist, state_action_value_prediction, alpha=0.05, ci='bootstrap', n_bootstrap_samples=10000, random_state=None, **kwargs)[source]#

Estimate the confidence interval of the policy value by nonparametric bootstrap.

Parameters:

step_per_trajectory (int (> 0)) – Number of timesteps in an episode.
action (array-like of shape (n_trajectories * step_per_trajectory, )) – Action chosen by the behavior policy.
reward (array-like of shape (n_trajectories * step_per_trajectory, )) – Observed immediate rewards.
state_action_marginal_importance_weight (array-like of shape (n_trajectories * step_per_trajectory, )) – Importance weight wrt the state-action marginal distribution, i.e., \(d^{\pi}(s, a) / d^{\pi_b}(s, a)\)
evaluation_policy_action_dist (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – Conditional action distribution induced by the evaluation policy, i.e., \(\pi(a | s) \forall a \in \mathcal{A}\)
state_action_value_prediction (array-like of shape (n_trajectories * step_per_trajectory, n_action)) – \(\hat{Q}\) for all actions, i.e., \(\hat{Q}(s_t, a) \forall a \in \mathcal{A}\).
alpha (float, default=0.05) – Significance level. The value should be within [0, 1).
ci ({"bootstrap", "hoeffding", "bernstein", "ttest"}, default="bootstrap") – Method to estimate the confidence interval.
n_bootstrap_samples (int, default=10000 (> 0)) – Number of resampling performed in the bootstrap procedure.
random_state (int, default=None (>= 0)) – Random state.

Returns:

estimated_confidence_interval – Dictionary storing the estimated mean and upper-lower confidence bounds.

key: [
    mean,
    {100 * (1. - alpha)}% CI (lower),
    {100 * (1. - alpha)}% CI (upper),
]

Return type:

dict

Methods