SCOPE-RL#

A Python library for offline reinforcement learning, off-policy evaluation, and selection

Overview#

SCOPE-RL is an open-source Python library designed for both Offline Reinforcement Learning (RL) and Off-Policy Evaluation and Selection (OPE/OPS). This library is intended to streamline offline RL research by providing an easy, flexible, and reliable platform for conducting experiments. It also offers straightforward implementations for practitioners. SCOPE-RL incorporates a series of modules that allow for synthetic dataset generation, dataset preprocessing, and the conducting and evaluation of OPE/OPS.

SCOPE-RL can be used in any RL environment that has an interface similar to OpenAI Gym or Gymnasium-like interface. The library is also compatible with d3rlpy which implements both online and offline RL methods.

Our software facilitates implementation, evaluation and algorithm comparison related to the following research topics:

workflow of offline RL, OPE, and online A/B testing

Offline Reinforcement Learning:
Offline RL aims to train a new policy from only offline logged data collected by a behavior policy. SCOPE-RL enables a flexible experiment using customized datasets on diverse environments collected by various behavior policies.
Off-Policy Evaluation:
OPE aims to evaluate the performance of a counterfactual policy using only offline logged data. SCOPE-RL supports implementations of a range of OPE estimators and streamlines the experimental procedure to evaluate the accuracy of OPE estimators.
Off-Policy Selection:
OPS aims to select the top-\(k\) policies from several candidate policies using only offline logged data. Typically, the final production policy is chosen based on the online A/B tests of the top-\(k\) policies selected by OPS. SCOPE-RL supports implementations of a range of OPS methods and provides some metrics to evaluate the OPS result.

Note

This documentation aims to provide a gentle introduction to offline RL and OPE/OPS in the following steps.

Explain the basic concepts in Overview (online/offline RL) and Overview (OPE/OPS).
Provide various examples of conducting offline RL and OPE/OPS in practical problem settings in Quickstart and Example Codes.
Describe the algorithms and implementations in detail in Supported Implementation and Package Reference.

You can also find the distinctive features of SCOPE-RL here: Why SCOPE-RL?

Implementation#

Data Collection Policy and Offline RL#

SCOPE-RL overrides d3rlpy’s implementation for the base RL algorithms. We provide a class to handle synthetic dataset generation, off-policy learning with multiple algorithms, and wrapper classes for transforming the policy into a stochastic one.

Meta class#

SyntheticDataset
TrainCandidatePolicies

Discrete#

Epsilon Greedy
Softmax

Continuous#

Gaussian
Truncated Gaussian

Basic OPE#

Policy Value Estimated by OPE Estimators

SCOPE-RL provides a variety of OPE estimators both in discrete and continuous action spaces. Moreover, SCOPE-RL also implements meta classes to handle OPE with multiple estimators at once and provides generic classes of OPE estimators to facilitate research development.

Basic estimators#

Direct Method (DM) [3, 4]
Trajectory-wise Importance Sampling (TIS) [5]
Per-Decision Importance Sampling (PDIS) [5]
Doubly Robust (DR) [6, 7]
Self-Normalized Trajectory-wise Importance Sampling (SNTIS) [5, 11]
Self-Normalized Per-Decision Importance Sampling (SNPDIS) [5, 11]
Self-Normalized Doubly Robust (SNDR) [6, 7, 11]

State Marginal Estimators#

State Marginal Direct Method (SM-DM) [12]
State Marginal Importance Sampling (SM-IS) [12, 13, 14]
State Marginal Doubly Robust (SM-DR) [12, 13, 14]
State Marginal Self-Normalized Importance Sampling (SM-SNIS) [12, 13, 14]
State Marginal Self-Normalized Doubly Robust (SM-SNDR) [12, 13, 14]

State-Action Marginal Estimators#

State-Action Marginal Importance Sampling (SAM-IS) [12, 14]
State-Action Marginal Doubly Robust (SAM-DR) [12, 14]
State-Action Marginal Self-Normalized Importance Sampling (SAM-SNIS) [12, 14]
State-Action Marginal Self-Normalized Doubly Robust (SAM-SNDR) [12, 14]

Double Reinforcement Learning#

Double Reinforcement Learning [15]

Weight and Value Learning Methods#

Augmented Lagrangian Method (ALM/DICE) [16]
- BestDICE [16]
- GradientDICE [17]
- GenDICE [18]
- AlgaeDICE [19]
- DualDICE [20]
- MQL/MWL [12]
Minimax Q-Learning and Weight Learning (MQL/MWL) [12]

High Confidence OPE#

Bootstrap [21, 22]
Hoeffding [23]
(Empirical) Bernstein [21, 23]
Student T-test [21]

Cumulative Distribution OPE#

Cumulative Distribution Function Estimated by OPE Estimators

SCOPE-RL also provides cumulative distribution OPE estimators, which enables practitioners to evaluate various risk metrics (e.g., conditional value at risk) for safety assessment beyond the mere expectation of the trajectory-wise reward. Meta class and generic abstract class are also available for cumulative distribution OPE.

Estimators#

Direct Method (DM) [8]
Trajectory-wise Importance Sampling (TIS) [8, 10]
Trajectory-wise Doubly Robust (TDR) [8]
Self-Normalized Trajectory-wise Importance Sampling (SNTIS) [8, 10]
Self-Normalized Trajectory-wise Doubly Robust (SNDR) [8]

Metrics of Interest#

Cumulative Distribution Function (CDF)
Mean (i.e., policy value)
Variance
Conditional Value at Risk (CVaR)
Interquartile Range

Off-Policy Selection Metrics#

Comparison of the Top-k Statistics of 10% Lower Quartile of Policy Value

Finally, SCOPE-RL also standardizes the evaluation protocol of OPE in two axes, firstly by measuring the accuracy of OPE over the whole candidate policies, and secondly by evaluating the gains and costs in top-k deployment (e.g., the best and worst performance in top-k deployment). The streamlined implementations and visualization of OPS class provide informative insights on offline RL and OPE performance.

OPE metrics#

Mean Squared Error [24, 25, 26]
Spearman’s Rank Correlation Coefficient [24, 26]
Regret [24, 26]
Type I and Type II Error Rates

OPS metrics (performance of top \(k\) deployment policies)#

{Best/Worst/Mean/Std} of {policy value/conditional value at risk/lower quartile}
Safety violation rate
Sharpe ratio (our proposal)

Citation#

If you use our pipeline in your work, please cite our paper below.

Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito.

SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation

@article{kiyohara2023scope,
    title={SCOPE-RL: A Python Library for Offline Reinforcement Learning and Off-Policy Evaluation},
    author={Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nakata, Kazuhide and Saito, Yuta},
    journal={arXiv preprint arXiv:2311.18206},
    year={2023}
}

If you use the proposed metric (SharpeRatio@k) or refer to our findings in your work, please cite our paper below.

Haruka Kiyohara, Ren Kishimoto, Kosuke Kawakami, Ken Kobayashi, Kazuhide Nakata, Yuta Saito.
Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation
(a preprint is coming soon..)

@article{kiyohara2023towards,
    title={Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation},
    author={Kiyohara, Haruka and Kishimoto, Ren and Kawakami, Kosuke and Kobayashi, Ken and Nakata, Kazuhide and Saito, Yuta},
    journal={arXiv preprint arXiv:2311.18207},
    year={2023}
}

Google Group#

Feel free to follow our updates from our google group: scope-rl@googlegroups.com.

Contact#

For any questions about the paper and pipeline, feel free to contact: hk844@cornell.edu

Contribution#

Any contributions to SCOPE-RL are more than welcome! Please refer to CONTRIBUTING.md for general guidelines on how to contribute to the project.

Table of Contents#

Getting Started:

Online & Offline RL:

Our Proposal:

Risk-Return Assessments of OPE via SharpeRatio@k

Sub-packages:

Gallery of Sub-packages

Package References: