References#

Papers#

[1]

Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2094––2100. 2016.

[2]

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. In Advances in Neural Information Processing Systems, volume 33, 1179–1191. 2020.

[3]

Alina Beygelzimer and John Langford. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 129–138. 2009.

[4]

Hoang Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In Proceedings of the 36th International Conference on Machine Learning, volume 97, 3703–3712. PMLR, 2019.

[5]

Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, 759––766. 2000.

[6]

Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, volume 48, 652–661. PMLR, 2016.

[7]

Philip Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, volume 48, 2139–2148. PMLR, 2016.

[8]

Audrey Huang, Liu Leqi, Zachary Lipton, and Kamyar Azizzadenesheli. Off-policy risk assessment in contextual bandits. In Advances in Neural Information Processing Systems, volume 34, 23714–23726. 2021.

[9]

Audrey Huang, Liu Leqi, Zachary Lipton, and Kamyar Azizzadenesheli. Off-policy risk assessment for markov decision processes. In roceedings of the 25th International Conference on Artificial Intelligence and Statistics, 5022–5050. 2022.

[10]

Yash Chandak, Scott Niekum, Bruno da Silva, Erik Learned-Miller, Emma Brunskill, and Philip S Thomas. Universal off-policy evaluation. In Advances in Neural Information Processing Systems, volume 34, 27475–27490. 2021.

[11]

Nathan Kallus and Masatoshi Uehara. Intrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning. In Advances in Neural Information Processing Systems, 3325–3334. 2019.

[12]

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and q-function learning for off-policy evaluation. In Proceedings of the 37th International Conference on Machine Learning, 9659–9668. PMLR, 2020.

[13]

Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: infinite-horizon off-policy estimation. Advances in Neural Information Processing Systems, 2018.

[14]

Christina Yuan, Yash Chandak, Stephen Giguere, Philip S Thomas, and Scott Niekum. Sope: spectrum of off-policy estimators. Advances in Neural Information Processing Systems, 34:18958–18969, 2021.

[15]

Nathan Kallus and Masatoshi Uehara. Double reinforcement learning for efficient off-policy evaluation in markov decision processes. Journal of Machine Learning Research, 2020.

[16]

Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, and Dale Schuurmans. Off-policy evaluation via the regularized lagrangian. Advances in Neural Information Processing Systems, 33:6551–6561, 2020.

[17]

Shangtong Zhang, Bo Liu, and Shimon Whiteson. Gradientdice: rethinking generalized offline estimation of stationary values. In Proceedings of the 37th International Conference on Machine Learning, 11194–11203. PMLR, 2020.

[18]

Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. Gendice: generalized offline estimation of stationary values. Proceedings of the 8th International Conference on Learning Representations, 2020.

[19]

Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019.

[20]

Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: behavior-agnostic estimation of discounted stationary distribution corrections. Advances in Neural Information Processing Systems, 2019.

[21]

Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High confidence policy improvement. In International Conference on Machine Learning, 2380–2388. PMLR, 2015.

[22]

Josiah P Hanna, Peter Stone, and Scott Niekum. Bootstrapping with models: confidence intervals for off-policy evaluation. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017.

[23]

Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-confidence off-policy evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29. 2015.

[24]

Tom Le Paine, Cosmin Paduraru, Andrea Michi, Caglar Gulcehre, Konrad Zolna, Alexander Novikov, Ziyu Wang, and Nando de Freitas. Hyperparameter selection for offline reinforcement learning. arXiv preprint arXiv:2007.09055, 2020.

[25]

Cameron Voloshin, Hoang M Le, Nan Jiang, and Yisong Yue. Empirical study of off-policy policy evaluation for reinforcement learning. Advances in Neural Information Processing Systems, 2019.

[26]

Justin Fu, Mohammad Norouzi, Ofir Nachum, George Tucker, Ziyu Wang, Alexander Novikov, Mengjiao Yang, Michael R. Zhang, Yutian Chen, Aviral Kumar, Cosmin Paduraru, Sergey Levine, and Thomas Paine. Benchmarks for deep off-policy evaluation. In Proceedings of the 9th International Conference on Learning Representations. 2021.

[27]

Vladislav Kurenkov and Sergey Kolesnikov. Showing your offline reinforcement learning work: online evaluation budget matters. In Proceedings of the 39th International Conference on Machine Learning, 11729–11752. PMLR, 2022.

[28]

Masatoshi Uehara, Chengchun Shi, and Nathan Kallus. A review of off-policy evaluation in reinforcement learning. arXiv preprint arXiv:2212.06355, 2022.

[29]

Shayan Doroudi, Philip S Thomas, and Emma Brunskill. Importance sampling for fair policy selection. Grantee Submission, 2017.

[30]

Shengpu Tang and Jenna Wiens. Model selection for offline reinforcement learning: practical considerations for healthcare settings. In Machine Learning for Healthcare Conference, 2–35. 2021.

[31]

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.

[32]

Takuma Seno and Michita Imai. D3rlpy: an offline deep reinforcement learning library. arXiv preprint arXiv:2111.03788, 2021.

[33]

Denis Tarasov, Alexander Nikulin, Dmitry Akimov, Vladislav Kurenkov, and Sergey Kolesnikov. CORL: research-oriented deep offline reinforcement learning library. In 3rd Offline RL Workshop: Offline RL as a ”Launchpad”. 2022.

[34]

Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. Rllib: abstractions for distributed reinforcement learning. In Proceeings of the 35th International Conference on Machine Learning, 3053–3062. PMLR, 2018.

[35]

Jason Gauci, Edoardo Conti, Yitao Liang, Kittipat Virochsiri, Yuchen He, Zachary Kaden, Vivek Narayanan, Xiaohui Ye, Zhengxing Chen, and Scott Fujimoto. Horizon: facebook's open source applied reinforcement learning platform. arXiv preprint arXiv:1811.00260, 2018.

[36]

Rongjun Qin, Songyi Gao, Xingyuan Zhang, Zhen Xu, Shengkai Huang, Zewen Li, Weinan Zhang, and Yang Yu. Neorl: a near real-world benchmark for offline reinforcement learning. Advances in Neural Information Processing Systems, 2021.

[37]

David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, and Alexandros Karatzoglou. Recogym: a reinforcement learning environment for the problem of product recommendation in online advertising. arXiv preprint arXiv:1808.00720, 2018.

[38]

Kai Wang, Zhene Zou, Qilin Deng, Yue Shang, Minghao Zhao, Runze Wu, Xudong Shen, Tangjie Lyu, and Changjie Fan. Rl4rs: a real-world benchmark for reinforcement learning based recommender system. arXiv preprint arXiv:2110.11073, 2021.

[39]

Olivier Jeunen, Sean Murphy, and Ben Allison. Learning to bid with auctiongym. In KDD 2022 Workshop on Artificial Intelligence for Computational Advertising (AdKDD). 2022. URL: https://www.amazon.science/publications/learning-to-bid-with-auctiongym.

[40]

Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. Open bandit dataset and pipeline: towards realistic and reproducible off-policy evaluation. Advances in Neural Information Processing Systems, 2021.

[41]

Sham M Kakade. A natural policy gradient. Advances in Neural Information Processing Systems, 2001.

[42]

David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In Proceedings of the 31th International Conference on Machine Learning, 387–395. PMLR, 2014.

[43]

Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3):279–292, 1992.

[44]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

[45]

Vijay Konda and John Tsitsiklis. Actor-critic algorithms. Advances in Neural Information Processing Systems, 1999.

[46]

Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. In Proceedings of the 29th International Coference on Machine Learning, 179–186. 2012.

[47]

Hado Van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, and Joseph Modayil. Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648, 2018.

[48]

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.

[49]

Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum, and Shixiang Gu. Deployment-efficient reinforcement learning via model-based offline optimization. Proceedings of the 9th International Conference on Learning Representations, 2021.

[50]

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In Proceedings of the 36th International Conference on Machine Learning, volume 97, 2052–2062. PMLR, 2019.

[51]

Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.

[52]

Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021.

[53]

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 2019.

[54]

Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. Combo: conservative offline model-based policy optimization. Advances in neural information processing systems, 34:28954–28967, 2021.

[55]

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.

[56]

Rafael Figueiredo Prudencio, Marcos ROA Maximo, and Esther Luna Colombini. A survey on offline reinforcement learning: taxonomy, review, and open problems. arXiv preprint arXiv:2203.01387, 2022.

[57]

Nathan Kallus and Angela Zhou. Policy evaluation and optimization with continuous treatments. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics, volume 84, 1243–1251. PMLR, 2018.

[58]

Haanvid Lee, Jongmin Lee, Yunseon Choi, Wonseok Jeon, Byung-Jun Lee, Yung-Kyun Noh, and Kee-Eung Kim. Local metric learning for off-policy evaluation in contextual bandits with continuous actions. In Advances in Neural Information Processing Systems, xxxx–xxxx. 2022.

[59]

Eugene Ie, Chih-wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu, and Craig Boutilier. Recsim: a configurable simulation platform for recommender systems. arXiv preprint arXiv:1909.04847, 2019.

[60]

Xiao-Yang Liu, Hongyang Yang, Qian Chen, Runjia Zhang, Liuqing Yang, Bowen Xiao, and Christina Dan Wang. Finrl: a deep reinforcement learning library for automated stock trading in quantitative finance. arXiv preprint arXiv:2011.09607, 2020.

[61]

Sarah Dean and Jamie Morgenstern. Preference dynamics under personalized recommendations. In Proceedings of the 23rd ACM Conference on Economics and Computation, 795–816. 2022.