Corruption-Robust Exploration in Episodic Reinforcement Learning

被引：0

作者：

Lykouris, Thodoris ^{[1
]}

Simchowitz, Max ^{[2
]}

Slivkins, Aleksandrs ^{[3
]}

Sun, Wen ^{[4
]}

机构：

[1] MIT, Sloan Sch Management, Cambridge, MA 02139 USA

[2] MIT, Dept Elect Engn & Comp Sci, Cambridge, MA 02139 USA

[3] Microsoft Res Lab, New York, NY 10012 USA

[4] Cornell Univ, Dept Comp Sci, Ithaca, NY 14850 USA

来源：

MATHEMATICS OF OPERATIONS RESEARCH | 2024年

关键词：

reinforcement learning; bandit feedback; exploration; robustness; regret; MULTIARMED BANDIT;

D O I：

10.1287/moor.2021.0202

中图分类号：

C93 [管理学]; O22 [运筹学];

学科分类号：

070105 ; 12 ; 1201 ; 1202 ; 120202 ;

摘要：

We initiate the study of episodic reinforcement learning (RL) under adversarial corruptions in both the rewards and the transition probabilities of the underlying system, extending recent results for the special case of multiarmed bandits. We provide a framework that modifies the aggressive exploration enjoyed by existing reinforcement learning approaches based on optimism in the face of uncertainty by complementing them with principles from action elimination. Importantly, our framework circumvents the major challenges posed by naively applying action elimination in the RL setting, as formalized by a lower bound we demonstrate. Our framework yields efficient algorithms that (a) attain near -optimal regret in the absence of corruptions and (b) adapt to unknown levels of corruption, enjoying regret guarantees that degrade gracefully in the total corruption encountered. To showcase the generality of our approach, we derive results for both tabular settings (where states and actions are finite) and linear Markov decision process settings (where the dynamics and rewards admit a linear underlying representation). Notably, our work provides the first sublinear regret guarantee that accommodates any deviation from purely independent and identically distributed transitions in the bandit -feedback model for episodic reinforcement learning.

引用

页数：29

共 50 条

[21] Robust Federated Learning with Realistic Corruption
Zhao, Puning
Wu, Jiafei
Liu, Zhe
WEB AND BIG DATA, APWEB-WAIM 2024, PT IV, 2024, 14964 : 228 - 242
[22] Structured-Anomaly Pursuit of Network Traffic via Corruption-Robust Low-Rank Tensor Decomposition
Zeng, Jiuzhen
Yang, Laurence T.
Wang, Chao
Ruan, Yiheng
Zhu, Chenlu
IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2024, 11 (03): : 2510 - 2523
[23] Exploration Entropy for Reinforcement Learning
Xin, Bo
Yu, Haixu
Qin, You
Tang, Qing
Zhu, Zhangqing
MATHEMATICAL PROBLEMS IN ENGINEERING, 2020, 2020
[24] Exploration in Structured Reinforcement Learning
Ok, Jungseul
Proutiere, Alexandre
Tranos, Damianos
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
[25] Exploration and Incentives in Reinforcement Learning
Simchowitz, Max
Slivkins, Aleksandrs
OPERATIONS RESEARCH, 2024, 72 (03) : 983 - 998
[26] Deep Reinforcement Learning with Parametric Episodic Memory
Chen, Kangkang
Gan, Zhongxue
Leng, Siyang
Guan, Chun
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[27] Conservative Exploration in Reinforcement Learning
Garcelon, Evrard
Ghavamzadeh, Mohammad
Lazaric, Alessandro
Pirotta, Matteo
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108 : 1431 - 1440
[28] Latent Exploration for Reinforcement Learning
Chiappa, Alberto Silvio
Vargas, Alessandro Marin
Huang, Ann Zixiang
Mathis, Alexander
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[29] Steady State Analysis of Episodic Reinforcement Learning
Huang Bojun
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[30] Detecting Rewards Deterioration in Episodic Reinforcement Learning
Greenberg, Ido
Mannor, Shie
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139

← 1 2 3 4 5 →