Self-Supervised Network Distillation for Exploration

被引:0
作者
Zhang, Xu [1 ]
Dai, Ruiyu [2 ]
Chen, Weisi [1 ]
Qiu, Jiguang [1 ]
机构
[1] Xiamen Univ Technol, Coll Software Engn, 600 Ligong Rd, Xiamen 361000, Peoples R China
[2] Xiamen Univ Technol, Coll Comp & Informat Engn, 600 Ligong Rd, Xiamen 361000, Peoples R China
关键词
Reinforcement learning; exploration; self-supervised learning; knowledge distillation;
D O I
10.1142/S0218001423510217
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Reinforcement learning is currently applicable across a range of domains, including robotics, gaming, and natural language processing. However, the approach faces difficulties in environments with sparse rewards. Random network distillation (RND) is a good intrinsic reward solution to this problem. Nevertheless, the RND method's effectiveness hinges on excellent initialization, and the reliance on random features somewhat constrains the agent's exploration capabilities. This paper proposes a self-supervised network distillation (SSND) exploration method, addressing the drawbacks of RND's reliance on initializing random networks while enhancing the agent's exploration capability in sparse reward environments. The method uses distillation error as intrinsic rewards, with the target network trained using self-supervised learning. During the training of the predictor network, we noticed fluctuations in both loss values and intrinsic rewards, which have a detrimental impact on the performance of the intelligent agent. To resolve this issue, we introduce batch normalization layers to the target network, which helps mitigate intrinsic reward anomalies stemming from the target network's instability. Experiments show that the self-supervised network distillation is better than RND in terms of exploration speed and performance.
引用
收藏
页数:18
相关论文
共 25 条
[1]  
Badia AP, 2020, Arxiv, DOI [arXiv:2002.06038, 10.48550/arXiv.2002.06038]
[2]  
Baldassarre G, 2019, Arxiv, DOI arXiv:1912.13263
[3]   Novelty or Surprise? [J].
Barto, Andrew ;
Mirolli, Marco ;
Baldassarre, Gianluca .
FRONTIERS IN PSYCHOLOGY, 2013, 4
[4]  
Bellemare MG, 2016, ADV NEUR IN, V29
[5]  
Burda Y, 2018, Arxiv, DOI [arXiv:1810.12894, 10.48550/arXiv.1810.12894]
[6]  
Burda Y, 2018, Arxiv, DOI arXiv:1808.04355
[7]  
Stadie BC, 2015, Arxiv, DOI arXiv:1507.00814
[8]  
Clark Jack, 2016, Faulty reward functions in the wild
[9]  
Fu J., 2017, Adv. Neural Inf. Process. Syst., P30
[10]  
Gidaris S, 2018, Arxiv, DOI arXiv:1803.07728