TVENet: Transformer-Based Visual Exploration Network for Mobile Robot in Unseen Environment

被引：3

作者：

Zhang, Tianyao ^{[1
,2
]}

Hu, Xiaoguang ^{[1
]}

Xiao, Jin ^{[1
]}

Zhang, Guofeng ^{[1
]}

机构：

[1] Beihang Univ, Sch Automat Sci & Elect Engn, Beijing, Peoples R China

[2] Beihang Univ, ShenYuan Honors Coll, Beijing 100191, Peoples R China

来源：

IEEE ACCESS | 2022年 / 10卷

基金：

中国国家自然科学基金;

关键词：

Visualization; Feature extraction; Navigation; Task analysis; Robots; Transformers; Training; Active perception; embodied AI; learning for navigation; visual exploration; visual navigation; NAVIGATION; VISION;

D O I：

10.1109/ACCESS.2022.3181989

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper presents a Transformer-based Visual Exploration Network (TVENet) that capably serves as a solution for active perception problems, especially the visual exploration problem: How could a robot that is equipped with a camera explore an unknown 3D environment? The TVENet consists of a Mapper, a Global Policy and a Local Policy. The mapper is trained by supervised learning to take the visual observation as input and generate an occupancy grid map for the explored environment. The Global Policy and the Local Policy are trained by reinforcement learning in order to make navigation decision. Most state-of-the-art methods in visual exploration domain use ResNet as feature extractor, and few of them pay attention to the extraction capability of the extractor. Therefore, this paper focuses on enhancing the extraction capability, and proposes a Transformer-based Feature Pyramid Module (TFPM). Moreover, two tricks for training process are introduced to improve the performance (M.F. and Aux.) Our experiments in photo-realistic simulated environment (Habitat) demonstrate the higher-accuracy mapping of TVENet. Experimental results prove that the TFPM and tricks have positive impacts on the mapping accuracy of the visual exploration and increase it by 5.31% compared with the state-of-the-art. Most importantly, the TVENet is deployed on a real robot (NVIDIA Jetbot) to prove the feasibility of Embodied AI approaches. To the authors' knowledge, this paper is the first one that proves the viability of the Embodied AI style approach for visual exploration tasks and deploys the pre-trained model on the NVIDIA Jetson robot.

引用

页码：62056 / 62072

页数：17

共 58 条

[1]

Achiam J., 2018, Spinning Up in Deep Reinforcement Learning

[2]

Anderson P., 2018, On evaluation of embodied navigation agents

[3] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments [J].

Anderson, Peter ;

Wu, Qi ;

Teney, Damien ;

Bruce, Jake ;

Johnson, Mark ;

Sunderhauf, Niko ;

Reid, Ian ;

Gould, Stephen ;

van den Hengel, Anton .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3674-3683

[4]

[Anonymous], 2014, P 2014 C EMP METH NA, DOI [DOI 10.3115/V1, DOI 10.3115/V1/D14-1179]

[5]

Ba J. L., 2016, Advances in Neural Information Processing Systems (NeurIPS), P1

[6]

Bansal Somil, 2020, C ROBOT LEARNING, P420

[7]

Bojarski Mariusz, 2016, arXiv

[8] ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial, and Multimap SLAM [J].

Campos, Carlos ;

Elvira, Richard ;

Gomez Rodriguez, Juan J. ;

Montiel, Jose M. M. ;

Tardos, Juan D. .

IEEE TRANSACTIONS ON ROBOTICS, 2021, 37 (06) :1874-1890

[9] Matterport3D: Learning from RGB-D Data in Indoor Environments [J].

Chang, Angel ;

Dai, Angela ;

Funkhouser, Thomas ;

Halber, Maciej ;

Niessner, Matthias ;

Savva, Manolis ;

Song, Shuran ;

Zeng, Andy ;

Zhang, Yinda .

PROCEEDINGS 2017 INTERNATIONAL CONFERENCE ON 3D VISION (3DV), 2017, :667-676

[10]

Chaplot D. S., 2018, PROC 6 INT C LEARN R

← 1 2 3 4 5 6 →