Hybrid CNN-Transformer Features for Visual Place Recognition

被引:34
作者
Wang, Yuwei [1 ]
Qiu, Yuanying [1 ]
Cheng, Peitao [1 ]
Zhang, Junyu [1 ]
机构
[1] Xidian Univ, Sch Mechanoelect Engn, Key Lab Elect Equipment Struct Design, Minist Educ, Xian 710071, Shaanxi, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Semantics; Visualization; Feature extraction; Convolutional neural networks; Task analysis; Training; Adaptive triplet loss; hybrid CNN-transformer; semantic NetVLAD aggregation; visual place recognition;
D O I
10.1109/TCSVT.2022.3212434
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Visual place recognition is a challenging problem in robotics and autonomous systems because the scene undergoes appearance and viewpoint changes in a changing world. Existing state-of-the-art methods heavily rely on CNN-based architectures. However, CNN cannot effectively model image spatial structure information due to the inherent locality. To address this issue, this paper proposes a novel Transformer-based place recognition method to combine local details, spatial context, and semantic information for image feature embedding. Firstly, to overcome the inherent locality of the convolutional neural network (CNN), a hybrid CNN-Transformer feature extraction network is introduced. The network utilizes the feature pyramid based on CNN to obtain the detailed visual understanding, while using the vision Transformer to model image contextual information and aggregate task-related features dynamically. Specifically, the multi-level output tokens from the Transformer are fed into a single Transformer encoder block to fuse multi-scale spatial information. Secondly, to acquire the multi-scale semantic information, a global semantic NetVLAD aggregation strategy is constructed. This strategy employs semantic enhanced NetVLAD, imposing prior knowledge on the terms of the Vector of Locally Aggregated Descriptors (VLAD), to aggregate multi-level token maps, and further concatenates the multi-level semantic features globally. Finally, to alleviate the disadvantage that the fixed margin of triplet loss leads to the suboptimal convergence, an adaptive triplet loss with dynamic margin is proposed. Extensive experiments on public datasets show that the learned features are robust to appearance and viewpoint changes and achieve promising performance compared to state-of-the-arts.
引用
收藏
页码:1109 / 1122
页数:14
相关论文
共 79 条
[1]  
[Anonymous], 1997, Conference on Uncertainty in Artificial Intelligence
[2]  
[Anonymous], 2008, COMPUT VIS IMAGE UND
[3]  
Arandjelovic R, 2018, IEEE T PATTERN ANAL, V40, P1437, DOI [10.1109/TPAMI.2017.2711011, 10.1109/CVPR.2016.572]
[4]   Aggregating Deep Convolutional Features for Image Retrieval [J].
Babenko, Artem ;
Lempitsky, Victor .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1269-1277
[5]   BRIEF: Computing a Local Binary Descriptor Very Fast [J].
Calonder, Michael ;
Lepetit, Vincent ;
Oezuysal, Mustafa ;
Trzcinski, Tomasz ;
Strecha, Christoph ;
Fua, Pascal .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2012, 34 (07) :1281-1298
[6]   Unifying Deep Local and Global Features for Image Search [J].
Cao, Bingyi ;
Araujo, Andre ;
Sim, Jack .
COMPUTER VISION - ECCV 2020, PT XX, 2020, 12365 :726-743
[7]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[8]   Pre-Trained Image Processing Transformer [J].
Chen, Hanting ;
Wang, Yunhe ;
Guo, Tianyu ;
Xu, Chang ;
Deng, Yiping ;
Liu, Zhenhua ;
Ma, Siwei ;
Xu, Chunjing ;
Xu, Chao ;
Gao, Wen .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :12294-12305
[9]   A Semisupervised Recurrent Convolutional Attention Model for Human Activity Recognition [J].
Chen, Kaixuan ;
Yao, Lina ;
Zhang, Dalin ;
Wang, Xianzhi ;
Chang, Xiaojun ;
Nie, Feiping .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (05) :1747-1756
[10]   Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation [J].
Chen, Liang-Chieh ;
Zhu, Yukun ;
Papandreou, George ;
Schroff, Florian ;
Adam, Hartwig .
COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 :833-851