MatchFormer: Interleaving Attention in Transformers for Feature Matching

被引：24

作者：

Wang, Qing ^{[1
]}

Zhang, Jiaming ^{[1
]}

Yang, Kailun ^{[1
]}

Peng, Kunyu ^{[1
]}

Stiefelhagen, Rainer ^{[1
]}

机构：

[1] Karlsruhe Inst Technol, Karlsruhe, Germany

来源：

COMPUTER VISION - ACCV 2022, PT III | 2023年 / 13843卷

关键词：

Feature matching; Vision transformers;

D O I：

10.1007/978-3-031-26313-2_16

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In contrast, we propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. Inside each stage of the hierarchical encoder, we interleave self-attention for feature extraction and cross-attention for feature matching, yielding a human-intuitive extract-and-match scheme. Such a match-aware encoder releases the overloaded decoder and makes the model highly efficient. Further, combining self- and cross-attention on multi-scale features in a hierarchical architecture improves matching robustness, particularly in low-texture indoor scenes or with less outdoor training data. Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision. Compared to the previous best method in indoor pose estimation, our lite MatchFormer has only 45% GFLOPs, yet achieves a +1.3% precision gain and a 41% running speed boost. The large MatchFormer reaches state-of-the-art on four different benchmarks, including indoor pose estimation (ScanNet), outdoor pose estimation (MegaDepth), homography estimation and image matching (HPatch), and visual localization (InLoc).

引用

页码：256 / 273

页数：18

共 50 条

[11] Self-attention in vision transformers performs perceptual grouping, not attention
Mehrani, Paria
Tsotsos, John K.
FRONTIERS IN COMPUTER SCIENCE, 2023, 5
[12] Feature matching method: Sparse feature tree
Department of Computer Science and Engineering, Fudan University, Shanghai 200433, China
Ruan Jian Xue Bao, 2006, 5 (1026-1033): : 1026 - 1033
[13] Multi-Manifold Attention for Vision Transformers
Konstantinidis, Dimitrios
Papastratis, Ilias
Dimitropoulos, Kosmas
Daras, Petros
IEEE ACCESS, 2023, 11 : 123433 - 123444
[14] MSGA-Net: Progressive Feature Matching via Multi-Layer Sparse Graph Attention
Gong, Zhepeng
Xiao, Guobao
Shi, Ziwei
Chen, Riqing
Yu, Jun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5765 - 5775
[15] Feature Matching and Position Matching Between Optical and SAR With Local Deep Feature Descriptor
Liao, Yun
Di, Yide
Zhou, Hao
Li, Anran
Liu, Junhui
Lu, Mingyu
Duan, Qing
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2022, 15 : 448 - 462
[16] FEATURE MATCHING IN GROWING DATABASES
Pires, Bernardo Rodrigues
Moura, Jose M. F.
2012 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP 2012), 2012, : 1913 - 1916
[17] PROGRESSIVE FILTERING FOR FEATURE MATCHING
Jiang, Xingyu
Ma, Jiayi
Chen, Jun
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2217 - 2221
[18] Feature Descriptor Learning Based on Sparse Feature Matching
Song, Dengpan
Liu, Shiyuan
Kang, Ruirui
Ai, Danni
2021 THE 5TH INTERNATIONAL CONFERENCE ON VIDEO AND IMAGE PROCESSING, ICVIP 2021, 2021, : 62 - 68
[19] AMatFormer: Efficient Feature Matching via Anchor Matching Transformer
Jiang, Bo
Luo, Shuxian
Wang, Xiao
Li, Chuanfu
Tang, Jin
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1504 - 1515
[20] A Feature Map Adversarial Attack Against Vision Transformers
Altoub, Majed
Mehmood, Rashid
AlQurashi, Fahad
Alqahtany, Saad
Alsulami, Bassma
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (10) : 962 - 968

← 1 2 3 4 5 →