Coatrsnet: Fully Exploiting Convolution and Attention for Stereo Matching by Region Separation

被引:12
作者
Cheng, Junda [1 ]
Xu, Gangwei [1 ]
Guo, Peng [1 ]
Yang, Xin [1 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Elect Informat & Commun, Wuhan 430074, Hubei, Peoples R China
基金
中国国家自然科学基金;
关键词
Stereo matching; Attention; Region separation; Content-dependent interaction; COST AGGREGATION;
D O I
10.1007/s11263-023-01872-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Stereo matching is a fundamental technique for many vision and robotics applications. State-of-the-art methods either employ convolutional neural networks with spatially-shared kernels or utilize content-dependent interactions (e.g., local or global attention) to augment convolution operations. Despite of great improvements being made, existing methods could either suffer from a high computational cost arising from global attention operations or a suboptimal performance at edge regions due to spatially-shared convolutions. In this paper, we propose a CoAtRS stereo matching method to exert the complementary advantages of convolution and attention to the full via region separation. Our method can adaptively adopt the most suitable feature extraction and aggregation patterns for smooth and edge regions with less computational cost. In addition, we propose D-global attention which performs global filtering on the disparity dimension to better fuse cost volumes of different regions and alleviate the locality defects of convolutions. Our CoAtRS stereo matching method can also be embedded conveniently in various existing 3D CNN stereo networks. The resulting networks can achieve significant improvements in terms of both accuracy and efficiency. Furthermore, we design an accurate network (named CoAtRSNet) which achieves the state-of-the-art results on five public datasets. At the time of writing, CoAtRSNet ranks 1st-3rd on all the metrics published on the ETH3D website, ranks 2nd on Scene Flow, and ranks 1st for the Root-Mean-Square metric, 2nd for the average error metric and 3rd for the bad 0.5 metric on the Middlebury benchmark.
引用
收藏
页码:56 / 73
页数:18
相关论文
共 41 条
[1]  
Ba J, 2014, ACS SYM SER
[2]   Attention Augmented Convolutional Networks [J].
Bello, Irwan ;
Zoph, Barret ;
Vaswani, Ashish ;
Shlens, Jonathon ;
Le, Quoc V. .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3285-3294
[3]   GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond [J].
Cao, Yue ;
Xu, Jiarui ;
Lin, Stephen ;
Wei, Fangyun ;
Hu, Han .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, :1971-1980
[4]   Pyramid Stereo Matching Network [J].
Chang, Jia-Ren ;
Chen, Yong-Sheng .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5410-5418
[5]  
Cheng X., 2020, ARXIV
[6]  
Geiger A, 2012, PROC CVPR IEEE, P3354, DOI 10.1109/CVPR.2012.6248074
[7]   Group-wise Correlation Stereo Network [J].
Guo, Xiaoyang ;
Yang, Kai ;
Yang, Wukui ;
Wang, Xiaogang ;
Li, Hongsheng .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :3268-3277
[8]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[9]   Local Relation Networks for Image Recognition [J].
Hu, Han ;
Zhang, Zheng ;
Xie, Zhenda ;
Lin, Stephen .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3463-3472
[10]   Self-supervised Monocular Trained Depth Estimation using Self-attention and Discrete Disparity Volume [J].
Johnston, Adrian ;
Carneiro, Gustavo .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :4755-4764