Masked-attention Mask Transformer for Universal Image Segmentation

被引:1416
作者
Cheng, Bowen [1 ,2 ]
Misra, Ishan [1 ]
Schwing, Alexander G. [2 ]
Kirillov, Alexander [1 ]
Girdhar, Rohit [1 ]
机构
[1] Facebook AI Res FAIR, Menlo Pk, CA 94025 USA
[2] Univ Illinois Urbana Champaign UIUC, Champaign, IL 61820 USA
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年
关键词
D O I
10.1109/CVPR52688.2022.00135
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image segmentation groups pixels with different semantics, e.g., category or instance membership. Each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).
引用
收藏
页码:1280 / 1289
页数:10
相关论文
共 66 条
[1]  
[Anonymous], 2018, CVPR, DOI DOI 10.1163/9789004385580002
[2]  
[Anonymous], 2016, INT CONF 3D VISION, DOI DOI 10.1109/3DV.2016.79
[3]  
[Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00656
[4]  
[Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00902
[5]  
[Anonymous], 2018, ECCV, DOI DOI 10.1007/978-3-030-01249-611
[6]  
[Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00656
[7]  
[Anonymous], 2017, CVPR, DOI DOI 10.1109/CVPR.2017.774
[8]   Multiscale Combinatorial Grouping [J].
Arbelaez, Pablo ;
Pont-Tuset, Jordi ;
Barron, Jonathan T. ;
Marques, Ferran ;
Malik, Jitendra .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :328-335
[9]  
Bao Hangbo., 2021, INT C LEARNING REPRE
[10]  
Bolya D., 2019, YOLACT BETTER REAL T