Masked-attention Mask Transformer for Universal Image Segmentation

被引:1047
作者
Cheng, Bowen [1 ,2 ]
Misra, Ishan [1 ]
Schwing, Alexander G. [2 ]
Kirillov, Alexander [1 ]
Girdhar, Rohit [1 ]
机构
[1] Facebook AI Res FAIR, Menlo Pk, CA 94025 USA
[2] Univ Illinois Urbana Champaign UIUC, Champaign, IL 61820 USA
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年
关键词
D O I
10.1109/CVPR52688.2022.00135
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image segmentation groups pixels with different semantics, e.g., category or instance membership. Each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).
引用
收藏
页码:1280 / 1289
页数:10
相关论文
共 66 条
  • [1] [Anonymous], 2018, CVPR, DOI DOI 10.1163/9789004385580002
  • [2] [Anonymous], 2016, 2016 4 INT C 3D VISI, DOI DOI 10.1109/3DV.2016.79
  • [3] [Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00656
  • [4] [Anonymous], 2017, CVPR
  • [5] [Anonymous], 2018, ECCV, DOI DOI 10.1007/978-3-030-01249-611
  • [6] [Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00902
  • [7] [Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00656
  • [8] [Anonymous], 2017, CVPR, DOI DOI 10.1109/CVPR.2017.774
  • [9] Multiscale Combinatorial Grouping
    Arbelaez, Pablo
    Pont-Tuset, Jordi
    Barron, Jonathan T.
    Marques, Ferran
    Malik, Jitendra
    [J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 328 - 335
  • [10] Bao Hangbo., 2021, INT C LEARNING REPRE