Multi-Task Learning With Multi-Query Transformer for Dense Prediction

被引：24

作者：

Xu, Yangyang ^{[1
]}

Li, Xiangtai ^{[2
]}

Yuan, Haobo ^{[1
]}

Yang, Yibo ^{[3
]}

Zhang, Lefei ^{[1
,4
]}

机构：

[1] Wuhan Univ, Inst Artificial Intelligence, Sch Comp Sci, Wuhan 430072, Peoples R China

[2] Nanyang Technol Univ, S Lab, Singapore 637335, Singapore

[3] JD Explore Acad, Beijing 101111, Peoples R China

[4] Hubei Luojia Lab, Wuhan 430072, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Scene understanding; multi-task learning; dense prediction; transformers; NETWORK;

D O I：

10.1109/TCSVT.2023.3292995

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Previous multi-task dense prediction studies developed complex pipelines such as multi-modal distillations in multiple stages or searching for task relational contexts for each task. The core insight beyond these methods is to maximize the mutual effects of each task. Inspired by the recent query-based Transformers, we propose a simple pipeline named Multi-Query Transformer (MQTransformer) that is equipped with multiple queries from different tasks to facilitate the reasoning among multiple tasks and simplify the cross-task interaction pipeline. Instead of modeling the dense per-pixel context among different tasks, we seek a task-specific proxy to perform cross-task reasoning via multiple queries where each query encodes the task-related context. The MQTransformer is composed of three key components: shared encoder, cross-task query attention module and shared decoder. We first model each task with a task-relevant query. Then both the task-specific feature output by the feature extractor and the task-relevant query are fed into the shared encoder, thus encoding the task-relevant query from the task-specific feature. Secondly, we design a cross-task query attention module to reason the dependencies among multiple task-relevant queries; this enables the module to only focus on the query-level interaction. Finally, we use a shared decoder to gradually refine the image features with the reasoned query features from different tasks. Extensive experiment results on two dense prediction datasets (NYUD-v2 and PASCAL-Context) show that the proposed method is an effective approach and achieves state-of-the-art results.

引用

页码：1228 / 1240

页数：13

共 72 条

[21] UM-Adapt: Unsupervised Multi-Task Adaptation Using Adversarial Cross-Task Distillation [J].

Kundu, Jogendra Nath ;

Lakkakula, Nishank ;

Babu, R. Venkatesh .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :1436-1445

[22] Learning Multiple Dense Prediction Tasks from Partially Annotated Data [J].

Li, Wei-Hong ;

Liu, Xialei ;

Bilen, Hakan .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :18857-18867

[23] Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation [J].

Li, Xiangtai ;

Xu, Shilin ;

Yang, Yibo ;

Cheng, Guangliang ;

Tong, Yunhai ;

Tao, Dacheng .

COMPUTER VISION - ECCV 2022, PT XXVII, 2022, 13687 :729-747

[24] Feature Pyramid Networks for Object Detection [J].

Lin, Tsung-Yi ;

Dollar, Piotr ;

Girshick, Ross ;

He, Kaiming ;

Hariharan, Bharath ;

Belongie, Serge .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :936-944

[25] SG-Net: Spatial Granularity Network for One-Stage Video Instance Segmentation [J].

Liu, Dongfang ;

Cui, Yiming ;

Tan, Wenbo ;

Chen, Yingjie .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :9811-9820

[26] Are we ready for a new paradigm shift? A survey on visual deep MLP [J].

Liu, Ruiyang ;

Li, Yinghui ;

Tao, Linmi ;

Liang, Dun ;

Zheng, Hai-Tao .

PATTERNS, 2022, 3 (07)

[27] End-to-End Multi-Task Learning with Attention [J].

Liu, Shikun ;

Johns, Edward ;

Davison, Andrew J. .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1871-1880

[28] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [J].

Liu, Ze ;

Lin, Yutong ;

Cao, Yue ;

Hu, Han ;

Wei, Yixuan ;

Zhang, Zheng ;

Lin, Stephen ;

Guo, Baining .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9992-10002

[29] Cross-task Attention Mechanism for Dense Multi-task Learning [J].

Lopes, Ivan ;

Tuan-Hung Vu ;

de Charette, Raoul .

2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, :2328-2337

[30] Task-Aware Weakly Supervised Object Localization With Transformer [J].

Meng, Meng ;

Zhang, Tianzhu ;

Zhang, Zhe ;

Zhang, Yongdong ;

Wu, Feng .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (07) :9109-9121

← 1 2 3 4 5 6 7 8 →