Local-Global Context Aware Transformer for Language-Guided Video Segmentation

被引:59
作者
Liang, Chen [1 ]
Wang, Wenguan [1 ]
Zhou, Tianfei [2 ]
Miao, Jiaxu [1 ]
Luo, Yawei [1 ]
Yang, Yi [1 ]
机构
[1] Zhejiang Univ, ReLER, CCAI, Hangzhou 310027, Zhejiang, Peoples R China
[2] Swiss Fed Inst Technol, CH-8092 Zurich, Switzerland
基金
国家重点研发计划;
关键词
Transformers; Task analysis; Visualization; Three-dimensional displays; Linguistics; Object segmentation; Grounding; Language-guided video segmentation; memory network; multi-modal transformer;
D O I
10.1109/TPAMI.2023.3262578
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present LOCATER (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. The memory is designed to involve two components - one for persistently preserving global video content, and one for dynamically gathering local temporal context and segmentation history. Based on the memorized local-global context and the particular content of each frame, LOCATER holistically and flexibly comprehends the expression as an adaptive query vector for each frame. The vector is used to query the corresponding frame for mask generation. The memory also allows LOCATER to process videos with linear time complexity and constant size memory, while Transformer-style self-attention computation scales quadratically with sequence length. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset but poses increased challenges in disambiguating among similar objects. Experiments on three LVS datasets and our A2D-S+ show that LOCATER outperforms previous state-of-the-arts. Further, we won the 1st place in the Referring Video Object Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge, where LOCATER served as the foundation for the winning solution.
引用
收藏
页码:10055 / 10069
页数:15
相关论文
共 122 条
[1]  
[Anonymous], 2015, ICLR
[2]  
Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473,1409.0473, DOI 10.48550/ARXIV.1409.0473,1409.0473]
[3]   G3RAPHGROUND: Graph-based Language Grounding [J].
Bajaj, Mohit ;
Wang, Lanjun ;
Sigal, Leonid .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4280-4289
[4]  
Beltagy I, 2020, Arxiv, DOI arXiv:2004.05150
[5]   End-to-End Referring Video Object Segmentation with Multimodal Transformers [J].
Botach, Adam ;
Zheltonozhskii, Evgenii ;
Baskin, Chaim .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :4975-4985
[6]   One-Shot Video Object Segmentation [J].
Caelles, S. ;
Maninis, K. -K. ;
Pont-Tuset, J. ;
Leal-Taixe, L. ;
Cremers, D. ;
Van Gool, L. .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :5320-5329
[7]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[8]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[9]   UNITER: UNiversal Image-TExt Representation Learning [J].
Chen, Yen-Chun ;
Li, Linjie ;
Yu, Licheng ;
El Kholy, Ahmed ;
Ahmed, Faisal ;
Gan, Zhe ;
Cheng, Yu ;
Liu, Jingjing .
COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120
[10]  
Child R, 2019, Arxiv, DOI [arXiv:1904.10509, 10.48550/arXiv.1904.10509]