MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

被引:44
作者
Gao, Difei [1 ]
Zhou, Luowei [2 ,5 ]
Ji, Lei [3 ]
Zhu, Linchao [4 ]
Yang, Yi [4 ]
Shou, Mike Zheng [1 ]
机构
[1] Natl Univ Singapore, Show Lab, Singapore, Singapore
[2] Microsoft, Albuquerque, NM USA
[3] Microsoft Res Asia, Beijing, Peoples R China
[4] Zhejiang Univ, Hangzhou, Peoples R China
[5] Google Brain, Mountain View, CA USA
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
基金
新加坡国家研究基金会;
关键词
D O I
10.1109/CVPR52729.2023.01419
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must. Existing multi-modal VQA models achieve promising performance on images or short video clips, especially with the recent success of large-scale multimodal pre-training. However, when extending these methods to long-form videos, new challenges arise. On the one hand, using a dense video sampling strategy is computationally prohibitive. On the other hand, methods relying on sparse sampling struggle in scenarios where multievent and multi-granularity visual reasoning are required. In this work, we introduce a new model named Multimodal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. Specifically, MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules that adaptively select frames and image regions that are closely relevant to the question itself. Visual concepts at different granularities are then processed efficiently through an attention module. In addition, MIST iteratively conducts selection and attention over multiple layers to support reasoning over multiple events. The experimental results on four VideoQA datasets, including AGQA, NExT-QA, STAR, and Env-QA, show that MIST achieves state-of-the-art performance and is superior at efficiency. The code is available at github.com/showlab/mist.
引用
收藏
页码:14773 / 14783
页数:11
相关论文
共 54 条
[31]  
Lin Yan-Bo, 2022, ARXIV220402874, P8
[32]   12-in-1: Multi-Task Vision and Language Representation Learning [J].
Lu, Jiasen ;
Goswami, Vedanuj ;
Rohrbach, Marcus ;
Parikh, Devi ;
Lee, Stefan .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10434-10443
[33]   CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning [J].
Luo, Huaishao ;
Ji, Lei ;
Zhong, Ming ;
Chen, Yang ;
Lei, Wen ;
Duan, Nan ;
Li, Tianrui .
NEUROCOMPUTING, 2022, 508 :293-304
[34]   HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips [J].
Miech, Antoine ;
Zhukov, Dimitri ;
Alayrac, Jean-Baptiste ;
Tapaswi, Makarand ;
Laptev, Ivan ;
Sivic, Josef .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2630-2640
[35]  
Mnih V, 2014, ADV NEUR IN, V27
[36]   Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering [J].
Park, Jungin ;
Lee, Jiyoung ;
Sohn, Kwanghoon .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15521-15530
[37]   StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery [J].
Patashnik, Or ;
Wu, Zongze ;
Shechtman, Eli ;
Cohen-Or, Daniel ;
Lischinski, Dani .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :2065-2074
[38]  
Radford A, 2021, PR MACH LEARN RES, V139
[39]  
Sharma P, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P2556
[40]   MovieQA: Understanding Stories in Movies through Question-Answering [J].
Tapaswi, Makarand ;
Zhu, Yukun ;
Stiefelhagen, Rainer ;
Torralba, Antonio ;
Urtasun, Raquel ;
Fidler, Sanja .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :4631-4640