Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering

被引:22
作者
Zang, Chuanqi [1 ]
Wang, Hanqing [1 ]
Pei, Mingtao [1 ]
Liang, Wei [1 ,2 ]
机构
[1] Beijing Inst Technol, Sch Comp Sci & Technol, Beijing, Peoples R China
[2] Beijing Inst Technol, Yangtze Delta Reg Acad, Jiaxing, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52729.2023.01824
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video Question Answering (VideoQA) is challenging as it requires capturing accurate correlations between modalities from redundant information. Recent methods focus on the explicit challenges of the task, e.g. multimodal feature extraction, video-text alignment and fusion. Their frameworks reason the answer relying on statistical evidence causes, which ignores potential bias in the multimodal data. In our work, we investigate relational structure from a causal representation perspective on multimodal data and propose a novel inference framework. For visual data, question-irrelevant objects may establish simple matching associations with the answer. For textual data, the model prefers the local phrase semantics which may deviate from the global semantics in long sentences. Therefore, to enhance the generalization of the model, we discover the real association by explicitly capturing visual features that are causally related to the question semantics and weakening the impact of local language semantics on question answering. The experimental results on two large causal VideoQA datasets verify that our proposed framework 1) improves the accuracy of the existing VideoQA backbone, 2) demonstrates robustness on complex scenes and questions. The code will be released at https://github.com/Chuanqi-Zang/Discovering-the-Real-Association.
引用
收藏
页码:19027 / 19036
页数:10
相关论文
共 43 条
[1]  
[Anonymous], 2022, P IEEE CVF C COMP VI, DOI DOI 10.1002/STC.2929
[2]  
Arjovsky M., 2019, arXiv preprint arXiv:1907.02893
[3]  
Carreira Joao, 2019, CoRR
[4]  
Chadha Aman, 2020, ARXIV201107735
[5]  
Chen Zhenfang, 2021, INT C LEARNING REPRE
[6]  
Dang Long Hoang, 2021, ARXIV210613432
[7]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[8]  
Ding Ming, 2021, Advances in Neural Information Processing Systems, V34
[9]  
ei Yixuan, 2022, ARXIV220514141
[10]   Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering [J].
Fan, Chenyou ;
Zhang, Xiaofan ;
Zhang, Shu ;
Wang, Wensheng ;
Zhang, Chi ;
Huang, Heng .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1999-2007