TLNet: Temporal Span Localization Network With Collaborative Graph Reasoning for Video Question Answering

被引:0
作者
Liang, Lili [1 ]
Sun, Guanglu [1 ]
Li, Tianlin [1 ]
Liu, Shuai [2 ]
Ding, Weiping [3 ]
机构
[1] Harbin Univ Sci & Technol, Sch Comp Sci & Technol, Harbin 150080, Peoples R China
[2] Hunan Normal Univ, Sch Educ Sci, Changsha 410081, Peoples R China
[3] Nantong Univ, Sch Artificial Intelligence & Comp Sci, Nantong 226019, Peoples R China
来源
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE | 2024年
关键词
Cognition; Location awareness; Transformers; Spatiotemporal phenomena; Proposals; Feature extraction; Education; Annotations; Semantics; Collaboration; Multi-modal learning; spatiotemporal reasoning; temporal localization; video question answering; video understanding; LANGUAGE;
D O I
10.1109/TETCI.2024.3452751
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video question answering (VideoQA) has witnessed remarkable progress in the past few years, but there are still challenges in precisely locating question-related segments and reasoning spatiotemporal relationships. Targeting these challenges, a Temporal Span Localization Network (TLNet) is proposed, which comprises Temporal Span Localization (TSL) and Collaborative Graph Reasoning (CGR). TSL is introduced to precisely locate question-related segments by employing a cross-modal attention localization strategy that predicts the start and end moments of temporal span proposals. The proposals are refined through a binarized alignment fusion approach. Furthermore, CGR combines the graph structure and Transformer to reason spatiotemporal relationships and acquire unbiased intra- and inter-modal cues for answering questions. Specifically, Transformer is enhanced by leveraging information from the edges and nodes of different modality graphs, which enables the multi-head attention to be effectively guided. The Channel-Wise Normalization (CW Norm) is integrated into the Transformer for unbiasing intra- and inter-modal cues and optimizing network performance. Experimental evaluations on the TVQA and TVQA+ datasets demonstrate that TLNet outperforms the previous state-of-the-art methods. Additionally, extensive ablation studies are conducted to demonstrate the effectiveness of key components.
引用
收藏
页数:13
相关论文
共 48 条
[1]  
[Anonymous], 2014, ARXIV14090473
[2]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[3]  
Caruana R, 2001, ADV NEUR IN, V13, P402
[4]  
Chen JY, 2019, AAAI CONF ARTIF INTE, P8175
[5]  
Chen SX, 2019, AAAI CONF ARTIF INTE, P8199
[6]  
Dang LH, 2021, PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, P636
[7]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[8]  
Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[9]   MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering [J].
Gao, Difei ;
Zhou, Luowei ;
Ji, Lei ;
Zhu, Linchao ;
Yang, Yi ;
Shou, Mike Zheng .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :14773-14783
[10]   MAC: Mining Activity Concepts for Language-based Temporal Localization [J].
Ge, Runzhou ;
Gao, Jiyang ;
Chen, Kan ;
Nevatia, Ram .
2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, :245-253