TLNet: Temporal Span Localization Network With Collaborative Graph Reasoning for Video Question Answering

被引：0

作者：

Liang, Lili ^{[1
]}

Sun, Guanglu ^{[1
]}

Li, Tianlin ^{[1
]}

Liu, Shuai ^{[2
]}

Ding, Weiping ^{[3
]}

机构：

[1] Harbin Univ Sci & Technol, Sch Comp Sci & Technol, Harbin 150080, Peoples R China

[2] Hunan Normal Univ, Sch Educ Sci, Changsha 410081, Peoples R China

[3] Nantong Univ, Sch Artificial Intelligence & Comp Sci, Nantong 226019, Peoples R China

来源：

IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE | 2024年

关键词：

Cognition; Location awareness; Transformers; Spatiotemporal phenomena; Proposals; Feature extraction; Education; Annotations; Semantics; Collaboration; Multi-modal learning; spatiotemporal reasoning; temporal localization; video question answering; video understanding; LANGUAGE;

D O I：

10.1109/TETCI.2024.3452751

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video question answering (VideoQA) has witnessed remarkable progress in the past few years, but there are still challenges in precisely locating question-related segments and reasoning spatiotemporal relationships. Targeting these challenges, a Temporal Span Localization Network (TLNet) is proposed, which comprises Temporal Span Localization (TSL) and Collaborative Graph Reasoning (CGR). TSL is introduced to precisely locate question-related segments by employing a cross-modal attention localization strategy that predicts the start and end moments of temporal span proposals. The proposals are refined through a binarized alignment fusion approach. Furthermore, CGR combines the graph structure and Transformer to reason spatiotemporal relationships and acquire unbiased intra- and inter-modal cues for answering questions. Specifically, Transformer is enhanced by leveraging information from the edges and nodes of different modality graphs, which enables the multi-head attention to be effectively guided. The Channel-Wise Normalization (CW Norm) is integrated into the Transformer for unbiasing intra- and inter-modal cues and optimizing network performance. Experimental evaluations on the TVQA and TVQA+ datasets demonstrate that TLNet outperforms the previous state-of-the-art methods. Additionally, extensive ablation studies are conducted to demonstrate the effectiveness of key components.

引用

页数：13

共 48 条

[1]

[Anonymous], 2014, ARXIV14090473

[2] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[3]

Caruana R, 2001, ADV NEUR IN, V13, P402

[4]

Chen JY, 2019, AAAI CONF ARTIF INTE, P8175

[5]

Chen SX, 2019, AAAI CONF ARTIF INTE, P8199

[6]

Dang LH, 2021, PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, P636

[7]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[8]

Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929

[9] MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering [J].

Gao, Difei ;

Zhou, Luowei ;

Ji, Lei ;

Zhu, Linchao ;

Yang, Yi ;

Shou, Mike Zheng .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :14773-14783

[10] MAC: Mining Activity Concepts for Language-based Temporal Localization [J].

Ge, Runzhou ;

Gao, Jiyang ;

Chen, Kan ;

Nevatia, Ram .

2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, :245-253

← 1 2 3 4 5 →