SEMANTIC ASSOCIATION NETWORK FOR VIDEO CORPUS MOMENT RETRIEVAL

被引：6

作者：

Kim, Dahyun ^{[1
]}

Yoon, Sunjae ^{[1
]}

Hong, Ji Woo ^{[1
]}

Yoo, Chang D. ^{[1
]}

机构：

[1] Korea Adv Inst Sci & Technol KAIST, Daejeon, South Korea

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

Video Corpus Moment Retrieval; Video Moment Retrieval; Temporal Moment Localization; Localizing Moment; Vision Language Task;

D O I：

10.1109/ICASSP43922.2022.9747523

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper considers Semantic Association Network (SAN) for Video Corpus Moment Retrieval (VCMR) which localizes temporal moment that best corresponds to the given text query in a corpus of videos. Collaborations among common semantics from multi-modal inputs are essential for effectively understanding video together with subtitle and text query. For this collaboration, SAN associates common semantics within the same modality (by Intra Semantic Association) and across different modalities (by Inter Semantic Association) with dedicated module referred to as Modality Semantic Association (MSA). SAN surpasses existing state-of-the-art performance on the TVR and DiDeMo benchmark datasets. Extensive ablation studies and qualitative analyses show the effectiveness of the proposed model.

引用

页码：1720 / 1724

页数：5

共 21 条

[1]

Ba Jimmy Lei, 2016, ARXIV160706450

[2] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[3]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[4] TALL: Temporal Activity Localization via Language Query [J].

Gao, Jiyang ;

Sun, Chen ;

Yang, Zhenheng ;

Nevatia, Ram .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5277-5285

[5] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[6] Localizing Moments in Video with Natural Language [J].

Hendricks, Lisa Anne ;

Wang, Oliver ;

Shechtman, Eli ;

Sivic, Josef ;

Darrell, Trevor ;

Russell, Bryan .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5804-5813

[7]

Kay W., 2017, The Kinetics Human Action Video Dataset

[8]

Lei J, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P1369

[9]

Lei Jie, 2019, Tvqa+: Spatio-temporal grounding for video question answering

[10]

Li LJ, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P2046

← 1 2 3 →