Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization

被引：20

作者：

Cao, Da ^{[1
]}

Zeng, Yawen ^{[1
]}

Wei, Xiaochi ^{[2
]}

Nie, Liqiang ^{[3
]}

Hong, Richang ^{[4
]}

Qin, Zheng ^{[1
]}

机构：

[1] Hunan Univ, Changsha, Peoples R China

[2] Baidu Inc, Beijing, Peoples R China

[3] Shandong Univ, Jinan, Shandong, Peoples R China

[4] Hefei Univ Technol, Hefei, Peoples R China

来源：

MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA | 2020年

基金：

中国国家自然科学基金;

关键词：

Video Moment Retrieval; Cross-Modal Retrieval; Adversarial Learning; Reinforcement Learning; Bayesian Personalized Ranking; LANGUAGE;

D O I：

10.1145/3394171.3413841

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Retrieving video moments from an untrimmed video given a natural language as the query is a challenging task in both academia and industry. Although much effort has been made to address this issue, traditional video moment ranking methods are unable to generate reasonable video moment candidates and video moment localization approaches are not applicable to large-scale retrieval scenario. How to combine ranking and localization into a unified framework to overcome their drawbacks and reinforce each other is rarely considered. Toward this end, we contribute a novel solution to thoroughly investigate the video moment retrieval issue under the adversarial learning paradigm. The key of our solution is to formulate the video moment retrieval task as an adversarial learning problem with two tightly connected components. Specifically, a reinforcement learning is employed as a generator to produce a set of possible video moments. Meanwhile, a pairwise ranking model is utilized as a discriminator to rank the generated video moments and the ground truth. Finally, the generator and the discriminator are mutually reinforced in the adversarial learning framework, which is able to jointly optimize the performance of both video moment ranking and video moment localization. Extensive experiments on two well-known datasets have well verified the effectiveness and rationality of our proposed solution.

引用

页码：898 / 906

页数：9

共 43 条

[1]

[Anonymous], 2019, P 2019 INT C MULT RE, DOI DOI 10.1145/3323873.3325019

[2] Social-Enhanced Attentive Group Recommendation [J].

Cao, Da ;

He, Xiangnan ;

Miao, Lianhai ;

Xiao, Guangyi ;

Chen, Hao ;

Xu, Jiao .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2021, 33 (03) :1195-1209

[3] Video-Based Cross-Modal Recipe Retrieval [J].

Cao, Da ;

Yu, Zhiwang ;

Zhang, Hanling ;

Fang, Jiansheng ;

Nie, Liqiang ;

Tian, Qi .

PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, :1685-1693

[4] Attentive Group Recommendation [J].

Cao, Da ;

He, Xiangnan ;

Miao, Lianhai ;

An, Yahui ;

Yang, Chao ;

Hong, Richang .

ACM/SIGIR PROCEEDINGS 2018, 2018, :645-654

[5] Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval [J].

Chen, Jing-Jing ;

Ngo, Chong-Wah ;

Feng, Fu-Li ;

Chua, Tat-Seng .

PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, :1020-1028

[6]

Chen JY, 2019, AAAI CONF ARTIF INTE, P8175

[7] TC-GAN: Triangle Cycle-Consistent GANs for Face Frontalization with Facial Features Preserved [J].

Cheng, Juntong ;

Chen, Yi-Ping Phoebe ;

Li, Minjun ;

Jiang, Yu-Gang .

PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, :220-228

[8]

Ding JT, 2019, PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P2230

[9] TALL: Temporal Activity Localization via Language Query [J].

Gao, Jiyang ;

Sun, Chen ;

Yang, Zhenheng ;

Nevatia, Ram .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5277-5285

[10]

Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672

← 1 2 3 4 5 →