Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization

被引:21
作者
Cao, Da [1 ]
Zeng, Yawen [1 ]
Wei, Xiaochi [2 ]
Nie, Liqiang [3 ]
Hong, Richang [4 ]
Qin, Zheng [1 ]
机构
[1] Hunan Univ, Changsha, Peoples R China
[2] Baidu Inc, Beijing, Peoples R China
[3] Shandong Univ, Jinan, Shandong, Peoples R China
[4] Hefei Univ Technol, Hefei, Peoples R China
来源
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA | 2020年
基金
中国国家自然科学基金;
关键词
Video Moment Retrieval; Cross-Modal Retrieval; Adversarial Learning; Reinforcement Learning; Bayesian Personalized Ranking; LANGUAGE;
D O I
10.1145/3394171.3413841
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Retrieving video moments from an untrimmed video given a natural language as the query is a challenging task in both academia and industry. Although much effort has been made to address this issue, traditional video moment ranking methods are unable to generate reasonable video moment candidates and video moment localization approaches are not applicable to large-scale retrieval scenario. How to combine ranking and localization into a unified framework to overcome their drawbacks and reinforce each other is rarely considered. Toward this end, we contribute a novel solution to thoroughly investigate the video moment retrieval issue under the adversarial learning paradigm. The key of our solution is to formulate the video moment retrieval task as an adversarial learning problem with two tightly connected components. Specifically, a reinforcement learning is employed as a generator to produce a set of possible video moments. Meanwhile, a pairwise ranking model is utilized as a discriminator to rank the generated video moments and the ground truth. Finally, the generator and the discriminator are mutually reinforced in the adversarial learning framework, which is able to jointly optimize the performance of both video moment ranking and video moment localization. Extensive experiments on two well-known datasets have well verified the effectiveness and rationality of our proposed solution.
引用
收藏
页码:898 / 906
页数:9
相关论文
共 43 条
  • [1] [Anonymous], 2019, P 2019 INT C MULT RE, DOI DOI 10.1145/3323873.3325019
  • [2] Social-Enhanced Attentive Group Recommendation
    Cao, Da
    He, Xiangnan
    Miao, Lianhai
    Xiao, Guangyi
    Chen, Hao
    Xu, Jiao
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2021, 33 (03) : 1195 - 1209
  • [3] Video-Based Cross-Modal Recipe Retrieval
    Cao, Da
    Yu, Zhiwang
    Zhang, Hanling
    Fang, Jiansheng
    Nie, Liqiang
    Tian, Qi
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1685 - 1693
  • [4] Attentive Group Recommendation
    Cao, Da
    He, Xiangnan
    Miao, Lianhai
    An, Yahui
    Yang, Chao
    Hong, Richang
    [J]. ACM/SIGIR PROCEEDINGS 2018, 2018, : 645 - 654
  • [5] Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval
    Chen, Jing-Jing
    Ngo, Chong-Wah
    Feng, Fu-Li
    Chua, Tat-Seng
    [J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 1020 - 1028
  • [6] Chen JY, 2019, AAAI CONF ARTIF INTE, P8175
  • [7] TC-GAN: Triangle Cycle-Consistent GANs for Face Frontalization with Facial Features Preserved
    Cheng, Juntong
    Chen, Yi-Ping Phoebe
    Li, Minjun
    Jiang, Yu-Gang
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 220 - 228
  • [8] Ding JT, 2019, PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P2230
  • [9] TALL: Temporal Activity Localization via Language Query
    Gao, Jiyang
    Sun, Chen
    Yang, Zhenheng
    Nevatia, Ram
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5277 - 5285
  • [10] Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672