Find and Focus: Retrieve and Localize Video Events with Natural Language Queries

被引：44

作者：

Shao, Dian ^{[1
]}

Xiong, Yu ^{[1
]}

Zhao, Yue ^{[1
]}

Huang, Qingqiu ^{[1
]}

Qiao, Yu ^{[2
]}

Lin, Dahua ^{[1
]}

机构：

[1] Chinese Univ Hong Kong, CUHK SenseTime Joint Lab, Shatin, Hong Kong, Peoples R China

[2] Chinese Acad Sci, Shenzhen Inst Adv Technol, SIAT SenseTime Joint Lab, Beijing, Peoples R China

来源：

COMPUTER VISION - ECCV 2018, PT IX | 2018年 / 11213卷

关键词：

D O I：

10.1007/978-3-030-01240-3_13

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The thriving of video sharing services brings new challenges to video retrieval, e.g. the rapid growth in video duration and content diversity. Meeting such challenges calls for new techniques that can effectively retrieve videos with natural language queries. Existing methods along this line, which mostly rely on embedding videos as a whole, remain far from satisfactory for real-world applications due to the limited expressive power. In this work, we aim to move beyond this limitation by delving into the internal structures of both sides, the queries and the videos. Specifically, we propose a new framework called Find and Focus (FIFO), which not only performs top-level matching (paragraph vs. video), but also makes part-level associations, localizing a video clip for each sentence in the query with the help of a focusing guide. These levels are complementary - the top-level matching narrows the search while the part-level localization refines the results. On both ActivityNet Captions and modified LSMDC datasets, the proposed framework achieves remarkable performance gains (Project Page: https://ycxioooong.github.io/projects/fifo).

引用

页码：202 / 218

页数：17

共 44 条

[21]

Kingma D. P., P 3 INT C LEARN REPR

[22]

Kiros R, 2014, Arxiv, DOI arXiv:1411.2539

[23] Visual Semantic Search: Retrieving Videos via Complex Textual Queries [J].

Lin, Dahua ;

Fidler, Sanja ;

Kong, Chen ;

Urtasun, Raquel .

2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :2657-2664

[24]

Liu W, 2015, PROC CVPR IEEE, P3707, DOI 10.1109/CVPR.2015.7298994

[25]

Otani M., 2016, LNCS, V9913, P651, DOI [10.1007/978-3-319-46604-0 _46, DOI 10.1007/978-3-319-46604-0_46]

[26]

Plummer B.A., 2017, IEEE C COMP VIS PATT

[27] Movie Description [J].

Rohrbach, Anna ;

Torabi, Atousa ;

Rohrbach, Marcus ;

Tandon, Niket ;

Pal, Christopher ;

Larochelle, Hugo ;

Courville, Aaron ;

Schiele, Bernt .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 123 (01) :94-120

[28] Query-Focused Extractive Video Summarization [J].

Sharghi, Aidean ;

Gong, Boqing ;

Shah, Mubarak .

COMPUTER VISION - ECCV 2016, PT VIII, 2016, 9912 :3-19

[29] Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs [J].

Shou, Zheng ;

Wang, Dongang ;

Chang, Shih-Fu .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1049-1058

[30]

Smoliar S. W., 1994, IEEE Multimedia, V1, P62, DOI 10.1109/93.311653

← 1 2 3 4 5 →