An Empirical Study of Frame Selection for Text-to-Video Retrieval

被引：0

作者：

Wu, Mengxia ^{[1
]}

Cao, Min ^{[1
]}

Bai, Yang ^{[1
]}

Zeng, Ziyin ^{[1
]}

Chen, Chen ^{[2
]}

Nie, Liqiang ^{[3
]}

Zhang, Min ^{[1
]}

机构：

[1] Soochow Univ, Suzhou, Peoples R China

[2] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China

[3] Harbin Inst Technol, Shenzhen, Peoples R China

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023 | 2023年

基金：

美国国家科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text-to-video retrieval (TVR) aims to find the most relevant video in a large video gallery given a query text. The intricate and abundant context of the video challenges the performance and efficiency of TVR. To handle the serialized video contexts, existing methods typically select a subset of frames within a video to represent the video content for TVR. How to select the most representative frames is a crucial issue, whereby the selected frames are required to not only retain the semantic information of the video but also promote retrieval efficiency by excluding temporally redundant frames. In this paper, we make the first empirical study of frame selection for TVR. We systemically classify existing frame selection methods into text-free and text-guided ones, under which we detailedly analyze six different frame selections in terms of effectiveness and efficiency. Among them, two frame selections are first developed in this paper. According to the comprehensive analysis on multiple TVR benchmarks, we empirically conclude that the TVR with proper frame selections can significantly improve the retrieval efficiency without sacrificing the retrieval performance.

引用

页码：6821 / 6832

页数：12

共 36 条

[1] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].

Bain, Max ;

Nagrani, Arsha ;

Varol, Gul ;

Zisserman, Andrew .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718

[2] Revisiting the "Video" in Video-Language Understanding [J].

Buch, Shyamal ;

Eyzaguirre, Cristobal ;

Gaidon, Adrien ;

Wu, Jiajun ;

Li Fei-Fei ;

Niebles, Juan Carlos .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :2907-2917

[3]

Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698

[4]

Chang HS, 1999, IEEE T CIRC SYST VID, V9, P1269, DOI 10.1109/76.809161

[5]

Chang SF, 2003, 12TH INTERNATIONAL CONFERENCE ON IMAGE ANALYSIS AND PROCESSING, PROCEEDINGS, P494

[6]

Chen Dave Zhenyu, 2022, arXiv

[7]

Chen Y., 2023, ARXIV

[8]

Cheng Xing, 2021, CoRR

[9]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[10]

Divakaran Ajay, 2002, P 2002 INT C IM PROC, V1, pI, DOI DOI 10.1109/ICIP.2002.1038180

← 1 2 3 4 →