FeatInter: Exploring fine-grained object features for video-text retrieval

被引：8

作者：

Liu, Baolong ^{[1
]}

Zheng, Qi ^{[1
]}

Wang, Yabing ^{[1
]}

Zhang, Minsong ^{[1
]}

Dong, Jianfeng ^{[1
,2
]}

Wang, Xun ^{[1
]}

机构：

[1] Zhejiang Gongshang Univ, Sch Comp & Informat Engn, Hangzhou 314423, Peoples R China

[2] Chinese Acad Sci, Inst Informat Engn, State Key Lab Informat Secur, Beijing 100093, Peoples R China

来源：

NEUROCOMPUTING | 2022年 / 496卷

基金：

中国国家自然科学基金;

关键词：

Cross-modal retrieval; Video-text retrieval; Feature interaction; Visual semantic interaction; Fine-grained object feature;

D O I：

10.1016/j.neucom.2022.01.094

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we target the challenging task of video-text retrieval. The common way for this task is to learn a text-video joint embedding space by cross-modal representation learning, and compute the cross modality similarity in the joint space. As videos typically contain rich information, how to represent videos in a joint embedding space is crucial for video-text retrieval. The majority of works typically depend on pre-extracted frame-level features or clip-level features for video representation, which may cause fine-grained object information in videos to be ignored. To alleviate it, we explicitly introduce more fine-grained object-level features to enrich video representation. In order to exploit the potential of the object-level features, we propose a new model named FeatInter, which jointly considers the visual and semantic features of objects. Besides, a visual-semantic interaction and a cross-feature interaction are proposed to mutually enhance object features and frame features. Extensive experiments on two challenging video datasets, i.e., MSR-VTT and TGIF, demonstrate the effectiveness of our proposed model. Moreover, our model achieves a new state-of-the-art on TGIF. While the state-of-the-art methods use seven video features on MSR-VTT, our model with just three features obtains comparable performance. (c) 2022 Elsevier B.V. All rights reserved.

引用

页码：178 / 191

页数：14

共 82 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[2]

[Anonymous], 2015, ARXIV151106361

[3] Interclass-Relativity-Adaptive Metric Learning for Cross-Modal Matching and Beyond [J].

Chen, Feiyu ;

Shao, Jie ;

Zhang, Yonghui ;

Xu, Xing ;

Shen, Heng Tao .

IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 :3073-3084

[4]

Chen J., PROC IEEE C COMPUT V, V2021, P15789

[5]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[6]

Dong J., 2017, TRECVID WORKSHOP

[7] Fine-Grained Fashion Similarity Prediction by Attribute-Specific Embedding Learning [J].

Dong, Jianfeng ;

Ma, Zhe ;

Mao, Xiaofeng ;

Yang, Xun ;

He, Yuan ;

Hong, Richang ;

Ji, Shouling .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 :8410-8425

[8] Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval [J].

Dong, Jianfeng ;

Long, Zhongzi ;

Mao, Xiaofeng ;

Lin, Changting ;

He, Yuan ;

Ji, Shouling .

NEUROCOMPUTING, 2021, 440 :207-219

[9] Feature Re-Learning with Data Augmentation for Video Relevance Prediction [J].

Dong, Jianfeng ;

Wang, Xun ;

Zhang, Leimin ;

Xu, Chaoxi ;

Yang, Gang ;

Li, Xirong .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2021, 33 (05) :1946-1959

[10] Dual Encoding for Zero-Example Video Retrieval [J].

Dong, Jianfeng ;

Li, Xirong ;

Xu, Chaoxi ;

Ji, Shouling ;

He, Yuan ;

Yang, Gang ;

Wang, Xun .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :9338-9347

← 1 2 3 4 5 6 7 8 9 →