FeatInter: Exploring fine-grained object features for video-text retrieval

被引:8
作者
Liu, Baolong [1 ]
Zheng, Qi [1 ]
Wang, Yabing [1 ]
Zhang, Minsong [1 ]
Dong, Jianfeng [1 ,2 ]
Wang, Xun [1 ]
机构
[1] Zhejiang Gongshang Univ, Sch Comp & Informat Engn, Hangzhou 314423, Peoples R China
[2] Chinese Acad Sci, Inst Informat Engn, State Key Lab Informat Secur, Beijing 100093, Peoples R China
基金
中国国家自然科学基金;
关键词
Cross-modal retrieval; Video-text retrieval; Feature interaction; Visual semantic interaction; Fine-grained object feature;
D O I
10.1016/j.neucom.2022.01.094
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we target the challenging task of video-text retrieval. The common way for this task is to learn a text-video joint embedding space by cross-modal representation learning, and compute the cross modality similarity in the joint space. As videos typically contain rich information, how to represent videos in a joint embedding space is crucial for video-text retrieval. The majority of works typically depend on pre-extracted frame-level features or clip-level features for video representation, which may cause fine-grained object information in videos to be ignored. To alleviate it, we explicitly introduce more fine-grained object-level features to enrich video representation. In order to exploit the potential of the object-level features, we propose a new model named FeatInter, which jointly considers the visual and semantic features of objects. Besides, a visual-semantic interaction and a cross-feature interaction are proposed to mutually enhance object features and frame features. Extensive experiments on two challenging video datasets, i.e., MSR-VTT and TGIF, demonstrate the effectiveness of our proposed model. Moreover, our model achieves a new state-of-the-art on TGIF. While the state-of-the-art methods use seven video features on MSR-VTT, our model with just three features obtains comparable performance. (c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页码:178 / 191
页数:14
相关论文
共 82 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]  
[Anonymous], 2015, ARXIV151106361
[3]   Interclass-Relativity-Adaptive Metric Learning for Cross-Modal Matching and Beyond [J].
Chen, Feiyu ;
Shao, Jie ;
Zhang, Yonghui ;
Xu, Xing ;
Shen, Heng Tao .
IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 :3073-3084
[4]  
Chen J., PROC IEEE C COMPUT V, V2021, P15789
[5]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[6]  
Dong J., 2017, TRECVID WORKSHOP
[7]   Fine-Grained Fashion Similarity Prediction by Attribute-Specific Embedding Learning [J].
Dong, Jianfeng ;
Ma, Zhe ;
Mao, Xiaofeng ;
Yang, Xun ;
He, Yuan ;
Hong, Richang ;
Ji, Shouling .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 :8410-8425
[8]   Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval [J].
Dong, Jianfeng ;
Long, Zhongzi ;
Mao, Xiaofeng ;
Lin, Changting ;
He, Yuan ;
Ji, Shouling .
NEUROCOMPUTING, 2021, 440 :207-219
[9]   Feature Re-Learning with Data Augmentation for Video Relevance Prediction [J].
Dong, Jianfeng ;
Wang, Xun ;
Zhang, Leimin ;
Xu, Chaoxi ;
Yang, Gang ;
Li, Xirong .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2021, 33 (05) :1946-1959
[10]   Dual Encoding for Zero-Example Video Retrieval [J].
Dong, Jianfeng ;
Li, Xirong ;
Xu, Chaoxi ;
Ji, Shouling ;
He, Yuan ;
Yang, Gang ;
Wang, Xun .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :9338-9347