Boosting Video-Text Retrieval with Explicit High-Level Semantics

被引:7
作者
Wang, Haoran [1 ]
Xu, Di [2 ]
He, Dongliang [1 ]
Li, Fu [1 ]
Ji, Zhong [3 ]
Han, Jungong [4 ]
Ding, Errui [1 ]
机构
[1] Baidu Inc, Dept Comp Vis Technol VIS, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China
[4] Aberystwyth Univ, Comp Sci Dept, Aberystwyth SY23 3FL, Dyfed, Wales
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
关键词
Video-Text Retrieval; High-level Semantics; Vision-language Understanding;
D O I
10.1145/3503161.3548010
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a trained video captioning model is employed to output holistic high-level semantics. As for the textual modality, we parse the text into three parts including occurrence, action and entity. In particular, the occurrence corresponds to the holistic high-level semantics, meanwhile both action and entity represent the discrete ones. Then, different graph reasoning techniques are utilized to promote the interaction between holistic and discrete high-level semantics. Extensive experiments demonstrate that, with the aid of explicit high-level semantics, our method achieves the superior performance over state-of-the-art methods on three benchmark datasets, including MSR-VTT, MSVD and DiDeMo.
引用
收藏
页码:4887 / 4898
页数:12
相关论文
共 50 条
[1]   Mask to Reconstruct: Cooperative Semantics Completion for Video-text Retrieval [J].
Fang, Han ;
Yang, Zhifei ;
Zang, Xianghao ;
Ban, Chao ;
He, Zhongjiang ;
Sun, Hao ;
Zhou, Lanxiang .
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, :3847-3856
[2]   Learning Semantics-Grounded Vocabulary Representation for Video-Text Retrieval [J].
Shi, Yaya ;
Liu, Haowei ;
Xu, Haiyang ;
Ma, Zongyang ;
Ye, Qinghao ;
Hu, Anwen ;
Yan, Ming ;
Zhang, Ji ;
Huang, Fei ;
Yuan, Chunfeng ;
Li, Bing ;
Hu, Weiming ;
Zha, Zheng-Jun .
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, :4460-4470
[3]   HANet: Hierarchical Alignment Networks for Video-Text Retrieval [J].
Wu, Peng ;
He, Xiangteng ;
Tang, Mingqian ;
Lv, Yiliang ;
Liu, Jing .
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :3518-3527
[4]   Deep learning for video-text retrieval: a review [J].
Zhu, Cunjuan ;
Jia, Qi ;
Chen, Wei ;
Guo, Yanming ;
Liu, Yu .
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2023, 12 (01)
[5]   Progressive Semantic Matching for Video-Text Retrieval [J].
Liu, Hongying ;
Luo, Ruyi ;
Shang, Fanhua ;
Niu, Mantang ;
Liu, Yuanyuan .
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :5083-5091
[6]   A NOVEL CONVOLUTIONAL ARCHITECTURE FOR VIDEO-TEXT RETRIEVAL [J].
Li, Zheng ;
Guo, Caili ;
Yang, Bo ;
Feng, Zerun ;
Zhang, Hao .
2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
[7]   TOWARDS VIDEO-TEXT RETRIEVAL ADVERSARIAL ATTACK [J].
Yang, Haozhe ;
Xiang, Yuhan ;
Sun, Ke ;
Hu, Jianlong ;
Lin, Xianming .
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, :6500-6504
[8]   Deep learning for video-text retrieval: a review [J].
Cunjuan Zhu ;
Qi Jia ;
Wei Chen ;
Yanming Guo ;
Yu Liu .
International Journal of Multimedia Information Retrieval, 2023, 12
[9]   MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval [J].
Ge, Yuying ;
Ge, Yixiao ;
Liu, Xihui ;
Wang, Jinpeng ;
Wu, Jianping ;
Shan, Ying ;
Qie, Xiaohu ;
Luo, Ping .
COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 :691-708
[10]   Animating Images to Transfer CLIP for Video-Text Retrieval [J].
Liu, Yu ;
Chen, Huai ;
Huang, Lianghua ;
Chen, Di ;
Wang, Bin ;
Pan, Pan ;
Wang, Lisheng .
PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, :1906-1911