Boosting Video-Text Retrieval with Explicit High-Level Semantics

被引:7
作者
Wang, Haoran [1 ]
Xu, Di [2 ]
He, Dongliang [1 ]
Li, Fu [1 ]
Ji, Zhong [3 ]
Han, Jungong [4 ]
Ding, Errui [1 ]
机构
[1] Baidu Inc, Dept Comp Vis Technol VIS, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China
[4] Aberystwyth Univ, Comp Sci Dept, Aberystwyth SY23 3FL, Dyfed, Wales
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
关键词
Video-Text Retrieval; High-level Semantics; Vision-language Understanding;
D O I
10.1145/3503161.3548010
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a trained video captioning model is employed to output holistic high-level semantics. As for the textual modality, we parse the text into three parts including occurrence, action and entity. In particular, the occurrence corresponds to the holistic high-level semantics, meanwhile both action and entity represent the discrete ones. Then, different graph reasoning techniques are utilized to promote the interaction between holistic and discrete high-level semantics. Extensive experiments demonstrate that, with the aid of explicit high-level semantics, our method achieves the superior performance over state-of-the-art methods on three benchmark datasets, including MSR-VTT, MSVD and DiDeMo.
引用
收藏
页码:4887 / 4898
页数:12
相关论文
共 50 条
[11]   Joint embeddings with multimodal cues for video-text retrieval [J].
Niluthpol C. Mithun ;
Juncheng Li ;
Florian Metze ;
Amit K. Roy-Chowdhury .
International Journal of Multimedia Information Retrieval, 2019, 8 :3-18
[12]   Joint embeddings with multimodal cues for video-text retrieval [J].
Mithun, Niluthpol C. ;
Li, Juncheng ;
Metze, Florian ;
Roy-Chowdhury, Amit K. .
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2019, 8 (01) :3-18
[13]   Fine-grained Video Semantic Distillation for Video-Text Retrieval [J].
Pei, Zuyi ;
Sun, Baoli ;
Wang, Zhihui ;
Li, Haojie .
PROCEEDINGS OF THE 6TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA IN ASIA, MMASIA 2024, 2024,
[14]   Reliable Phrase Feature Mining for Hierarchical Video-Text Retrieval [J].
Lai, Huakai ;
Yang, Wenfei ;
Zhang, Tianzhu ;
Zhang, Yongdong .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) :12019-12031
[15]   Complementarity-Aware Space Learning for Video-Text Retrieval [J].
Zhu, Jinkuan ;
Zeng, Pengpeng ;
Gao, Lianli ;
Li, Gongfu ;
Liao, Dongliang ;
Song, Jingkuan .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (08) :4362-4374
[16]   Uncertainty-Aware with Negative Samples for Video-Text Retrieval [J].
Song, Weitao ;
Chen, Weiran ;
Xu, Jialiang ;
Ji, Yi ;
Li, Ying ;
Liu, Chunping .
PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 :318-332
[17]   Adaptive Token Excitation with Negative Selection for Video-Text Retrieval [J].
Yu, Juntao ;
Ni, Zhangkai ;
Su, Taiyi ;
Wang, Hanli .
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 :349-361
[18]   Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval [J].
Nian, Fudong ;
Ding, Ling ;
Hu, Yuxia ;
Gu, Yanhong .
MATHEMATICS, 2022, 10 (18)
[19]   INTEGRATED MODALITIES AND MULTI-LEVEL GRANULARITY: TOWARDS A UNIFIED VIDEO-TEXT RETRIEVAL FRAMEWORK [J].
Liu, Liu ;
Wang, Wenzhe ;
Zhang, Zhijie ;
Zhang, Mengdan ;
Peng, Pai ;
Sun, Xing .
2021 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2021,
[20]   Robust Video-Text Retrieval Via Noisy Pair Calibration [J].
Zhang, Huaiwen ;
Yang, Yang ;
Qi, Fan ;
Qian, Shengsheng ;
Xu, Changsheng .
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :8632-8645