Boosting Video-Text Retrieval with Explicit High-Level Semantics

被引:7
作者
Wang, Haoran [1 ]
Xu, Di [2 ]
He, Dongliang [1 ]
Li, Fu [1 ]
Ji, Zhong [3 ]
Han, Jungong [4 ]
Ding, Errui [1 ]
机构
[1] Baidu Inc, Dept Comp Vis Technol VIS, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China
[4] Aberystwyth Univ, Comp Sci Dept, Aberystwyth SY23 3FL, Dyfed, Wales
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
关键词
Video-Text Retrieval; High-level Semantics; Vision-language Understanding;
D O I
10.1145/3503161.3548010
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a trained video captioning model is employed to output holistic high-level semantics. As for the textual modality, we parse the text into three parts including occurrence, action and entity. In particular, the occurrence corresponds to the holistic high-level semantics, meanwhile both action and entity represent the discrete ones. Then, different graph reasoning techniques are utilized to promote the interaction between holistic and discrete high-level semantics. Extensive experiments demonstrate that, with the aid of explicit high-level semantics, our method achieves the superior performance over state-of-the-art methods on three benchmark datasets, including MSR-VTT, MSVD and DiDeMo.
引用
收藏
页码:4887 / 4898
页数:12
相关论文
共 50 条
[31]   Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval [J].
Jin, Weike ;
Zhao, Zhou ;
Zhang, Pengcheng ;
Zhu, Jieming ;
He, Xiuqiang ;
Zhuang, Yueting .
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, :1114-1124
[32]   CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval [J].
Gao, Yizhao ;
Lu, Zhiwu .
PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, :76-84
[33]   Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval [J].
Wang, Wei ;
Gao, Junyu ;
Yang, Xiaoshan ;
Xu, Changsheng .
IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 :2386-2397
[34]   CLIP Based Multi-Event Representation Generation for Video-Text Retrieval [J].
Tu R. ;
Mao X. ;
Kong W. ;
Cai C. ;
Zhao W. ;
Wang H. ;
Huang H. .
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (09) :2169-2179
[35]   Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval [J].
Feng, Zerun ;
Zeng, Zhimin ;
Guo, Caili ;
Li, Zheng .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (03) :1438-1453
[36]   Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval [J].
Mithun, Niluthpol Chowdhury ;
Li, Juncheng ;
Metze, Florian ;
Roy-Chowdhury, Amit K. .
ICMR '18: PROCEEDINGS OF THE 2018 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2018, :19-27
[37]   Fine-Grained Cross-Modal Contrast Learning for Video-Text Retrieval [J].
Liu, Hui ;
Lv, Gang ;
Gu, Yanhong ;
Nian, Fudong .
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT V, ICIC 2024, 2024, 14866 :298-310
[38]   CONTEXT-AWARE HIERARCHICAL TRANSFORMER FOR FINE-GRAINED VIDEO-TEXT RETRIEVAL [J].
Chen, Mingliang ;
Zhang, Weimin ;
Ren, Yurui ;
Li, Ge .
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, :386-390
[39]   A Multi-interaction Model with Cross-Branch Feature Fusion for Video-Text Retrieval [J].
Li, Junting ;
Wu, Dehao ;
Zhu, Yuesheng ;
Bai, Zhiqiang .
NEURAL INFORMATION PROCESSING, ICONIP 2021, PT VI, 2022, 1517 :476-484
[40]   JM-CLIP: A JOINT MODAL SIMILARITY CONTRASTIVE LEARNING MODEL FOR VIDEO-TEXT RETRIEVAL [J].
Ge, Mingyuan ;
Li, Yewen ;
Wu, Honghao ;
Li, Mingyong .
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, :3010-3014