Boosting Video-Text Retrieval with Explicit High-Level Semantics

被引:6
作者
Wang, Haoran [1 ]
Xu, Di [2 ]
He, Dongliang [1 ]
Li, Fu [1 ]
Ji, Zhong [3 ]
Han, Jungong [4 ]
Ding, Errui [1 ]
机构
[1] Baidu Inc, Dept Comp Vis Technol VIS, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China
[4] Aberystwyth Univ, Comp Sci Dept, Aberystwyth SY23 3FL, Dyfed, Wales
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
关键词
Video-Text Retrieval; High-level Semantics; Vision-language Understanding;
D O I
10.1145/3503161.3548010
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a trained video captioning model is employed to output holistic high-level semantics. As for the textual modality, we parse the text into three parts including occurrence, action and entity. In particular, the occurrence corresponds to the holistic high-level semantics, meanwhile both action and entity represent the discrete ones. Then, different graph reasoning techniques are utilized to promote the interaction between holistic and discrete high-level semantics. Extensive experiments demonstrate that, with the aid of explicit high-level semantics, our method achieves the superior performance over state-of-the-art methods on three benchmark datasets, including MSR-VTT, MSVD and DiDeMo.
引用
收藏
页码:4887 / 4898
页数:12
相关论文
共 49 条
[21]   LSECA: local semantic enhancement and cross aggregation for video-text retrieval [J].
Wang, Zhiwen ;
Zhang, Donglin ;
Hu, Zhikai .
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2024, 13 (03)
[22]   Self-expressive induced clustered attention for video-text retrieval [J].
Zhu, Jingxuan ;
Shen, Xiangjun ;
Mehta, Sumet ;
Abeo, Timothy Apasiba ;
Zhan, Yongzhao .
MULTIMEDIA SYSTEMS, 2024, 30 (06)
[23]   A survey of content-based image retrieval with high-level semantics [J].
Liu, Ying ;
Zhang, Dengsheng ;
Lu, Guojun ;
Ma, Wei-Ying .
PATTERN RECOGNITION, 2007, 40 (01) :262-282
[24]   Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations [J].
Fang, Han ;
Xiong, Pengfei ;
Xu, Luhui ;
Luo, Wenhan .
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :7772-7785
[25]   Concept Propagation via Attentional Knowledge Graph Reasoning for Video-Text Retrieval [J].
Fang, Sheng ;
Wang, Shuhui ;
Zhuo, Junbao ;
Huang, Qingming ;
Ma, Bin ;
Wei, Xiaoming ;
Wei, Xiaolin .
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :4789-4800
[26]   MAC: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval [J].
Shu, Fangxun ;
Chen, Biaolong ;
Liao, Yue ;
Wang, Jinqiao ;
Liu, Si .
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 :9962-9972
[27]   Multilevel Semantic Interaction Alignment for Video-Text Cross-Modal Retrieval [J].
Chen, Lei ;
Deng, Zhen ;
Liu, Libo ;
Yin, Shibai .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) :6559-6575
[28]   FeatInter: Exploring fine-grained object features for video-text retrieval [J].
Liu, Baolong ;
Zheng, Qi ;
Wang, Yabing ;
Zhang, Minsong ;
Dong, Jianfeng ;
Wang, Xun .
NEUROCOMPUTING, 2022, 496 :178-191
[29]   Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval [J].
Wang, Wei ;
Gao, Junyu ;
Yang, Xiaoshan ;
Xu, Changsheng .
IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 :2386-2397
[30]   Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval [J].
Jin, Weike ;
Zhao, Zhou ;
Zhang, Pengcheng ;
Zhu, Jieming ;
He, Xiuqiang ;
Zhuang, Yueting .
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, :1114-1124