Boosting Video-Text Retrieval with Explicit High-Level Semantics

被引：7

作者：

Wang, Haoran ^{[1
]}

Xu, Di ^{[2
]}

He, Dongliang ^{[1
]}

Li, Fu ^{[1
]}

Ji, Zhong ^{[3
]}

Han, Jungong ^{[4
]}

Ding, Errui ^{[1
]}

机构：

[1] Baidu Inc, Dept Comp Vis Technol VIS, Beijing, Peoples R China

[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China

[4] Aberystwyth Univ, Comp Sci Dept, Aberystwyth SY23 3FL, Dyfed, Wales

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

关键词：

Video-Text Retrieval; High-level Semantics; Vision-language Understanding;

D O I：

10.1145/3503161.3548010

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a trained video captioning model is employed to output holistic high-level semantics. As for the textual modality, we parse the text into three parts including occurrence, action and entity. In particular, the occurrence corresponds to the holistic high-level semantics, meanwhile both action and entity represent the discrete ones. Then, different graph reasoning techniques are utilized to promote the interaction between holistic and discrete high-level semantics. Extensive experiments demonstrate that, with the aid of explicit high-level semantics, our method achieves the superior performance over state-of-the-art methods on three benchmark datasets, including MSR-VTT, MSVD and DiDeMo.

引用

页码：4887 / 4898

页数：12

共 50 条

[21] Expert-guided contrastive learning for video-text retrieval [J].

Lee, Jewook ;

Lee, Pilhyeon ;

Park, Sungho ;

Byun, Hyeran .

NEUROCOMPUTING, 2023, 536 :50-58

[22] SEMANTIC-PRESERVING METRIC LEARNING FOR VIDEO-TEXT RETRIEVAL [J].

Choo, Sungkwon ;

Ha, Seong Jong ;

Lee, Joonsoo .

2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, :2388-2392

[23] LSECA: local semantic enhancement and cross aggregation for video-text retrieval [J].

Wang, Zhiwen ;

Zhang, Donglin ;

Hu, Zhikai .

INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2024, 13 (03)

[24] Self-expressive induced clustered attention for video-text retrieval [J].

Zhu, Jingxuan ;

Shen, Xiangjun ;

Mehta, Sumet ;

Abeo, Timothy Apasiba ;

Zhan, Yongzhao .

MULTIMEDIA SYSTEMS, 2024, 30 (06)

[25] A survey of content-based image retrieval with high-level semantics [J].

Liu, Ying ;

Zhang, Dengsheng ;

Lu, Guojun ;

Ma, Wei-Ying .

PATTERN RECOGNITION, 2007, 40 (01) :262-282

[26] Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations [J].

Fang, Han ;

Xiong, Pengfei ;

Xu, Luhui ;

Luo, Wenhan .

IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :7772-7785

[27] Concept Propagation via Attentional Knowledge Graph Reasoning for Video-Text Retrieval [J].

Fang, Sheng ;

Wang, Shuhui ;

Zhuo, Junbao ;

Huang, Qingming ;

Ma, Bin ;

Wei, Xiaoming ;

Wei, Xiaolin .

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :4789-4800

[28] MAC: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval [J].

Shu, Fangxun ;

Chen, Biaolong ;

Liao, Yue ;

Wang, Jinqiao ;

Liu, Si .

IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 :9962-9972

[29] FeatInter: Exploring fine-grained object features for video-text retrieval [J].

Liu, Baolong ;

Zheng, Qi ;

Wang, Yabing ;

Zhang, Minsong ;

Dong, Jianfeng ;

Wang, Xun .

NEUROCOMPUTING, 2022, 496 :178-191

[30] Multilevel Semantic Interaction Alignment for Video-Text Cross-Modal Retrieval [J].

Chen, Lei ;

Deng, Zhen ;

Liu, Libo ;

Yin, Shibai .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) :6559-6575

← 1 2 3 4 5 →