Boosting Video-Text Retrieval with Explicit High-Level Semantics

被引：7

作者：

Wang, Haoran ^{[1
]}

Xu, Di ^{[2
]}

He, Dongliang ^{[1
]}

Li, Fu ^{[1
]}

Ji, Zhong ^{[3
]}

Han, Jungong ^{[4
]}

Ding, Errui ^{[1
]}

机构：

[1] Baidu Inc, Dept Comp Vis Technol VIS, Beijing, Peoples R China

[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China

[4] Aberystwyth Univ, Comp Sci Dept, Aberystwyth SY23 3FL, Dyfed, Wales

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

关键词：

Video-Text Retrieval; High-level Semantics; Vision-language Understanding;

D O I：

10.1145/3503161.3548010

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a trained video captioning model is employed to output holistic high-level semantics. As for the textual modality, we parse the text into three parts including occurrence, action and entity. In particular, the occurrence corresponds to the holistic high-level semantics, meanwhile both action and entity represent the discrete ones. Then, different graph reasoning techniques are utilized to promote the interaction between holistic and discrete high-level semantics. Extensive experiments demonstrate that, with the aid of explicit high-level semantics, our method achieves the superior performance over state-of-the-art methods on three benchmark datasets, including MSR-VTT, MSVD and DiDeMo.

引用

页码：4887 / 4898

页数：12

共 50 条

[11] Joint embeddings with multimodal cues for video-text retrieval [J].

Niluthpol C. Mithun ;

Juncheng Li ;

Florian Metze ;

Amit K. Roy-Chowdhury .

International Journal of Multimedia Information Retrieval, 2019, 8 :3-18

[12] Joint embeddings with multimodal cues for video-text retrieval [J].

Mithun, Niluthpol C. ;

Li, Juncheng ;

Metze, Florian ;

Roy-Chowdhury, Amit K. .

INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2019, 8 (01) :3-18

[13] Fine-grained Video Semantic Distillation for Video-Text Retrieval [J].

Pei, Zuyi ;

Sun, Baoli ;

Wang, Zhihui ;

Li, Haojie .

PROCEEDINGS OF THE 6TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA IN ASIA, MMASIA 2024, 2024,

[14] Reliable Phrase Feature Mining for Hierarchical Video-Text Retrieval [J].

Lai, Huakai ;

Yang, Wenfei ;

Zhang, Tianzhu ;

Zhang, Yongdong .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) :12019-12031

[15] Complementarity-Aware Space Learning for Video-Text Retrieval [J].

Zhu, Jinkuan ;

Zeng, Pengpeng ;

Gao, Lianli ;

Li, Gongfu ;

Liao, Dongliang ;

Song, Jingkuan .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (08) :4362-4374

[16] Uncertainty-Aware with Negative Samples for Video-Text Retrieval [J].

Song, Weitao ;

Chen, Weiran ;

Xu, Jialiang ;

Ji, Yi ;

Li, Ying ;

Liu, Chunping .

PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 :318-332

[17] Adaptive Token Excitation with Negative Selection for Video-Text Retrieval [J].

Yu, Juntao ;

Ni, Zhangkai ;

Su, Taiyi ;

Wang, Hanli .

ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 :349-361

[18] Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval [J].

Nian, Fudong ;

Ding, Ling ;

Hu, Yuxia ;

Gu, Yanhong .

MATHEMATICS, 2022, 10 (18)

[19] INTEGRATED MODALITIES AND MULTI-LEVEL GRANULARITY: TOWARDS A UNIFIED VIDEO-TEXT RETRIEVAL FRAMEWORK [J].

Liu, Liu ;

Wang, Wenzhe ;

Zhang, Zhijie ;

Zhang, Mengdan ;

Peng, Pai ;

Sun, Xing .

2021 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2021,

[20] Robust Video-Text Retrieval Via Noisy Pair Calibration [J].

Zhang, Huaiwen ;

Yang, Yang ;

Qi, Fan ;

Qian, Shengsheng ;

Xu, Changsheng .

IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :8632-8645

← 1 2 3 4 5 →