Boosting Video-Text Retrieval with Explicit High-Level Semantics

被引：7

作者：

Wang, Haoran ^{[1
]}

Xu, Di ^{[2
]}

He, Dongliang ^{[1
]}

Li, Fu ^{[1
]}

Ji, Zhong ^{[3
]}

Han, Jungong ^{[4
]}

Ding, Errui ^{[1
]}

机构：

[1] Baidu Inc, Dept Comp Vis Technol VIS, Beijing, Peoples R China

[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China

[4] Aberystwyth Univ, Comp Sci Dept, Aberystwyth SY23 3FL, Dyfed, Wales

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

关键词：

Video-Text Retrieval; High-level Semantics; Vision-language Understanding;

D O I：

10.1145/3503161.3548010

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a trained video captioning model is employed to output holistic high-level semantics. As for the textual modality, we parse the text into three parts including occurrence, action and entity. In particular, the occurrence corresponds to the holistic high-level semantics, meanwhile both action and entity represent the discrete ones. Then, different graph reasoning techniques are utilized to promote the interaction between holistic and discrete high-level semantics. Extensive experiments demonstrate that, with the aid of explicit high-level semantics, our method achieves the superior performance over state-of-the-art methods on three benchmark datasets, including MSR-VTT, MSVD and DiDeMo.

引用

页码：4887 / 4898

页数：12

共 50 条

[31] Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval [J].

Jin, Weike ;

Zhao, Zhou ;

Zhang, Pengcheng ;

Zhu, Jieming ;

He, Xiuqiang ;

Zhuang, Yueting .

SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, :1114-1124

[32] CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval [J].

Gao, Yizhao ;

Lu, Zhiwu .

PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, :76-84

[33] Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval [J].

Wang, Wei ;

Gao, Junyu ;

Yang, Xiaoshan ;

Xu, Changsheng .

IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 :2386-2397

[34] CLIP Based Multi-Event Representation Generation for Video-Text Retrieval [J].

Tu R. ;

Mao X. ;

Kong W. ;

Cai C. ;

Zhao W. ;

Wang H. ;

Huang H. .

Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (09) :2169-2179

[35] Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval [J].

Feng, Zerun ;

Zeng, Zhimin ;

Guo, Caili ;

Li, Zheng .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (03) :1438-1453

[36] Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval [J].

Mithun, Niluthpol Chowdhury ;

Li, Juncheng ;

Metze, Florian ;

Roy-Chowdhury, Amit K. .

ICMR '18: PROCEEDINGS OF THE 2018 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2018, :19-27

[37] Fine-Grained Cross-Modal Contrast Learning for Video-Text Retrieval [J].

Liu, Hui ;

Lv, Gang ;

Gu, Yanhong ;

Nian, Fudong .

ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT V, ICIC 2024, 2024, 14866 :298-310

[38] CONTEXT-AWARE HIERARCHICAL TRANSFORMER FOR FINE-GRAINED VIDEO-TEXT RETRIEVAL [J].

Chen, Mingliang ;

Zhang, Weimin ;

Ren, Yurui ;

Li, Ge .

2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, :386-390

[39] A Multi-interaction Model with Cross-Branch Feature Fusion for Video-Text Retrieval [J].

Li, Junting ;

Wu, Dehao ;

Zhu, Yuesheng ;

Bai, Zhiqiang .

NEURAL INFORMATION PROCESSING, ICONIP 2021, PT VI, 2022, 1517 :476-484

[40] JM-CLIP: A JOINT MODAL SIMILARITY CONTRASTIVE LEARNING MODEL FOR VIDEO-TEXT RETRIEVAL [J].

Ge, Mingyuan ;

Li, Yewen ;

Wu, Honghao ;

Li, Mingyong .

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, :3010-3014

← 1 2 3 4 5 →