Boosting Video-Text Retrieval with Explicit High-Level Semantics

被引：7

作者：

Wang, Haoran ^{[1
]}

Xu, Di ^{[2
]}

He, Dongliang ^{[1
]}

Li, Fu ^{[1
]}

Ji, Zhong ^{[3
]}

Han, Jungong ^{[4
]}

Ding, Errui ^{[1
]}

机构：

[1] Baidu Inc, Dept Comp Vis Technol VIS, Beijing, Peoples R China

[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China

[4] Aberystwyth Univ, Comp Sci Dept, Aberystwyth SY23 3FL, Dyfed, Wales

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

关键词：

Video-Text Retrieval; High-level Semantics; Vision-language Understanding;

D O I：

10.1145/3503161.3548010

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a trained video captioning model is employed to output holistic high-level semantics. As for the textual modality, we parse the text into three parts including occurrence, action and entity. In particular, the occurrence corresponds to the holistic high-level semantics, meanwhile both action and entity represent the discrete ones. Then, different graph reasoning techniques are utilized to promote the interaction between holistic and discrete high-level semantics. Extensive experiments demonstrate that, with the aid of explicit high-level semantics, our method achieves the superior performance over state-of-the-art methods on three benchmark datasets, including MSR-VTT, MSVD and DiDeMo.

引用

页码：4887 / 4898

页数：12

共 50 条

[1] Mask to Reconstruct: Cooperative Semantics Completion for Video-text Retrieval [J].

Fang, Han ;

Yang, Zhifei ;

Zang, Xianghao ;

Ban, Chao ;

He, Zhongjiang ;

Sun, Hao ;

Zhou, Lanxiang .

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, :3847-3856

[2] Learning Semantics-Grounded Vocabulary Representation for Video-Text Retrieval [J].

Shi, Yaya ;

Liu, Haowei ;

Xu, Haiyang ;

Ma, Zongyang ;

Ye, Qinghao ;

Hu, Anwen ;

Yan, Ming ;

Zhang, Ji ;

Huang, Fei ;

Yuan, Chunfeng ;

Li, Bing ;

Hu, Weiming ;

Zha, Zheng-Jun .

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, :4460-4470

[3] HANet: Hierarchical Alignment Networks for Video-Text Retrieval [J].

Wu, Peng ;

He, Xiangteng ;

Tang, Mingqian ;

Lv, Yiliang ;

Liu, Jing .

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :3518-3527

[4] Deep learning for video-text retrieval: a review [J].

Zhu, Cunjuan ;

Jia, Qi ;

Chen, Wei ;

Guo, Yanming ;

Liu, Yu .

INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2023, 12 (01)

[5] Progressive Semantic Matching for Video-Text Retrieval [J].

Liu, Hongying ;

Luo, Ruyi ;

Shang, Fanhua ;

Niu, Mantang ;

Liu, Yuanyuan .

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :5083-5091

[6] A NOVEL CONVOLUTIONAL ARCHITECTURE FOR VIDEO-TEXT RETRIEVAL [J].

Li, Zheng ;

Guo, Caili ;

Yang, Bo ;

Feng, Zerun ;

Zhang, Hao .

2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,

[7] TOWARDS VIDEO-TEXT RETRIEVAL ADVERSARIAL ATTACK [J].

Yang, Haozhe ;

Xiang, Yuhan ;

Sun, Ke ;

Hu, Jianlong ;

Lin, Xianming .

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, :6500-6504

[8] Deep learning for video-text retrieval: a review [J].

Cunjuan Zhu ;

Qi Jia ;

Wei Chen ;

Yanming Guo ;

Yu Liu .

International Journal of Multimedia Information Retrieval, 2023, 12

[9] MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval [J].

Ge, Yuying ;

Ge, Yixiao ;

Liu, Xihui ;

Wang, Jinpeng ;

Wu, Jianping ;

Shan, Ying ;

Qie, Xiaohu ;

Luo, Ping .

COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 :691-708

[10] Animating Images to Transfer CLIP for Video-Text Retrieval [J].

Liu, Yu ;

Chen, Huai ;

Huang, Lianghua ;

Chen, Di ;

Wang, Bin ;

Pan, Pan ;

Wang, Lisheng .

PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, :1906-1911

← 1 2 3 4 5 →