Reliable Phrase Feature Mining for Hierarchical Video-Text Retrieval

被引：0

作者：

Lai, Huakai ^{[1
]}

Yang, Wenfei ^{[1
]}

Zhang, Tianzhu ^{[1
,2
]}

Zhang, Yongdong ^{[1
]}

机构：

[1] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230027, Peoples R China

[2] Deep Space Explorat Lab, Hefei 230031, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 11期

基金：

中国国家自然科学基金;

关键词：

Video-text retrieval; attention work; reliable phrase mining; denoised decoder;

D O I：

10.1109/TCSVT.2024.3422869

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Video-Text Retrieval is a fundamental task in multi-modal understanding and has attracted increasing attention from both academia and industry communities in recent years. Generally, video inherently contains multi-grained semantic and each video corresponds to several different texts, which is challenging. Previous best-performing methods adopt video-sentence, phrase-phrase, and frame-word interactions simultaneously. Different from word/frame features that can be obtained directly, phrase features need to be adaptively aggregated from correlative word/frame features, which makes it very demanding. However, existing method utilizes simple intra-modal self-attention to generate phrase features without considering the following three aspects: cross-modality semantic correlation, phrase generation noise and diversity. In this paper, we propose a novel Reliable Phrase Mining model (RPM) to construct reliable phrase features and conduct hierarchical cross-modal interactions for video-text retrieval. The proposed RPM model enjoys several merits. Firstly, to guarantee the semantic consistency between video phrases and text phrases, we propose a set of modality-shared prototypes as the joint query to aggregate the semantically related frame/word features into adaptive-grained phrase features. Secondly, to deal with the phrase generation noise, the proposed denoised decoder module is responsible for obtaining more reliable similarity between prototypes and frame/word features. Specifically, not only the correlation between frame/word features and prototypes, but also the correlation among prototypes, should be taken into account when calculating the similarity. Furthermore, to encourage different prototypes to focus on different semantic information, we design a prototype contrastive loss whose core idea is enabling phrases produced by the same prototype to be more similar than those produced by different prototypes. Extensive experiment results demonstrate that the proposed method performs favorably on three benchmark datasets, including MSR-VTT, MSVD, and LSMDC.

引用

页码：12019 / 12031

页数：13

共 71 条

[1]

[Anonymous], 2011, P 49 ANN M ASS COMPU

[2] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[3] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].

Bain, Max ;

Nagrani, Arsha ;

Varol, Gul ;

Zisserman, Andrew .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718

[4] Learning the Best Pooling Strategy for Visual Semantic Embedding [J].

Chen, Jiacheng ;

Hu, Hexiang ;

Wu, Hao ;

Jiang, Yuning ;

Wang, Changhu .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15784-15793

[5] Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [J].

Chen, Shizhe ;

Zhao, Yida ;

Jin, Qin ;

Wu, Qi .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10635-10644

[6] UNITER: UNiversal Image-TExt Representation Learning [J].

Chen, Yen-Chun ;

Li, Linjie ;

Yu, Licheng ;

El Kholy, Ahmed ;

Ahmed, Faisal ;

Gan, Zhe ;

Cheng, Yu ;

Liu, Jingjing .

COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120

[7]

Cheng Xing, 2021, arXiv

[8] TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [J].

Croitoru, Ioana ;

Bogolin, Simion-Vlad ;

Leordeanu, Marius ;

Jin, Hailin ;

Zisserman, Andrew ;

Albanie, Samuel ;

Liu, Yang .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :11563-11573

[9]

Devlin J, 2019, Arxiv, DOI arXiv:1810.04805

[10] Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval [J].

Dong, Jianfeng ;

Wang, Yabing ;

Chen, Xianke ;

Qu, Xiaoye ;

Li, Xirong ;

He, Yuan ;

Wang, Xun .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) :5680-5694

← 1 2 3 4 5 6 7 8 →