Self-expressive induced clustered attention for video-text retrieval

被引：0

作者：

Zhu, Jingxuan ^{[1
]}

Shen, Xiangjun ^{[1
]}

Mehta, Sumet ^{[1
]}

Abeo, Timothy Apasiba ^{[2
]}

Zhan, Yongzhao ^{[1
,3
,4
]}

机构：

[1] Jiangsu Univ, Sch Comp Sci & Commun Engn, Zhenjiang 212013, Jiangsu, Peoples R China

[2] Tamale Tech Univ, Sch Appl Sci, Tamale, Ghana

[3] Jiangsu Univ, Jiangsu Engn Res Ctr Big Data Ubiquitous Percept &, Zhenjiang 212013, Jiangsu, Peoples R China

[4] Jiangsu Univ, Prov Key Lab Computat Intelligence & New Technol L, Zhenjiang 212013, Jiangsu, Peoples R China

来源：

MULTIMEDIA SYSTEMS | 2024年 / 30卷 / 06期

基金：

中国国家自然科学基金;

关键词：

Video-text retrieval; Self-attention; Video embedding; Self-expressive cluster;

D O I：

10.1007/s00530-024-01549-9

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Extensive research has proven that self-attention achieves impressive performance in video-text retrieval. However, most state-of-the-art methods neglect the intrinsic redundancy in videos caused by consecutive and similar frames, making them difficult to construct a well-defined fine-grained semantic space and improve the performance retrieval limited. Otherwise, current self-attentions exhibit high complexity when computing frame-word attention coefficient. It leads to high cost of computational and storage resources when employing these attentions for video-text retrieval. To solve these problems, we propose a new method of cluster induced self-expressive induced clustered attention for video-text retrieval. In this method, unlike existing methods, we perform self-expressive induced clustering (SEIC) on video embedding to mine well-defined fine-grained video semantic features, which is a self-adaptive clustering method without giving the special the number of video embedding pair, can capture well-defined fine-grained semantic features from video embedding and reduce the redundancy of video-frame-level content. Then a self-expressive induced clustered attention model (SEICA) is proposed, which can enhance the quality of video embedding, reduce the computational cost and save storage resources effectively. Finally, we apply this method to video-text retrieval tasks. Experimental results on several benchmark datasets such as MSVD, MSRVTT, ActivityNet and DiDeMo. demonstrate that the retrieval performance of the proposed method is superior to that of the relative stat of the art methods with less consumption of computing and storage resources.

引用

页数：15

共 50 条

[41] INTEGRATED MODALITIES AND MULTI-LEVEL GRANULARITY: TOWARDS A UNIFIED VIDEO-TEXT RETRIEVAL FRAMEWORK [J].

Liu, Liu ;

Wang, Wenzhe ;

Zhang, Zhijie ;

Zhang, Mengdan ;

Peng, Pai ;

Sun, Xing .

2021 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2021,

[42] Video-text retrieval based on multi-grained hierarchical aggregation and semantic similarity optimization [J].

Guo, Jie ;

Lan, Shujie ;

Song, Bin ;

Wang, Mengying .

NEUROCOMPUTING, 2025, 638

[43] CLIP4Hashing: Unsupervised Deep Hashing for Cross-Modal Video-Text Retrieval [J].

Zhuo, Yaoxin ;

Li, Yikang ;

Hsiao, Jenhao ;

Ho, Chiuman ;

Li, Baoxin .

PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, :158-166

[44] An empirical study of excitation and aggregation design adaptions in CLIP4Clip for video-text retrieval [J].

Jing, Xiaolun ;

Yang, Genke ;

Chu, Jian .

NEUROCOMPUTING, 2024, 596

[45] Many Hands Make Light Work: Transferring Knowledge From Auxiliary Tasks for Video-Text Retrieval [J].

Wang, Wei ;

Gao, Junyu ;

Yang, Xiaoshan ;

Xu, Changsheng .

IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :2661-2674

[46] X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval [J].

Ma, Yiwei ;

Xu, Guohai ;

Sun, Xiaoshuai ;

Yan, Ming ;

Zhang, Ji ;

Ji, Rongrong .

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,

[47] Text-video retrieval method based on enhanced self-attention and multi-task learning [J].

Xiaoyu Wu ;

Jiayao Qian ;

Tiantian Wang .

Multimedia Tools and Applications, 2023, 82 :24387-24406

[48] Text-video retrieval method based on enhanced self-attention and multi-task learning [J].

Wu, Xiaoyu ;

Qian, Jiayao ;

Wang, Tiantian .

MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (16) :24387-24406

[49] Video-text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network [J].

Lv, Gang ;

Sun, Yining ;

Nian, Fudong .

MULTIMEDIA SYSTEMS, 2024, 30 (01)

[50] Cross-modal retrieval of remote sensing images and text based on self-attention unsupervised deep common feature space [J].

Ding, Qilin ;

Zhang, Haisu ;

Wang, Xu ;

Li, Weipeng .

INTERNATIONAL JOURNAL OF REMOTE SENSING, 2023, 44 (12) :3892-3909

← 1 2 3 4 5 →