Self-expressive induced clustered attention for video-text retrieval

被引：0

作者：

Zhu, Jingxuan ^{[1
]}

Shen, Xiangjun ^{[1
]}

Mehta, Sumet ^{[1
]}

Abeo, Timothy Apasiba ^{[2
]}

Zhan, Yongzhao ^{[1
,3
,4
]}

机构：

[1] Jiangsu Univ, Sch Comp Sci & Commun Engn, Zhenjiang 212013, Jiangsu, Peoples R China

[2] Tamale Tech Univ, Sch Appl Sci, Tamale, Ghana

[3] Jiangsu Univ, Jiangsu Engn Res Ctr Big Data Ubiquitous Percept &, Zhenjiang 212013, Jiangsu, Peoples R China

[4] Jiangsu Univ, Prov Key Lab Computat Intelligence & New Technol L, Zhenjiang 212013, Jiangsu, Peoples R China

来源：

MULTIMEDIA SYSTEMS | 2024年 / 30卷 / 06期

基金：

中国国家自然科学基金;

关键词：

Video-text retrieval; Self-attention; Video embedding; Self-expressive cluster;

D O I：

10.1007/s00530-024-01549-9

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Extensive research has proven that self-attention achieves impressive performance in video-text retrieval. However, most state-of-the-art methods neglect the intrinsic redundancy in videos caused by consecutive and similar frames, making them difficult to construct a well-defined fine-grained semantic space and improve the performance retrieval limited. Otherwise, current self-attentions exhibit high complexity when computing frame-word attention coefficient. It leads to high cost of computational and storage resources when employing these attentions for video-text retrieval. To solve these problems, we propose a new method of cluster induced self-expressive induced clustered attention for video-text retrieval. In this method, unlike existing methods, we perform self-expressive induced clustering (SEIC) on video embedding to mine well-defined fine-grained video semantic features, which is a self-adaptive clustering method without giving the special the number of video embedding pair, can capture well-defined fine-grained semantic features from video embedding and reduce the redundancy of video-frame-level content. Then a self-expressive induced clustered attention model (SEICA) is proposed, which can enhance the quality of video embedding, reduce the computational cost and save storage resources effectively. Finally, we apply this method to video-text retrieval tasks. Experimental results on several benchmark datasets such as MSVD, MSRVTT, ActivityNet and DiDeMo. demonstrate that the retrieval performance of the proposed method is superior to that of the relative stat of the art methods with less consumption of computing and storage resources.

引用

页数：15

共 51 条

[1]

[Anonymous], 2011, P 49 ANN M ASS COMPU

[2]

Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698

[3] Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval [J].

Deng, Chaorui ;

Chen, Qi ;

Qin, Pengda ;

Chen, Da ;

Wu, Qi .

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, :15602-15612

[4]

Dosovitskiy A., 2021, 9 INT C LEARN REPR I

[5] A cross-modal conditional mechanism based on attention for text-video retrieval [J].

Du, Wanru ;

Jing, Xiaochuan ;

Zhu, Quan ;

Wang, Xiaoyin ;

Liu, Xuan .

MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2023, 20 (11) :20073-20092

[6] Concept Propagation via Attentional Knowledge Graph Reasoning for Video-Text Retrieval [J].

Fang, Sheng ;

Wang, Shuhui ;

Zhuo, Junbao ;

Huang, Qingming ;

Ma, Bin ;

Wei, Xiaoming ;

Wei, Xiaolin .

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :4789-4800

[7] Graph Convolutional Tracking [J].

Gao, Junyu ;

Zhang, Tianzhu ;

Xu, Changsheng .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4644-4654

[8] CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval [J].

Gao, Yizhao ;

Lu, Zhiwu .

PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, :76-84

[9] Bridging Video-text Retrieval with Multiple Choice Questions [J].

Ge, Yuying ;

Ge, Yixiao ;

Liu, Xihui ;

Li, Dian ;

Shan, Ying ;

Qie, Xiaohu ;

Luo, Ping .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :16146-16155

[10] X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval [J].

Gorti, Satya Krishna ;

Vouitsis, Noel ;

Ma, Junwei ;

Golestan, Keyvan ;

Volkovs, Maksims ;

Garg, Animesh ;

Yu, Guangwei .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :4996-5005

← 1 2 3 4 5 6 →