Self-expressive induced clustered attention for video-text retrieval

被引：0

作者：

Zhu, Jingxuan ^{[1
]}

Shen, Xiangjun ^{[1
]}

Mehta, Sumet ^{[1
]}

Abeo, Timothy Apasiba ^{[2
]}

Zhan, Yongzhao ^{[1
,3
,4
]}

机构：

[1] Jiangsu Univ, Sch Comp Sci & Commun Engn, Zhenjiang 212013, Jiangsu, Peoples R China

[2] Tamale Tech Univ, Sch Appl Sci, Tamale, Ghana

[3] Jiangsu Univ, Jiangsu Engn Res Ctr Big Data Ubiquitous Percept &, Zhenjiang 212013, Jiangsu, Peoples R China

[4] Jiangsu Univ, Prov Key Lab Computat Intelligence & New Technol L, Zhenjiang 212013, Jiangsu, Peoples R China

来源：

MULTIMEDIA SYSTEMS | 2024年 / 30卷 / 06期

基金：

中国国家自然科学基金;

关键词：

Video-text retrieval; Self-attention; Video embedding; Self-expressive cluster;

D O I：

10.1007/s00530-024-01549-9

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Extensive research has proven that self-attention achieves impressive performance in video-text retrieval. However, most state-of-the-art methods neglect the intrinsic redundancy in videos caused by consecutive and similar frames, making them difficult to construct a well-defined fine-grained semantic space and improve the performance retrieval limited. Otherwise, current self-attentions exhibit high complexity when computing frame-word attention coefficient. It leads to high cost of computational and storage resources when employing these attentions for video-text retrieval. To solve these problems, we propose a new method of cluster induced self-expressive induced clustered attention for video-text retrieval. In this method, unlike existing methods, we perform self-expressive induced clustering (SEIC) on video embedding to mine well-defined fine-grained video semantic features, which is a self-adaptive clustering method without giving the special the number of video embedding pair, can capture well-defined fine-grained semantic features from video embedding and reduce the redundancy of video-frame-level content. Then a self-expressive induced clustered attention model (SEICA) is proposed, which can enhance the quality of video embedding, reduce the computational cost and save storage resources effectively. Finally, we apply this method to video-text retrieval tasks. Experimental results on several benchmark datasets such as MSVD, MSRVTT, ActivityNet and DiDeMo. demonstrate that the retrieval performance of the proposed method is superior to that of the relative stat of the art methods with less consumption of computing and storage resources.

引用

页数：15

共 46 条

[21] LSECA: local semantic enhancement and cross aggregation for video-text retrieval
Wang, Zhiwen
Zhang, Donglin
Hu, Zhikai
[J]. INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2024, 13 (03)
[22] Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations
Fang, Han
Xiong, Pengfei
Xu, Luhui
Luo, Wenhan
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7772 - 7785
[23] Concept Propagation via Attentional Knowledge Graph Reasoning for Video-Text Retrieval
Fang, Sheng
Wang, Shuhui
Zhuo, Junbao
Huang, Qingming
Ma, Bin
Wei, Xiaoming
Wei, Xiaolin
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4789 - 4800
[24] MAC: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
Shu, Fangxun
Chen, Biaolong
Liao, Yue
Wang, Jinqiao
Liu, Si
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9962 - 9972
[25] Multilevel Semantic Interaction Alignment for Video-Text Cross-Modal Retrieval
Chen, Lei
Deng, Zhen
Liu, Libo
Yin, Shibai
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 6559 - 6575
[26] FeatInter: Exploring fine-grained object features for video-text retrieval
Liu, Baolong
Zheng, Qi
Wang, Yabing
Zhang, Minsong
Dong, Jianfeng
Wang, Xun
[J]. NEUROCOMPUTING, 2022, 496 : 178 - 191
[27] Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval
Wang, Wei
Gao, Junyu
Yang, Xiaoshan
Xu, Changsheng
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 2386 - 2397
[28] Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval
Jin, Weike
Zhao, Zhou
Zhang, Pengcheng
Zhu, Jieming
He, Xiuqiang
Zhuang, Yueting
[J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1114 - 1124
[29] CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval
Gao, Yizhao
Lu, Zhiwu
[J]. PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 76 - 84
[30] CLIP Based Multi-Event Representation Generation for Video-Text Retrieval
Tu R.
Mao X.
Kong W.
Cai C.
Zhao W.
Wang H.
Huang H.
[J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (09): : 2169 - 2179

← 1 2 3 4 5 →