Self-expressive induced clustered attention for video-text retrieval

被引：0

作者：

Zhu, Jingxuan ^{[1
]}

Shen, Xiangjun ^{[1
]}

Mehta, Sumet ^{[1
]}

Abeo, Timothy Apasiba ^{[2
]}

Zhan, Yongzhao ^{[1
,3
,4
]}

机构：

[1] Jiangsu Univ, Sch Comp Sci & Commun Engn, Zhenjiang 212013, Jiangsu, Peoples R China

[2] Tamale Tech Univ, Sch Appl Sci, Tamale, Ghana

[3] Jiangsu Univ, Jiangsu Engn Res Ctr Big Data Ubiquitous Percept &, Zhenjiang 212013, Jiangsu, Peoples R China

[4] Jiangsu Univ, Prov Key Lab Computat Intelligence & New Technol L, Zhenjiang 212013, Jiangsu, Peoples R China

来源：

MULTIMEDIA SYSTEMS | 2024年 / 30卷 / 06期

基金：

中国国家自然科学基金;

关键词：

Video-text retrieval; Self-attention; Video embedding; Self-expressive cluster;

D O I：

10.1007/s00530-024-01549-9

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Extensive research has proven that self-attention achieves impressive performance in video-text retrieval. However, most state-of-the-art methods neglect the intrinsic redundancy in videos caused by consecutive and similar frames, making them difficult to construct a well-defined fine-grained semantic space and improve the performance retrieval limited. Otherwise, current self-attentions exhibit high complexity when computing frame-word attention coefficient. It leads to high cost of computational and storage resources when employing these attentions for video-text retrieval. To solve these problems, we propose a new method of cluster induced self-expressive induced clustered attention for video-text retrieval. In this method, unlike existing methods, we perform self-expressive induced clustering (SEIC) on video embedding to mine well-defined fine-grained video semantic features, which is a self-adaptive clustering method without giving the special the number of video embedding pair, can capture well-defined fine-grained semantic features from video embedding and reduce the redundancy of video-frame-level content. Then a self-expressive induced clustered attention model (SEICA) is proposed, which can enhance the quality of video embedding, reduce the computational cost and save storage resources effectively. Finally, we apply this method to video-text retrieval tasks. Experimental results on several benchmark datasets such as MSVD, MSRVTT, ActivityNet and DiDeMo. demonstrate that the retrieval performance of the proposed method is superior to that of the relative stat of the art methods with less consumption of computing and storage resources.

引用

页数：15

共 46 条

[11] Uncertainty-Aware with Negative Samples for Video-Text Retrieval
Song, Weitao
Chen, Weiran
Xu, Jialiang
Ji, Yi
Li, Ying
Liu, Chunping
PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 318 - 332
[12] Complementarity-Aware Space Learning for Video-Text Retrieval
Zhu, Jinkuan
Zeng, Pengpeng
Gao, Lianli
Li, Gongfu
Liao, Dongliang
Song, Jingkuan
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (08) : 4362 - 4374
[13] SEMANTIC-PRESERVING METRIC LEARNING FOR VIDEO-TEXT RETRIEVAL
Choo, Sungkwon
Ha, Seong Jong
Lee, Joonsoo
2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 2388 - 2392
[14] Robust Video-Text Retrieval Via Noisy Pair Calibration
Zhang, Huaiwen
Yang, Yang
Qi, Fan
Qian, Shengsheng
Xu, Changsheng
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8632 - 8645
[15] Expert-guided contrastive learning for video-text retrieval
Lee, Jewook
Lee, Pilhyeon
Park, Sungho
Byun, Hyeran
NEUROCOMPUTING, 2023, 536 : 50 - 58
[16] Mask to Reconstruct: Cooperative Semantics Completion for Video-text Retrieval
Fang, Han
Yang, Zhifei
Zang, Xianghao
Ban, Chao
He, Zhongjiang
Sun, Hao
Zhou, Lanxiang
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3847 - 3856
[17] Reliable Phrase Feature Mining for Hierarchical Video-Text Retrieval
Lai, Huakai
Yang, Wenfei
Zhang, Tianzhu
Zhang, Yongdong
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 12019 - 12031
[18] Adaptive Token Excitation with Negative Selection for Video-Text Retrieval
Yu, Juntao
Ni, Zhangkai
Su, Taiyi
Wang, Hanli
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 349 - 361
[19] LSECA: local semantic enhancement and cross aggregation for video-text retrieval
Wang, Zhiwen
Zhang, Donglin
Hu, Zhikai
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2024, 13 (03)
[20] Boosting Video-Text Retrieval with Explicit High-Level Semantics
Wang, Haoran
Xu, Di
He, Dongliang
Li, Fu
Ji, Zhong
Han, Jungong
Ding, Errui
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4887 - 4898

← 1 2 3 4 5 →