Self-expressive induced clustered attention for video-text retrieval

被引:0
|
作者
Zhu, Jingxuan [1 ]
Shen, Xiangjun [1 ]
Mehta, Sumet [1 ]
Abeo, Timothy Apasiba [2 ]
Zhan, Yongzhao [1 ,3 ,4 ]
机构
[1] Jiangsu Univ, Sch Comp Sci & Commun Engn, Zhenjiang 212013, Jiangsu, Peoples R China
[2] Tamale Tech Univ, Sch Appl Sci, Tamale, Ghana
[3] Jiangsu Univ, Jiangsu Engn Res Ctr Big Data Ubiquitous Percept &, Zhenjiang 212013, Jiangsu, Peoples R China
[4] Jiangsu Univ, Prov Key Lab Computat Intelligence & New Technol L, Zhenjiang 212013, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
Video-text retrieval; Self-attention; Video embedding; Self-expressive cluster;
D O I
10.1007/s00530-024-01549-9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Extensive research has proven that self-attention achieves impressive performance in video-text retrieval. However, most state-of-the-art methods neglect the intrinsic redundancy in videos caused by consecutive and similar frames, making them difficult to construct a well-defined fine-grained semantic space and improve the performance retrieval limited. Otherwise, current self-attentions exhibit high complexity when computing frame-word attention coefficient. It leads to high cost of computational and storage resources when employing these attentions for video-text retrieval. To solve these problems, we propose a new method of cluster induced self-expressive induced clustered attention for video-text retrieval. In this method, unlike existing methods, we perform self-expressive induced clustering (SEIC) on video embedding to mine well-defined fine-grained video semantic features, which is a self-adaptive clustering method without giving the special the number of video embedding pair, can capture well-defined fine-grained semantic features from video embedding and reduce the redundancy of video-frame-level content. Then a self-expressive induced clustered attention model (SEICA) is proposed, which can enhance the quality of video embedding, reduce the computational cost and save storage resources effectively. Finally, we apply this method to video-text retrieval tasks. Experimental results on several benchmark datasets such as MSVD, MSRVTT, ActivityNet and DiDeMo. demonstrate that the retrieval performance of the proposed method is superior to that of the relative stat of the art methods with less consumption of computing and storage resources.
引用
收藏
页数:15
相关论文
共 46 条
  • [11] Uncertainty-Aware with Negative Samples for Video-Text Retrieval
    Song, Weitao
    Chen, Weiran
    Xu, Jialiang
    Ji, Yi
    Li, Ying
    Liu, Chunping
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 318 - 332
  • [12] Complementarity-Aware Space Learning for Video-Text Retrieval
    Zhu, Jinkuan
    Zeng, Pengpeng
    Gao, Lianli
    Li, Gongfu
    Liao, Dongliang
    Song, Jingkuan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (08) : 4362 - 4374
  • [13] SEMANTIC-PRESERVING METRIC LEARNING FOR VIDEO-TEXT RETRIEVAL
    Choo, Sungkwon
    Ha, Seong Jong
    Lee, Joonsoo
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 2388 - 2392
  • [14] Robust Video-Text Retrieval Via Noisy Pair Calibration
    Zhang, Huaiwen
    Yang, Yang
    Qi, Fan
    Qian, Shengsheng
    Xu, Changsheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8632 - 8645
  • [15] Expert-guided contrastive learning for video-text retrieval
    Lee, Jewook
    Lee, Pilhyeon
    Park, Sungho
    Byun, Hyeran
    NEUROCOMPUTING, 2023, 536 : 50 - 58
  • [16] Mask to Reconstruct: Cooperative Semantics Completion for Video-text Retrieval
    Fang, Han
    Yang, Zhifei
    Zang, Xianghao
    Ban, Chao
    He, Zhongjiang
    Sun, Hao
    Zhou, Lanxiang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3847 - 3856
  • [17] Reliable Phrase Feature Mining for Hierarchical Video-Text Retrieval
    Lai, Huakai
    Yang, Wenfei
    Zhang, Tianzhu
    Zhang, Yongdong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 12019 - 12031
  • [18] Adaptive Token Excitation with Negative Selection for Video-Text Retrieval
    Yu, Juntao
    Ni, Zhangkai
    Su, Taiyi
    Wang, Hanli
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 349 - 361
  • [19] LSECA: local semantic enhancement and cross aggregation for video-text retrieval
    Wang, Zhiwen
    Zhang, Donglin
    Hu, Zhikai
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2024, 13 (03)
  • [20] Boosting Video-Text Retrieval with Explicit High-Level Semantics
    Wang, Haoran
    Xu, Di
    He, Dongliang
    Li, Fu
    Ji, Zhong
    Han, Jungong
    Ding, Errui
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4887 - 4898