Self-expressive induced clustered attention for video-text retrieval

被引:0
作者
Zhu, Jingxuan [1 ]
Shen, Xiangjun [1 ]
Mehta, Sumet [1 ]
Abeo, Timothy Apasiba [2 ]
Zhan, Yongzhao [1 ,3 ,4 ]
机构
[1] Jiangsu Univ, Sch Comp Sci & Commun Engn, Zhenjiang 212013, Jiangsu, Peoples R China
[2] Tamale Tech Univ, Sch Appl Sci, Tamale, Ghana
[3] Jiangsu Univ, Jiangsu Engn Res Ctr Big Data Ubiquitous Percept &, Zhenjiang 212013, Jiangsu, Peoples R China
[4] Jiangsu Univ, Prov Key Lab Computat Intelligence & New Technol L, Zhenjiang 212013, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
Video-text retrieval; Self-attention; Video embedding; Self-expressive cluster;
D O I
10.1007/s00530-024-01549-9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Extensive research has proven that self-attention achieves impressive performance in video-text retrieval. However, most state-of-the-art methods neglect the intrinsic redundancy in videos caused by consecutive and similar frames, making them difficult to construct a well-defined fine-grained semantic space and improve the performance retrieval limited. Otherwise, current self-attentions exhibit high complexity when computing frame-word attention coefficient. It leads to high cost of computational and storage resources when employing these attentions for video-text retrieval. To solve these problems, we propose a new method of cluster induced self-expressive induced clustered attention for video-text retrieval. In this method, unlike existing methods, we perform self-expressive induced clustering (SEIC) on video embedding to mine well-defined fine-grained video semantic features, which is a self-adaptive clustering method without giving the special the number of video embedding pair, can capture well-defined fine-grained semantic features from video embedding and reduce the redundancy of video-frame-level content. Then a self-expressive induced clustered attention model (SEICA) is proposed, which can enhance the quality of video embedding, reduce the computational cost and save storage resources effectively. Finally, we apply this method to video-text retrieval tasks. Experimental results on several benchmark datasets such as MSVD, MSRVTT, ActivityNet and DiDeMo. demonstrate that the retrieval performance of the proposed method is superior to that of the relative stat of the art methods with less consumption of computing and storage resources.
引用
收藏
页数:15
相关论文
共 46 条
  • [21] LSECA: local semantic enhancement and cross aggregation for video-text retrieval
    Wang, Zhiwen
    Zhang, Donglin
    Hu, Zhikai
    [J]. INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2024, 13 (03)
  • [22] Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations
    Fang, Han
    Xiong, Pengfei
    Xu, Luhui
    Luo, Wenhan
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7772 - 7785
  • [23] Concept Propagation via Attentional Knowledge Graph Reasoning for Video-Text Retrieval
    Fang, Sheng
    Wang, Shuhui
    Zhuo, Junbao
    Huang, Qingming
    Ma, Bin
    Wei, Xiaoming
    Wei, Xiaolin
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4789 - 4800
  • [24] MAC: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
    Shu, Fangxun
    Chen, Biaolong
    Liao, Yue
    Wang, Jinqiao
    Liu, Si
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9962 - 9972
  • [25] Multilevel Semantic Interaction Alignment for Video-Text Cross-Modal Retrieval
    Chen, Lei
    Deng, Zhen
    Liu, Libo
    Yin, Shibai
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 6559 - 6575
  • [26] FeatInter: Exploring fine-grained object features for video-text retrieval
    Liu, Baolong
    Zheng, Qi
    Wang, Yabing
    Zhang, Minsong
    Dong, Jianfeng
    Wang, Xun
    [J]. NEUROCOMPUTING, 2022, 496 : 178 - 191
  • [27] Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval
    Wang, Wei
    Gao, Junyu
    Yang, Xiaoshan
    Xu, Changsheng
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 2386 - 2397
  • [28] Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval
    Jin, Weike
    Zhao, Zhou
    Zhang, Pengcheng
    Zhu, Jieming
    He, Xiuqiang
    Zhuang, Yueting
    [J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1114 - 1124
  • [29] CMMT: Cross-Modal Meta-Transformer for Video-Text Retrieval
    Gao, Yizhao
    Lu, Zhiwu
    [J]. PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023, 2023, : 76 - 84
  • [30] CLIP Based Multi-Event Representation Generation for Video-Text Retrieval
    Tu R.
    Mao X.
    Kong W.
    Cai C.
    Zhao W.
    Wang H.
    Huang H.
    [J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (09): : 2169 - 2179