Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval

被引:2
|
作者
Nian, Fudong [1 ,2 ]
Ding, Ling [1 ]
Hu, Yuxia [2 ]
Gu, Yanhong [1 ]
机构
[1] Hefei Univ, Sch Adv Mfg Engn, Hefei 230601, Peoples R China
[2] Anhui Jianzhu Univ, Anhui Int Joint Res Ctr Ancient Architecture Inte, Hefei 230601, Peoples R China
关键词
video-text retrieval; multi-level space learning; cross-modal similarity calculation; IMAGE;
D O I
10.3390/math10183346
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
This paper strives to improve the performance of video-text retrieval. To date, many algorithms have been proposed to facilitate the similarity measure of video-text retrieval from the single global semantic to multi-level semantics. However, these methods may suffer from the following limitations: (1) largely ignore the relationship semantic which results in semantic levels are insufficient; (2) it is incomplete to constrain the real-valued features of different modalities to be in the same space only through the feature distance measurement; (3) fail to handle the problem that the distributions of attribute labels in different semantic levels are heavily imbalanced. To overcome the above limitations, this paper proposes a novel multi-level cross-modal semantic alignment network (MCSAN) for video-text retrieval by jointly modeling video-text similarity on global, entity, action and relationship semantic levels in a unified deep model. Specifically, both video and text are first decomposed into global, entity, action and relationship semantic levels by carefully designing spatial-temporal semantic learning structures. Then, we utilize KLDivLoss and a cross-modal parameter-share attribute projection layer as statistical constraints to ensure that representations from different modalities in different semantic levels are projected into a common semantic space. In addition, a novel focal binary cross-entropy (FBCE) loss function is presented, which is the first effort to model the unbalanced attribute distribution problem for video-text retrieval. MCSAN is practically effective to take the advantage of the complementary information among four semantic levels. Extensive experiments on two challenging video-text retrieval datasets, namely, MSR-VTT and VATEX, show the viability of our method.
引用
收藏
页数:19
相关论文
共 50 条
  • [41] Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
    Jin, Peng
    Huang, Jinfa
    Xiong, Pengfei
    Tian, Shangxuan
    Liu, Chang
    Ji, Xiangyang
    Yuan, Li
    Chen, Jie
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2472 - 2482
  • [42] Adversarial Modality Alignment Network for Cross-Modal Molecule Retrieval
    Zhao W.
    Zhou D.
    Cao B.
    Zhang K.
    Chen J.
    IEEE Transactions on Artificial Intelligence, 2024, 5 (01): : 278 - 289
  • [43] Multi-event Video-Text Retrieval
    Zhang, Gengyuan
    Ren, Jisen
    Gu, Jindong
    Tresp, Volker
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22056 - 22066
  • [44] Multi-level Symmetric Semantic Alignment Network for image-text matching
    Wang, Wenzhuang
    Di, Xiaoguang
    Liu, Maozhen
    Gao, Feng
    NEUROCOMPUTING, 2024, 599
  • [45] Image-Text Retrieval With Cross-Modal Semantic Importance Consistency
    Liu, Zejun
    Chen, Fanglin
    Xu, Jun
    Pei, Wenjie
    Lu, Guangming
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (05) : 2465 - 2476
  • [46] Multi-view visual semantic embedding for cross-modal image-text retrieval
    Li, Zheng
    Guo, Caili
    Wang, Xin
    Zhang, Hao
    Hu, Lin
    PATTERN RECOGNITION, 2025, 159
  • [47] Information Aggregation Semantic Adversarial Network for Cross-Modal Retrieval
    Wang, Hongfei
    Feng, Aimin
    Liu, Xuejun
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [48] Relation Triplet Construction for Cross-modal Text-to-Video Retrieval
    Song, Xue
    Chen, Jingjing
    Jiang, Yu-Gang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4759 - 4767
  • [49] Disambiguity and Alignment: An Effective Multi-Modal Alignment Method for Cross-Modal Recipe Retrieval
    Zou, Zhuoyang
    Zhu, Xinghui
    Zhu, Qinying
    Zhang, Hongyan
    Zhu, Lei
    FOODS, 2024, 13 (11)
  • [50] ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval
    Fragomeni, Adriano
    Wray, Michael
    Damen, Dima
    COMPUTER VISION - ACCV 2022, PT IV, 2023, 13844 : 451 - 468