Multi-Level Cross-Modal Semantic Alignment Network for Video-Text Retrieval

被引:2
|
作者
Nian, Fudong [1 ,2 ]
Ding, Ling [1 ]
Hu, Yuxia [2 ]
Gu, Yanhong [1 ]
机构
[1] Hefei Univ, Sch Adv Mfg Engn, Hefei 230601, Peoples R China
[2] Anhui Jianzhu Univ, Anhui Int Joint Res Ctr Ancient Architecture Inte, Hefei 230601, Peoples R China
关键词
video-text retrieval; multi-level space learning; cross-modal similarity calculation; IMAGE;
D O I
10.3390/math10183346
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
This paper strives to improve the performance of video-text retrieval. To date, many algorithms have been proposed to facilitate the similarity measure of video-text retrieval from the single global semantic to multi-level semantics. However, these methods may suffer from the following limitations: (1) largely ignore the relationship semantic which results in semantic levels are insufficient; (2) it is incomplete to constrain the real-valued features of different modalities to be in the same space only through the feature distance measurement; (3) fail to handle the problem that the distributions of attribute labels in different semantic levels are heavily imbalanced. To overcome the above limitations, this paper proposes a novel multi-level cross-modal semantic alignment network (MCSAN) for video-text retrieval by jointly modeling video-text similarity on global, entity, action and relationship semantic levels in a unified deep model. Specifically, both video and text are first decomposed into global, entity, action and relationship semantic levels by carefully designing spatial-temporal semantic learning structures. Then, we utilize KLDivLoss and a cross-modal parameter-share attribute projection layer as statistical constraints to ensure that representations from different modalities in different semantic levels are projected into a common semantic space. In addition, a novel focal binary cross-entropy (FBCE) loss function is presented, which is the first effort to model the unbalanced attribute distribution problem for video-text retrieval. MCSAN is practically effective to take the advantage of the complementary information among four semantic levels. Extensive experiments on two challenging video-text retrieval datasets, namely, MSR-VTT and VATEX, show the viability of our method.
引用
收藏
页数:19
相关论文
共 50 条
  • [31] Unsupervised multi-perspective fusing semantic alignment for cross-modal hashing retrieval
    Chen, Yongfeng
    Tan, Junpeng
    Yang, Zhijing
    Shi, Yukai
    Qin, Jinghui
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (23) : 63993 - 64014
  • [32] Cross-Modal Image-Text Retrieval with Semantic Consistency
    Chen, Hui
    Ding, Guiguang
    Lin, Zijin
    Zhao, Sicheng
    Han, Jungong
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1749 - 1757
  • [33] Cross-Modal Interaction Network for Video Moment Retrieval
    Ping, Shen
    Jiang, Xiao
    Tian, Zean
    Cao, Ronghui
    Chi, Weiming
    Yang, Shenghong
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2023, 37 (08)
  • [34] HANet: Hierarchical Alignment Networks for Video-Text Retrieval
    Wu, Peng
    He, Xiangteng
    Tang, Mingqian
    Lv, Yiliang
    Liu, Jing
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3518 - 3527
  • [35] Exploiting Visual Semantic Reasoning for Video-Text Retrieval
    Feng, Zerun
    Zeng, Zhimin
    Guo, Caili
    Li, Zheng
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1005 - 1011
  • [36] Semantic-alignment transformer and adversary hashing for cross-modal retrieval
    Sun, Yajun
    Wang, Meng
    Ma, Ying
    APPLIED INTELLIGENCE, 2024, 54 (17-18) : 7581 - 7602
  • [37] Semi-supervised cross-modal retrieval with graph-based semantic alignment network
    Zhang, Lei
    Chen, Leiting
    Ou, Weihua
    Zhou, Chuan
    COMPUTERS & ELECTRICAL ENGINEERING, 2022, 102
  • [38] CRET: Cross-Modal Retrieval Transformer for Efficient Text-Video Retrieval
    Ji, Kaixiang
    Liu, Jiajia
    Hong, Weixiang
    Zhong, Liheng
    Wang, Jian
    Chen, Jingdong
    Chu, Wei
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 949 - 959
  • [39] Cross-modal alignment with graph reasoning for image-text retrieval
    Cui, Zheng
    Hu, Yongli
    Sun, Yanfeng
    Gao, Junbin
    Yin, Baocai
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (17) : 23615 - 23632
  • [40] Cross-modal alignment with graph reasoning for image-text retrieval
    Zheng Cui
    Yongli Hu
    Yanfeng Sun
    Junbin Gao
    Baocai Yin
    Multimedia Tools and Applications, 2022, 81 : 23615 - 23632