Complementarity is the king: Multi-modal and multi-grained hierarchical semantic enhancement network for cross-modal retrieval

被引:6
作者
Pei, Xinlei [1 ,2 ]
Liu, Zheng [1 ,2 ]
Gao, Shanshan [1 ,2 ]
Su, Yijun [3 ]
机构
[1] Shandong Univ Finance & Econ, Sch Comp Sci & Technol, Jinan 250014, Shandong, Peoples R China
[2] Shandong Univ Finance & Econ, Shandong Prov Key Lab Digital Media Technol, Jinan 250014, Shandong, Peoples R China
[3] Minzu Univ China, Sch Informat Engn, Beijing 100081, Peoples R China
关键词
Cross-modal retrieval; Primary similarity; Auxiliary similarity; Semantic enhancement; Multi-spring balance loss;
D O I
10.1016/j.eswa.2022.119415
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal retrieval takes a query of one modality to retrieve relevant results from another modality, and its key issue lies in how to learn the cross-modal similarity. Note that the complete semantic information of a specific concept is widely scattered over the multi-modal and multi-grained data, and it cannot be thoroughly captured by most existing methods to learn the cross-modal similarity accurately. Therefore, we propose a Multi-modal and Multi-grained Hierarchical Semantic Enhancement network (M2HSE), which contains two stages to obtain more complete semantic information by fusing the complementarity in multi modal and multi-grained data. In stage 1, two classes of cross-modal similarity (primary similarity and auxiliary similarity) are calculated more comprehensively in two subnetworks. Especially, the primary similarities from two subnetworks are fused to perform the cross-modal retrieval, while the auxiliary similarity provides a valuable complement for the primary similarity. In stage 2, the multi-spring balance loss is proposed to optimize the cross-modal similarity more flexibly. Utilizing this loss, the most representative samples are selected to establish the multi-spring balance system, which adaptively optimizes the cross-modal similarities until reaching the equilibrium state. Extensive experiments conducted on public benchmark datasets clearly prove the effectiveness of our proposed method and show its competitive performance with the state-of-the-arts.
引用
收藏
页数:21
相关论文
共 54 条
  • [41] Tianlang Chen, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12358), P549, DOI 10.1007/978-3-030-58601-0_33
  • [42] Ustinova E, 2016, ADV NEUR IN, V29
  • [43] Vaswani A, 2017, ADV NEUR IN, V30
  • [44] Adversarial Cross-Modal Retrieval
    Wang, Bokun
    Yang, Yang
    Xu, Xing
    Hanjalic, Alan
    Shen, Heng Tao
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 154 - 162
  • [45] Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval
    Wang, Kaiye
    He, Ran
    Wang, Liang
    Wang, Wei
    Tan, Tieniu
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2016, 38 (10) : 2010 - 2023
  • [46] Meta Self-Paced Learning for Cross-Modal Matching
    Wei, Jiwei
    Xu, Xing
    Wang, Zheng
    Wang, Guoqing
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3835 - 3843
  • [47] Universal Weighting Metric Learning for Cross-Modal Retrieval
    Wei, Jiwei
    Yang, Yang
    Xu, Xing
    Zhu, Xiaofeng
    Shen, Heng Tao
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) : 6534 - 6545
  • [48] Multi-Modality Cross Attention Network for Image and Sentence Matching
    Wei, Xi
    Zhang, Tianzhu
    Li, Yan
    Zhang, Yongdong
    Wu, Feng
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 10938 - 10947
  • [49] Joint Feature Synthesis and Embedding: Adversarial Cross-Modal Retrieval Revisited
    Xu, Xing
    Lin, Kaiyi
    Yang, Yang
    Hanjalic, Alan
    Shen, Heng Tao
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (06) : 3030 - 3047
  • [50] Zhai XH, 2012, LECT NOTES COMPUT SC, V7131, P312