Complementarity is the king: Multi-modal and multi-grained hierarchical semantic enhancement network for cross-modal retrieval

被引:6
作者
Pei, Xinlei [1 ,2 ]
Liu, Zheng [1 ,2 ]
Gao, Shanshan [1 ,2 ]
Su, Yijun [3 ]
机构
[1] Shandong Univ Finance & Econ, Sch Comp Sci & Technol, Jinan 250014, Shandong, Peoples R China
[2] Shandong Univ Finance & Econ, Shandong Prov Key Lab Digital Media Technol, Jinan 250014, Shandong, Peoples R China
[3] Minzu Univ China, Sch Informat Engn, Beijing 100081, Peoples R China
关键词
Cross-modal retrieval; Primary similarity; Auxiliary similarity; Semantic enhancement; Multi-spring balance loss;
D O I
10.1016/j.eswa.2022.119415
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal retrieval takes a query of one modality to retrieve relevant results from another modality, and its key issue lies in how to learn the cross-modal similarity. Note that the complete semantic information of a specific concept is widely scattered over the multi-modal and multi-grained data, and it cannot be thoroughly captured by most existing methods to learn the cross-modal similarity accurately. Therefore, we propose a Multi-modal and Multi-grained Hierarchical Semantic Enhancement network (M2HSE), which contains two stages to obtain more complete semantic information by fusing the complementarity in multi modal and multi-grained data. In stage 1, two classes of cross-modal similarity (primary similarity and auxiliary similarity) are calculated more comprehensively in two subnetworks. Especially, the primary similarities from two subnetworks are fused to perform the cross-modal retrieval, while the auxiliary similarity provides a valuable complement for the primary similarity. In stage 2, the multi-spring balance loss is proposed to optimize the cross-modal similarity more flexibly. Utilizing this loss, the most representative samples are selected to establish the multi-spring balance system, which adaptively optimizes the cross-modal similarities until reaching the equilibrium state. Extensive experiments conducted on public benchmark datasets clearly prove the effectiveness of our proposed method and show its competitive performance with the state-of-the-arts.
引用
收藏
页数:21
相关论文
共 54 条
  • [31] Paszke A., 2017, AUTOMATIC DIFFERENTI
  • [32] An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges
    Peng, Yuxin
    Huang, Xin
    Zhao, Yunzhen
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2018, 28 (09) : 2372 - 2385
  • [33] MAVA: Multi-Level Adaptive Visual-Textual Alignment by Cross-Media Bi-Attention Mechanism
    Peng, Yuxin
    Qi, Jinwei
    Zhuo, Yunkan
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 2728 - 2741
  • [34] CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network
    Peng, Yuxin
    Qi, Jinwei
    Huang, Xin
    Yuan, Yuxin
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (02) : 405 - 420
  • [35] Semi-Supervised Cross-Media Feature Learning With Unified Patch Graph Regularization
    Peng, Yuxin
    Zhai, Xiaohua
    Zhao, Yunzhen
    Huang, Xin
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2016, 26 (03) : 583 - 596
  • [36] Dynamic Modality Interaction Modeling for Image-Text Retrieval
    Qu, Leigang
    Liu, Meng
    Wu, Jianlong
    Gao, Zan
    Nie, Liqiang
    [J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1104 - 1113
  • [37] Rashtchian C., 2010, NAACL HLT WORKSHOP, P139, DOI DOI 10.5555/1866696.1866717
  • [38] Rasiwasia N., 2010, P 18 ACM INT C MULT, P251, DOI DOI 10.1145/1873951.1873987
  • [39] Ruder S, 2017, Arxiv, DOI [arXiv:1609.04747, DOI 10.48550/ARXIV.1609.04747]
  • [40] Schroff F, 2015, PROC CVPR IEEE, P815, DOI 10.1109/CVPR.2015.7298682