Geometric Matching for Cross-Modal Retrieval

被引:4
|
作者
Wang, Zheng [1 ,2 ]
Gao, Zhenwei [3 ]
Yang, Yang [3 ]
Wang, Guoqing [3 ]
Jiao, Chengbo [3 ]
Shen, Heng Tao [1 ]
机构
[1] Tongji Univ, Coll Elect & Informat Engn, Shanghai 201804, Peoples R China
[2] Univ Elect Sci & Technol China, Inst Elect & Informat Engn, Dongguan 523808, Guangdong, Peoples R China
[3] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
基金
中国国家自然科学基金;
关键词
Semantics; Geometry; Visualization; Measurement; Representation learning; Loss measurement; Vectors; Image-text matching; one-to-many correspondence; point-to-rectangle matching (P2RM); rectangle-to-rectangle matching (R2RM); video-text retrieval; IMAGE; TEXT;
D O I
10.1109/TNNLS.2024.3381347
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite its significant progress, cross-modal retrieval still suffers from one-to-many matching cases, where the multiplicity of semantic instances in another modality could be acquired by a given query. However, existing approaches usually map heterogeneous data into the learned space as deterministic point vectors. In spite of their remarkable performance in matching the most similar instance, such deterministic point embedding suffers from the insufficient representation of rich semantics in one-to-many correspondence. To address the limitations, we intuitively extend a deterministic point into a closed geometry and develop geometric representation learning methods for cross-modal retrieval. Thus, a set of points inside such a geometry could be semantically related to many candidates, and we could effectively capture the semantic uncertainty. We then introduce two types of geometric matching for one-to-many correspondence, i.e., point-to-rectangle matching (dubbed P2RM) and rectangle-to-rectangle matching (termed R2RM). The former treats all retrieved candidates as rectangles with zero volume (equivalent to points) and the query as a box, while the latter encodes all heterogeneous data into rectangles. Therefore, we could evaluate semantic similarity among heterogeneous data by the Euclidean distance from a point to a rectangle or the volume of intersection between two rectangles. Additionally, both strategies could be easily employed for off-the-self approaches and further improve the retrieval performance of baselines. Under various evaluation metrics, extensive experiments and ablation studies on several commonly used datasets, two for image-text matching and two for video-text retrieval, demonstrate our effectiveness and superiority.
引用
收藏
页码:1 / 13
页数:13
相关论文
共 50 条
  • [1] CREAMY: Cross-Modal Recipe Retrieval By Avoiding Matching Imperfectly
    Zou, Zhuoyang
    Zhu, Xinghui
    Zhu, Qinying
    Liu, Yi
    Zhu, Lei
    IEEE ACCESS, 2024, 12 : 33283 - 33295
  • [2] Deep Normalization Cross-Modal Retrieval for Trajectory and Image Matching
    Zhang, Xudong
    Zhao, Wenfeng
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS. DASFAA 2023 INTERNATIONAL WORKSHOPS, BDMS 2023, BDQM 2023, GDMA 2023, BUNDLERS 2023, 2023, 13922 : 181 - 193
  • [3] Cross-specificity: modelling data semantics for cross-modal matching and retrieval
    Verma, Yashaswi
    Jha, Abhishek
    Jawahar, C., V
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2018, 7 (02) : 139 - 146
  • [4] Cross-specificity: modelling data semantics for cross-modal matching and retrieval
    Yashaswi Verma
    Abhishek Jha
    C. V. Jawahar
    International Journal of Multimedia Information Retrieval, 2018, 7 : 139 - 146
  • [5] Adversarial Cross-Modal Retrieval
    Wang, Bokun
    Yang, Yang
    Xu, Xing
    Hanjalic, Alan
    Shen, Heng Tao
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 154 - 162
  • [6] HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval
    Zhang, Chengyuan
    Song, Jiayu
    Zhu, Xiaofeng
    Zhu, Lei
    Zhang, Shichao
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (01)
  • [7] Frame-Wise Cross-Modal Matching for Video Moment Retrieval
    Tang, Haoyu
    Zhu, Jihua
    Liu, Meng
    Gao, Zan
    Cheng, Zhiyong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1338 - 1349
  • [8] FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval
    Gao, Dehong
    Jin, Linbo
    Chen, Ben
    Qiu, Minghui
    Li, Peng
    Wei, Yi
    Hu, Yi
    Wang, Hao
    PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 2251 - 2260
  • [9] Cross-modal independent matching network for image-text retrieval
    Ke, Xiao
    Chen, Baitao
    Yang, Xiong
    Cai, Yuhang
    Liu, Hao
    Guo, Wenzhong
    PATTERN RECOGNITION, 2025, 159
  • [10] Cross-modal Graph Matching Network for Image-text Retrieval
    Cheng, Yuhao
    Zhu, Xiaoguang
    Qian, Jiuchao
    Wen, Fei
    Liu, Peilin
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (04)