MVItem: A Benchmark for Multi-View Cross-Modal Item Retrieval

被引:0
|
作者
Li, Bo [1 ]
Zhu, Jiansheng [2 ]
Dai, Linlin [3 ]
Jing, Hui [3 ]
Huang, Zhizheng [3 ]
Sui, Yuteng [1 ]
机构
[1] China Acad Railway Sci, Postgrad Dept, Beijing 100081, Peoples R China
[2] China Railway, Dept Sci Technol & Informat, Beijing 100844, Peoples R China
[3] China Acad Railway Sci Corp Ltd, Inst Comp Technol, Beijing 100081, Peoples R China
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Annotations; Benchmark testing; Text to image; Deep learning; Contrastive learning; Open source software; Cross-model retrieval; deep learning; item retrieval; contrastive text-image pre-training model; multi-view;
D O I
10.1109/ACCESS.2024.3447872
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The existing text-image pre-training models have demonstrated strong generalization capabilities, however, their performance of item retrieval in real-world scenarios still falls short of expectations. In order to optimize the performance of text-image pre-training model to retrieve items in real scenarios, we present a benchmark called MVItem for exploring multi-view item retrieval based on the open-source dataset MVImgNet. Firstly, we evenly sample items in MVImgNet to obtain 5 images from different views, and automatically annotate this images based on MiniGPT-4. Subsequently, through manual cleaning and comparison, we present a high-quality textual description for each sample. Then, in order to investigate the spatial misalignment problem of item retrieval in real-world scenarios and mitigate the impact of spatial misalignment on retrieval, we devise a multi-view feature fusion strategy and propose a cosine distance balancing method based on Sequential Least Squares Programming (SLSQP) to achieve the fusion of multiple view vectors, namely balancing cosine distance(BCD). On this basis, we select the representative state-of-the-art text-image pre-training retrieval models as baselines, and establish multiple test groups to explore the effectiveness of multi-view information on item retrieval to easing potential spatial misalignment. The experimental results show that the retrieval of fusing multi-view features is generally better than that of the baseline, indicating that multi-view feature fusion is helpful to alleviate the impact of spatial misalignment on item retrieval. Moreover, the proposed feature fusion, balancing cosine distance(BCD), is generally better than that of feature averaging, denoted as balancing euclidean distance(BED) in this work. At the results, we find that the fusion of multiple images with different views is more helpful for text-to-image (T2I) retrieval, and the fusion of a small number of images with large differences in views is more helpful for image-to-image (I2I) retrieval.
引用
收藏
页码:119563 / 119576
页数:14
相关论文
共 50 条
  • [21] Cross-Modal and Multi-Attribute Face Recognition: A Benchmark
    Lin, Feng
    Fu, Kaiqiang
    Luo, Hao
    Zhan, Ziyue
    Wang, Zhibo
    Liu, Zhenguang
    Cavallaro, Lorenzo
    Ren, Kui
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 271 - 279
  • [22] Adversarial Cross-Modal Retrieval
    Wang, Bokun
    Yang, Yang
    Xu, Xing
    Hanjalic, Alan
    Shen, Heng Tao
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 154 - 162
  • [23] HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval
    Zhang, Chengyuan
    Song, Jiayu
    Zhu, Xiaofeng
    Zhu, Lei
    Zhang, Shichao
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (01)
  • [24] Multi-grained Representation Learning for Cross-modal Retrieval
    Zhao, Shengwei
    Xu, Linhai
    Liu, Yuying
    Du, Shaoyi
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 2194 - 2198
  • [25] A Cross-Modal Multi-View Self-Supervised Heterogeneous Graph Network for Personalized Food Recommendation
    Song Y.
    Yang X.
    Xu C.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2023, 35 (03): : 413 - 422
  • [26] MSegNet: A Multi-View Coupled Cross-Modal Attention Model for Enhanced MRI Brain Tumor Segmentation
    Wang, Yu
    Xu, Juan
    Guan, Yucheng
    Ahmad, Faizan
    Mahmood, Tariq
    Rehman, Amjad
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2025, 18 (01)
  • [27] Adversarial Graph Attention Network for Multi-modal Cross-modal Retrieval
    Wu, Hongchang
    Guan, Ziyu
    Zhi, Tao
    zhao, Wei
    Xu, Cai
    Han, Hong
    Yang, Yarning
    2019 10TH IEEE INTERNATIONAL CONFERENCE ON BIG KNOWLEDGE (ICBK 2019), 2019, : 265 - 272
  • [28] Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval
    Zeng, Yawen
    Cao, Da
    Wei, Xiaochi
    Liu, Meng
    Zhao, Zhou
    Qin, Zheng
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 2215 - 2224
  • [29] STXD: Structural and Temporal Cross-Modal Distillation for Multi-View 3D Object Detection
    Jang, Sujin
    Jo, Dae Ung
    Hwang, Sung Ju
    Lee, Dongwook
    Ji, Daehyun
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [30] A Multi-modal & Multi-view & Interactive Benchmark Dataset for Human Action Recognition
    Xu, Ning
    Liu, Anan
    Nie, Weizhi
    Wong, Yongkang
    Li, Fuwu
    Su, Yuting
    MM'15: PROCEEDINGS OF THE 2015 ACM MULTIMEDIA CONFERENCE, 2015, : 1195 - 1198