MVItem: A Benchmark for Multi-View Cross-Modal Item Retrieval

被引:0
|
作者
Li, Bo [1 ]
Zhu, Jiansheng [2 ]
Dai, Linlin [3 ]
Jing, Hui [3 ]
Huang, Zhizheng [3 ]
Sui, Yuteng [1 ]
机构
[1] China Acad Railway Sci, Postgrad Dept, Beijing 100081, Peoples R China
[2] China Railway, Dept Sci Technol & Informat, Beijing 100844, Peoples R China
[3] China Acad Railway Sci Corp Ltd, Inst Comp Technol, Beijing 100081, Peoples R China
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Annotations; Benchmark testing; Text to image; Deep learning; Contrastive learning; Open source software; Cross-model retrieval; deep learning; item retrieval; contrastive text-image pre-training model; multi-view;
D O I
10.1109/ACCESS.2024.3447872
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The existing text-image pre-training models have demonstrated strong generalization capabilities, however, their performance of item retrieval in real-world scenarios still falls short of expectations. In order to optimize the performance of text-image pre-training model to retrieve items in real scenarios, we present a benchmark called MVItem for exploring multi-view item retrieval based on the open-source dataset MVImgNet. Firstly, we evenly sample items in MVImgNet to obtain 5 images from different views, and automatically annotate this images based on MiniGPT-4. Subsequently, through manual cleaning and comparison, we present a high-quality textual description for each sample. Then, in order to investigate the spatial misalignment problem of item retrieval in real-world scenarios and mitigate the impact of spatial misalignment on retrieval, we devise a multi-view feature fusion strategy and propose a cosine distance balancing method based on Sequential Least Squares Programming (SLSQP) to achieve the fusion of multiple view vectors, namely balancing cosine distance(BCD). On this basis, we select the representative state-of-the-art text-image pre-training retrieval models as baselines, and establish multiple test groups to explore the effectiveness of multi-view information on item retrieval to easing potential spatial misalignment. The experimental results show that the retrieval of fusing multi-view features is generally better than that of the baseline, indicating that multi-view feature fusion is helpful to alleviate the impact of spatial misalignment on item retrieval. Moreover, the proposed feature fusion, balancing cosine distance(BCD), is generally better than that of feature averaging, denoted as balancing euclidean distance(BED) in this work. At the results, we find that the fusion of multiple images with different views is more helpful for text-to-image (T2I) retrieval, and the fusion of a small number of images with large differences in views is more helpful for image-to-image (I2I) retrieval.
引用
收藏
页码:119563 / 119576
页数:14
相关论文
共 50 条
  • [31] A semi-supervised cross-modal memory bank for cross-modal retrieval
    Huang, Yingying
    Hu, Bingliang
    Zhang, Yipeng
    Gao, Chi
    Wang, Quan
    NEUROCOMPUTING, 2024, 579
  • [32] Cross-Modal Center Loss for 3D Cross-Modal Retrieval
    Jing, Longlong
    Vahdani, Elahe
    Tan, Jiaxing
    Tian, Yingli
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3141 - 3150
  • [33] Soft Contrastive Cross-Modal Retrieval
    Song, Jiayu
    Hu, Yuxuan
    Zhu, Lei
    Zhang, Chengyuan
    Zhang, Jian
    Zhang, Shichao
    APPLIED SCIENCES-BASEL, 2024, 14 (05):
  • [34] Probabilistic Embeddings for Cross-Modal Retrieval
    Chun, Sanghyuk
    Oh, Seong Joon
    de Rezende, Rafael Sampaio
    Kalantidis, Yannis
    Larlus, Diane
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 8411 - 8420
  • [35] Cross-modal Retrieval with Correspondence Autoencoder
    Feng, Fangxiang
    Wang, Xiaojie
    Li, Ruifan
    PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, : 7 - 16
  • [36] Cross-modal retrieval with dual optimization
    Xu, Qingzhen
    Liu, Shuang
    Qiao, Han
    Li, Miao
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (05) : 7141 - 7157
  • [37] Geometric Matching for Cross-Modal Retrieval
    Wang, Zheng
    Gao, Zhenwei
    Yang, Yang
    Wang, Guoqing
    Jiao, Chengbo
    Shen, Heng Tao
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 13
  • [38] CROSS-MODAL RETRIEVAL WITH NOISY LABELS
    Mandal, Devraj
    Biswas, Soma
    2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 2326 - 2330
  • [39] Cross-Modal Retrieval for CPSS Data
    Zhong, Fangming
    Wang, Guangze
    Chen, Zhikui
    Xia, Feng
    Min, Geyong
    IEEE ACCESS, 2020, 8 : 16689 - 16701
  • [40] Hashing for Cross-Modal Similarity Retrieval
    Liu, Yao
    Yuan, Yanhong
    Huang, Qiaoli
    Huang, Zhixing
    2015 11TH INTERNATIONAL CONFERENCE ON SEMANTICS, KNOWLEDGE AND GRIDS (SKG), 2015, : 1 - 8