Cross-Modal Retrieval Augmentation for Multi-Modal Classification

被引:0
|
作者
Gur, Shir [1 ,3 ]
Neverova, Natalia [2 ]
Stauffer, Chris [2 ]
Lim, Ser-Nam [2 ]
Kiela, Douwe [2 ]
Reiter, Austin [2 ]
机构
[1] Tel Aviv Univ, Tel Aviv, Israel
[2] Facebook AI, Menlo Pk, CA USA
[3] FAIR, Menlo Pk, CA USA
关键词
KNOWLEDGE; LANGUAGE;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in using retrieval components over external knowledge sources have shown impressive results for a variety of downstream tasks in natural language processing. Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA). First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvements in performance on image-caption retrieval w.r.t. similar methods. Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines. We further conduct extensive experiments to establish the promise of this approach, and examine novel applications for inference time such as hot-swapping indices.
引用
收藏
页码:111 / 123
页数:13
相关论文
共 50 条
  • [41] Cross-Modal Center Loss for 3D Cross-Modal Retrieval
    Jing, Longlong
    Vahdani, Elahe
    Tan, Jiaxing
    Tian, Yingli
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3141 - 3150
  • [42] Flexible Dual Multi-Modal Hashing for Incomplete Multi-Modal Retrieval
    Wei, Yuhong
    An, Junfeng
    INTERNATIONAL JOURNAL OF IMAGE AND GRAPHICS, 2024,
  • [43] Multi-Task Collaboration for Cross-Modal Generation and Multi-Modal Ophthalmic Diseases Diagnosis
    Yu, Yang
    Zhu, Hongqing
    Qian, Tianwei
    Hou, Tong
    Huang, Bingcang
    IET IMAGE PROCESSING, 2025, 19 (01)
  • [44] Multi-Level Cross-Modal Interactive-Network-Based Semi-Supervised Multi-Modal Ship Classification
    Song, Xin
    Chen, Zhikui
    Zhong, Fangming
    Gao, Jing
    Zhang, Jianning
    Li, Peng
    SENSORS, 2024, 24 (22)
  • [45] CAESAR: concept augmentation based semantic representation for cross-modal retrieval
    Zhu, Lei
    Song, Jiayu
    Wei, Xiangxiang
    Yu, Hao
    Long, Jun
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (24) : 34213 - 34243
  • [46] CAESAR: concept augmentation based semantic representation for cross-modal retrieval
    Lei Zhu
    Jiayu Song
    Xiangxiang Wei
    Hao Yu
    Jun Long
    Multimedia Tools and Applications, 2022, 81 : 34213 - 34243
  • [47] Cross-Modal Attention-Guided Convolutional Network for Multi-modal Cardiac Segmentation
    Zhou, Ziqi
    Guo, Xinna
    Yang, Wanqi
    Shi, Yinghuan
    Zhou, Luping
    Wang, Lei
    Yang, Ming
    MACHINE LEARNING IN MEDICAL IMAGING (MLMI 2019), 2019, 11861 : 601 - 610
  • [48] Semi-supervised Multi-modal Emotion Recognition with Cross-Modal Distribution Matching
    Liang, Jingjun
    Li, Ruichen
    Jin, Qin
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 2852 - 2861
  • [49] Multi-Modal Pulmonary Mass Segmentation Network Based on Cross-Modal Spatial Alignment
    LI Jiaxin
    CHEN Houjin
    PENG Yahui
    LI Yanfeng
    JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2022, 44 (01) : 11 - 17
  • [50] Adversarial Cross-modal Domain Adaptation for Multi-modal Semantic Segmentation in Autonomous Driving
    Shi, Mengqi
    Cao, Haozhi
    Xie, Lihua
    Yang, Jianfei
    2022 17TH INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION, ROBOTICS AND VISION (ICARCV), 2022, : 850 - 855