Cross-Modal Retrieval Augmentation for Multi-Modal Classification

被引：0

作者：

Gur, Shir ^{[1
,3
]}

Neverova, Natalia ^{[2
]}

Stauffer, Chris ^{[2
]}

Lim, Ser-Nam ^{[2
]}

Kiela, Douwe ^{[2
]}

Reiter, Austin ^{[2
]}

机构：

[1] Tel Aviv Univ, Tel Aviv, Israel

[2] Facebook AI, Menlo Pk, CA USA

[3] FAIR, Menlo Pk, CA USA

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021 | 2021年

关键词：

KNOWLEDGE; LANGUAGE;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent advances in using retrieval components over external knowledge sources have shown impressive results for a variety of downstream tasks in natural language processing. Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA). First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvements in performance on image-caption retrieval w.r.t. similar methods. Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines. We further conduct extensive experiments to establish the promise of this approach, and examine novel applications for inference time such as hot-swapping indices.

引用

页码：111 / 123

页数：13

共 50 条

[41] Cross-Modal Center Loss for 3D Cross-Modal Retrieval
Jing, Longlong
Vahdani, Elahe
Tan, Jiaxing
Tian, Yingli
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3141 - 3150
[42] Flexible Dual Multi-Modal Hashing for Incomplete Multi-Modal Retrieval
Wei, Yuhong
An, Junfeng
INTERNATIONAL JOURNAL OF IMAGE AND GRAPHICS, 2024,
[43] Multi-Task Collaboration for Cross-Modal Generation and Multi-Modal Ophthalmic Diseases Diagnosis
Yu, Yang
Zhu, Hongqing
Qian, Tianwei
Hou, Tong
Huang, Bingcang
IET IMAGE PROCESSING, 2025, 19 (01)
[44] Multi-Level Cross-Modal Interactive-Network-Based Semi-Supervised Multi-Modal Ship Classification
Song, Xin
Chen, Zhikui
Zhong, Fangming
Gao, Jing
Zhang, Jianning
Li, Peng
SENSORS, 2024, 24 (22)
[45] CAESAR: concept augmentation based semantic representation for cross-modal retrieval
Zhu, Lei
Song, Jiayu
Wei, Xiangxiang
Yu, Hao
Long, Jun
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (24) : 34213 - 34243
[46] CAESAR: concept augmentation based semantic representation for cross-modal retrieval
Lei Zhu
Jiayu Song
Xiangxiang Wei
Hao Yu
Jun Long
Multimedia Tools and Applications, 2022, 81 : 34213 - 34243
[47] Cross-Modal Attention-Guided Convolutional Network for Multi-modal Cardiac Segmentation
Zhou, Ziqi
Guo, Xinna
Yang, Wanqi
Shi, Yinghuan
Zhou, Luping
Wang, Lei
Yang, Ming
MACHINE LEARNING IN MEDICAL IMAGING (MLMI 2019), 2019, 11861 : 601 - 610
[48] Semi-supervised Multi-modal Emotion Recognition with Cross-Modal Distribution Matching
Liang, Jingjun
Li, Ruichen
Jin, Qin
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 2852 - 2861
[49] Multi-Modal Pulmonary Mass Segmentation Network Based on Cross-Modal Spatial Alignment
LI Jiaxin
CHEN Houjin
PENG Yahui
LI Yanfeng
JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2022, 44 (01) : 11 - 17
[50] Adversarial Cross-modal Domain Adaptation for Multi-modal Semantic Segmentation in Autonomous Driving
Shi, Mengqi
Cao, Haozhi
Xie, Lihua
Yang, Jianfei
2022 17TH INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION, ROBOTICS AND VISION (ICARCV), 2022, : 850 - 855

← 1 2 3 4 5 →