Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers

被引：0

作者：

Frank, Stella ^{[1
]}

Bugliarello, Emanuele ^{[2
]}

Elliott, Desmond ^{[2
]}

机构：

[1] Univ Trento, Trento, Italy

[2] Univ Copenhagen, Copenhagen, Denmark

来源：

2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021) | 2021年

基金：

欧盟地平线“2020”;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Pretrained vision-and-language BERTs aim to learn representations that combine information from both modalities. We propose a diagnostic method based on cross-modal input ablation to assess the extent to which these models actually integrate cross-modal information. This method involves ablating inputs from one modality, either entirely or selectively based on cross-modal grounding alignments, and evaluating the model prediction performance on the other modality. Model performance is measured by modality-specific tasks that mirror the model pretraining objectives (e.g. masked language modelling for text). Models that have learned to construct cross-modal representations using both modalities are expected to perform worse when inputs are missing from a modality. We find that recently proposed models have much greater relative difficulty predicting text when visual information is ablated, compared to predicting visual object categories when text is ablated, indicating that these models are not symmetrically cross-modal.

引用

页码：9847 / 9857

页数：11

共 29 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Anderson, Peter
He, Xiaodong
Buehler, Chris
Teney, Damien
Johnson, Mark
Gould, Stephen
Zhang, Lei
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
[2] [Anonymous], 1935, Logik der Forschung-Zur Erkenntnistheorie der modernen Naturwissenschaft
[3] Analysis Methods in Neural Language Processing: A Survey
Belinkov, Yonatan
Glass, James
[J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2019, 7 : 49 - 72
[4] Bugliarello Emanuele, 2021, T ASS COMPUTATIONAL
[5] Cao J., 2020, P EUR C COMP VIS, P565
[6] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7] The Pascal Visual Object Classes (VOC) Challenge
Everingham, Mark
Van Gool, Luc
Williams, Christopher K. I.
Winn, John
Zisserman, Andrew
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2010, 88 (02) : 303 - 338
[8] Fernandes P., 2021, Long Papers, V1, P6467
[9] Hendricks LA, 2021, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, P3635
[10] Visualisation and 'Diagnostic Classifiers' Reveal how Recurrent and Recursive Neural Networks Process Hierarchical Structure
Hupkes, Dieuwke
Veldhoen, Sara
Zuidema, Willem
[J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2018, 61 : 907 - 926

← 1 2 3 →