Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers

被引:0
作者
Frank, Stella [1 ]
Bugliarello, Emanuele [2 ]
Elliott, Desmond [2 ]
机构
[1] Univ Trento, Trento, Italy
[2] Univ Copenhagen, Copenhagen, Denmark
来源
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021) | 2021年
基金
欧盟地平线“2020”;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pretrained vision-and-language BERTs aim to learn representations that combine information from both modalities. We propose a diagnostic method based on cross-modal input ablation to assess the extent to which these models actually integrate cross-modal information. This method involves ablating inputs from one modality, either entirely or selectively based on cross-modal grounding alignments, and evaluating the model prediction performance on the other modality. Model performance is measured by modality-specific tasks that mirror the model pretraining objectives (e.g. masked language modelling for text). Models that have learned to construct cross-modal representations using both modalities are expected to perform worse when inputs are missing from a modality. We find that recently proposed models have much greater relative difficulty predicting text when visual information is ablated, compared to predicting visual object categories when text is ablated, indicating that these models are not symmetrically cross-modal.
引用
收藏
页码:9847 / 9857
页数:11
相关论文
共 29 条
  • [1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [2] [Anonymous], 1935, Logik der Forschung-Zur Erkenntnistheorie der modernen Naturwissenschaft
  • [3] Analysis Methods in Neural Language Processing: A Survey
    Belinkov, Yonatan
    Glass, James
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2019, 7 : 49 - 72
  • [4] Bugliarello Emanuele, 2021, T ASS COMPUTATIONAL
  • [5] Cao J., 2020, P EUR C COMP VIS, P565
  • [6] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [7] The Pascal Visual Object Classes (VOC) Challenge
    Everingham, Mark
    Van Gool, Luc
    Williams, Christopher K. I.
    Winn, John
    Zisserman, Andrew
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2010, 88 (02) : 303 - 338
  • [8] Fernandes P., 2021, Long Papers, V1, P6467
  • [9] Hendricks LA, 2021, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, P3635
  • [10] Visualisation and 'Diagnostic Classifiers' Reveal how Recurrent and Recursive Neural Networks Process Hierarchical Structure
    Hupkes, Dieuwke
    Veldhoen, Sara
    Zuidema, Willem
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2018, 61 : 907 - 926