Word Representation Learning in Multimodal Pre-Trained Transformers: An Intrinsic Evaluation

被引:4
作者
Pezzelle, Sandro [1 ]
Takmaz, Ece [1 ]
Fernandez, Raquel [1 ]
机构
[1] Univ Amsterdam, Inst Log Language & Computat, Amsterdam, Netherlands
基金
欧洲研究理事会;
关键词
DISTRIBUTIONAL SEMANTICS; MODELS;
D O I
10.1162/tacl_a_00443
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This study carries out a systematic intrinsic evaluation of the semantic representations learned by state-of-the-art pre-trained multimodal Transformers. These representations are claimed to be task-agnostic and shown to help on many downstream language-and-vision tasks. However, the extent to which they align with human semantic intuitions remains unclear. We experiment with various models and obtain static word representations from the contextualized ones they learn. We then evaluate them against the semantic judgments provided by human speakers. In linewith previous evidence, we observe a generalized advantage of multimodal representations over languageonly ones on concrete word pairs, but not on abstract ones. On the one hand, this confirms the effectiveness of these models to align language and vision, which results in better semantic representations for concepts that are grounded in images. On the other hand, models are shown to follow different representation learning patterns, which sheds some light on how and when they perform multimodal integration.
引用
收藏
页码:1563 / 1579
页数:17
相关论文
共 64 条
  • [1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [2] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [3] Grounding Distributional Semantics in the Visual World
    Baroni, Marco
    [J]. LANGUAGE AND LINGUISTICS COMPASS, 2016, 10 (01): : 3 - 13
  • [4] Grounded cognition
    Barsalou, Lawrence W.
    [J]. ANNUAL REVIEW OF PSYCHOLOGY, 2008, 59 : 617 - 645
  • [5] Beinborn Lisa, 2018, P 27 INT C COMP LING, P2325
  • [6] Bommasani Rishi, 2020, P 58 ANN M ASS COMPU, P4758, DOI [10.18653/v1/2020.acl-main.431, DOI 10.18653/V1/2020.ACL-MAIN.431]
  • [7] Multimodal Distributional Semantics
    Bruni, Elia
    Nam Khanh Tran
    Baroni, Marco
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2014, 49 : 1 - 47
  • [8] Bruni Elia., 2012, Proceedings of the 20th ACM International Conference on Multimedia, P1219
  • [9] Concreteness ratings for 40 thousand generally known English word lemmas
    Brysbaert, Marc
    Warriner, Amy Beth
    Kuperman, Victor
    [J]. BEHAVIOR RESEARCH METHODS, 2014, 46 (03) : 904 - 911
  • [10] Bugliarello Emanuele, 2021, T ASSOC COMPUT LING, DOI [10.1162/tacl_a_00408, DOI 10.1162/TACL_A_00408]