Roles and Utilization of Attention Heads in Transformer-based Neural Language Models

被引:0
作者
Jo, Jae-young [1 ,2 ]
Myaeng, Sung-hyon [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Sch Comp, Daejeon, South Korea
[2] Dingbro AI Res, Daejeon, South Korea
来源
58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020) | 2020年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sentence encoders based on the transformer architecture have shown promising results on various natural language tasks. The main impetus lies in the pre-trained neural language models that capture long-range dependencies among words, owing to multi-head attention that is unique in the architecture. However, little is known for how linguistic properties are processed, represented, and utilized for downstream tasks among hundreds of attention heads inside the pre-trained transformer-based model. For the initial goal of examining the roles of attention heads in handling a set of linguistic features, we conducted a set of experiments with ten probing tasks and three downstream tasks on four pre-trained transformer families (GPT, GPT2, BERT, and ELECTRA). Meaningful insights are shown through the lens of heat map visualization and utilized to propose a relatively simple sentence representation method that takes advantage of most influential attention heads, resulting in additional performance improvements on the downstream tasks.
引用
收藏
页码:3404 / 3417
页数:14
相关论文
共 50 条
  • [21] The Generalization and Robustness of Transformer-Based Language Models on Commonsense Reasoning
    Shen, Ke
    THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23419 - 23420
  • [22] Reward modeling for mitigating toxicity in transformer-based language models
    Farshid Faal
    Ketra Schmitt
    Jia Yuan Yu
    Applied Intelligence, 2023, 53 : 8421 - 8435
  • [23] Reward modeling for mitigating toxicity in transformer-based language models
    Faal, Farshid
    Schmitt, Ketra
    Yu, Jia Yuan
    APPLIED INTELLIGENCE, 2023, 53 (07) : 8421 - 8435
  • [24] Self-Distillation into Self-Attention Heads for Improving Transformer-based End-to-End Neural Speaker Diarization
    Jeoung, Ye-Rin
    Choi, Jeong-Hwan
    Seong, Ju-Seok
    Kyung, JeHyun
    Chang, Joon-Hyuk
    INTERSPEECH 2023, 2023, : 3197 - 3201
  • [25] Attention heads of large language models
    Zheng, Zifan
    Wang, Yezhaohui
    Huang, Yuxin
    Song, Shichao
    Yang, Mingchuan
    Tang, Bo
    Xiong, Feiyu
    Li, Zhiyu
    PATTERNS, 2025, 6 (02):
  • [26] Tweets Topic Classification and Sentiment Analysis Based on Transformer-Based Language Models
    Mandal, Ranju
    Chen, Jinyan
    Becken, Susanne
    Stantic, Bela
    VIETNAM JOURNAL OF COMPUTER SCIENCE, 2023, 10 (02) : 117 - 134
  • [27] Transformer-based Language Models for Semantic Search and Mobile Applications Retrieval
    Coelho, Joao
    Neto, Antonio
    Tavares, Miguel
    Coutinho, Carlos
    Oliveira, Joao
    Ribeiro, Ricardo
    Batista, Fernando
    PROCEEDINGS OF THE 13TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT (KDIR), VOL 1:, 2021, : 225 - 232
  • [28] Dynamic Low-rank Estimation for Transformer-based Language Models
    Huai, Ting
    Lie, Xiao
    Gao, Shangqian
    Hsu, Yenchang
    Shen, Yilin
    Jin, Hongxia
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 9275 - 9287
  • [29] Pre-training and Evaluating Transformer-based Language Models for Icelandic
    Daoason, Jon Friorik
    Loftsson, Hrafn
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 7386 - 7391
  • [30] Shared functional specialization in transformer-based language models and the human brain
    Kumar, Sreejan
    Sumers, Theodore R.
    Yamakoshi, Takateru
    Goldstein, Ariel
    Hasson, Uri
    Norman, Kenneth A.
    Griffiths, Thomas L.
    Hawkins, Robert D.
    Nastase, Samuel A.
    NATURE COMMUNICATIONS, 2024, 15 (01)