Roles and Utilization of Attention Heads in Transformer-based Neural Language Models

被引:0
作者
Jo, Jae-young [1 ,2 ]
Myaeng, Sung-hyon [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Sch Comp, Daejeon, South Korea
[2] Dingbro AI Res, Daejeon, South Korea
来源
58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020) | 2020年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sentence encoders based on the transformer architecture have shown promising results on various natural language tasks. The main impetus lies in the pre-trained neural language models that capture long-range dependencies among words, owing to multi-head attention that is unique in the architecture. However, little is known for how linguistic properties are processed, represented, and utilized for downstream tasks among hundreds of attention heads inside the pre-trained transformer-based model. For the initial goal of examining the roles of attention heads in handling a set of linguistic features, we conducted a set of experiments with ten probing tasks and three downstream tasks on four pre-trained transformer families (GPT, GPT2, BERT, and ELECTRA). Meaningful insights are shown through the lens of heat map visualization and utilized to propose a relatively simple sentence representation method that takes advantage of most influential attention heads, resulting in additional performance improvements on the downstream tasks.
引用
收藏
页码:3404 / 3417
页数:14
相关论文
共 50 条
[31]   Localizing in-domain adaptation of transformer-based biomedical language models [J].
Buonocore, Tommaso Mario ;
Crema, Claudio ;
Redolfi, Alberto ;
Bellazzi, Riccardo ;
Parimbelli, Enea .
JOURNAL OF BIOMEDICAL INFORMATICS, 2023, 144
[32]   Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping [J].
Zhang, Minjia ;
He, Yuxiong .
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS (NEURIPS 2020), 2020, 33
[33]   Arlo: Serving Transformer-based Language Models with Dynamic Input Lengths [J].
Tan, Xin ;
Li, Jiamin ;
Yang, Yitao ;
Li, Jingzong ;
Xu, Hong .
53RD INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2024, 2024, :367-376
[34]   Enhancing Address Data Integrity using Transformer-Based Language Models [J].
Kurklu, Omer Faruk ;
Akagiunduz, Erdem .
32ND IEEE SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU 2024, 2024,
[35]   Efficient Open Domain Question Answering With Delayed Attention in Transformer-Based Models [J].
Siblini, Wissam ;
Challal, Mohamed ;
Pasqual, Charlotte .
INTERNATIONAL JOURNAL OF DATA WAREHOUSING AND MINING, 2022, 18 (02)
[36]   TRANSFORMER-BASED STREAMING ASR WITH CUMULATIVE ATTENTION [J].
Li, Mohan ;
Zhang, Shucong ;
Zorila, Catalin ;
Doddipatla, Rama .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :8272-8276
[37]   Attention Calibration for Transformer-based Sequential Recommendation [J].
Zhou, Peilin ;
Ye, Qichen ;
Xie, Yueqi ;
Gao, Jingqi ;
Wang, Shoujin ;
Kim, Jae Boum ;
You, Chenyu ;
Kim, Sunghun .
PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, :3595-3605
[38]   Korean Sign Language Recognition Using Transformer-Based Deep Neural Network [J].
Shin, Jungpil ;
Musa Miah, Abu Saleh ;
Hasan, Md. Al Mehedi ;
Hirooka, Koki ;
Suzuki, Kota ;
Lee, Hyoun-Sup ;
Jang, Si-Woong .
APPLIED SCIENCES-BASEL, 2023, 13 (05)
[39]   Quantifying the Bias of Transformer-Based Language Models for African American English in Masked Language Modeling [J].
Salutari, Flavia ;
Ramos, Jerome ;
Rahmani, Hossein A. ;
Linguaglossa, Leonardo ;
Lipani, Aldo .
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2023, PT I, 2023, 13935 :532-543
[40]   Incorporating Medical Knowledge to Transformer-based Language Models for Medical Dialogue Generation [J].
Naseem, Usman ;
Bandi, Ajay ;
Raza, Shaina ;
Rashid, Junaid ;
Chakravarthi, Bharathi Raja .
PROCEEDINGS OF THE 21ST WORKSHOP ON BIOMEDICAL LANGUAGE PROCESSING (BIONLP 2022), 2022, :110-115