Enhancing performance of transformer-based models in natural language understanding through word importance embedding

被引:1
作者
Hong, Seung-Kyu [1 ]
Jang, Jae-Seok [1 ]
Kwon, Hyuk-Yoon [1 ]
机构
[1] Seoul Natl Univ Sci & Technol, Grad Sch Data Sci, 232 Gongneung Ro, Seoul 01811, South Korea
关键词
Natural language understanding; Transformer; Word importance; Word dependency;
D O I
10.1016/j.knosys.2024.112404
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer-based models have achieved state-of-the-art performance on natural language understanding (NLU) tasks by learning important token relationships through the attention mechanism. However, we observe that attention can become overly distributed during fine-tuning, failing to preserve the dependencies between meaningful tokens adequately. This phenomenon negatively affects the learning of token relationships in sentences. To overcome this issue, we propose a methodology that embeds the feature of word importance (WI) in the transformer-based models as a new layer, weighting the words according to their importance. Our simple yet powerful approach offers a general technique to boost transformer model capabilities on NLU tasks by mitigating the risk of attention dispersion during fine-tuning. Through extensive experiments on GLUE, SuperGLUE, and SQuAD benchmarks for pre-trained models (BERT, RoBERTa, ELECTRA, and DeBERTa), and MMLU, Big Bench Hard, and DROP benchmarks for the large language model, Llama2, we validate the effectiveness of our method in consistently enhancing performance across models with negligible overhead. Furthermore, we validate that our WI layer better preserves the dependencies between important tokens than standard fine-tuning by introducing a model classifying dependent tokens from the learned attention weights. The code is available at https://github.com/bigbases/WordImportance.
引用
收藏
页数:13
相关论文
共 46 条
  • [1] Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]
  • [2] Bai JZ, 2023, Arxiv, DOI [arXiv:2309.16609, DOI 10.48550/ARXIV.2309.16609]
  • [3] Brown TB, 2020, ADV NEUR IN, V33
  • [4] Brysbaert M., 2011, Experimental psychology
  • [5] Extracting semantic representations from word co-occurrence statistics: A computational study
    Bullinaria, John A.
    Levy, Joseph P.
    [J]. BEHAVIOR RESEARCH METHODS, 2007, 39 (03) : 510 - 526
  • [6] DictPrompt: Comprehensive dictionary-integrated prompt tuning for pre-trained language model
    Cao, Rui
    Wang, Yihao
    Gao, Ling
    Yang, Meng
    [J]. KNOWLEDGE-BASED SYSTEMS, 2023, 273
  • [7] ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS
    Clark, Kevin
    Luong, Minh-Thang
    Le, Quoc V.
    Manning, Christopher D.
    [J]. INFORMATION SYSTEMS RESEARCH, 2020,
  • [8] Clark K, 2019, Arxiv, DOI arXiv:1906.04341
  • [9] Coenen A, 2019, ADV NEUR IN, V32
  • [10] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805