Fine-Tuning via Mask Language Model Enhanced Representations Based Contrastive Learning and Application

被引:0
作者
Zhang, Dechi [1 ]
Wan, Weibing [1 ]
机构
[1] School of Electrical and Electronic Engineering, Shanghai University of Engineering Science, Shanghai
关键词
contrast learning; fine-tuning; generalization ability; masked language model; spurious association; Transformer;
D O I
10.3778/j.issn.1002-8331.2306-0190
中图分类号
学科分类号
摘要
Autoattention networks play an important role in a language model based on Transformer, where a fully connected structure can capture non-continuous dependencies in a sequence in parallel. However, the fully connected self-attention network is easy to overfit to false association information, such as false association between words and words, and between words and the predicted target. This overfitting problem limits the ability of language models to generalize data outside the domain or the distribution. To improve the robustness and generalization ability of the Transformer language model against false associations, fine-tuning framework via mask language model enhanced representations based contrastive learning is proposed in this paper. Specifically, the text sequence and the sequence after its random mask are sent into a twin network, and then the model parameters are learned by combining the contrast learning objective and the downstream task objective. Each twin network consists of a pre-trained language model and a task classifier. Therefore, the fine-tuning framework is more consistent with the mask language model pre-training learning mode and can maintain the generalization ability of pre-training knowledge in downstream tasks. The MNLI, FEVER, and QQP datasets and their challenge datasets are compared with the latest baseline models, including large language models ChatGPT, GPT4, and LLaMA. Experimental results show that the proposed model can guarantee the performance in distribution and improve the performance out of distribution. The experimental results on ATIS and Snips data sets prove that the model is also effective in common natural language processing tasks. © 2024 Journal of Computer Engineering and Applications Beijing Co., Ltd.; Science Press. All rights reserved.
引用
收藏
页码:129 / 138
页数:9
相关论文
共 35 条
[1]  
MIKOLOV T, CHEN K, CORRADO G, Et al., Efficient estimation of word representations in vector space, (2013)
[2]  
LIAO S L, JI J M, YU C, Et al., Intention classification method based on BERT model and knowledge distillation, Computer Engineering, 47, 5, pp. 73-79, (2021)
[3]  
PENNINGTON J, SOCHER R, MANNING C D., Glove: global vectors for word representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532-1543, (2014)
[4]  
FANG J K, CHEN P H, LIAO W X., Text classification model based on GloVe and GRU, Computer Engineering and Applications, 56, 20, pp. 98-103, (2020)
[5]  
ZHOU Y., Text classification method based on GloVe model and attention Mechanism Bi- LSTM, Electronic Measurement Technology, 45, 7, pp. 42-47, (2022)
[6]  
SHAHBAZ M, SURESH L, REXFORD J, Et al., Elmo: source routed multicast for public clouds, IEEE/ACM Transactions on Networking, 28, 6, pp. 2587-2600, (2020)
[7]  
WANG Y Y, LIAO B L, PENG C, Et al., Research review of recurrent neural networks, Journal of Jishou University (Natural Sciences Edition), 42, 1, pp. 41-48, (2021)
[8]  
KIM Y., Convolutional neural networks for sentence classification, (2014)
[9]  
KALCHBRENNER N, GREFENSTETTE E, BLUNSOM P., A convolutional neural network for modelling sentences[J], (2014)
[10]  
MA J H, LIU Y P, LIU Y D, Et al., CGGA: text classification model based on CNN and parallel gating mechanism, Journal of Chinese Computer Systems, 42, 3, pp. 516-521, (2021)