A Primer in BERTology: What We Know About How BERT Works

被引:733
作者
Rogers, Anna [1 ]
Kovaleva, Olga [2 ]
Rumshisky, Anna [2 ]
机构
[1] Univ Copenhagen, Ctr Social Data Sci, Copenhagen, Denmark
[2] Univ Massachusetts, Dept Comp Sci, Lowell, MA USA
关键词
D O I
10.1162/tacl_a_00349
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue, and approaches to compression. We then outline directions for future research.
引用
收藏
页码:842 / 866
页数:25
相关论文
共 180 条
[21]  
Clark Kevin, 2020, P 8 INT C LEARN REPR, DOI [DOI 10.48550/ARXIV.2003.10555, 10.48550/arXiv.2003.10555]
[22]  
Clinchant S., 2019, P 3 WORKSHOP NEURAL, P108, DOI [10.18653/v1/D19-5611, 10.18653/ v1/D19-5611]
[23]  
Conneau Alexis, 2020, P 58 ANN M ASS COMPU, P8440, DOI DOI 10.18653/V1/2020.ACL-MAIN.747
[24]  
Correia GM, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P2174
[25]  
Crane M., 2018, Transactions of the Association for Computational Linguistics, V6, P241, DOI [DOI 10.1162/TACLA00018, DOI 10.1162/TACL_A_00018]
[26]  
Cui Leyang, 2020, ARXIV200803945
[27]   Pre-Training With Whole Word Masking for Chinese BERT [J].
Cui, Yiming ;
Che, Wanxiang ;
Liu, Ting ;
Qin, Bing ;
Yang, Ziqing .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :3504-3514
[28]  
Da J, 2019, P 1 WORKSH COMM INF, P1, DOI [DOI 10.18653/V1/D19-6001, 10.18653/v1/D19-6001]
[29]  
Dan Kondratyuk, 2019, EMNLP, P2779, DOI 10.18653/v1/D19 -1279
[30]  
Davison Joe, 2019, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), P1173, DOI DOI 10.18653/V1/D19-1109