Limitations of Transformers on Clinical Text Classification

被引:88
作者
Gao, Shang [1 ]
Alawad, Mohammed [1 ]
Young, M. Todd [1 ]
Gounley, John [1 ]
Schaefferkoetter, Noah [1 ]
Yoon, Hong Jun [1 ]
Wu, Xiao-Cheng [2 ]
Durbin, Eric B. [3 ]
Doherty, Jennifer [4 ]
Stroup, Antoinette [5 ]
Coyle, Linda [6 ]
Tourassi, Georgia [1 ]
机构
[1] Oak Ridge Natl Lab, Oak Ridge, TN 37830 USA
[2] Louisiana State Univ, Hlth Sci Ctr, Louisiana Tumor Registry, New Orleans, LA 70112 USA
[3] Univ Kentucky, Kentucky Canc Registry, Lexington, KY 40536 USA
[4] Univ Utah, Hlth Huntsman Canc Inst, Utah Canc Registry, Salt Lake City, UT 84132 USA
[5] New Jersey State Canc Registry, Trenton, NJ 08625 USA
[6] Informat Management Serv Inc, Calverton, MD 20705 USA
关键词
Bit error rate; Task analysis; Cancer; MIMICs; Biological system modeling; Adaptation models; Data models; BERT; clinical text; deep learning; natural language processing; neural networks; text classification;
D O I
10.1109/JBHI.2021.3062322
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Bidirectional Encoder Representations from Transformers (BERT) and BERT-based approaches are the current state-of-the-art in many natural language processing (NLP) tasks; however, their application to document classification on long clinical texts is limited. In this work, we introduce four methods to scale BERT, which by default can only handle input sequences up to approximately 400 words long, to perform document classification on clinical texts several thousand words long. We compare these methods against two much simpler architectures - a word-level convolutional neural network and a hierarchical self-attention network - and show that BERT often cannot beat these simpler baselines when classifying MIMIC-III discharge summaries and SEER cancer pathology reports. In our analysis, we show that two key components of BERT - pretraining and WordPiece tokenization - may actually be inhibiting BERT's performance on clinical text classification tasks where the input document is several thousand words long and where correctly identifying labels may depend more on identifying a few key words or phrases rather than understanding the contextual meaning of sequences of text.
引用
收藏
页码:3596 / 3607
页数:12
相关论文
共 44 条
[1]  
Adhikari A., ARXIV190408398, V2019
[2]   Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks [J].
Alawad, Mohammed ;
Gao, Shang ;
Qiu, John X. ;
Yoon, Hong Jun ;
Christian, J. Blair ;
Penberthy, Lynne ;
Mumphrey, Brent ;
Wu, Xiao-Cheng ;
Coyle, Linda ;
Tourassi, Georgia .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2020, 27 (01) :89-98
[3]  
Alberti Chris, 2019, A bert baseline for the natural questions
[4]  
Alsentzer E, 2019, P 2 CLIN NAT LANG PR, P72, DOI 10.18653/v1/W19-1909
[5]  
[Anonymous], 2014, PROC C EMPIRICAL MET, DOI DOI 10.3115/V1/D14-1181
[6]  
Baumel T., 2017, MULTILABEL CLASSIFIC
[7]  
Beltagy Iz, 2020, ARXIV200405150
[8]  
Collobert R., 2008, P ICML, P160
[9]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[10]   Classifying cancer pathology reports with hierarchical self-attention networks [J].
Gao, Shang ;
Qiu, John X. ;
Alawad, Mohammed ;
Hinkle, Jacob D. ;
Schaefferkoetter, Noah ;
Yoon, Hong-Jun ;
Christian, Blair ;
Fearn, Paul A. ;
Penberthy, Lynne ;
Wu, Xiao-Cheng ;
Coyle, Linda ;
Tourassi, Georgia ;
Ramanathan, Arvind .
ARTIFICIAL INTELLIGENCE IN MEDICINE, 2019, 101