Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks

被引:52
作者
Alawad, Mohammed [1 ]
Gao, Shang [1 ]
Qiu, John X. [1 ]
Yoon, Hong Jun [1 ]
Christian, J. Blair [1 ]
Penberthy, Lynne [2 ]
Mumphrey, Brent [3 ]
Wu, Xiao-Cheng [3 ]
Coyle, Linda [4 ]
Tourassi, Georgia [1 ]
机构
[1] Oak Ridge Natl Lab, Computat Sci & Engn Div, Hlth Data Sci Inst, Oak Ridge, TN 37831 USA
[2] NCI, Surveillance Res Program, Div Canc Control & Populat Sci, Bethesda, MD 20892 USA
[3] Louisiana State Univ, Louisiana Tumor Registry, Hlth Sci Ctr, Sch Publ Hlth, New Orleans, LA USA
[4] Informat Management Serv Inc, Calverton, MD USA
基金
美国国家卫生研究院;
关键词
deep learning; multitask learning; convolutional neural network; cancer pathology reports; natural language processing; information extraction; CLINICAL INFORMATION;
D O I
10.1093/jamia/ocz153
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective: We implement 2 different multitask learning (MTL) techniques, hard parameter sharing and cross-stitch, to train a word-level convolutional neural network (CNN) specifically designed for automatic extraction of cancer data from unstructured text in pathology reports. We show the importance of learning related information extraction (IE) tasks leveraging shared representations across the tasks to achieve state-of-the-art performance in classification accuracy and computational efficiency. Materials and Methods: Multitask CNN (MTCNN) attempts to tackle document information extraction by learning to extract multiple key cancer characteristics simultaneously. We trained our MTCNN to perform 5 information extraction tasks: (1) primary cancer site (65 classes), (2) laterality (4 classes), (3) behavior (3 classes), (4) histological type (63 classes), and (5) histological grade (5 classes). We evaluated the performance on a corpus of 95 231 pathology documents (71 223 unique tumors) obtained from the Louisiana Tumor Registry. We compared the performance of the MTCNN models against single-task CNN models and 2 traditional machine learning approaches, namely support vector machine (SVM) and random forest classifier (RFC). Results: MTCNNs offered superior performance across all 5 tasks in terms of classification accuracy as compared with the other machine learning models. Based on retrospective evaluation, the hard parameter sharing and cross-stitch MTCNN models correctly classified 59.04% and 57.93% of the pathology reports respectively across all 5 tasks. The baseline models achieved 53.68% (CNN), 46.37% (RFC), and 36.75% (SVM). Based on prospective evaluation, the percentages of correctly classified cases across the 5 tasks were 60.11% (hard parameter sharing), 58.13% (cross-stitch), 51.30% (single-task CNN), 42.07% (RFC), and 35.16% (SVM). Moreover, hard parameter sharing MTCNNs outperformed the other models in computational efficiency by using about the same number of trainable parameters as a single-task CNN. Conclusions: The hard parameter sharing MTCNN offers superior classification accuracy for automated coding support of pathology documents across a wide range of cancers and multiple information extraction tasks while maintaining similar training and inference time as those of a single task-specific model.
引用
收藏
页码:89 / 98
页数:10
相关论文
共 29 条
[1]  
Alawad Mohammed, 2018, 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), P218, DOI 10.1109/BHI.2018.8333408
[2]  
[Anonymous], 2017, IJCNLP
[3]  
[Anonymous], ONCOLOGY INFORM
[4]   A Bayesian information theoretic model of learning to learn via multiple task sampling [J].
Baxter, J .
MACHINE LEARNING, 1997, 28 (01) :7-39
[5]  
Buckley Julliette M, 2012, J Pathol Inform, V3, P23, DOI 10.4103/2153-3539.97788
[6]  
Collobert R, 2011, J MACH LEARN RES, V12, P2493
[7]  
Currie AM, 2006, AMIA 2006 AM MED INF
[8]  
Efron B, 1994, INTRO BOOTSTRAP CHAP
[9]   The anti-inflammatory activities of ethanol extract from Dan-Lou prescription in vivo and in vitro [J].
Gao, Li-Na ;
Zhou, Xin ;
Zhang, Yi ;
Cui, Yuan-Lu ;
Yu, Chun-Quan ;
Gao, Shan .
BMC COMPLEMENTARY AND ALTERNATIVE MEDICINE, 2015, 15
[10]   Hierarchical attention networks for information extraction from cancer pathology reports [J].
Gao, Shang ;
Young, Michael T. ;
Qiu, John X. ;
Yoon, Hong-Jun ;
Christian, James B. ;
Fearn, Paul A. ;
Tourassi, Georgia D. ;
Ramanthan, Arvind .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2018, 25 (03) :321-330