Automatic inference of BI-RADS final assessment categories from narrative mammography report findings

被引:16
作者
Banerjee, Imon [1 ]
Bozkurt, Selen [1 ,2 ]
Alkim, Emel [1 ]
Sagreiya, Hersh [3 ]
Kurian, Allison W. [4 ]
Rubin, Daniel L. [1 ,3 ]
机构
[1] Stanford Univ, Sch Med, Dept Biomed Data Sci, Stanford, CA 94305 USA
[2] Akdeniz Univ, Fac Med, Dept Biostat & Med Informat, TR-07059 Antalya, Turkey
[3] Stanford Univ, Sch Med, Dept Radiol, Stanford, CA 94305 USA
[4] Stanford Univ, Med Oncol & Hlth Res & Policy, Sch Med, Stanford, CA 94305 USA
基金
美国国家卫生研究院;
关键词
BI-RADS classification; Deep learning; Mammography report; NLP; Distributional semantics; Text mining; DATA SYSTEM; CLASSIFICATION; VARIABILITY;
D O I
10.1016/j.jbi.2019.103137
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
We propose an efficient natural language processing approach for inferring the BI-RADS final assessment categories by analyzing only the mammogram findings reported by the mammographer in narrative form. The proposed hybrid method integrates semantic term embedding with distributional semantics, producing a context-aware vector representation of unstructured mammography reports. A large corpus of unannotated mammography reports (300,000) was used to learn the context of the key-terms using a distributional semantics approach, and the trained model was applied to generate context-aware vector representations of the reports annotated with BI-RADS category (22,091). The vectorized reports were utilized to train a supervised classifier to derive the BI-RADS assessment class. Even though the majority of the proposed embedding pipeline is unsupervised, the classifier was able to recognize substantial semantic information for deriving the BI-BADS categorization not only on a holdout internal testset and also on an external validation set (1900 reports). Our proposed method outperforms a recently published domain-specific rule-based system and could be relevant for evaluating concordance between radiologists. With minimal requirement for task specific customization, the proposed method can be easily transferable to a different domain to support large scale text mining or derivation of patient phenotype.
引用
收藏
页数:11
相关论文
共 32 条
[1]  
[Anonymous], ILLUSTRATED BREAST I
[2]   Breast cancer surveillance consortium: A national mammography screening and outcomes database [J].
BallardBarbash, R ;
Taplin, SH ;
Yankaskas, BC ;
Ernster, VL ;
Rosenberg, RD ;
Carney, PA ;
Barlow, WE ;
Geller, BM ;
Kerlikowske, K ;
Edwards, BK ;
Lynch, CF ;
Urban, N ;
Key, CR ;
Poplack, SP ;
Worden, JK ;
Kessler, LG .
AMERICAN JOURNAL OF ROENTGENOLOGY, 1997, 169 (04) :1001-1008
[3]  
Banerjee I., ARXIV171106968
[4]   Breast imaging reporting and data system: Inter- and intraobserver variability in feature analysis and final assessment [J].
Berg, WA ;
Campassi, C ;
Langenberg, P ;
Sexton, MJ .
AMERICAN JOURNAL OF ROENTGENOLOGY, 2000, 174 (06) :1769-1777
[5]  
Bird Steven., 2004, P ACL INT POST DEM S, P214
[6]  
Bouma Gerlof, 2009, German Society for Computational Linguistics and Language Technology (GSCL) Conference, P31
[7]   Variability and errors when applying the BIRADS mammography classification [J].
Boyer, Bruno ;
Canale, Sandra ;
Arfi-Rouche, Julia ;
Monzani, Quentin ;
Khaled, Wassef ;
Balleyguier, Corinne .
EUROPEAN JOURNAL OF RADIOLOGY, 2013, 82 (03) :388-397
[8]   Using automatically extracted information from mammography reports for decision-support [J].
Bozkurt, Selen ;
Gimenez, Francisco ;
Burnside, Elizabeth S. ;
Gulkesen, Kemal H. ;
Rubin, Daniel L. .
JOURNAL OF BIOMEDICAL INFORMATICS, 2016, 62 :224-231
[9]   Automatic abstraction of imaging observations with their characteristics from mammography reports [J].
Bozkurt, Selen ;
Lipson, Jafi A. ;
Senol, Utku ;
Rubin, Daniel L. .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2015, 22 (E1) :E81-U246
[10]   Automated Detection of Ambiguity in BI-RADS Assessment Categories in Mammography Reports [J].
Bozkurt, Selen ;
Rubin, Daniel .
CROSS-BORDER CHALLENGES IN INFORMATICS WITH A FOCUS ON DISEASE SURVEILLANCE AND UTILISING BIG DATA, 2014, 197 :35-39