Text Mining in Clinical Domain: Dealing with Noise

被引:14
作者
Hoang Nguyen [1 ]
Patrick, Jon [2 ]
机构
[1] CSIRO, Data61, 13 Garden St, Eveleigh, NSW 2015, Australia
[2] Univ Sydney, 1 Cleveland St, Sydney, NSW 2006, Australia
来源
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING | 2016年
关键词
Clinical; active learning; text classification; named-entity recognition; natural languages processing; INFORMATION;
D O I
10.1145/2939672.2939720
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text mining in clinical domain is usually more difficult than general domains (e.g. newswire reports and scientific literature) because of the high level of noise in both the corpus and training data for machine learning (ML). A large number of unknown word, non-word and poor grammatical sentences made up the noise in the clinical corpus. Unknown words are usually complex medical vocabularies, misspellings, acronyms and abbreviations where unknown non words are generally the clinical patterns including scores and measures. This noise produces obstacles in the initial lexical processing step as well as subsequent semantic analysis. Furthermore, the labelled data used to build ML models is very costly to obtain because it requires intensive clinical knowledge from the annotators. And even created by experts, the training examples usually contain errors and inconsistencies due to the variations in human annotators' attentiveness. Clinical domain also suffers from the nature of the imbalanced data distribution problem. These kinds of noise are very popular and potentially affect the overall information extraction performance but they were not carefully investigated in most presented health informatics systems. This paper introduces a general clinical data mining architecture which is potential of addressing all of these challenges using: automatic proof-reading process, trainable finite state pattern recogniser, iterative model development and active learning. The reportability classifier based on this architecture achieved 98.25% sensitivity and 96.14% specificity on an Australian cancer registry's held-out test set and up to 92% of training data provided for supervised ML was saved by active learning.
引用
收藏
页码:549 / 558
页数:10
相关论文
共 36 条
  • [1] Baram Y, 2004, J MACH LEARN RES, V5, P255
  • [2] Campbell C., 2000, ICML, P111
  • [3] Chapman W., 2013, J BIOMEDICAL INFORM, V34, P301
  • [4] Discerning Tumor Status from Unstructured MRI Reports-Completeness of Information in Existing Reports and Utility of Automated Natural Language Processing
    Cheng, Lionel T. E.
    Zheng, Jiaping
    Savova, Guergana K.
    Erickson, Bradley J.
    [J]. JOURNAL OF DIGITAL IMAGING, 2010, 23 (02) : 119 - 132
  • [5] Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010
    de Bruijn, Berry
    Cherry, Colin
    Kiritchenko, Svetlana
    Martin, Joel
    Zhu, Xiaodan
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2011, 18 (05) : 557 - 562
  • [6] Application of recently developed computer algorithm for automatic classification of unstructured radiology reports: Validation study
    Dreyer, KJ
    Kalra, MK
    Maher, MM
    Hurier, AM
    Asfaw, BA
    Schultz, T
    Halpern, EF
    Thrall, JH
    [J]. RADIOLOGY, 2005, 234 (02) : 323 - 329
  • [7] Dung Nguyen, 2012, AI 2012: Advances in Artificial Intelligence. 25th Australasian Conference. Proceedings, P445, DOI 10.1007/978-3-642-35101-3_38
  • [8] Ertekin S., 2007, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, CIKM '07, P127
  • [9] Fan RE, 2008, J MACH LEARN RES, V9, P1871
  • [10] A GENERAL NATURAL-LANGUAGE TEXT PROCESSOR FOR CLINICAL RADIOLOGY
    FRIEDMAN, C
    ALDERSON, PO
    AUSTIN, JHM
    CIMINO, JJ
    JOHNSON, SB
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 1994, 1 (02) : 161 - 174