Efficient identification of nationally mandated reportable cancer cases using natural language processing and machine learning

被引:37
作者
Osborne, John D. [1 ]
Wyatt, Matthew [1 ]
Westfall, Andrew O. [2 ]
Willig, James [3 ]
Bethard, Steven [4 ]
Gordon, Geoff [5 ]
机构
[1] Univ Alabama Birmingham, Ctr Clin & Translat Sci, Birmingham, AL 35294 USA
[2] Univ Alabama Birmingham, Dept Biostat, Birmingham, AL 35294 USA
[3] Univ Alabama Birmingham, Dept Med, Birmingham, AL 35294 USA
[4] Univ Alabama Birmingham, Dept Comp & Informat Sci, Birmingham, AL 35294 USA
[5] Univ Alabama Birmingham, Informat Inst, Birmingham, AL 35294 USA
基金
美国国家卫生研究院;
关键词
natural language processing; machine learning; information extraction; neoplasms; electronic health records; user-computer interface; TEXT; UMLS;
D O I
10.1093/jamia/ocw006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective To help cancer registrars efficiently and accurately identify reportable cancer cases. Material and Methods The Cancer Registry Control Panel (CRCP) was developed to detect mentions of reportable cancer cases using a pipeline built on the Unstructured Information Management Architecture - Asynchronous Scaleout (UIMA-AS) architecture containing the National Library of Medicine's UIMA MetaMap annotator as well as a variety of rule-based UIMA annotators that primarily act to filter out concepts referring to nonreportable cancers. CRCP inspects pathology reports nightly to identify pathology records containing relevant cancer concepts and combines this with diagnosis codes from the Clinical Electronic Data Warehouse to identify candidate cancer patients using supervised machine learning. Cancer mentions are highlighted in all candidate clinical notes and then sorted in CRCP's web interface for faster validation by cancer registrars. Results CRCP achieved an accuracy of 0.872 and detected reportable cancer cases with a precision of 0.843 and a recall of 0.848. CRCP increases throughput by 22.6% over a baseline (manual review) pathology report inspection system while achieving a higher precision and recall. Depending on registrar time constraints, CRCP can increase recall to 0.939 at the expense of precision by incorporating a data source information feature. Conclusion CRCP demonstrates accurate results when applying natural language processing features to the problem of detecting patients with cases of reportable cancer from clinical notes. We show that implementing only a portion of cancer reporting rules in the form of regular expressions is sufficient to increase the precision, recall, and speed of the detection of reportable cancer cases when combined with off-the-shelf information extraction software and machine learning.
引用
收藏
页码:1077 / 1084
页数:8
相关论文
共 15 条
  • [1] [Anonymous], FAC ONC REG DAT STAN
  • [2] An overview of MetaMap: historical perspective and recent advances
    Aronson, Alan R.
    Lang, Francois-Michel
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2010, 17 (03) : 229 - 236
  • [3] Aronson AR, 2001, J AM MED INFORM ASSN, P17
  • [4] Bland JM, COMP PROPORTIONS OVE
  • [5] The Unified Medical Language System (UMLS): integrating biomedical terminology
    Bodenreider, O
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 : D267 - D270
  • [6] Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model
    Coden, Anni
    Savova, Guergana
    Sominsky, Igor
    Tanenblatt, Michael
    Masanz, James
    Schuler, Karin
    Cooper, James
    Guan, Wei
    de Groen, Piet C.
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2009, 42 (05) : 937 - 949
  • [7] Friedlin Jeff, 2010, AMIA Annu Symp Proc, V2010, P237
  • [8] Pattern-based information extraction from pathology reports for cancer registration
    Napolitano, Giulio
    Fox, Colin
    Middleton, Richard
    Connolly, David
    [J]. CANCER CAUSES & CONTROL, 2010, 21 (11) : 1887 - 1894
  • [9] National Cancer Registrars Association, 2014, BEC CANC REG
  • [10] Symbolic rule-based classification of lung cancer stages from free-text pathology reports
    Nguyen, Anthony N.
    Lawley, Michael J.
    Hansen, David P.
    Bowman, Rayleen V.
    Clarke, Belinda E.
    Duhig, Edwina E.
    Colquist, Shoni
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2010, 17 (04) : 440 - 445