Functional evaluation of out-of-the-box text-mining tools for data-mining tasks

被引:33
作者
Jung, Kenneth [1 ]
LePendu, Paea [2 ]
Iyer, Srinivasan
Bauer-Mehren, Anna
Percha, Bethany [1 ]
Shah, Nigam H. [2 ]
机构
[1] Stanford Univ, Program Biomed Informat, Stanford, CA 94305 USA
[2] Stanford Univ, Ctr Biomed Informat Res, Stanford, CA 94305 USA
关键词
electronic health records; natural language processing; text mining; ELECTRONIC HEALTH RECORDS; CLINICAL TEXT; INFORMATION EXTRACTION; SYSTEM; ACCURACY; ARCHITECTURE; ALGORITHM; KNOWLEDGE; ARTHRITIS; ART;
D O I
10.1136/amiajnl-2014-002902
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug-drug interactions, and learning used-to-treat relationships between drugs and indications. Materials We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. Results There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. Conclusions For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice.
引用
收藏
页码:121 / 131
页数:11
相关论文
共 56 条
  • [1] Bauer-Mehren Anna, 2013, AMIA Jt Summits Transl Sci Proc, V2013, P14
  • [2] Accuracy of ICD-9-CM codes for identifying cardiovascular and stroke risk factors
    Birman-Deych, E
    Waterman, AD
    Yan, Y
    Nilasena, DS
    Radford, MJ
    Gage, BF
    [J]. MEDICAL CARE, 2005, 43 (05) : 480 - 485
  • [3] Exploring semantic groups through visual approaches
    Bodenreider, O
    McCray, AT
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2003, 36 (06) : 414 - 432
  • [4] The Unified Medical Language System (UMLS): integrating biomedical terminology
    Bodenreider, O
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 : D267 - D270
  • [5] Defining a comprehensive verotype using electronic health records for personalized medicine
    Boland, Mary Regina
    Hripcsak, George
    Shen, Yufeng
    Chung, Wendy K.
    Weng, Chunhua
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2013, 20 (E2) : E232 - E238
  • [6] Portability of an algorithm to identify rheumatoid arthritis in electronic health records
    Carroll, Robert J.
    Thompson, Will K.
    Eyler, Anne E.
    Mandelin, Arthur M.
    Cai, Tianxi
    Zink, Raquel M.
    Pacheco, Jennifer A.
    Boomershine, Chad S.
    Lasko, Thomas A.
    Xu, Hua
    Karlson, Elizabeth W.
    Perez, Raul G.
    Gainer, Vivian S.
    Murphy, Shawn N.
    Ruderman, Eric M.
    Pope, Richard M.
    Plenge, Robert M.
    Kho, Abel Ngo
    Liao, Katherine P.
    Denny, Joshua C.
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2012, 19 (E1) : E162 - E169
  • [7] A simple algorithm for identifying negated findings and diseases in discharge summaries
    Chapman, WW
    Bridewell, W
    Hanbury, P
    Cooper, GF
    Buchanan, BG
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2001, 34 (05) : 301 - 310
  • [8] Automated acquisition of disease-drug knowledge from biomedical and clinical documents: An initial study
    Chen, Elizabeth S.
    Hripcsak, George
    Xu, Hua
    Markatou, Marianthi
    Friedman, Carol
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2008, 15 (01) : 87 - 98
  • [9] Chen LF, 2004, STUD HEALTH TECHNOL, V107, P758
  • [10] Profiling risk factors for chronic uveitis in juvenile idiopathic arthritis: a new model for EHR-based research
    Cole, Tyler S.
    Frankovich, Jennifer
    Iyer, Srinivasan
    LePendu, Paea
    Bauer-Mehren, Anna
    Shah, Nigam H.
    [J]. PEDIATRIC RHEUMATOLOGY, 2013, 11