Searching the PDF Haystack: Automated Knowledge Discovery in Scanned EHR Documents

被引:8
作者
Kostrinsky-Thomas, Alexander L. [1 ]
Hisama, Fuki M. [2 ]
Payne, Thomas H. [3 ]
机构
[1] Pacific Northwest Univ Hlth Sci, Coll Osteopath Med, 200 Univ Pkwy, Yakima, WA USA
[2] Univ Washington, Dept Med, Div Med Genet, Sch Med, Seattle, WA 98195 USA
[3] Univ Washington, Sch Med, Dept Med, Seattle, WA 98195 USA
来源
APPLIED CLINICAL INFORMATICS | 2021年 / 12卷 / 02期
关键词
electronic health records; portable document format; optical character recognition; natural language processing; machine learning; evaluation;
D O I
10.1055/s-0041-1726103
中图分类号
R-058 [];
学科分类号
摘要
Background Clinicians express concern that they may be unaware of important information contained in voluminous scanned and other outside documents contained in electronic health records (EHRs). An example is "unrecognized EHR risk factor information," defined as risk factors for heritable cancer that exist within a patient's EHR but are not known by current treating providers. In a related study using manual EHR chart review, we found that half of the women whose EHR contained risk factor information meet criteria for further genetic risk evaluation for heritable forms of breast and ovarian cancer. They were not referred for genetic counseling. Objectives The purpose of this study was to compare the use of automated methods (optical character recognition with natural language processing) versus human review in their ability to identify risk factors for heritable breast and ovarian cancer within EHR scanned documents. Methods We evaluated the accuracy of the chart review by comparing our criterion standard (physician chart review) versus an automated method involving Amazon's Textract service (Amazon.com, Seattle, Washington, United States), a clinical language annotation modeling and processing toolkit (CLAMP) (Center for Computational Biomedicine at The University of Texas Health Science, Houston, Texas, United States), and a custom-written Java application. Results We found that automated methods identified most cancer risk factor information that would otherwise require clinician manual review and therefore is at risk of being missed. Conclusion The use of automated methods for identification of heritable risk factors within EHRs may provide an accurate yet rapid review of patients' past medical histories. These methods could be further strengthened via improved analysis of handwritten notes, tables, and colloquial phrases.
引用
收藏
页码:245 / 250
页数:6
相关论文
共 18 条
  • [1] Amazon Textract, AMAZON TEXTRACT
  • [2] [Anonymous], GENETICFAMILIAL HIGH
  • [3] A simple algorithm for identifying negated findings and diseases in discharge summaries
    Chapman, WW
    Bridewell, W
    Hanbury, P
    Cooper, GF
    Buchanan, BG
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2001, 34 (05) : 301 - 310
  • [4] Farri Oladimeji, 2012, AMIA Annu Symp Proc, V2012, P1211
  • [5] Building Watson: An Overview of the DeepQA Project
    Ferrucci, David
    Brown, Eric
    Chu-Carroll, Jennifer
    Fan, James
    Gondek, David
    Kalyanpur, Aditya A.
    Lally, Adam
    Murdock, J. William
    Nyberg, Eric
    Prager, John
    Schlaefer, Nico
    Welty, Chris
    [J]. AI MAGAZINE, 2010, 31 (03) : 59 - 79
  • [6] Automatic classification of scanned electronic health record documents
    Goodrum, Heath
    Roberts, Kirk
    Bernstam, Elmer, V
    [J]. INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2020, 144
  • [7] Healthit.gov, 2020, WHAT IS HIE HEALTHIT
  • [8] Learning string distance with smoothing for OCR spelling correction
    Hladek, Daniel
    Stas, Jan
    Ondas, Stanislav
    Juhar, Jozef
    Kovacs, Laszlo
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2017, 76 (22) : 24549 - 24567
  • [9] Holley R., 2009, D LIB MAG, V15, P3, DOI DOI 10.1045/MARCH2009-H0LLEY
  • [10] Identifying Women at High Risk for Breast Cancer Using Data From the Electronic Health Record Compared With Self-Report
    Jiang, Xinyi
    McGuinness, Julia E.
    Sin, Margaret
    Silverman, Thomas
    Kukafka, Rita
    Crew, Katherine D.
    [J]. JCO CLINICAL CANCER INFORMATICS, 2019, 3 : 1 - 8