Automated derivation of diagnostic criteria for lung cancer using natural language processing on electronic health records: a pilot study

被引:0
作者
Houston, Andrew [1 ,2 ]
Williams, Sophie [1 ,2 ]
Ricketts, William [3 ]
Gutteridge, Charles [1 ]
Tackaberry, Chris [4 ]
Conibear, John [5 ]
机构
[1] Barts Hlth NHS Trust, Barts Life Sci, London, England
[2] Queen Mary Univ London, Digital Environm Res Inst, London, England
[3] Barts Hlth NHS Trust, Resp Med, London, England
[4] Clinithink Ltd, London, England
[5] Barts Hlth NHS Trust, Barts Canc Ctr, London, England
关键词
Electronic health records; Natural language processing; Cancer; Diagnostics; SNOMED-CT; Machine learning; Genetic optimisation; ACCURACY; FEATURES;
D O I
10.1186/s12911-024-02790-y
中图分类号
R-058 [];
学科分类号
摘要
BackgroundThe digitisation of healthcare records has generated vast amounts of unstructured data, presenting opportunities for improvements in disease diagnosis when clinical coding falls short, such as in the recording of patient symptoms. This study presents an approach using natural language processing to extract clinical concepts from free-text which are used to automatically form diagnostic criteria for lung cancer from unstructured secondary-care data.MethodsPatients aged 40 and above who underwent a chest x-ray (CXR) between 2016 and 2022 were included. ICD-10 and unstructured data were pulled from their electronic health records (EHRs) over the preceding 12 months to the CXR. The unstructured data were processed using named entity recognition to extract symptoms, which were mapped to SNOMED-CT codes. Subsumption of features up the SNOMED-CT hierarchy was used to mitigate against sparse features and a frequency-based criteria, combined with univariate logarithmic probabilities, was applied to select candidate features to take forward to the model development phase. A genetic algorithm was employed to identify the most discriminating features to form the diagnostic criteria.Results75002 patients were included, with 1012 lung cancer diagnoses made within 12 months of the CXR. The best-performing model achieved an AUROC of 0.72. Results showed that an existing 'disorder of the lung', such as pneumonia, and a 'cough' increased the probability of a lung cancer diagnosis. 'Anomalies of great vessel', 'disorder of the retroperitoneal compartment' and 'context-dependent findings', such as pain, statistically reduced the risk of lung cancer, making other diagnoses more likely. The performance of the developed model was compared to the existing cancer risk scores, demonstrating superior performance.ConclusionsThe proposed methods demonstrated success in leveraging unstructured secondary-care data to derive diagnostic criteria for lung cancer, outperforming existing risk tools. These advancements show potential for enhancing patient care and results. However, it is essential to tackle specific limitations by integrating primary care data to ensure a more thorough and unbiased development of diagnostic criteria. Moreover, the study highlights the importance of contextualising SNOMED-CT concepts into meaningful terminology that resonates with clinicians, facilitating a clearer and more tangible understanding of the criteria applied.
引用
收藏
页数:10
相关论文
共 40 条
  • [1] Reduced Lung-Cancer Mortality with Low-Dose Computed Tomographic Screening
    Aberle, Denise R.
    Adams, Amanda M.
    Berg, Christine D.
    Black, William C.
    Clapp, Jonathan D.
    Fagerstrom, Richard M.
    Gareen, Ilana F.
    Gatsonis, Constantine
    Marcus, Pamela M.
    Sicks, JoRean D.
    [J]. NEW ENGLAND JOURNAL OF MEDICINE, 2011, 365 (05) : 395 - 409
  • [2] Predicting surgical outcomes for chronic exertional compartment syndrome using a machine learning framework with embedded trust by interrogation strategies
    Andrew, Houston
    Georgina, Cosma
    Phillipa, Turner
    Alexander, Bennett
    [J]. SCIENTIFIC REPORTS, 2021, 11 (01)
  • [3] Avanzi B, 2023, Machine learning with high-cardinality categorical features in Actuarial Applications
  • [4] Early Diagnosis and Lung Cancer Screening
    Balata, H.
    Quaife, S. L.
    Craig, C.
    Ryan, D. J.
    Bradley, P.
    Crosbie, P. A. J.
    Murray, R. L.
    Evison, M.
    [J]. CLINICAL ONCOLOGY, 2022, 34 (11) : 708 - 715
  • [5] Bean DM, 2023, PLOS DIGIT HEALTH, V2, DOI 10.1371/journal.pdig.0000218
  • [6] Benson T., 2016, Published Online First, DOI [10.1007/978-3-319-30370-3, DOI 10.1007/978-3-319-30370-3]
  • [7] Recognising Lung Cancer in Primary Care
    Bradley, Stephen H.
    Kennedy, Martyn P. T.
    Neal, Richard D.
    [J]. ADVANCES IN THERAPY, 2019, 36 (01) : 19 - 30
  • [8] Family history of cancer and lung cancer: Utility of big data and artificial intelligence for exploring the role of genetic risk
    Calvo, Virginia
    Niazmand, Emetis
    Carcereny, Enric
    Rodriguez-Abreu, Delvys
    Cobo, Manuel
    Lopez-Castro, Rafael
    Guirado, Maria
    Camps, Carlos
    Ortega, Ana Laura
    Bernabe, Reyes
    Massuti, Bartomeu
    Garcia-Campelo, Rosario
    del Barco, Edel
    Gonzalez-Larriba, Jose Luis
    Bosch-Barrera, Joaquim
    Martinez, Marta
    Torrente, Maria
    Vidal, Maria-Esther
    Provencio, Mariano
    [J]. LUNG CANCER, 2024, 195
  • [9] Early recognition of multiple sclerosis using natural language processing of the electronic health record
    Chase, Herbert S.
    Mitrani, Lindsey R.
    Lu, Gabriel G.
    Fulgieri, Dominick J.
    [J]. BMC MEDICAL INFORMATICS AND DECISION MAKING, 2017, 17 : 24
  • [10] Development of Lung Cancer Risk Prediction Machine Learning Models for Equitable Learning Health System: Retrospective Study
    Chen, Anjun
    Wu, Erman
    Huang, Ran
    Shen, Bairong
    Han, Ruobing
    Wen, Jian
    Zhang, Zhiyong
    Li, Qinghua
    [J]. JMIR AI, 2024, 3