On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models

被引:1
|
作者
Afshar, Majid [1 ]
Gao, Yanjun [1 ]
Gupta, Deepak [2 ]
Croxford, Emma [1 ]
Demner-Fushman, Dina [2 ]
机构
[1] Univ Wisconsin, Sch Med & Publ Hlth, 750 Highland Ave, Madison, WI 53726 USA
[2] NIH, Natl Lib Med, HHS, 8600 Rockville Pike, Bethesda, MD 20894 USA
基金
美国国家卫生研究院;
关键词
Artificial intelligence; Knowledge representation (computer); Natural language processing; Unified medical language system; Evaluation methodology; Differential diagnoses;
D O I
10.1016/j.jbi.2024.104707
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Objective: Traditional knowledge-based and machine learning diagnostic decision support systems have benefited from integrating the medical domain knowledge encoded in the Unified Medical Language System (UMLS). The emergence of Large Language Models (LLMs) to supplant traditional systems poses questions of the quality and extent of the medical knowledge in the models' internal knowledge representations and the need for external knowledge sources. The objective of this study is three-fold: to probe the diagnosis-related medical knowledge of popular LLMs, to examine the benefit of providing the UMLS knowledge to LLMs (grounding the diagnosis predictions), and to evaluate the correlations between human judgments and the UMLS-based metrics for generations by LLMs. Methods: We evaluated diagnoses generated by LLMs from consumer health questions and daily care notes in the electronic health records using the ConsumerQA and Problem Summarization datasets. Probing LLMs for the UMLS knowledge was performed by prompting the LLM to complete the diagnosis-related UMLS knowledge paths. Grounding the predictions was examined in an approach that integrated the UMLS graph paths and clinical notes in prompting the LLMs. The results were compared to prompting without the UMLS paths. The final experiments examined the alignment of different evaluation metrics, UMLS-based and non-UMLS, with human expert evaluation. Results: In probing the UMLS knowledge, GPT-3.5 significantly outperformed Llama2 and a simple baseline yielding an F1 score of 10.9% in completing one-hop UMLS paths for a given concept. Grounding diagnosis predictions with the UMLS paths improved the results for both models on both tasks, with the highest improvement (4%) in SapBERT score. There was a weak correlation between the widely used evaluation metrics (ROUGE and SapBERT) and human judgments. Conclusion: We found that while popular LLMs contain some medical knowledge in their internal representations, augmentation with the UMLS knowledge provides performance gains around diagnosis generation. The UMLS needs to be tailored for the task to improve the LLMs predictions. Finding evaluation metrics that are aligned with human judgments better than the traditional ROUGE and BERT-based scores remains an open research question.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Application of large language models in disease diagnosis and treatment
    Yang Xintian
    Li Tongxin
    Su Qin
    Liu Yaling
    Kang Chenxi
    Lyu Yong
    Zhao Lina
    Nie Yongzhan
    Pan Yanglin
    中华医学杂志英文版, 2025, 138 (02)
  • [2] Towards accurate differential diagnosis with large language models
    McDuff, Daniel
    Schaekermann, Mike
    Tu, Tao
    Palepu, Anil
    Wang, Amy
    Garrison, Jake
    Singhal, Karan
    Sharma, Yash
    Azizi, Shekoofeh
    Kulkarni, Kavita
    Hou, Le
    Cheng, Yong
    Liu, Yun
    Mahdavi, S. Sara
    Prakash, Sushant
    Pathak, Anupam
    Semturs, Christopher
    Patel, Shwetak
    Webster, Dale R.
    Dominowska, Ewa
    Gottweis, Juraj
    Barral, Joelle
    Chou, Katherine
    Corrado, Greg S.
    Matias, Yossi
    Sunshine, Jake
    Karthikesalingam, Alan
    Natarajan, Vivek
    NATURE, 2025,
  • [3] Application of large language models in disease diagnosis and treatment
    Yang, Xintian
    Li, Tongxin
    Su, Qin
    Liu, Yaling
    Kang, Chenxi
    Lyu, Yong
    Zhao, Lina
    Nie, Yongzhan
    Pan, Yanglin
    CHINESE MEDICAL JOURNAL, 2025, 138 (02) : 130 - 142
  • [4] Framework for evaluating code generation ability of large language models
    Yeo, Sangyeop
    Ma, Yu-Seung
    Kim, Sang Cheol
    Jun, Hyungkook
    Kim, Taeho
    ETRI JOURNAL, 2024, 46 (01) : 106 - 117
  • [5] A Method for Efficient Structured Data Generation with Large Language Models
    Hou, Zongzhi
    Zhao, Ruohan
    Li, Zhongyang
    Wang, Zheng
    Wu, Yizhen
    Gou, Junwei
    Zhu, Zhifeng
    PROCEEDINGS OF THE 2ND WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM(CUBE)A 2024, 2024, : 36 - 44
  • [6] Taming large language models to implement diagnosis and evaluating the generation of LLMs at the semantic similarity level in acupuncture and moxibustion
    Li, Shusheng
    Tan, Wenjun
    Zhang, Changshuai
    Li, Jiale
    Ren, Haiyan
    Guo, Yanliang
    Jia, Jing
    Liu, Yangyang
    Pan, Xingfang
    Guo, Jing
    Meng, Wei
    He, Zhaoshui
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 264
  • [7] Automation of Network Configuration Generation using Large Language Models
    Chakraborty, Supratim
    Chitta, Nithin
    Sundaresan, Rajesh
    2024 20TH INTERNATIONAL CONFERENCE ON NETWORK AND SERVICE MANAGEMENT, CNSM 2024, 2024,
  • [8] Applications of Large Language Models in Pathology
    Cheng, Jerome
    BIOENGINEERING-BASEL, 2024, 11 (04):
  • [9] Eliciting metaknowledge in Large Language Models
    Longo, Carmelo Fabio
    Mongiovi, Misael
    Bulla, Luana
    Lieto, Antonio
    COGNITIVE SYSTEMS RESEARCH, 2025, 91
  • [10] Personalized Impression Generation for PET Reports Using Large Language Models
    Tie, Xin
    Shin, Muheon
    Pirasteh, Ali
    Ibrahim, Nevein
    Huemann, Zachary
    Castellino, Sharon M.
    Kelly, Kara M.
    Garrett, John
    Hu, Junjie
    Cho, Steve Y.
    Bradshaw, Tyler J.
    JOURNAL OF IMAGING INFORMATICS IN MEDICINE, 2024, 37 (02): : 471 - 488