On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models

被引：1

作者：

Afshar, Majid ^{[1
]}

Gao, Yanjun ^{[1
]}

Gupta, Deepak ^{[2
]}

Croxford, Emma ^{[1
]}

Demner-Fushman, Dina ^{[2
]}

机构：

[1] Univ Wisconsin, Sch Med & Publ Hlth, 750 Highland Ave, Madison, WI 53726 USA

[2] NIH, Natl Lib Med, HHS, 8600 Rockville Pike, Bethesda, MD 20894 USA

来源：

JOURNAL OF BIOMEDICAL INFORMATICS | 2024年 / 157卷

基金：

美国国家卫生研究院;

关键词：

Artificial intelligence; Knowledge representation (computer); Natural language processing; Unified medical language system; Evaluation methodology; Differential diagnoses;

D O I：

10.1016/j.jbi.2024.104707

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Objective: Traditional knowledge-based and machine learning diagnostic decision support systems have benefited from integrating the medical domain knowledge encoded in the Unified Medical Language System (UMLS). The emergence of Large Language Models (LLMs) to supplant traditional systems poses questions of the quality and extent of the medical knowledge in the models' internal knowledge representations and the need for external knowledge sources. The objective of this study is three-fold: to probe the diagnosis-related medical knowledge of popular LLMs, to examine the benefit of providing the UMLS knowledge to LLMs (grounding the diagnosis predictions), and to evaluate the correlations between human judgments and the UMLS-based metrics for generations by LLMs. Methods: We evaluated diagnoses generated by LLMs from consumer health questions and daily care notes in the electronic health records using the ConsumerQA and Problem Summarization datasets. Probing LLMs for the UMLS knowledge was performed by prompting the LLM to complete the diagnosis-related UMLS knowledge paths. Grounding the predictions was examined in an approach that integrated the UMLS graph paths and clinical notes in prompting the LLMs. The results were compared to prompting without the UMLS paths. The final experiments examined the alignment of different evaluation metrics, UMLS-based and non-UMLS, with human expert evaluation. Results: In probing the UMLS knowledge, GPT-3.5 significantly outperformed Llama2 and a simple baseline yielding an F1 score of 10.9% in completing one-hop UMLS paths for a given concept. Grounding diagnosis predictions with the UMLS paths improved the results for both models on both tasks, with the highest improvement (4%) in SapBERT score. There was a weak correlation between the widely used evaluation metrics (ROUGE and SapBERT) and human judgments. Conclusion: We found that while popular LLMs contain some medical knowledge in their internal representations, augmentation with the UMLS knowledge provides performance gains around diagnosis generation. The UMLS needs to be tailored for the task to improve the LLMs predictions. Finding evaluation metrics that are aligned with human judgments better than the traditional ROUGE and BERT-based scores remains an open research question.

引用

页数：9

共 50 条

[31] Large language models in science
Kowalewski, Karl-Friedrich
Rodler, Severin
UROLOGIE, 2024, 63 (09): : 860 - 866
[32] Frontiers: Supporting Content Marketing with Natural Language Generation
Reisenbichler, Martin
Reutterer, Thomas
Schweidel, David A.
Dan, Daniel
MARKETING SCIENCE, 2022, 41 (03) : 441 - 452
[33] Large language models (LLMs) as agents for augmented democracy
Gudino, Jairo F.
Grandi, Umberto
Hidalgo, Cesar
PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2024, 382 (2285):
[34] A Critical Review of Methods and Challenges in Large Language Models
Moradi, Milad
Yan, Ke
Colwell, David
Samwald, Matthias
Asgari, Rhona
CMC-COMPUTERS MATERIALS & CONTINUA, 2025, 82 (02): : 1681 - 1698
[35] Large Language Models in Oncology: Revolution or Cause for Concern?
Caglayan, Aydin
Slusarczyk, Wojciech
Rabbani, Rukhshana Dina
Ghose, Aruni
Papadopoulos, Vasileios
Boussios, Stergios
CURRENT ONCOLOGY, 2024, 31 (04) : 1817 - 1830
[36] Einsatzmöglichkeiten von „large language models“ in der OnkologieApplications of large language models in oncology
Chiara M. Loeffler
Keno K. Bressem
Daniel Truhn
Die Onkologie, 2024, 30 (5) : 388 - 393
[37] Transforming Informed Consent Generation Using Large Language Models: Mixed Methods Study
Shi, Qiming
Luzuriaga, Katherine
Allison, Jeroan J.
Oztekin, Asil
Faro, Jamie M.
Lee, Joy L.
Hafer, Nathaniel
Mcmanus, Margaret
Zai, Adrian H.
JMIR MEDICAL INFORMATICS, 2025, 13
[38] Distractor Generation for Multiple-Choice Questions with Predictive Prompting and Large Language Models
Bitew, Semere Kiros
Deleu, Johannes
Develder, Chris
Demeester, Thomas
MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2023, PT II, 2025, 2134 : 48 - 63
[39] A Survey Study on the State of the Art of Programming Exercise Generation using Large Language Models
Frankford, Eduard
Hoehn, Ingo
Sauerwein, Clemens
Breu, Ruth
2024 36TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING EDUCATION AND TRAINING, CSEE & T 2024, 2024,
[40] Enhancing textual textbook question answering with large language models and retrieval augmented generation
Alawwad, Hessa A.
Alhothali, Areej
Naseem, Usman
Alkhathlan, Ali
Jamal, Amani
PATTERN RECOGNITION, 2025, 162

← 1 2 3 4 5 →