Generalizable clinical note section identification with large language models

被引:1
作者
Zhou, Weipeng [1 ]
Miller, Timothy A. [2 ,3 ]
机构
[1] Univ Washington Seattle, Sch Med, Dept Biomed Informat & Med Educ, Seattle, WA 98195 USA
[2] Boston Childrens Hosp, Computat Hlth Informat Program, Boston, MA 02215 USA
[3] Harvard Med Sch, Dept Pediat, Boston, MA 02215 USA
基金
美国国家卫生研究院;
关键词
section identification; large language models; ChatGPT; GPT4; fine-tuning;
D O I
10.1093/jamiaopen/ooae075
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Objectives Clinical note section identification helps locate relevant information and could be beneficial for downstream tasks such as named entity recognition. However, the traditional supervised methods suffer from transferability issues. This study proposes a new framework for using large language models (LLMs) for section identification to overcome the limitations.Materials and Methods We framed section identification as question-answering and provided the section definitions in free-text. We evaluated multiple LLMs off-the-shelf without any training. We also fine-tune our LLMs to investigate how the size and the specificity of the fine-tuning dataset impacts model performance.Results GPT4 achieved the highest F1 score of 0.77. The best open-source model (Tulu2-70b) achieved 0.64 and is on par with GPT3.5 (ChatGPT). GPT4 is also found to obtain F1 scores greater than 0.9 for 9 out of the 27 (33%) section types and greater than 0.8 for 15 out of 27 (56%) section types. For our fine-tuned models, we found they plateaued with an increasing size of the general domain dataset. We also found that adding a reasonable amount of section identification examples is beneficial.Discussion These results indicate that GPT4 is nearly production-ready for section identification, and seemingly contains both knowledge of note structure and the ability to follow complex instructions, and the best current open-source LLM is catching up.Conclusion Our study shows that LLMs are promising for generalizable clinical note section identification. They have the potential to be further improved by adding section identification examples to the fine-tuning dataset. Clinical note section identification helps locate relevant information and could be beneficial for downstream tasks such as extracting social determinants of health from the social history section. Traditional machine learning methods typically require an annotated dataset and only operate on a fixed set of section categories. In this study, we approached section identification using large language models (LLMs) in a question-answering way, providing the LLM section definitions as a part of the question. Among the multiple LLMs we tested, we found GPT4 achieved the highest F1 score of 0.77. The best open-source model (Tulu2-70b) achieved 0.64 and is on par with GPT3.5 (ChatGPT). GPT4 is also found to obtain F1 scores greater than 0.9 for 9 out of the 27 (33%) section types and greater than 0.8 for 15 out of 27 (56%) section types. These results indicate that GPT4 is nearly production-ready for section identification, and seemingly contains both knowledge of note structure and the ability to follow complex instructions, and the best current open-source LLM is catching up.
引用
收藏
页数:10
相关论文
共 42 条
[1]  
[Anonymous], ShareGPT: Share your wildest ChatGPT conversations with one click
[2]  
[Anonymous], Introducing ChatGPT
[3]  
[Anonymous], Introducing meta llama 3: The most capable openly available llm to date
[4]  
[Anonymous], Vicuna: an open-source chatbot impressing GPT-4 with 90% ChatGPT quality
[5]  
[Anonymous], 2023, Open-Orca/OpenOrca datasets at hugging face
[6]  
[Anonymous], Gpt-4
[7]   The promise of large language models in health care [J].
Arora, Anmol ;
Arora, Ananya .
LANCET, 2023, 401 (10377) :641-642
[8]  
Batista D., nervaluate: NER evaluation done right
[9]   Statistical models for text segmentation [J].
Beeferman, D ;
Berger, A ;
Lafferty, J .
MACHINE LEARNING, 1999, 34 (1-3) :177-210
[10]  
Burkhardt HA., Extracting COVID-19 related symptoms from EHR data: a comparison of three methods