Methodological insights into ChatGPT's screening performance in systematic reviews

被引:7
|
作者
Issaiy, Mahbod [1 ]
Ghanaati, Hossein [1 ]
Kolahi, Shahriar [1 ]
Shakiba, Madjid [1 ]
Jalali, Amir Hossein [1 ]
Zarei, Diana [1 ]
Kazemian, Sina [2 ]
Avanaki, Mahsa Alborzi [1 ]
Firouznia, Kavous [1 ]
机构
[1] Univ Tehran Med Sci, Adv Diagnost & Intervent Radiol Res Ctr ADIR, Tehran, Iran
[2] Univ Tehran Med Sci, Cardiovasc Dis Res Inst, Cardiac Primary Prevent Res Ctr, Tehran, Iran
关键词
Systematic review; ChatGPT; AI; Large language model; Article screening; Radiology; GPT;
D O I
10.1186/s12874-024-02203-8
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background The screening process for systematic reviews and meta-analyses in medical research is a labor-intensive and time-consuming task. While machine learning and deep learning have been applied to facilitate this process, these methods often require training data and user annotation. This study aims to assess the efficacy of ChatGPT, a large language model based on the Generative Pretrained Transformers (GPT) architecture, in automating the screening process for systematic reviews in radiology without the need for training data.Methods A prospective simulation study was conducted between May 2nd and 24th, 2023, comparing ChatGPT's performance in screening abstracts against that of general physicians (GPs). A total of 1198 abstracts across three subfields of radiology were evaluated. Metrics such as sensitivity, specificity, positive and negative predictive values (PPV and NPV), workload saving, and others were employed. Statistical analyses included the Kappa coefficient for inter-rater agreement, ROC curve plotting, AUC calculation, and bootstrapping for p-values and confidence intervals.Results ChatGPT completed the screening process within an hour, while GPs took an average of 7-10 days. The AI model achieved a sensitivity of 95% and an NPV of 99%, slightly outperforming the GPs' sensitive consensus (i.e., including records if at least one person includes them). It also exhibited remarkably low false negative counts and high workload savings, ranging from 40 to 83%. However, ChatGPT had lower specificity and PPV compared to human raters. The average Kappa agreement between ChatGPT and other raters was 0.27.Conclusions ChatGPT shows promise in automating the article screening phase of systematic reviews, achieving high sensitivity and workload savings. While not entirely replacing human expertise, it could serve as an efficient first-line screening tool, particularly in reducing the burden on human resources. Further studies are needed to fine-tune its capabilities and validate its utility across different medical subfields.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Methodological Guidance Paper: The Art and Science of Quality Systematic Reviews
    Alexander, Patricia A.
    REVIEW OF EDUCATIONAL RESEARCH, 2020, 90 (01) : 6 - 23
  • [32] Does updating improve the methodological and reporting quality of systematic reviews?
    Shea B.
    Boers M.
    Grimshaw J.M.
    Hamel C.
    Bouter L.M.
    BMC Medical Research Methodology, 6 (1)
  • [33] Methodological reflections on the use of systematic reviews in early childhood research
    Ang, Lynn
    JOURNAL OF EARLY CHILDHOOD RESEARCH, 2018, 16 (01) : 18 - 31
  • [34] GPT-3.5 Turbo and GPT-4 Turbo in Title and Abstract Screening for Systematic Reviews
    Oami, Takehiko
    Okada, Yohei
    Nakada, Taka-aki
    JMIR MEDICAL INFORMATICS, 2025, 13
  • [35] Performance of ChatGPT on the Taiwan urology board examination: insights into current strengths and shortcomings
    Tsai, Chung-You
    Hsieh, Shang-Ju
    Huang, Hung-Hsiang
    Deng, Juinn-Horng
    Huang, Yi-You
    Cheng, Pai-Yu
    WORLD JOURNAL OF UROLOGY, 2024, 42 (01)
  • [36] ChatGPT's Impact on Education and Healthcare: Insights, Challenges, and Ethical Consideration
    Hamad, Zineb Touati
    Jamil, Nuraini
    Belkacem, Abdelkader Nasreddine
    IEEE ACCESS, 2024, 12 : 114858 - 114877
  • [37] Methodological Quality of Systematic Reviews on Bodyweight Management Strategies for Children and Adolescents
    Ho, Robin Sze-Tak
    Chui, King Yin
    Huang, Wendy Yajun
    Wong, Stephen Heung-Sang
    MEDICINE & SCIENCE IN SPORTS & EXERCISE, 2023, 55 (05) : 892 - 899
  • [38] The methodological quality of systematic reviews comparing intravitreal bevacizumab and alternates for neovascular age related macular degeneration: A systematic review of reviews
    George, Pradeep Paul
    Molina, Joseph Antonio DeCastro
    Heng, Bee Hoon
    INDIAN JOURNAL OF OPHTHALMOLOGY, 2014, 62 (07) : 761 - 767
  • [39] Evaluation of the methodological quality of systematic reviews of health status measurement instruments
    Mokkink, Lidwine B.
    Terwee, Caroline B.
    Stratford, Paul W.
    Alonso, Jordi
    Patrick, Donald L.
    Riphagen, Ingrid
    Knol, Dirk L.
    Bouter, Lex M.
    de Vet, Henrica C. W.
    QUALITY OF LIFE RESEARCH, 2009, 18 (03) : 313 - 333
  • [40] Development of AMSTAR: a measurement tool to assess the methodological quality of systematic reviews
    Beverley J Shea
    Jeremy M Grimshaw
    George A Wells
    Maarten Boers
    Neil Andersson
    Candyce Hamel
    Ashley C Porter
    Peter Tugwell
    David Moher
    Lex M Bouter
    BMC Medical Research Methodology, 7