Methodological insights into ChatGPT's screening performance in systematic reviews

被引:7
|
作者
Issaiy, Mahbod [1 ]
Ghanaati, Hossein [1 ]
Kolahi, Shahriar [1 ]
Shakiba, Madjid [1 ]
Jalali, Amir Hossein [1 ]
Zarei, Diana [1 ]
Kazemian, Sina [2 ]
Avanaki, Mahsa Alborzi [1 ]
Firouznia, Kavous [1 ]
机构
[1] Univ Tehran Med Sci, Adv Diagnost & Intervent Radiol Res Ctr ADIR, Tehran, Iran
[2] Univ Tehran Med Sci, Cardiovasc Dis Res Inst, Cardiac Primary Prevent Res Ctr, Tehran, Iran
关键词
Systematic review; ChatGPT; AI; Large language model; Article screening; Radiology; GPT;
D O I
10.1186/s12874-024-02203-8
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background The screening process for systematic reviews and meta-analyses in medical research is a labor-intensive and time-consuming task. While machine learning and deep learning have been applied to facilitate this process, these methods often require training data and user annotation. This study aims to assess the efficacy of ChatGPT, a large language model based on the Generative Pretrained Transformers (GPT) architecture, in automating the screening process for systematic reviews in radiology without the need for training data.Methods A prospective simulation study was conducted between May 2nd and 24th, 2023, comparing ChatGPT's performance in screening abstracts against that of general physicians (GPs). A total of 1198 abstracts across three subfields of radiology were evaluated. Metrics such as sensitivity, specificity, positive and negative predictive values (PPV and NPV), workload saving, and others were employed. Statistical analyses included the Kappa coefficient for inter-rater agreement, ROC curve plotting, AUC calculation, and bootstrapping for p-values and confidence intervals.Results ChatGPT completed the screening process within an hour, while GPs took an average of 7-10 days. The AI model achieved a sensitivity of 95% and an NPV of 99%, slightly outperforming the GPs' sensitive consensus (i.e., including records if at least one person includes them). It also exhibited remarkably low false negative counts and high workload savings, ranging from 40 to 83%. However, ChatGPT had lower specificity and PPV compared to human raters. The average Kappa agreement between ChatGPT and other raters was 0.27.Conclusions ChatGPT shows promise in automating the article screening phase of systematic reviews, achieving high sensitivity and workload savings. While not entirely replacing human expertise, it could serve as an efficient first-line screening tool, particularly in reducing the burden on human resources. Further studies are needed to fine-tune its capabilities and validate its utility across different medical subfields.
引用
收藏
页数:11
相关论文
共 50 条
  • [41] Pediatric dentistry systematic reviews using the GRADE approach: methodological study
    Alvarenga-Brant, Rachel
    Notaro, Sarah Queiroz
    Stefani, Cristine Miron
    Canto, Graziela De Luca
    Pereira, Alexandre Godinho
    Povoa-Santos, Luciana
    Souza-Oliveira, Ana Clara
    Campos, Julya Ribeiro
    Martins-Pfeifer, Carolina Castro
    BMC ORAL HEALTH, 2024, 24 (01):
  • [42] The methodological and reporting quality of systematic reviews from China and the USA are similar
    Tian, Jinhui
    Zhang, Jun
    Ge, Long
    Yang, Kehu
    Song, Fujian
    JOURNAL OF CLINICAL EPIDEMIOLOGY, 2017, 85 : 50 - 58
  • [43] Methodological issues arising from systematic reviews of the evidence of safety of vaccines
    Price, D
    Jefferson, T
    Demicheli, V
    VACCINE, 2004, 22 (15-16) : 2080 - 2084
  • [44] Evaluation of the methodological quality of systematic reviews of health status measurement instruments
    Lidwine B. Mokkink
    Caroline B. Terwee
    Paul W. Stratford
    Jordi Alonso
    Donald L. Patrick
    Ingrid Riphagen
    Dirk L. Knol
    Lex M. Bouter
    Henrica C. W. de Vet
    Quality of Life Research, 2009, 18 : 313 - 333
  • [45] A methodological review of systematic literature reviews in higher education: Heterogeneity and homogeneity
    Chen, Yulu
    Chong, Sin Wang
    Lin, Ting Jun
    EDUCATIONAL RESEARCH REVIEW, 2022, 35
  • [46] Statistical stopping criteria for automated screening in systematic reviews
    Max W Callaghan
    Finn Müller-Hansen
    Systematic Reviews, 9
  • [47] Methodological and Reporting Quality of Systematic Reviews and Meta-analyses in Endodontics
    Nagendrababu, Venkateshbabu
    Pulikkotil, Shaju Jacob
    Sultan, Omer Sheriff
    Jayaraman, Jayakumar
    Peters, Ove A.
    JOURNAL OF ENDODONTICS, 2018, 44 (06) : 903 - 913
  • [48] Systematic reviews identify important methodological flaws in stroke rehabilitation therapy primary studies: review of reviews
    Santaguida, Pasqualina
    Oremus, Mark
    Walker, Kathryn
    Wishart, Laurie R.
    Siegel, Karen Lohmann
    Raina, Parminder
    JOURNAL OF CLINICAL EPIDEMIOLOGY, 2012, 65 (04) : 358 - 367
  • [49] ChatGPT’s performance on JSA-certified anesthesiologist exam
    Michiko Kinoshita
    Mizuki Komasaka
    Katsuya Tanaka
    Journal of Anesthesia, 2024, 38 : 282 - 283
  • [50] Statistical stopping criteria for automated screening in systematic reviews
    Callaghan, Max W.
    Mueller-Hansen, Finn
    SYSTEMATIC REVIEWS, 2020, 9 (01)