Can artificial intelligence diagnose seizures based on patients' descriptions? A study of GPT-4

被引:0
作者
Ford, Joseph [1 ]
Pevy, Nathan [1 ]
Grunewald, Richard [1 ]
Howell, Stephen [2 ]
Reuber, Markus [1 ]
机构
[1] Univ Sheffield, Acad Neurol Unit, Sheffield, England
[2] Royal Hallamshire Hosp, Dept Neurol, Sheffield, England
关键词
artificial intelligence; automated diagnosis; large language model; epilepsy; functional/dissociative seizures; MISDIAGNOSIS; MANAGEMENT; EPILEPSY;
D O I
10.1111/epi.18322
中图分类号
R74 [神经病学与精神病学];
学科分类号
摘要
ObjectiveGeneralist large language models (LLMs) have shown diagnostic potential in various medical contexts but have not been explored extensively in relation to epilepsy. This paper aims to test the performance of an LLM (OpenAI's GPT-4) on the differential diagnosis of epileptic and functional/dissociative seizures (FDS) based on patients' descriptions. MethodsGPT-4 was asked to diagnose 41 cases of epilepsy (n = 16) or FDS (n = 25) based on transcripts of patients describing their symptoms (median word count = 399). It was first asked to perform this task without additional training examples (zero-shot) before being asked to perform it having been given one, two, and three examples of each condition (one-, two, and three-shot). As a benchmark, three experienced neurologists performed this task without access to any additional clinical or demographic information (e.g., age, gender, socioeconomic status). ResultsIn the zero-shot condition, GPT-4's average balanced accuracy was 57% (kappa = .15). Balanced accuracy improved in the one-shot condition (64%, kappa = .27), but did not improve any further in the two-shot (62%, kappa = .24) and three-shot (62%, kappa = .23) conditions. Performance in all four conditions was worse than the mean balanced accuracy of the experienced neurologists (71%, kappa = .42). However, in the subset of 18 cases that all three neurologists had "diagnosed" correctly (median word count = 684), GPT-4's balanced accuracy was 81% (kappa = .66). SignificanceAlthough its "raw" performance was poor, GPT-4 showed noticeable improvement having been given just one example of a patient describing epilepsy and FDS. Giving two and three examples did not further improve performance, but the finding that GPT-4 did much better in those cases correctly diagnosed by all three neurologists suggests that providing more extensive clinical data and more elaborate approaches (e.g., more refined prompt engineering, fine-tuning, or retrieval augmented generation) could unlock the full diagnostic potential of LLMs.
引用
收藏
页数:16
相关论文
共 36 条
  • [1] Brown TB, 2020, Arxiv, DOI [arXiv:2005.14165, DOI 10.48550/ARXIV.2005.14165]
  • [2] Balas M, 2023, JFO Open Ophthalmol, V1, DOI [DOI 10.1016/J.JFOP.2023.100005, 10.1016/j.jfop.2023.100005]
  • [3] Differentiating PNES from epileptic seizures using conversational analysis on French patients: A prospective blinded study
    Biberon, Julien
    de Liege, Astrid
    de Toffol, Bertrand
    Limousin, Nadege
    El-Hage, Wissam
    Florence, Aline-Marie
    Duwicquet, Coline
    [J]. EPILEPSY & BEHAVIOR, 2020, 111
  • [4] Conversation analysis in the differential diagnosis of Italian patients with epileptic or psychogenic non-epileptic seizures: A blind prospective study
    Cornaggia, Cesare Maria
    Gugliotta, Simona Corinna
    Magaudda, Adriana
    Alfa, Rossella
    Beghi, Massimiliano
    Polita, Maria
    [J]. EPILEPSY & BEHAVIOR, 2012, 25 (04) : 598 - 604
  • [5] ChatGPT-Generated Differential Diagnosis Lists for Complex Case-Derived Clinical Vignettes: Diagnostic Accuracy Evaluation
    Hirosawa, Takanobu
    Kawamura, Ren
    Harada, Yukinori
    Mizuta, Kazuya
    Tokumasu, Kazuki
    Kaji, Yuki
    Suzuki, Tomoharu
    Shimizu, Taro
    [J]. JMIR MEDICAL INFORMATICS, 2023, 11
  • [6] ChatGPT Is Equivalent to First-Year Plastic Surgery Residents: Evaluation of ChatGPT on the Plastic Surgery In-service Examination
    Humar, Pooja
    Asaad, Malke
    Bengur, Fuat Baris
    Nguyen, Vu
    [J]. AESTHETIC SURGERY JOURNAL, 2023, 43 (12) : NP1085 - NP1089
  • [7] Neurologists can identify diagnostic linguistic features during routine seizure clinic interactions: results of a one-day teaching intervention
    Jenkins, Laura
    Cosgrove, Jeremy
    Chappell, Paul
    Kheder, Ammar
    Sokhi, Dilraj
    Reuber, Markus
    [J]. EPILEPSY & BEHAVIOR, 2016, 64 : 257 - 261
  • [8] History of artificial intelligence in medicine
    Kaul, Vivek
    Enslin, Sarah
    Gross, Seth A.
    [J]. GASTROINTESTINAL ENDOSCOPY, 2020, 92 (04) : 807 - 812
  • [9] Diagnostic delay in psychogenic seizures and the association with anti-seizure medication trials
    Kerr, Wesley T.
    Janio, Emily A.
    Le, Justine M.
    Hori, Jessica M.
    Patel, Akash B.
    Gallardo, Norma L.
    Bauirjan, Janar
    Chau, Andrea M.
    D'Ambrosio, Shannon R.
    Cho, Andrew Y.
    Engel, Jerome, Jr.
    Cohen, Mark S.
    Stern, John M.
    [J]. SEIZURE-EUROPEAN JOURNAL OF EPILEPSY, 2016, 40 : 123 - 126
  • [10] ChatGPT in radiology: A systematic review of performance, pitfalls, and future perspectives
    Keshavarz, Pedram
    Bagherieh, Sara
    Nabipoorashra, Seyed Ali
    Chalian, Hamid
    Rahsepar, Amir Ali
    Kim, Grace Hyun J.
    Hassani, Cameron
    Raman, Steven S.
    Bedayat, Arash
    [J]. DIAGNOSTIC AND INTERVENTIONAL IMAGING, 2024, 105 (7-8) : 251 - 265