Evaluation of Large Language Model Performance and Reliability for Citations and References in Scholarly Writing: Cross-Disciplinary Study

被引:9
|
作者
Mugaanyi, Joseph [1 ]
Cai, Liuying [2 ]
Cheng, Sumei [2 ]
Lu, Caide [1 ]
Huang, Jing [1 ]
机构
[1] Ningbo Univ, Lihuili Hosp, Hlth Sci Ctr, Ningbo Med Ctr,Dept Hepatopancreato Biliary Surg, 1111 Jiangnan Rd, Ningbo 315000, Peoples R China
[2] Shanghai Acad Social Sci, Inst Philosophy, Shanghai, Peoples R China
关键词
large language models; accuracy; academic writing; AI; cross -disciplinary evaluation; scholarly writing; ChatGPT; GPT-3.5; writing tool; scholarly; academic discourse; LLMs; machine learning algorithms; NLP; natural language processing; citations; references; natural science; humanities; chatbot; artificial intelligence;
D O I
10.2196/52935
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: Large language models (LLMs) have gained prominence since the release of ChatGPT in late 2022. Objective: The aim of this study was to assess the accuracy of citations and references generated by ChatGPT (GPT-3.5) in two distinct academic domains: the natural sciences and humanities. Methods: Two researchers independently prompted ChatGPT to write an introduction section for a manuscript and include citations; they then evaluated the accuracy of the citations and Digital Object Identifiers (DOIs). Results were compared between the two disciplines. Results: Ten topics were included, including 5 in the natural sciences and 5 in the humanities. A total of 102 citations were generated, with 55 in the natural sciences and 47 in the humanities. Among these, 40 citations (72.7%) in the natural sciences and 36 citations (76.6%) in the humanities were confirmed to exist (P=.42). There were significant disparities found in DOI presence in the natural sciences (39/55, 70.9%) and the humanities (18/47, 38.3%), along with significant differences in accuracy between the two disciplines (18/55, 32.7% vs 4/47, 8.5%). DOI hallucination was more prevalent in the humanities (42/55, 89.4%). The Levenshtein distance was significantly higher in the humanities than in the natural sciences, reflecting the lower DOI accuracy. Conclusions: ChatGPT's performance in generating citations and references varies across disciplines. Differences in DOI standards and disciplinary nuances contribute to performance variations. Researchers should consider the strengths and limitations of artificial intelligence writing tools with respect to citation accuracy. The use of domain-specific models may enhance accuracy.
引用
收藏
页数:7
相关论文
共 23 条
  • [1] Viability of Open Large Language Models for Clinical Documentation in German Health Care: Real-World Model Evaluation Study
    Heilmeyer, Felix
    Boehringer, Daniel
    Reinhard, Thomas
    Arens, Sebastian
    Lyssenko, Lisa
    Haverkamp, Christian
    JMIR MEDICAL INFORMATICS, 2024, 12
  • [2] Evaluation of Prompts to Simplify Cardiovascular Disease Information Generated Using a Large Language Model: Cross-Sectional Study
    Mishra, Vishala
    Sarraju, Ashish
    Kalwani, Neil M.
    Dexter, Joseph P.
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [3] Metadiscourse repertoire of L1 Mandarin undergraduates writing in English: A cross-contextual, cross-disciplinary study
    Li, Ting
    Wharton, Sue
    JOURNAL OF ENGLISH FOR ACADEMIC PURPOSES, 2012, 11 (04) : 345 - 356
  • [4] Classifying Unstructured Text in Electronic Health Records for Mental Health Prediction Models: Large Language Model Evaluation Study
    Cardamone, Nicholas C.
    Olfson, Mark
    Schmutte, Timothy
    Ungar, Lyle
    Liu, Tony
    Cullen, Sara W.
    Williams, Nathaniel J.
    Marcus, Steven C.
    JMIR MEDICAL INFORMATICS, 2025, 13
  • [5] A cross-disciplinary corpus-based study on English and Chinese native speakers' use of linking adverbials in academic writing
    Gao, Xia
    JOURNAL OF ENGLISH FOR ACADEMIC PURPOSES, 2016, 24 : 14 - 28
  • [6] A cross-disciplinary corpus-based study on English and Chinese native speakers' use of first-person pronouns in academic English writing
    Xia, Gao
    TEXT & TALK, 2018, 38 (01) : 93 - 113
  • [7] Evaluation of Large language model performance on the Multi-Specialty Recruitment Assessment (MSRA) exam
    Tsoutsanis, Panagiotis
    Tsoutsanis, Aristotelis
    COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 168
  • [8] Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study
    Giannakopoulos, Kostis
    Kavadella, Argyro
    Salim, Anas Aaqel
    Stamatopoulos, Vassilis
    Kaklamanos, Eleftherios G.
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2023, 25
  • [9] Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study
    Kuerbanjiang, Warisijiang
    Peng, Shengzhe
    Jiamaliding, Yiershatijiang
    Yi, Yuexiong
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2025, 27
  • [10] Scenario-Driven Evaluation of Autonomous Agents: Integrating Large Language Model for UAV Mission Reliability
    Sezgin, Anil
    DRONES, 2025, 9 (03)