From Algorithms to Academia: An Endeavor to Benchmark AI-Generated Scientific Papers against Human Standards

被引:0
作者
Woodrow, Jackson [1 ]
Nassour, Nour [1 ]
Kwon, John Y. [1 ]
Ashkani-Esfahani, Soheil [1 ]
Harris, Mitchel [1 ]
机构
[1] Harvard Med Sch, Massachusetts Gen Hosp, Foot & Ankle Res & Innovat Lab FARIL, Boston, MA USA
来源
ARCHIVES OF BONE AND JOINT SURGERY-ABJS | 2025年 / 13卷 / 04期
关键词
Artificial intelligence; ChatGPT; Large language models; Natural language processing; Prompt engineering; PREDICTION;
D O I
10.22038/ABJS.2024.80093.3669
中图分类号
R826.8 [整形外科学]; R782.2 [口腔颌面部整形外科学]; R726.2 [小儿整形外科学]; R62 [整形外科学(修复外科学)];
学科分类号
摘要
Objectives: The aim of this study is to quantitatively investigate the accuracy of text generated by AI large language models while comparing their readability and likelihood of being accepted to a scientific compared to human-authored papers on the same topics. Methods: The study consisted of two papers written by ChatGPT, two papers written by Assistant by scite, and two papers written by humans. A total of six independent reviewers were blinded to the authorship of each paper and assigned a grade to each subsection on a scale of 1 to 4. Additionally, each reviewer was asked to guess if the paper was written by a human or AI and explain their reasoning. The study authors also graded each AI-generated paper based on factual accuracy of the claims and citations. Results: The human-written calcaneus fracture paper received the highest score of a 3.70/4, followed by Assistant-written calcaneus fracture paper (3.02/4), human-written ankle osteoarthritis paper (2.98/4), ChatGPT calcaneus fracture (2.89/4), ChatGPT Ankle Osteoarthritis (2.87/4), and Assistant Ankle Osteoarthritis (2.78/4). The human calcaneus fracture paper received a statistically significant higher rating than the ChatGPT calcaneus fracture paper (P = 0.028) and the Assistant calcaneus fracture paper (P = 0.043). The ChatGPT osteoarthritis review showed 100% factual accuracy, the ChatGPT calcaneus fracture review was 97.46% factually accurate, the Assistant calcaneus fracture was 95.56% accurate, and the Assistant ankle osteoarthritis was 94.98% accurate. Regarding citations, the ChatGPT ankle osteoarthritis paper was 90% accurate, the ChatGPT calcaneus fracture was 69.23% accurate, the Assistant ankle osteoarthritis was 35.14% accurate, and the Assistant calcaneus fracture was 39.68% accurate. Conclusion: Through this paper we emphasize that while AI holds the promise of enhancing knowledge sharing, it must be used responsibly and in conjunction with comprehensive fact-checking procedures to maintain the integrity of the scientific discourse. Level of evidence: III
引用
收藏
页码:212 / 222
页数:11
相关论文
共 17 条
[1]   Prediction of Fusion Rod Curvature Angles in Posterior Scoliosis Correction Using Artificial Intelligence [J].
Abedi, Rasoul ;
Fatouraee, Nasser ;
Bostanshirin, Mahdi ;
Arjmand, Navid ;
Ghandhari, Hasan .
ARCHIVES OF BONE AND JOINT SURGERY-ABJS, 2024, 12 (07) :494-505
[2]   Artificial Hallucinations in ChatGPT: Implications in Scientific Writing [J].
Alkaissi, Hussam ;
McFarlane, Samy I. .
CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (02)
[3]  
[Anonymous], 2023, REUTERS
[4]   Exploring the Boundaries of Reality: Investigating the Phenomenon of Artificial Intelligence Hallucination in Scientific Writing Through ChatGPT References [J].
Athaluri, Sai Anirudh ;
Manthena, Sandeep Varma ;
Kesapragada, V. S. R. Krishna Manoj ;
Yarlagadda, Vineel ;
Dave, Tirth ;
Duddumpudi, Rama Tulasi Siri .
CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (04)
[5]  
Dehouche N., 2021, ETHICS SCI ENV POLIT, V21, P17, DOI [DOI 10.3354/ESEP00195, 10.3354/ESEP00195]
[6]  
Edwards B., 2023, on professional benchmarks
[7]  
Gao C.A., 2022, SCI COMMUN ED, P2022, DOI [DOI 10.1038/S41746-023-00819-6, 10.1101/2022.12.23.521610, DOI 10.1101/2022.12.23.521610]
[8]   Ankle osteoarthritis: comprehensive review and treatment algorithm proposal [J].
Herrera-Perez, Mario ;
Valderrabano, Victor ;
Godoy-Santos, Alexandre L. ;
Netto, Cesar de Cesar ;
Gonzalez-Martin, David ;
Tejero, Sergio .
EFORT OPEN REVIEWS, 2022, 7 (07) :448-459
[9]   The Use of Artificial Intelligence in Writing Scientific Review Articles [J].
Kacena, Melissa A. ;
Plotkin, Lilian I. ;
Fehrenbacher, Jill C. .
CURRENT OSTEOPOROSIS REPORTS, 2024, 22 (01) :115-121
[10]   ChatGPT Is Shaping the Future of Medical Writing But Still Requires Human Judgment [J].
Kitamura, Felipe C. .
RADIOLOGY, 2023, 307 (02)