From Algorithms to Academia: An Endeavor to Benchmark AI-Generated Scientific Papers against Human Standards

被引:0
作者
Woodrow, Jackson [1 ]
Nassour, Nour [1 ]
Kwon, John Y. [1 ]
Ashkani-Esfahani, Soheil [1 ]
Harris, Mitchel [1 ]
机构
[1] Harvard Med Sch, Massachusetts Gen Hosp, Foot & Ankle Res & Innovat Lab FARIL, Boston, MA USA
来源
ARCHIVES OF BONE AND JOINT SURGERY-ABJS | 2025年 / 13卷 / 04期
关键词
Artificial intelligence; ChatGPT; Large language models; Natural language processing; Prompt engineering; PREDICTION;
D O I
10.22038/ABJS.2024.80093.3669
中图分类号
R826.8 [整形外科学]; R782.2 [口腔颌面部整形外科学]; R726.2 [小儿整形外科学]; R62 [整形外科学(修复外科学)];
学科分类号
摘要
Objectives: The aim of this study is to quantitatively investigate the accuracy of text generated by AI large language models while comparing their readability and likelihood of being accepted to a scientific compared to human-authored papers on the same topics. Methods: The study consisted of two papers written by ChatGPT, two papers written by Assistant by scite, and two papers written by humans. A total of six independent reviewers were blinded to the authorship of each paper and assigned a grade to each subsection on a scale of 1 to 4. Additionally, each reviewer was asked to guess if the paper was written by a human or AI and explain their reasoning. The study authors also graded each AI-generated paper based on factual accuracy of the claims and citations. Results: The human-written calcaneus fracture paper received the highest score of a 3.70/4, followed by Assistant-written calcaneus fracture paper (3.02/4), human-written ankle osteoarthritis paper (2.98/4), ChatGPT calcaneus fracture (2.89/4), ChatGPT Ankle Osteoarthritis (2.87/4), and Assistant Ankle Osteoarthritis (2.78/4). The human calcaneus fracture paper received a statistically significant higher rating than the ChatGPT calcaneus fracture paper (P = 0.028) and the Assistant calcaneus fracture paper (P = 0.043). The ChatGPT osteoarthritis review showed 100% factual accuracy, the ChatGPT calcaneus fracture review was 97.46% factually accurate, the Assistant calcaneus fracture was 95.56% accurate, and the Assistant ankle osteoarthritis was 94.98% accurate. Regarding citations, the ChatGPT ankle osteoarthritis paper was 90% accurate, the ChatGPT calcaneus fracture was 69.23% accurate, the Assistant ankle osteoarthritis was 35.14% accurate, and the Assistant calcaneus fracture was 39.68% accurate. Conclusion: Through this paper we emphasize that while AI holds the promise of enhancing knowledge sharing, it must be used responsibly and in conjunction with comprehensive fact-checking procedures to maintain the integrity of the scientific discourse. Level of evidence: III
引用
收藏
页码:212 / 222
页数:11
相关论文
共 17 条
[11]  
Oremus W., 2022, Washington Post
[12]  
Pequeno Antonio., 2023, Forbes
[13]   Prediction Models for Knee Osteoarthritis: Review of Current Models and Future Directions [J].
Ramazanian, Taghi ;
Fu, Sunyang ;
Sohn, Sunghwan ;
Taunton, Michael J. ;
Kremers, Hilal Maradit .
ARCHIVES OF BONE AND JOINT SURGERY-ABJS, 2023, 11 (01) :1-10
[14]  
Ramponi M., 2022, How ChatGPT actually works
[15]   Management of displaced intra-articular calcaneal fractures; current concept review and treatment algorithm [J].
Salameh, Motasem ;
Al-Hashki, Leen ;
Al-Juboori, Saja ;
Rayyan, Rama ;
Hantouly, Ashraf ;
Blankenhorn, Brad .
EUROPEAN JOURNAL OF ORTHOPAEDIC SURGERY AND TRAUMATOLOGY, 2023, 33 (04) :779-785
[16]   Can artificial intelligence help for scientific writing? [J].
Salvagno, Michele ;
Taccone, Fabio Silvio ;
Gerli, Alberto Giovanni .
CRITICAL CARE, 2023, 27 (01)
[17]  
Zinkula J, 2024, Business Insider