Performance of ChatGPT on UK Standardized Admission Tests: Insights From the BMAT, TMUA, LNAT, and TSA Examinations

被引：57

作者：

Giannos, Panagiotis ^{[1
,2
]}

Delardas, Orestis ^{[2
]}

机构：

[1] Imperial Coll London, Fac Nat Sci, Dept Life Sci, London SW7 2AZ, England

[2] Promot Emerging & Evaluat Res Soc, London, England

来源：

JMIR MEDICAL EDUCATION | 2023年 / 9卷

关键词：

standardized admissions tests; GPT; ChatGPT; medical education; medicine; law; natural language processing; BMAT; TMUA; LNAT; TSA;

D O I：

10.2196/47737

中图分类号：

G40 [教育学];

学科分类号：

040101 ; 120403 ;

摘要：

Background: Large language models, such as ChatGPT by OpenAI, have demonstrated potential in various applications, including medical education. Previous studies have assessed ChatGPT's performance in university or professional settings. However, the model's potential in the context of standardized admission tests remains unexplored. Objective: This study evaluated ChatGPT's performance on standardized admission tests in the United Kingdom, including theBioMedical AdmissionsTest(BMAT),Testof MathematicsforUniversityAdmission (TMUA), Law National Aptitude Test (LNAT), and Thinking Skills Assessment (TSA), to understand its potential as an innovativetool for education and test preparation. Methods: Recent public resources (2019-2022) were used to compile a data set of 509 questions from the BMAT, TMUA, LNAT, and TSA covering diverse topics in aptitude, scientific knowledge and applications, mathematical thinking and reasoning, critical thinking, problem-solving, reading comprehension, and logical reasoning. This evaluation assessed ChatGPT's performance using the legacy GPT-3.5 model, focusing on multiple-choice questions for consistency. The model's performance was analyzed based on question difficulty, the proportion of correct responses when aggregating exams from all years, and a comparison of test scores between papers of the same exam using binomial distribution and paired-sample (2-tailed) t tests. Results: The proportion of correct responses was significantly lower than incorrect ones in BMAT section 2 (P<.001) and TMUA paper 1 (P<.001) and paper 2 (P<.001). No significant differences were observed in BMAT section 1 (P=.2), TSA section 1 (P=.7), or LNAT papers 1 and 2, section A (P=.3). ChatGPT performed better in BMAT section 1 than section 2 (P=.047), with a maximum candidate ranking of 73% compared to a minimum of 1%. In the TMUA, it engaged with questions but had limited accuracy and no performance difference between papers (P=.6), with candidate rankings below 10%. In the LNAT, it demonstrated moderate success, especially in paper 2's questions; however, student performance data were unavailable. TSA performance varied across years with generally moderate results and fluctuating candidate rankings. Similar trends were observed for easy to moderate difficulty questions (BMAT section 1, P=.3; BMAT section 2, P=.04; TMUA paper 1, P<.001; TMUA paper 2, P=.003; TSA section 1, P=.8; and LNAT papers 1and 2, section A, P>.99) and hard to challenging ones (BMAT section 1, P=.7; BMAT section 2, P<.001; TMUA paper 1, P=.007; TMUA paper 2, P<.001; TSA section 1, P=.3; and LNAT papers 1 and 2, section A, P=.2). Conclusions: ChatGPT shows promise as a supplementary tool for subject areas and test formats that assess aptitude, problem-solving and critical thinking, and reading comprehension. However, its limitations in areas such as scientific and mathematical knowledge and applications highlight the need for continuous development and integration with conventional learning strategies in order to fully harness its potential.

引用

页数：7

共 8 条

[1] ChatGPT in Clinical Toxicology [J].

Abdel-Messih, Mary Sabry ;

Boulos, Maged N. Kamel .

JMIR MEDICAL EDUCATION, 2023, 9

[2] The future of medical education and research: Is ChatGPT a blessing or blight in disguise? [J].

Arif, Taha Bin ;

Munaf, Uzair ;

Ul-Haque, Ibtehaj .

MEDICAL EDUCATION ONLINE, 2023, 28 (01)

[3] How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment [J].

Gilson, Aidan ;

Safranek, Conrad W. ;

Huang, Thomas ;

Socrates, Vimig ;

Chi, Ling ;

Taylor, Richard Andrew ;

Chartash, David .

JMIR MEDICAL EDUCATION, 2023, 9

[4]

Jeblick K, 2022, arXiv

[5] Natural language processing: state of the art, current trends and challenges [J].

Khurana, Diksha ;

Koli, Aditya ;

Khatter, Kiran ;

Singh, Sukhdev .

MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (03) :3713-3744

[6] Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models [J].

Kung, Tiffany H. ;

Cheatham, Morgan ;

Medenilla, Arielle ;

Sillos, Czarina ;

De Leon, Lorie ;

Elepano, Camille ;

Madriaga, Maria ;

Aggabao, Rimel ;

Diaz-Candido, Giezel ;

Maningo, James ;

Tseng, Victor .

PLOS DIGITAL HEALTH, 2023, 2 (02)

[7]

Nagarhalli TP, 2020, INT CONF ADVAN COMPU, P706, DOI [10.1109/icaccs48705.2020.9074420, 10.1109/ICACCS48705.2020.9074420]

[8]

Smith K., Official Microsoft Blog

← 1 →