Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

被引：0

作者：

Xu, Fangzhi ^{[1
]}

Lin, Qika ^{[1
]}

Han, Jiawei ^{[1
]}

Zhao, Tianzhe ^{[1
]}

Liu, Jun ^{[2
]}

Cambria, Erik ^{[3
]}

机构：

[1] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Key Lab Intelligent Networks & Net work Secur, Minist Educ, Xian 710049, Shaanxi, Peoples R China

[2] Shaanxi Prov Key Lab Big Data Knowledge Engn, Xian 710049, Shaanxi, Peoples R China

[3] Nanyang Technol Univ, Coll Comp & Data Sci, Singapore 639798, Singapore

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2025年 / 37卷 / 04期

基金：

中国国家自然科学基金;

关键词：

Cognition; Benchmark testing; Measurement; Large language models; Self-aware; Systematics; Redundancy; Knowledge engineering; Chatbots; Accuracy; Logical reasoning; large language model; deductive reasoning; inductive reasoning; abductive reasoning;

D O I：

10.1109/TKDE.2025.3536008

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Logical reasoning consistently plays a fundamental and significant role in the domains of knowledge engineering and artificial intelligence. Recently, Large Language Models (LLMs) have emerged as a noteworthy innovation in natural language processing (NLP). However, the question of whether LLMs can effectively address the task of logical reasoning, which requires gradual cognitive inference similar to human intelligence, remains unanswered. To this end, we aim to bridge this gap and provide comprehensive evaluations in this paper. First, to offer systematic evaluations, we select fifteen typical logical reasoning datasets and organize them into deductive, inductive, abductive and mixed-form reasoning settings. Considering the comprehensiveness of evaluations, we include 3 early-era representative LLMs and 4 trending LLMs. Second, different from previous evaluations relying only on simple metrics (e.g., accuracy), we propose fine-level evaluations in objective and subjective manners, covering both answers and explanations, including answer correctness, explain correctness, explain completeness and explain redundancy. Additionally, to uncover the logical flaws of LLMs, problematic cases will be attributed to five error types from two dimensions, i.e., evidence selection process and reasoning process. Third, to avoid the influences of knowledge bias and concentrate purely on benchmarking the logical reasoning capability of LLMs, we propose a new dataset with neutral content. Based on the in-depth evaluations, this paper finally forms a general evaluation scheme of logical reasoning capability from six dimensions (i.e., Correct, Rigorous, Self-aware, Active, Oriented and No hallucination). It reflects the pros and cons of LLMs and gives guiding directions for future works.

引用

页码：1620 / 1634

页数：15

共 50 条

[1] A Comprehensive Evaluation of Large Language Models for Turkish Abstractive Dialogue Summarization
Buyuk, Osman
IEEE ACCESS, 2024, 12 : 124391 - 124401
[2] Automated Commit Message Generation With Large Language Models: An Empirical Study and Beyond
Xue, Pengyu
Wu, Linhao
Yu, Zhongxing
Jin, Zhi
Yang, Zhen
Li, Xinyi
Yang, Zhenyu
Tan, Yue
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2024, 50 (12) : 3208 - 3224
[3] Large Language Models on Graphs: A Comprehensive Survey
Jin, Bowen
Liu, Gang
Han, Chi
Jiang, Meng
Ji, Heng
Han, Jiawei
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (12) : 8622 - 8642
[4] Are Large Language Models Good at Utility Judgments?
Zhang, Hengran
Zhang, Ruqing
Guo, Jiafeng
de Rijke, Maarten
Fan, Yixing
Cheng, Xueqi
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 1941 - 1951
[5] A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks
Jahan, Israt
Laskar, Md Tahmid Rahman
Peng, Chun
Huang, Jimmy Xiangji
COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 171
[6] Beyond Topic Modeling: Comparative Evaluation of Topic Interpretation by Large Language Models
de Melo, Tiago
Merialdo, Paolo
INTELLIGENT SYSTEMS AND APPLICATIONS, VOL 4, INTELLISYS 2024, 2024, 1068 : 215 - 230
[7] A comprehensive evaluation of large language models in mining gene relations and pathway knowledge
Azam, Muhammad
Chen, Yibo
Arowolo, Micheal Olaolu
Liu, Haowang
Popescu, Mihail
Xu, Dong
QUANTITATIVE BIOLOGY, 2024, 12 (04) : 360 - 374
[8] A comprehensive survey of large language models and multimodal large models in medicine
Xiao, Hanguang
Zhou, Feizhong
Liu, Xingyue
Liu, Tianqi
Li, Zhipeng
Liu, Xin
Huang, Xiaoxuan
INFORMATION FUSION, 2025, 117
[9] Evaluation and Analysis of Large Language Models for Clinical Text Augmentation and Generation
Latif, Atif
Kim, Jihie
IEEE ACCESS, 2024, 12 : 48987 - 48996
[10] Large Language Models: A Comprehensive Guide for Radiologists
Kim, Sunkyu
Lee, Choong-kun
Kim, Seung-seob
JOURNAL OF THE KOREAN SOCIETY OF RADIOLOGY, 2024, 85 (05): : 861 - 882

← 1 2 3 4 5 →