Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

被引：0

作者：

Xu, Fangzhi ^{[1
]}

Lin, Qika ^{[1
]}

Han, Jiawei ^{[1
]}

Zhao, Tianzhe ^{[1
]}

Liu, Jun ^{[2
]}

Cambria, Erik ^{[3
]}

机构：

[1] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Key Lab Intelligent Networks & Net work Secur, Minist Educ, Xian 710049, Shaanxi, Peoples R China

[2] Shaanxi Prov Key Lab Big Data Knowledge Engn, Xian 710049, Shaanxi, Peoples R China

[3] Nanyang Technol Univ, Coll Comp & Data Sci, Singapore 639798, Singapore

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2025年 / 37卷 / 04期

基金：

中国国家自然科学基金;

关键词：

Cognition; Benchmark testing; Measurement; Large language models; Self-aware; Systematics; Redundancy; Knowledge engineering; Chatbots; Accuracy; Logical reasoning; large language model; deductive reasoning; inductive reasoning; abductive reasoning;

D O I：

10.1109/TKDE.2025.3536008

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Logical reasoning consistently plays a fundamental and significant role in the domains of knowledge engineering and artificial intelligence. Recently, Large Language Models (LLMs) have emerged as a noteworthy innovation in natural language processing (NLP). However, the question of whether LLMs can effectively address the task of logical reasoning, which requires gradual cognitive inference similar to human intelligence, remains unanswered. To this end, we aim to bridge this gap and provide comprehensive evaluations in this paper. First, to offer systematic evaluations, we select fifteen typical logical reasoning datasets and organize them into deductive, inductive, abductive and mixed-form reasoning settings. Considering the comprehensiveness of evaluations, we include 3 early-era representative LLMs and 4 trending LLMs. Second, different from previous evaluations relying only on simple metrics (e.g., accuracy), we propose fine-level evaluations in objective and subjective manners, covering both answers and explanations, including answer correctness, explain correctness, explain completeness and explain redundancy. Additionally, to uncover the logical flaws of LLMs, problematic cases will be attributed to five error types from two dimensions, i.e., evidence selection process and reasoning process. Third, to avoid the influences of knowledge bias and concentrate purely on benchmarking the logical reasoning capability of LLMs, we propose a new dataset with neutral content. Based on the in-depth evaluations, this paper finally forms a general evaluation scheme of logical reasoning capability from six dimensions (i.e., Correct, Rigorous, Self-aware, Active, Oriented and No hallucination). It reflects the pros and cons of LLMs and gives guiding directions for future works.

引用

页码：1620 / 1634

页数：15

共 50 条

[21] Evaluating OpenAI Large Language Models for Generating Logical Abstractions of Technical Requirements Documents
Perko, Alexander
Wotawa, Franz
2024 IEEE 24TH INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY, QRS, 2024, : 238 - 249
[22] Unlocking the Black Box? A Comprehensive Exploration of Large Language Models in Rehabilitation
Bonnechere, Bruno
AMERICAN JOURNAL OF PHYSICAL MEDICINE & REHABILITATION, 2024, 103 (06) : 532 - 537
[23] Hate Speech Detection Using Large Language Models: A Comprehensive Review
Albladi, Aish
Islam, Minarul
Das, Amit
Bigonah, Maryam
Zhang, Zheng
Jamshidi, Fatemeh
Rahgouy, Mostafa
Raychawdhary, Nilanjana
Marghitu, Daniela
Seals, Cheryl
IEEE ACCESS, 2025, 13 : 20871 - 20892
[24] Updating knowledge in Large Language Models: an Empirical Evaluation
Marinelli, Alberto Roberto
Carta, Antonio
Passaro, Lucia C.
IEEE CONFERENCE ON EVOLVING AND ADAPTIVE INTELLIGENT SYSTEMS 2024, IEEE EAIS 2024, 2024, : 289 - 296
[25] PromptBench: A Unified Library for Evaluation of Large Language Models
Zhu, Kaijie
Zhao, Qinlin
Chen, Hao
Wang, Jindong
Xie, Xing
JOURNAL OF MACHINE LEARNING RESEARCH, 2024, 25 : 1 - 22
[26] Are large language models qualified reviewers in originality evaluation?
Huang, Shengzhi
Huang, Yong
Liu, Yinpeng
Luo, Zhuoran
Lu, Wei
INFORMATION PROCESSING & MANAGEMENT, 2025, 62 (03)
[27] Analyzing evaluation methods for large language models in the medical field: a scoping review
Lee, Junbok
Park, Sungkyung
Shin, Jaeyong
Cho, Belong
BMC MEDICAL INFORMATICS AND DECISION MAKING, 2024, 24 (01)
[28] A comprehensive review of large language models: issues and solutions in learning environments
Shahzad, Tariq
Mazhar, Tehseen
Tariq, Muhammad Usman
Ahmad, Wasim
Ouahada, Khmaies
Hamam, Habib
DISCOVER SUSTAINABILITY, 2025, 6 (01):
[29] Large language models leverage external knowledge to extend clinical insight beyond language boundaries
Wu, Jiageng
Wu, Xian
Qiu, Zhaopeng
Li, Minghui
Lin, Shixu
Zhang, Yingying
Zheng, Yefeng
Yuan, Changzheng
Yang, Jie
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (09) : 2054 - 2064
[30] The new paradigm in machine learning - foundation models, large language models and beyond: a primer for physicians
Scott, Ian A.
Zuccon, Guido
INTERNAL MEDICINE JOURNAL, 2024, 54 (05) : 705 - 715

← 1 2 3 4 5 →