Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

被引:0
|
作者
Xu, Fangzhi [1 ]
Lin, Qika [1 ]
Han, Jiawei [1 ]
Zhao, Tianzhe [1 ]
Liu, Jun [2 ]
Cambria, Erik [3 ]
机构
[1] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Key Lab Intelligent Networks & Net work Secur, Minist Educ, Xian 710049, Shaanxi, Peoples R China
[2] Shaanxi Prov Key Lab Big Data Knowledge Engn, Xian 710049, Shaanxi, Peoples R China
[3] Nanyang Technol Univ, Coll Comp & Data Sci, Singapore 639798, Singapore
基金
中国国家自然科学基金;
关键词
Cognition; Benchmark testing; Measurement; Large language models; Self-aware; Systematics; Redundancy; Knowledge engineering; Chatbots; Accuracy; Logical reasoning; large language model; deductive reasoning; inductive reasoning; abductive reasoning;
D O I
10.1109/TKDE.2025.3536008
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Logical reasoning consistently plays a fundamental and significant role in the domains of knowledge engineering and artificial intelligence. Recently, Large Language Models (LLMs) have emerged as a noteworthy innovation in natural language processing (NLP). However, the question of whether LLMs can effectively address the task of logical reasoning, which requires gradual cognitive inference similar to human intelligence, remains unanswered. To this end, we aim to bridge this gap and provide comprehensive evaluations in this paper. First, to offer systematic evaluations, we select fifteen typical logical reasoning datasets and organize them into deductive, inductive, abductive and mixed-form reasoning settings. Considering the comprehensiveness of evaluations, we include 3 early-era representative LLMs and 4 trending LLMs. Second, different from previous evaluations relying only on simple metrics (e.g., accuracy), we propose fine-level evaluations in objective and subjective manners, covering both answers and explanations, including answer correctness, explain correctness, explain completeness and explain redundancy. Additionally, to uncover the logical flaws of LLMs, problematic cases will be attributed to five error types from two dimensions, i.e., evidence selection process and reasoning process. Third, to avoid the influences of knowledge bias and concentrate purely on benchmarking the logical reasoning capability of LLMs, we propose a new dataset with neutral content. Based on the in-depth evaluations, this paper finally forms a general evaluation scheme of logical reasoning capability from six dimensions (i.e., Correct, Rigorous, Self-aware, Active, Oriented and No hallucination). It reflects the pros and cons of LLMs and gives guiding directions for future works.
引用
收藏
页码:1620 / 1634
页数:15
相关论文
共 50 条
  • [31] Unveiling the Impact of Large Language Models on Student Learning: A Comprehensive Case Study
    Zdravkova, Katerina
    Dalipi, Fisnik
    Ahlgren, Fredrik
    Ilijoski, Bojan
    Ohlsson, Tobias
    2024 IEEE GLOBAL ENGINEERING EDUCATION CONFERENCE, EDUCON 2024, 2024,
  • [32] Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study
    Tamberg, Karl
    Bahsi, Hayretdin
    IEEE ACCESS, 2025, 13 : 29698 - 29717
  • [33] Leveraging Generative AI and Large Language Models: A Comprehensive Roadmap for Healthcare Integration
    Yu, Ping
    Xu, Hua
    Hu, Xia
    Deng, Chao
    HEALTHCARE, 2023, 11 (20)
  • [34] Transforming breast cancer diagnosis and treatment with large language Models: A comprehensive survey
    Ghorbian, Mohsen
    Ghobaei-Arani, Mostafa
    Ghorbian, Saied
    METHODS, 2025, 239 : 85 - 110
  • [35] Beyond Individual Concerns: Multi-user Privacy in Large Language Models
    Zhan, Xiao
    Seymour, William
    Such, Jose
    PROCEEDINGS OF THE 6TH CONFERENCE ON ACM CONVERSATIONAL USER INTERFACES, CUI 2024, 2024,
  • [36] Leveraging large language models for comprehensive locomotion control in humanoid robots design
    Sun, Shilong
    Li, Chiyao
    Zhao, Zida
    Huang, Haodong
    Xu, Wenfu
    BIOMIMETIC INTELLIGENCE AND ROBOTICS, 2024, 4 (04):
  • [37] A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models
    Esmradi, Aysan
    Yip, Daniel Wankit
    Chan, Chun Fai
    UBIQUITOUS SECURITY, UBISEC 2023, 2024, 2034 : 76 - 95
  • [38] A Comprehensive Overview of Backdoor Attacks in Large Language Models Within Communication Networks
    Yang, Haomiao
    Xiang, Kunlan
    Ge, Mengyu
    Li, Hongwei
    Lu, Rongxing
    Yu, Shui
    IEEE NETWORK, 2024, 38 (06): : 211 - 218
  • [39] Large Language Models meet moral values: A comprehensive assessment of moral abilities
    Bulla, Luana
    De Giorgis, Stefano
    Mongiovi, Misael
    Gangemi, Aldo
    COMPUTERS IN HUMAN BEHAVIOR REPORTS, 2025, 17
  • [40] Large language models for cyber resilience: A comprehensive review, challenges, and future perspectives
    Ding, Weiping
    Abdel-Basset, Mohamed
    Ali, Ahmed M.
    Moustafa, Nour
    APPLIED SOFT COMPUTING, 2025, 170