Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

被引:0
|
作者
Xu, Fangzhi [1 ]
Lin, Qika [1 ]
Han, Jiawei [1 ]
Zhao, Tianzhe [1 ]
Liu, Jun [2 ]
Cambria, Erik [3 ]
机构
[1] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Key Lab Intelligent Networks & Net work Secur, Minist Educ, Xian 710049, Shaanxi, Peoples R China
[2] Shaanxi Prov Key Lab Big Data Knowledge Engn, Xian 710049, Shaanxi, Peoples R China
[3] Nanyang Technol Univ, Coll Comp & Data Sci, Singapore 639798, Singapore
基金
中国国家自然科学基金;
关键词
Cognition; Benchmark testing; Measurement; Large language models; Self-aware; Systematics; Redundancy; Knowledge engineering; Chatbots; Accuracy; Logical reasoning; large language model; deductive reasoning; inductive reasoning; abductive reasoning;
D O I
10.1109/TKDE.2025.3536008
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Logical reasoning consistently plays a fundamental and significant role in the domains of knowledge engineering and artificial intelligence. Recently, Large Language Models (LLMs) have emerged as a noteworthy innovation in natural language processing (NLP). However, the question of whether LLMs can effectively address the task of logical reasoning, which requires gradual cognitive inference similar to human intelligence, remains unanswered. To this end, we aim to bridge this gap and provide comprehensive evaluations in this paper. First, to offer systematic evaluations, we select fifteen typical logical reasoning datasets and organize them into deductive, inductive, abductive and mixed-form reasoning settings. Considering the comprehensiveness of evaluations, we include 3 early-era representative LLMs and 4 trending LLMs. Second, different from previous evaluations relying only on simple metrics (e.g., accuracy), we propose fine-level evaluations in objective and subjective manners, covering both answers and explanations, including answer correctness, explain correctness, explain completeness and explain redundancy. Additionally, to uncover the logical flaws of LLMs, problematic cases will be attributed to five error types from two dimensions, i.e., evidence selection process and reasoning process. Third, to avoid the influences of knowledge bias and concentrate purely on benchmarking the logical reasoning capability of LLMs, we propose a new dataset with neutral content. Based on the in-depth evaluations, this paper finally forms a general evaluation scheme of logical reasoning capability from six dimensions (i.e., Correct, Rigorous, Self-aware, Active, Oriented and No hallucination). It reflects the pros and cons of LLMs and gives guiding directions for future works.
引用
收藏
页码:1620 / 1634
页数:15
相关论文
共 50 条
  • [41] Beyond Search Engines: Can Large Language Models Improve Curriculum Development?
    Moein, Mohammad
    Hajiagha, Mohammadreza Molavi
    Faraji, Abdolali
    Tavakoli, Mohammadreza
    Kismihok, Gabor
    TECHNOLOGY ENHANCED LEARNING FOR INCLUSIVE AND EQUITABLE QUALITY EDUCATION, PT II, EC-TEL 2024, 2024, 15160 : 131 - 136
  • [42] Advanced deep learning and large language models: Comprehensive insights for cancer detection
    Habchi, Yassine
    Kheddar, Hamza
    Himeur, Yassine
    Belouchrani, Adel
    Serpedin, Erchin
    Khelifi, Fouad
    Chowdhury, Muhammad E. H.
    IMAGE AND VISION COMPUTING, 2025, 157
  • [43] An Evaluation of Large Language Models for Supplementing a Food Extrusion Dataset
    Bolucu, Necva
    Pennells, Jordan
    Yang, Huichen
    Rybinski, Maciej
    Wan, Stephen
    FOODS, 2025, 14 (08)
  • [44] SafeLLMs: A Benchmark for Secure Bilingual Evaluation of Large Language Models
    Liang, Wenhan
    Wu, Huijia
    Gao, Jun
    Shang, Yuhu
    He, Zhaofeng
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT II, NLPCC 2024, 2025, 15360 : 437 - 448
  • [45] Factual consistency evaluation of summarization in the Era of large language models
    Luo, Zheheng
    Xie, Qianqian
    Ananiadou, Sophia
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 254
  • [46] Toward Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models
    Srivastava, Akchay
    Memon, Atif
    IEEE ACCESS, 2024, 12 : 117483 - 117503
  • [47] Application of Holistic Artificial Intelligence and Large Language Models for Comprehensive Information Collection
    Han, Xu
    Sun, Yawei
    Zhao, Lu
    Beijing Youdian Daxue Xuebao/Journal of Beijing University of Posts and Telecommunications, 2024, 47 (04): : 11 - 19and28
  • [48] GenderCARE: A Comprehensive Framework for Assessing and Reducing Gender Bias in Large Language Models
    Tang, Kunsheng
    Zhou, Wenbo
    Zhang, Jie
    Liu, Aishan
    Deng, Gelei
    Li, Shuai
    Qi, Peigui
    Zhang, Weiming
    Zhang, Tianwei
    Yu, Nenghai
    PROCEEDINGS OF THE 2024 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, CCS 2024, 2024, : 1196 - 1210
  • [49] Evaluation of Pretrained Large Language Models in Embodied Planning Tasks
    Sarkisyan, Christina
    Korchemnyi, Alexandr
    Kovalev, Alexey K.
    Panov, Aleksandr, I
    ARTIFICIAL GENERAL INTELLIGENCE, AGI 2023, 2023, 13921 : 222 - 232
  • [50] LARGE LANGUAGE MODELS "AD REFERENDUM": HOW GOOD ARE THEY AT MACHINE TRANSLATION IN THE LEGAL DOMAIN?
    Briva-Iglesias, Vicent
    Camargo, Joao Lucas Cavalheiro
    Dogru, Gokhan
    MONTI, 2024, 16 : 75 - 107