Chatbots Put to the Test in Math and Logic Problems: A Comparison and Assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard

被引:22
作者
Plevris, Vagelis [1 ]
Papazafeiropoulos, George [2 ]
Rios, Alejandro Jimenez [3 ]
机构
[1] Qatar Univ, Dept Civil & Environm Engn, POB 2713, Doha, Qatar
[2] Natl Tech Univ Athens, Sch Civil Engn, Athens 15780, Greece
[3] Oslo Metropolitan Univ, Dept Built Environm, N-0166 Oslo, Norway
关键词
chatbot; AI; logic; mathematics; ChatGPT; GPT-3.5; GPT-4; Google Bard;
D O I
10.3390/ai4040048
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In an age where artificial intelligence is reshaping the landscape of education and problem solving, our study unveils the secrets behind three digital wizards, ChatGPT-3.5, ChatGPT-4, and Google Bard, as they engage in a thrilling showdown of mathematical and logical prowess. We assess the ability of the chatbots to understand the given problem, employ appropriate algorithms or methods to solve it, and generate coherent responses with correct answers. We conducted our study using a set of 30 questions. These questions were carefully crafted to be clear, unambiguous, and fully described using plain text only. Each question has a unique and well-defined correct answer. The questions were divided into two sets of 15: Set A consists of "Original" problems that cannot be found online, while Set B includes "Published" problems that are readily available online, often with their solutions. Each question was presented to each chatbot three times in May 2023. We recorded and analyzed their responses, highlighting their strengths and weaknesses. Our findings indicate that chatbots can provide accurate solutions for straightforward arithmetic, algebraic expressions, and basic logic puzzles, although they may not be consistently accurate in every attempt. However, for more complex mathematical problems or advanced logic tasks, the chatbots' answers, although they appear convincing, may not be reliable. Furthermore, consistency is a concern as chatbots often provide conflicting answers when presented with the same question multiple times. To evaluate and compare the performance of the three chatbots, we conducted a quantitative analysis by scoring their final answers based on correctness. Our results show that ChatGPT-4 performs better than ChatGPT-3.5 in both sets of questions. Bard ranks third in the original questions of Set A, trailing behind the other two chatbots. However, Bard achieves the best performance, taking first place in the published questions of Set B. This is likely due to Bard's direct access to the internet, unlike the ChatGPT chatbots, which, due to their designs, do not have external communication capabilities.
引用
收藏
页码:949 / 969
页数:21
相关论文
共 30 条
  • [1] Abdellatif T, 2018, INT CONF NEW TECHNOL
  • [2] Achiam OJ, 2023, Arxiv, DOI [arXiv:2303.08774, DOI 10.48550/ARXIV.2303.08774]
  • [3] Evaluation of an Arabic Chatbot Based on Extractive Question-Answering Transfer Learning and Language Transformers
    Alruqi, Tahani N.
    Alzahrani, Salha M.
    [J]. AI, 2023, 4 (03) : 667 - 691
  • [4] [Anonymous], 2023, Workers Are Needed to Build a Wall in 12 Days. How Long Would 10 Workers Take to Build the Wall?
  • [5] AI Chatbots: Threat or Opportunity?
    Bryant, Antony
    [J]. INFORMATICS-BASEL, 2023, 10 (02):
  • [6] Challenges and Limitations of ChatGPT and Artificial Intelligence for Scientific Research: A Perspective from Organic Materials
    Cheng, Hao-Wen
    [J]. AI, 2023, 4 (02) : 401 - 405
  • [7] Feiveson L., 2022, These 20 Tough Riddles for Adults Will Have You Scratching Your Head
  • [8] Frieder S, 2023, Arxiv, DOI [arXiv:2301.13867, DOI 10.48550/ARXIV.2301.13867]
  • [9] Gilson Aidan, 2023, JMIR Med Educ, V9, pe45312, DOI 10.2196/45312
  • [10] Heaven W.D., 2023, GPT-4 is Bigger and Better Than ChatGPT-But OpenAI Won't Say Why