Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

被引:15
|
作者
Sui, Yuan [1 ,4 ]
Zhou, Mengyu [2 ]
Zhou, Mingjie [3 ,4 ]
Han, Shi [2 ]
Zhang, Dongmei [2 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] Microsoft, Beijing, Peoples R China
[3] Univ Hong Kong, Hong Kong, Peoples R China
[4] Microsoft Res Asia, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 17TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM 2024 | 2024年
关键词
large language models; semi-structured data; structural understanding capabilities; benchmark;
D O I
10.1145/3616855.3635752
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks. However, there is still much to learn about how well LLMs understand structured data, such as tables. Although tables can be used as input to LLMs with serialization, there is a lack of comprehensive studies that examine whether LLMs can truly comprehend such data. In this paper, we try to understand this by designing a benchmark to evaluate the structural understanding capabilities (SUC) of LLMs. The benchmark we create includes seven tasks, each with its own unique challenges, e.g., cell lookup, row retrieval, and size detection. We perform a series of evaluations on GPT-3.5 and GPT-4. We find that performance varied depending on several input choices, including table input format, content order, role prompting, and partition marks. Drawing from the insights gained through the benchmark evaluations, we propose self-augmentation for effective structural prompting, such as critical value / range identification using internal knowledge of LLMs. When combined with carefully chosen input choices, these structural prompting methods lead to promising improvements in LLM performance on a variety of tabular tasks, e.g., TabFact(. 2.31%), HybridQA(. 2.13%), SQA(. 2.72%), Feverous(. 0.84%), and ToTTo(. 5.68%). We believe that our open source1 benchmark and proposed prompting methods can serve as a simple yet generic selection for future research.
引用
收藏
页码:645 / 654
页数:10
相关论文
共 34 条
  • [1] A survey of table reasoning with large language models
    Zhang, Xuanliang
    Wang, Dingzirui
    Dou, Longxu
    Zhu, Qingfu
    Che, Wanxiang
    FRONTIERS OF COMPUTER SCIENCE, 2025, 19 (09)
  • [2] Can large language models understand molecules?
    Sadeghi, Shaghayegh
    Bui, Alan
    Forooghi, Ali
    Lu, Jianguo
    Ngom, Alioune
    BMC BIOINFORMATICS, 2024, 25 (01):
  • [3] Are Large Language Models Table-based Fact-Checkers?
    Zhang, Hanwen
    Si, Qingyi
    Fu, Peng
    Lin, Zheng
    Wang, Weiping
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 3086 - 3091
  • [4] Using large language models for safety-related table summarization in clinical study reports
    Landman, Rogier
    Healey, Sean P.
    Loprinzo, Vittorio
    Kochendoerfer, Ulrike
    Winnier, Angela Russell
    Henstock, Peter, V
    Lin, Wenyi
    Chen, Aqiu
    Rajendran, Arthi
    Penshanwar, Sushant
    Khan, Sheraz
    Madhavan, Subha
    JAMIA OPEN, 2024, 7 (02)
  • [5] Large Language Models are Versatile Decomposers: Decomposing Evidence and Questions for Table-based Reasoning
    Ye, Yunhu
    Hui, Binyuan
    Yang, Min
    Li, Binhua
    Huang, Fei
    Li, Yongbin
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 174 - 184
  • [6] Enabling controllable table-to-text generation via prompting large language models with guided planning
    Zhao, Shuo
    Sun, Xin
    KNOWLEDGE-BASED SYSTEMS, 2024, 304
  • [7] Bugs in large language models generated code: an empirical study
    Tambon, Florian
    Moradi-Dakhel, Arghavan
    Nikanjam, Amin
    Khomh, Foutse
    Desmarais, Michel C.
    Antoniol, Giuliano
    EMPIRICAL SOFTWARE ENGINEERING, 2025, 30 (03)
  • [8] CRASS: A Novel Data Set and Benchmark to Test Counterfactual Reasoning of Large Language Models
    Frohberg, Jorg
    Binder, Frank
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 2126 - 2140
  • [9] Using Large Language Models to Generate JUnit Tests: An Empirical Study
    Siddiq, Mohammed Latif
    Santos, Joanna C. S.
    Tanvir, Ridwanul Hasan
    Ulfat, Noshin
    Al Rifat, Fahmid
    Lopes, Vinicius Carvalho
    PROCEEDINGS OF 2024 28TH INTERNATION CONFERENCE ON EVALUATION AND ASSESSMENT IN SOFTWARE ENGINEERING, EASE 2024, 2024, : 313 - 322
  • [10] An empirical study on the effectiveness of large language models for SATD identification and classification
    Sheikhaei, Mohammad Sadegh
    Tian, Yuan
    Wang, Shaowei
    Xu, Bowen
    EMPIRICAL SOFTWARE ENGINEERING, 2024, 29 (06)