Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

被引：15

作者：

Sui, Yuan ^{[1
,4
]}

Zhou, Mengyu ^{[2
]}

Zhou, Mingjie ^{[3
,4
]}

Han, Shi ^{[2
]}

Zhang, Dongmei ^{[2
]}

机构：

[1] Natl Univ Singapore, Singapore, Singapore

[2] Microsoft, Beijing, Peoples R China

[3] Univ Hong Kong, Hong Kong, Peoples R China

[4] Microsoft Res Asia, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 17TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM 2024 | 2024年

关键词：

large language models; semi-structured data; structural understanding capabilities; benchmark;

D O I：

10.1145/3616855.3635752

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks. However, there is still much to learn about how well LLMs understand structured data, such as tables. Although tables can be used as input to LLMs with serialization, there is a lack of comprehensive studies that examine whether LLMs can truly comprehend such data. In this paper, we try to understand this by designing a benchmark to evaluate the structural understanding capabilities (SUC) of LLMs. The benchmark we create includes seven tasks, each with its own unique challenges, e.g., cell lookup, row retrieval, and size detection. We perform a series of evaluations on GPT-3.5 and GPT-4. We find that performance varied depending on several input choices, including table input format, content order, role prompting, and partition marks. Drawing from the insights gained through the benchmark evaluations, we propose self-augmentation for effective structural prompting, such as critical value / range identification using internal knowledge of LLMs. When combined with carefully chosen input choices, these structural prompting methods lead to promising improvements in LLM performance on a variety of tabular tasks, e.g., TabFact(. 2.31%), HybridQA(. 2.13%), SQA(. 2.72%), Feverous(. 0.84%), and ToTTo(. 5.68%). We believe that our open source1 benchmark and proposed prompting methods can serve as a simple yet generic selection for future research.

引用

页码：645 / 654

页数：10

共 34 条

[21] The interaction of structured data using openEHR and large Language models for clinical decision support in prostate cancer
Kaiser, Philippe
Yang, Shan
Bach, Michael
Breit, Christian
Mertz, Kirsten
Stieltjes, Bram
Ebbing, Jan
Wetterauer, Christian
Henkel, Maurice
WORLD JOURNAL OF UROLOGY, 2025, 43 (01)
[22] An Empirical Study on Using Large Language Models to Analyze Software Supply Chain Security Failures
Singla, Tanmay
Anandayuvaraj, Dharun
Kalu, Kelechi G.
Schorlemmer, Taylor R.
Davis, James C.
PROCEEDINGS OF THE 2023 WORKSHOP ON SOFTWARE SUPPLY CHAIN OFFENSIVE RESEARCH AND ECOSYSTEM DEFENSES, SCORED 2023, 2023, : 5 - 15
[23] LLM-Commentator: Novel fine-tuning strategies of large language models for automatic commentary generation using football event data
Cook, Alec
Karakul, Oktay
KNOWLEDGE-BASED SYSTEMS, 2024, 300
[24] Integrating Large Language Models and Optimization in Semi- Structured Decision Making: Methodology and a Case Study
Ghiani, Gianpaolo
Solazzo, Gianluca
Elia, Gianluca
ALGORITHMS, 2024, 17 (12)
[25] Can large language models help predict results from a complex behavioural science study?
Lippert, Steffen
Dreber, Anna
Johannesson, Magnus
Tierney, Warren
Cyrus-Lai, Wilson
Uhlmann, Eric Luis
Pfeiffer, Thomas
ROYAL SOCIETY OPEN SCIENCE, 2024, 11 (09):
[26] Utilizing large language models for detecting hospital-acquired conditions: an empirical study on pulmonary embolism
Cheligeer, Cheligeer
Southern, Danielle A.
Yan, Jun
Wu, Guosong
Pan, Jie
Lee, Seungwon
Martin, Elliot A.
Jafarpour, Hamed
Eastwood, Cathy A.
Zeng, Yong
Quan, Hude
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2025, 32 (05) : 876 - 884
[27] Emergence of Self-Identity in Artificial Intelligence: A Mathematical Framework and Empirical Study with Generative Large Language Models
Lee, Minhyeok
AXIOMS, 2025, 14 (01)
[28] Using Large Language Models to Detect and Understand Drug Discontinuation Events in Web-Based Forums: Development and Validation Study
Trevena, William
Zhong, Xiang
Alvarado, Michelle
Semenov, Alexander
Oktay, Alp
Devlin, Devin
Gohil, Aarya Yogesh
Chittimouju, Sai Harsha
JOURNAL OF MEDICAL INTERNET RESEARCH, 2025, 27
[29] Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study
Guo, Eddie
Gupta, Mehul
Deng, Jiawen
Park, Ye-Jean
Paget, Michael
Naugler, Christopher
JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
[30] A Comparative Study: Can Large Language Models Beat Radiologists on PI-RADSv2.1-Related Questions?
Eren, Camur
Turay, Cesur
Celal, Guenes Yasin
JOURNAL OF MEDICAL AND BIOLOGICAL ENGINEERING, 2024, 44 (06) : 821 - 830

← 1 2 3 4 →