Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes

被引：9

作者：

Arora, Simran ^{[1
]}

Yang, Brandon ^{[1
]}

Eyuboglu, Sabri ^{[1
]}

Narayan, Avanika ^{[1
]}

Hojel, Andrew ^{[1
]}

Trummer, Immanuel ^{[2
]}

Re, Christopher ^{[1
]}

机构：

[1] Stanford Univ, Stanford, CA 94305 USA

[2] Cornell Univ, Ithaca, NY 14853 USA

来源：

PROCEEDINGS OF THE VLDB ENDOWMENT | 2023年 / 17卷 / 02期

关键词：

D O I：

10.14778/3626292.3626294

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

A long standing goal in the data management community is developing systems that input documents and output queryable tables without user effort. Given the sheer variety of potential documents, state-of-the art systems make simplifying assumptions and use domain specific training. In this work, we ask whether we can maintain generality by using the in-context learning abilities of large language models (LLMs). We propose and evaluate EVAPORATE, a prototype system powered by LLMs. We identify two strategies for implementing this system: prompt the LLM to directly extract values from documents or prompt the LLM to synthesize code that performs the extraction. Our evaluations show a cost-quality tradeoff between these two approaches. Code synthesis is cheap, but far less accurate than directly processing each document with the LLM. To improve quality while maintaining low cost, we propose an extended implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction. Our insight is to generate many candidate functions and ensemble their extractions using weak supervision. EVAPORATE-CODE+ outperforms the state-of-the art systems using a sublinear pass over the documents with the LLM. This equates to a 110x reduction in the number of documents the LLM needs to process across our 16 real-world evaluation settings.

引用

页码：92 / 105

页数：14

共 68 条

[1]

Agichtein E., 2000, ACM 2000. Digital Libraries. Proceedings of the Fifth ACM Conference on Digital Libraries, P85, DOI 10.1145/336597.336644

[2]

Agrawal Monica, 2022, 2022 C EMPIRICAL MET

[3]

[Anonymous], 2023, Wikipedia Statistics

[4]

Arora Simran, 2023, Transactions of Computational Linguistics (TACL)

[5]

Arora Simran, 2023, Language models enable simple systems for generating structured views of heterogeneous data lakes

[6]

Arora Simran, 2023, INT C

[7]

Askell A, 2021, Arxiv, DOI [arXiv:2112.00861, 10.48550/arXiv.2112.00861]

[8]

Banko M, 2007, 20TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P2670

[9] The Safety of Inpatient Health Care [J].

Bates, David W. ;

Levine, David M. ;

Salmasian, Hojjat ;

Syrowatka, Ania ;

Shahian, David M. ;

Lipsitz, Stuart ;

Zebrowski, Jonathan P. ;

Myers, Laura C. ;

Logan, Merranda S. ;

Roy, Christopher G. ;

Iannaccone, Christine ;

Frits, Michelle L. ;

Volk, Lynn A. ;

Dulgarian, Sevan ;

Amato, Mary G. ;

Edrees, Heba H. ;

Sato, Luke ;

Folcarelli, Patricia ;

Einbinder, Jonathan S. ;

Reynolds, Mark E. ;

Mort, Elizabeth .

NEW ENGLAND JOURNAL OF MEDICINE, 2023, 388 (02) :142-153

[10]

Boecking Benedikt, 2021, INT C LEARNING REPRE

← 1 2 3 4 5 6 7 →