ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models

被引:1
|
作者
Feuer, Benjamin [1 ]
Liu, Yurong [1 ]
Hegde, Chinmay [1 ]
Freire, Juliana [1 ]
机构
[1] NYU, New York, NY 10016 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2024年 / 17卷 / 09期
关键词
D O I
10.14778/3665844.3665857
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type; incur high run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark.
引用
收藏
页码:2279 / 2292
页数:14
相关论文
共 50 条
  • [41] FaultLines - Evaluating the Efficacy of Open-Source Large Language Models for Fault Detection in Cyber-Physical Systems
    Muehlburger, Herbert
    Wotawa, Franz
    2024 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE TESTING, AITEST, 2024, : 47 - 54
  • [42] Inductive Thematic Analysis of Healthcare Qualitative Interviews Using Open-Source Large Language Models: How Does it Compare to Traditional Methods?
    Mathis, Walter S.
    Zhao, Sophia
    Pratt, Nicholas
    Weleff, Jeremy
    De Paoli, Stefano
    SSRN,
  • [43] Inductive thematic analysis of healthcare qualitative interviews using open-source large language models: How does it compare to traditional methods?
    Mathis, Walter S.
    Zhao, Sophia
    Pratt, Nicholas
    Weleff, Jeremy
    De Paoli, Stefano
    COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2024, 255
  • [44] RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model
    Lu, Yao
    Liu, Shang
    Zhang, Qijun
    Xie, Zhiyao
    29TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, ASP-DAC 2024, 2024, : 722 - 727
  • [45] PMC-LLaMA: toward building open-source language models for medicine
    Wu, Chaoyi
    Lin, Weixiong
    Zhang, Xiaoman
    Zhang, Ya
    Xie, Weidi
    Wang, Yanfeng
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (09) : 1833 - 1843
  • [46] Improving Automatic Text Recognition with Language Models in the PyLaia Open-Source Library
    Tarride, Solene
    Schneider, Yoann
    Generali-Lince, Marie
    Boillet, Melodie
    Abadie, Bastien
    Kermorvant, Christopher
    DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT V, 2024, 14808 : 387 - 404
  • [47] Comparative Analysis of Open-Source Language Models in Summarizing Medical Text Data
    Chen, Yuhao
    Wang, Zhimu
    Zulkernine, Farhana
    2024 IEEE INTERNATIONAL CONFERENCE ON DIGITAL HEALTH, ICDH 2024, 2024, : 126 - 128
  • [48] Large language models for error detection in radiology reports: a comparative analysis between closed-source and privacy-compliant open-source models
    Salam, Babak
    Stuewe, Claire
    Nowak, Sebastian
    Sprinkart, Alois M.
    Theis, Maike
    Kravchenko, Dmitrij
    Mesropyan, Narine
    Dell, Tatjana
    Endler, Christoph
    Pieper, Claus C.
    Kuetting, Daniel L.
    Luetkens, Julian A.
    Isaak, Alexander
    EUROPEAN RADIOLOGY, 2025,
  • [49] A Novel and Open-Source Illumination Correction for Hyperspectral Digital Outcrop Models
    Thiele, Samuel T.
    Lorenz, Sandra
    Kirsch, Moritz
    Gloaguen, Richard
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [50] Open-source WAF Virtualization Solutions Comparison Using the OWASP Framework
    Reyes Narvaez, Aldrin
    Vaca Siena, Tulia Nohemi
    Reyes Narvaez, Edison Patricio
    Lara, Fernando
    Barba Molina, Hernan
    2024 IEEE COLOMBIAN CONFERENCE ON COMMUNICATIONS AND COMPUTING, COLCOM 2024, 2024,