Causal Dataset Discovery with Large Language Models

被引:0
作者
Liu, Junfei [1 ]
Sun, Shaotong [1 ]
Nargesian, Fatemeh [1 ]
机构
[1] Univ Rochester, 601 Elmwood Ave, Rochester, NY 14627 USA
来源
WORKSHOP ON HUMAN-IN-THE-LOOP DATA ANALYTICS, HILDA 2024 | 2024年
关键词
SEARCH;
D O I
10.1145/3665939.3665968
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Causal data discovery is crucial in scientific research by uncovering causal links among a variety of observed variables. Causal dataset discovery is the task of identifying datasets that contain columns that have causal relationships with columns in a query dataset. Discovering causal links from large-scale repositories faces three major challenges: vast scale of data, inherent sparsity of causal links, and incompleteness of variables present. Identifying causal relationships among datasets is a complex and time-intensive task, especially because it requires joining datasets, to bring all variables together, before applying causal link discovery. In this paper, we introduce the Causal Dataset Discovery problem and propose a large language model (LLM)-based framework to discover potential pairwise causal links between columns from different datasets. We heuristically improve LLM's grasp of causality through prompting and fine-tuning and prevent the extreme imbalance in causal candidate distributions due to natural sparsity of causal connections. We create benchmarks specific to this task1, experimentally show that our framework achieves remarkable performance with GPT-3.5 and GPT-4. We summarize the distinctive behaviors of different LLM strategies, and discuss improvements for future research.
引用
收藏
页数:8
相关论文
共 50 条
[41]   Discovery of a Large Population of Nitrogen-enhanced Stars in the Magellanic Clouds [J].
Fernandez-Trincado, Jose G. ;
Beers, Timothy C. ;
Minniti, Dante ;
Carigi, Leticia ;
Barbuy, Beatriz ;
Placco, Vinicius M. ;
Bidin, Christian Moni ;
Villanova, Sandro ;
Roman-Lopes, Alexandre ;
Nitschelm, Christian .
ASTROPHYSICAL JOURNAL LETTERS, 2020, 903 (01)
[42]   Causal Analysis to Enhance Creative Problem-Solving: Performance and Effects on Mental Models [J].
Hester, Kimberly S. ;
Robledo, Issac C. ;
Barrett, Jamie D. ;
Peterson, David R. ;
Hougen, Dean P. ;
Day, Eric A. ;
Mumford, Michael D. .
CREATIVITY RESEARCH JOURNAL, 2012, 24 (2-3) :115-133
[43]   A new avenue to charged Higgs discovery in multi-Higgs models [J].
Dermisek, Radovan ;
Hall, Jonathan P. ;
Lunghi, Enrico ;
Shin, Seodong .
JOURNAL OF HIGH ENERGY PHYSICS, 2014, (04)
[44]   Towards k-vertex connected component discovery from large networks [J].
Li, Yuan ;
Wang, Guoren ;
Zhao, Yuhai ;
Zhu, Feida ;
Wu, Yubao .
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2020, 23 (02) :799-830
[45]   EyeTrackUAV2: A Large-Scale Binocular Eye-Tracking Dataset for UAV Videos [J].
Perrin, Anne-Flore ;
Krassanakis, Vassilios ;
Zhang, Lu ;
Ricordel, Vincent ;
Perreira Da Silva, Matthieu ;
Le Meur, Olivier .
DRONES, 2020, 4 (01) :1-25
[46]   Name-Face Association in Web Videos: A Large-Scale Dataset, Baselines, and Open Issues [J].
Chen, Zhi-Neng ;
Ngo, Chong-Wah ;
Zhang, Wei ;
Cao, Juan ;
Jiang, Yu-Gang .
JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2014, 29 (05) :785-798
[47]   Finding viable models in SUSY parameter spaces with signal specific discovery potential [J].
Burgess, Thomas ;
Lindroos, Jan Oye ;
Lipniacka, Anna ;
Sandaker, Heidi .
JOURNAL OF HIGH ENERGY PHYSICS, 2013, (08)
[48]   Benchmarking Small-Dataset Structure-Activity-Relationship Models for Prediction of Wnt Signaling Inhibition [J].
Kokabi, Mahtab ;
Donnelly, Matthew ;
Xu, Guangyu .
IEEE ACCESS, 2020, 8 :228831-228840
[49]   KOMPOS: Connecting Causal Knots in Large Nonlinear Time Series with Non-Parametric Regression Splines [J].
Koutroulis, Georgios ;
Botler, Leo ;
Mutlu, Belgin ;
Diwold, Konrad ;
Roemer, Kay ;
Kern, Roman .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2021, 12 (05)
[50]   Modeling actions of PubMed users with n-gram language models [J].
Lin, Jimmy ;
Wilbur, W. John .
INFORMATION RETRIEVAL, 2009, 12 (04) :487-503