Causal Dataset Discovery with Large Language Models

被引:0
|
作者
Liu, Junfei [1 ]
Sun, Shaotong [1 ]
Nargesian, Fatemeh [1 ]
机构
[1] Univ Rochester, 601 Elmwood Ave, Rochester, NY 14627 USA
来源
WORKSHOP ON HUMAN-IN-THE-LOOP DATA ANALYTICS, HILDA 2024 | 2024年
关键词
SEARCH;
D O I
10.1145/3665939.3665968
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Causal data discovery is crucial in scientific research by uncovering causal links among a variety of observed variables. Causal dataset discovery is the task of identifying datasets that contain columns that have causal relationships with columns in a query dataset. Discovering causal links from large-scale repositories faces three major challenges: vast scale of data, inherent sparsity of causal links, and incompleteness of variables present. Identifying causal relationships among datasets is a complex and time-intensive task, especially because it requires joining datasets, to bring all variables together, before applying causal link discovery. In this paper, we introduce the Causal Dataset Discovery problem and propose a large language model (LLM)-based framework to discover potential pairwise causal links between columns from different datasets. We heuristically improve LLM's grasp of causality through prompting and fine-tuning and prevent the extreme imbalance in causal candidate distributions due to natural sparsity of causal connections. We create benchmarks specific to this task1, experimentally show that our framework achieves remarkable performance with GPT-3.5 and GPT-4. We summarize the distinctive behaviors of different LLM strategies, and discuss improvements for future research.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Large language models for causal hypothesis generation in science
    Cohrs, Kai-Hendrik
    Diaz, Emiliano
    Sitokonstantinou, Vasileios
    Varando, Gherardo
    Camps-Valls, Gustau
    MACHINE LEARNING-SCIENCE AND TECHNOLOGY, 2025, 6 (01):
  • [2] Dataset Discovery and Exploration: A Survey
    Paton, Norman W.
    Chen, Jiaoyan
    Wu, Zhenyu
    ACM COMPUTING SURVEYS, 2024, 56 (04)
  • [3] DeepJoin: Joinable Table Discovery with Pre-trained Language Models
    Dong, Yuyang
    Xiao, Chuan
    Nozawa, Takuma
    Enomoto, Masafumi
    Oyamada, Masafumi
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (10): : 2458 - 2470
  • [4] Large Language Models for Software Engineering: Survey and Open Problems
    Fan, Angela
    Gokkaya, Beliz
    Harman, Mark
    Lyubarskiy, Mitya
    Sengupta, Shubho
    Yoo, Shin
    Zhang, Jie M.
    2023 IEEE/ACM INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: FUTURE OF SOFTWARE ENGINEERING, ICSE-FOSE, 2023, : 31 - 53
  • [5] How large language models can reshape collective intelligence
    Burton, Jason W.
    Lopez-Lopez, Ezequiel
    Hechtlinger, Shahar
    Rahwan, Zoe
    Aeschbach, Samuel
    Bakker, Michiel A.
    Becker, Joshua A.
    Berditchevskaia, Aleks
    Berger, Julian
    Brinkmann, Levin
    Flek, Lucie
    Herzog, Stefan M.
    Huang, Saffron
    Kapoor, Sayash
    Narayanan, Arvind
    Nussberger, Anne-Marie
    Yasseri, Taha
    Nickl, Pietro
    Almaatouq, Abdullah
    Hahn, Ulrike
    Kurvers, Ralf H. J. M.
    Leavy, Susan
    Rahwan, Iyad
    Siddarth, Divya
    Siu, Alice
    Woolley, Anita W.
    Wulff, Dirk U.
    Hertwig, Ralph
    NATURE HUMAN BEHAVIOUR, 2024, 8 (09): : 1643 - 1655
  • [6] Design Drives Discovery in Causal Learning
    Walker, Caren M.
    Rett, Alexandra
    Bonawitz, Elizabeth
    PSYCHOLOGICAL SCIENCE, 2020, 31 (02) : 129 - 138
  • [7] Symbolic Execution with Test Cases Generated by Large Language Models
    Xu, Jiahe
    Xu, Jingwei
    Chen, Taolue
    Ma, Xiaoxing
    2024 IEEE 24TH INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY, QRS, 2024, : 228 - 237
  • [8] A Survey on Causal Discovery: Theory and Practice
    Zanga, Alessio
    Ozkirimli, Elif
    Stella, Fabio
    INTERNATIONAL JOURNAL OF APPROXIMATE REASONING, 2022, 151 : 101 - 129
  • [9] Web-Scale Semantic Product Search with Large Language Models
    Muhamed, Aashiq
    Srinivasan, Sriram
    Teo, Choon-Hui
    Cui, Qingjun
    Zeng, Belinda
    Chilimbi, Trishul
    Vishwanathan, S. V. N.
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2023, PT III, 2023, 13937 : 73 - 85
  • [10] Causal discovery using dynamically requested knowledge
    Kitson, Neville K.
    Constantinou, Anthony C.
    KNOWLEDGE-BASED SYSTEMS, 2025, 314