Toward Reliable Biodiversity Information Extraction From Large Language Models

被引:1
作者
Elliott, Michael J. [1 ]
Fortes, Jose A. B. [1 ]
机构
[1] Univ Florida, ACIS Lab, Gainesville, FL 32610 USA
来源
2024 IEEE 20TH INTERNATIONAL CONFERENCE ON E-SCIENCE, E-SCIENCE 2024 | 2024年
基金
美国国家科学基金会;
关键词
D O I
10.1109/e-Science62913.2024.10678666
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we develop a method for extracting information from Large Language Models (LLMs) with associated confidence estimates. We propose that effective confidence models may be designed using a large number of uncertainty measures (i.e., variables that are only weakly predictive of - but positively correlated with - information correctness) as inputs. We trained a confidence model that uses 20 handcrafted uncertainty measures to predict GPT-4's ability to reproduce species occurrence data from iDigBio and found that, if we only consider occurrence claims that are placed in the top 30% of confidence estimates, we can increase prediction accuracy from 57% to 88% for species absence predictions and from 77% to 86% for species presence predictions. Using the same confidence model, we used GPT-4 to extract new data that extrapolates beyond the occurrence records in iDigBio and used the results to visualize geographic distributions for four individual species. More generally, this represents a novel use case for LLMs in generating credible pseudo data for applications in which high-quality curated data are unavailable or inaccessible.
引用
收藏
页数:10
相关论文
共 23 条
  • [1] Selecting pseudo-absences for species distribution models: how, where and how many?
    Barbet-Massin, Morgane
    Jiguet, Frederic
    Albert, Cecile Helene
    Thuiller, Wilfried
    [J]. METHODS IN ECOLOGY AND EVOLUTION, 2012, 3 (02): : 327 - 338
  • [2] Cao M, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), P3340
  • [3] XGBoost: A Scalable Tree Boosting System
    Chen, Tianqi
    Guestrin, Carlos
    [J]. KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 785 - 794
  • [4] Bridging the gap between biodiversity data and policy reporting needs: An Essential Biodiversity Variables perspective
    Geijzendorffer, Ilse R.
    Regan, Eugenie C.
    Pereira, Henrique M.
    Brotons, Lluis
    Brummitt, Neil
    Gavish, Yoni
    Haase, Peter
    Martin, Corinne S.
    Mihoub, Jean-Baptiste
    Secades, Cristina
    Schmeller, Dirk S.
    Stoll, Stefan
    Wetzel, Florian T.
    Walters, Michele
    [J]. JOURNAL OF APPLIED ECOLOGY, 2016, 53 (05) : 1341 - 1350
  • [5] Geva M, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P5484
  • [6] Gu JD, 2023, Arxiv, DOI [arXiv:2307.12980, 10.48550/arXiv.2307.12980]
  • [7] Guo CA, 2017, PR MACH LEARN RES, V70
  • [8] Survey of Hallucination in Natural Language Generation
    Ji, Ziwei
    Lee, Nayeon
    Frieske, Rita
    Yu, Tiezheng
    Su, Dan
    Xu, Yan
    Ishii, Etsuko
    Bang, Ye Jin
    Madotto, Andrea
    Fung, Pascale
    [J]. ACM COMPUTING SURVEYS, 2023, 55 (12)
  • [9] Kossen J, 2024, Arxiv, DOI arXiv:2406.15927
  • [10] Kuhn L, 2023, Arxiv, DOI arXiv:2302.09664