Toward Reliable Biodiversity Information Extraction From Large Language Models

被引：1

作者：

Elliott, Michael J. ^{[1
]}

Fortes, Jose A. B. ^{[1
]}

机构：

[1] Univ Florida, ACIS Lab, Gainesville, FL 32610 USA

来源：

2024 IEEE 20TH INTERNATIONAL CONFERENCE ON E-SCIENCE, E-SCIENCE 2024 | 2024年

基金：

美国国家科学基金会;

关键词：

D O I：

10.1109/e-Science62913.2024.10678666

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

In this paper, we develop a method for extracting information from Large Language Models (LLMs) with associated confidence estimates. We propose that effective confidence models may be designed using a large number of uncertainty measures (i.e., variables that are only weakly predictive of - but positively correlated with - information correctness) as inputs. We trained a confidence model that uses 20 handcrafted uncertainty measures to predict GPT-4's ability to reproduce species occurrence data from iDigBio and found that, if we only consider occurrence claims that are placed in the top 30% of confidence estimates, we can increase prediction accuracy from 57% to 88% for species absence predictions and from 77% to 86% for species presence predictions. Using the same confidence model, we used GPT-4 to extract new data that extrapolates beyond the occurrence records in iDigBio and used the results to visualize geographic distributions for four individual species. More generally, this represents a novel use case for LLMs in generating credible pseudo data for applications in which high-quality curated data are unavailable or inaccessible.

引用

页数：10

共 23 条

[1] Selecting pseudo-absences for species distribution models: how, where and how many?
Barbet-Massin, Morgane
Jiguet, Frederic
Albert, Cecile Helene
Thuiller, Wilfried
[J]. METHODS IN ECOLOGY AND EVOLUTION, 2012, 3 (02): : 327 - 338
[2] Cao M, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), P3340
[3] XGBoost: A Scalable Tree Boosting System
Chen, Tianqi
Guestrin, Carlos
[J]. KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 785 - 794
[4] Bridging the gap between biodiversity data and policy reporting needs: An Essential Biodiversity Variables perspective
Geijzendorffer, Ilse R.
Regan, Eugenie C.
Pereira, Henrique M.
Brotons, Lluis
Brummitt, Neil
Gavish, Yoni
Haase, Peter
Martin, Corinne S.
Mihoub, Jean-Baptiste
Secades, Cristina
Schmeller, Dirk S.
Stoll, Stefan
Wetzel, Florian T.
Walters, Michele
[J]. JOURNAL OF APPLIED ECOLOGY, 2016, 53 (05) : 1341 - 1350
[5] Geva M, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P5484
[6] Gu JD, 2023, Arxiv, DOI [arXiv:2307.12980, 10.48550/arXiv.2307.12980]
[7] Guo CA, 2017, PR MACH LEARN RES, V70
[8] Survey of Hallucination in Natural Language Generation
Ji, Ziwei
Lee, Nayeon
Frieske, Rita
Yu, Tiezheng
Su, Dan
Xu, Yan
Ishii, Etsuko
Bang, Ye Jin
Madotto, Andrea
Fung, Pascale
[J]. ACM COMPUTING SURVEYS, 2023, 55 (12)
[9] Kossen J, 2024, Arxiv, DOI arXiv:2406.15927
[10] Kuhn L, 2023, Arxiv, DOI arXiv:2302.09664

← 1 2 3 →