Fast, scalable, and automated identification of articles for biodiversity and macroecological datasets

被引:14
作者
Cornford, Richard [1 ,2 ,3 ]
Deinet, Stefanie [1 ]
De Palma, Adriana [2 ]
Hill, Samantha L. L. [4 ]
McRae, Louise [1 ]
Pettit, Benjamin [5 ]
Marconi, Valentina [1 ,3 ]
Purvis, Andy [2 ,3 ]
Freeman, Robin [1 ]
机构
[1] Zool Soc London, Inst Zool, London NW1 4RY, England
[2] Nat Hist Museum, Dept Life Sci, London, England
[3] Imperial Coll London, Dept Life Sci, Ascot, Berks, England
[4] UNEP World Conservat Monitoring Ctr, Cambridge, England
[5] Cleo AI Ltd, London, England
来源
GLOBAL ECOLOGY AND BIOGEOGRAPHY | 2021年 / 30卷 / 01期
基金
英国自然环境研究理事会;
关键词
automated classification; biodiversity indicators; Biodiversity Intactness Index; ecological data; Living Planet Index; machine learning; text mining; SYSTEMATIC REVIEWS; DATABASE; CLASSIFICATION; RESPONSES;
D O I
10.1111/geb.13219
中图分类号
Q14 [生态学(生物生态学)];
学科分类号
071012 ; 0713 ;
摘要
Aim Understanding broad-scale ecological patterns and processes is necessary if we are to mitigate the consequences of anthropogenically driven biodiversity degradation. However, such analyses require large datasets and current data collation methods can be slow, involving extensive human input. Given rapid and ever-increasing rates of scientific publication, manually identifying data sources among hundreds of thousands of articles is a significant challenge, which can create a bottleneck in the generation of ecological databases. Innovation Here, we demonstrate the use of general, text-classification approaches to identify relevant biodiversity articles. We apply this to two freely available example databases, the Living Planet Database and the database of the PREDICTS (Projecting Responses of Ecological Diversity in Changing Terrestrial Systems) project, both of which underpin important biodiversity indicators. We assess machine-learning classifiers based on logistic regression (LR) and convolutional neural networks, and identify aspects of the text-processing workflow that influence classification performance. Main conclusions Our best classifiers can distinguish relevant from non-relevant articles with over 90% accuracy. Using readily available abstracts and titles or abstracts alone produces significantly better results than using titles alone. LR and neural network models performed similarly. Crucially, we show that deploying such models on real-world search results can significantly increase the rate at which potentially relevant papers are recovered compared to a current manual protocol. Furthermore, our results indicate that, given a modest initial sample of 100 relevant papers, high-performing classifiers could be generated quickly through iteratively updating the training texts based on targeted literature searches. These findings clearly demonstrate the usefulness of text-mining methods for constructing and enhancing ecological datasets, and wider application of these techniques has the potential to benefit large-scale analyses more broadly. We provide source code and examples that can be used to create new classifiers for other datasets.
引用
收藏
页码:339 / 347
页数:9
相关论文
共 46 条
[1]   Supporting Systematic Reviews Using Text Mining [J].
Ananiadou, Sophia ;
Rea, Brian ;
Okazaki, Naoaki ;
Procter, Rob ;
Thomas, James .
SOCIAL SCIENCE COMPUTER REVIEW, 2009, 27 (04) :509-523
[2]  
[Anonymous], 2019, CoRR, DOI DOI 10.48550/ARXIV.1907.11692
[3]   Machine learning algorithms for systematic review: reducing workload in a preclinical review of animal studies and reducing human screening error [J].
Bannach-Brown, Alexandra ;
Przybyla, Piotr ;
Thomas, James ;
Rice, Andrew S. C. ;
Ananiadou, Sophia ;
Liao, Jing ;
Macleod, Malcolm Robert .
SYSTEMATIC REVIEWS, 2019, 8 (1)
[4]  
Benson DA, 2010, NUCLEIC ACIDS RES, V38, pD46, DOI [10.1093/nar/gks1195, 10.1093/nar/gkp1024, 10.1093/nar/gkq1079, 10.1093/nar/gkw1070, 10.1093/nar/gkg057, 10.1093/nar/gkl986, 10.1093/nar/gkn723, 10.1093/nar/gkx1094, 10.1093/nar/gkr1202]
[5]  
Bolukbasi T, 2016, ADV NEUR IN, V29
[6]   Will a biological database be different from a biological journal? [J].
Bourne, P .
PLOS COMPUTATIONAL BIOLOGY, 2005, 1 (03) :179-181
[7]   Open access: Taking full advantage of the content [J].
Bourne, Philip E. ;
Fink, J. Lynn ;
Gerstein, Mark .
PLOS COMPUTATIONAL BIOLOGY, 2008, 4 (03)
[8]  
Brondizio E.S., 2019, IPBES secretariat, DOI [DOI 10.5281/ZENODO.3831673, 10.5281/zenodo.3831673]
[9]  
CBD, 2010, 10 M C PART CONV BIO
[10]   Studying the potential impact of automated document classification on scheduling a systematic review update [J].
Cohen, Aaron M. ;
Ambert, Kyle ;
McDonagh, Marian .
BMC MEDICAL INFORMATICS AND DECISION MAKING, 2012, 12