Using hierarchical text classification to investigate the utility of machine learning in automating online analyses of wildlife exploitation

被引:7
|
作者
Hunter, Sara Bronwen [1 ]
Mathews, Fiona [1 ]
Weeds, Julie [2 ]
机构
[1] Univ Sussex, Sch Life Sci, Brighton BN1 9QG, England
[2] Univ Sussex, Sch Engn & Informat, Brighton BN1 9QJ, England
关键词
Machine learning; Natural language processing; iEcology; Wildlife exploitation; Digital conservation; Social media;
D O I
10.1016/j.ecoinf.2023.102076
中图分类号
Q14 [生态学(生物生态学)];
学科分类号
071012 ; 0713 ;
摘要
Expanding digital data sources, including social media, online news articles and blogs, provide an opportunity to understand better the context and intensity of human-nature interactions, such as wildlife exploitation. However, online searches encompassing large taxonomic groups can generate vast datasets, which can be overwhelming to filter for relevant content without the use of automated tools. The variety of machine learning models available to researchers, and the need for manually labelled training data with an even balance of labels, can make applying these tools challenging. Here, we implement and evaluate a hierarchical text classification pipeline which brings together three binary classification tasks with increasingly specific relevancy criteria. Crucially, the hierarchical approach facilitates the filtering and structuring of a large dataset, of which relevant sources make up a small proportion. Using this pipeline, we also investigate how the accuracy with which text classifiers identify relevant and irrelevant texts is influenced by the use of different models, training datasets, and the classification task. To evaluate our methods, we collected data from Facebook, Twitter, Google and Bing search engines, with the aim of identifying sources documenting the hunting and persecution of bats (Chiroptera). Overall, the 'state-of-the-art' transformer-based models were able to identify relevant texts with an average accuracy of 90%, with some classifiers achieving accuracy of >95%. Whilst this demonstrates that application of more advanced models can lead to improved accuracy, comparable performance was achieved by simpler models when applied to longer documents and less ambiguous classification tasks. Hence, the benefits from using more computationally expensive models are dependent on the classification context. We also found that stratification of training data, according to the presence of key search terms, improved classification accuracy for less frequent topics within datasets, and therefore improves the applicability of classifiers to future data collection. Overall, whilst our findings reinforce the usefulness of automated tools for facilitating online analyses in conservation and ecology, they also highlight that the effectiveness and appropriateness of such tools is determined by the nature and volume of data collected, the complexity of the classification task, and the computational resources available to researchers.
引用
收藏
页数:11
相关论文
共 44 条
  • [31] Natural Language Processing for Imaging Protocol Assignment: Machine Learning for Multiclass Classification of Abdominal CT Protocols Using Indication Text Data
    Brian Arun Xavier
    Po-Hao Chen
    Journal of Digital Imaging, 2022, 35 : 1120 - 1130
  • [32] Natural Language Processing for Imaging Protocol Assignment: Machine Learning for Multiclass Classification of Abdominal CT Protocols Using Indication Text Data
    Xavier, Brian Arun
    Chen, Po-Hao
    JOURNAL OF DIGITAL IMAGING, 2022, 35 (05) : 1120 - 1130
  • [33] ReviewModus: Text classification and sentiment prediction of unstructured reviews using a hybrid combination of machine learning and evaluation models
    Zablith, Fouad
    Osman, Ibrahim H.
    APPLIED MATHEMATICAL MODELLING, 2019, 71 : 569 - 583
  • [34] Classification and predictive leaching risk assessment of construction and demolition waste using multivariate statistical and machine learning analyses☆
    Bisciotti, Andrea
    Brombin, Valentina
    Song, Yu
    Bianchini, Gianluca
    Cruciani, Giuseppe
    WASTE MANAGEMENT, 2025, 196 : 60 - 70
  • [35] Text mining for pharmacovigilance: Using machine learning for drug name recognition and drug-drug interaction extraction and classification
    Ben Abacha, Asma
    Chowdhury, Md. Faisal Mahbub
    Karanasiou, Aikaterini
    Mrabet, Yassine
    Lavelli, Alberto
    Zweigenbaum, Pierre
    JOURNAL OF BIOMEDICAL INFORMATICS, 2015, 58 : 122 - 132
  • [36] PIRAP: A Study on Optimized Multi-Language Classification and Text Categorization Using Supervised Hybrid Machine Learning Approaches
    Aladakatti, Shweta S.
    Durai, Senthil Kumar Swami
    INTERNATIONAL JOURNAL OF COOPERATIVE INFORMATION SYSTEMS, 2024,
  • [37] Detecting industrial discharges at an advanced water reuse facility using online instrumentation and supervised machine learning binary classification
    Thompson, Kyle A. A.
    Branch, Amos
    Nading, Tyler
    Dziura, Thomas
    Salazar-Benites, Germano
    Wilson, Chris
    Bott, Charles
    Salveson, Andrew
    Dickenson, Eric R. V.
    FRONTIERS IN WATER, 2022, 4
  • [38] Teleconsultations between Patients and Healthcare Professionals in Primary Care in Catalonia: The Evaluation of Text Classification Algorithms Using Supervised Machine Learning
    Lopez Segui, Francesc
    Egg Aguilar, Ricardo Ander
    de Maeztu, Gabriel
    Garcia-Altes, Anna
    Garcia Cuyas, Francesc
    Walsh, Sandra
    Sagarra Castro, Marta
    Vidal-Alaball, Josep
    INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH, 2020, 17 (03)
  • [39] Text-based paper-level classification procedure for non-traditional sciences using a machine learning approach
    Daniela Moctezuma
    Carlos López-Vázquez
    Lucas Lopes
    Norton Trevisan
    José Pérez
    Knowledge and Information Systems, 2024, 66 : 1503 - 1520
  • [40] Text-based paper-level classification procedure for non-traditional sciences using a machine learning approach
    Moctezuma, Daniela
    Lopez-Vazquez, Carlos
    Lopes, Lucas
    Trevisan, Norton
    Perez, Jose
    KNOWLEDGE AND INFORMATION SYSTEMS, 2024, 66 (02) : 1503 - 1520