Automated classification of fauna in seabed photographs: The impact of training and validation dataset size, with considerations for the class imbalance

被引:27
作者
Durden, Jennifer M. [1 ]
Hosking, Brett [1 ]
Bett, Brian J. [1 ]
Cline, Danelle [2 ]
Ruhl, Henry A. [1 ,2 ]
机构
[1] Natl Oceanog Ctr, Southampton, England
[2] Monterey Bay Aquarium Res Inst, Moss Landing, CA USA
基金
英国自然环境研究理事会;
关键词
Computer vision; Deep learning; Benthic ecology; Image annotation; Marine photography; Artificial intelligence; Convolutional neural networks; Sample size; LONG-TERM CHANGE; DEEP; HILL;
D O I
10.1016/j.pocean.2021.102612
中图分类号
P7 [海洋学];
学科分类号
0707 ;
摘要
Machine learning is rapidly developing as a tool for gathering data from imagery and may be useful in identifying (classifying) visible specimens in large numbers of seabed photographs. Application of an automated classification workflow requires manually identified specimens to be supplied for training and validating the model. These training and validation datasets are generally generated by partitioning the available manual identified specimens; typical ratios of training to validation dataset sizes are 75:25 or 80:20. However, this approach does not facilitate the desired scalability, which would require models to successfully classify specimens in hundreds of thousands to millions of images after training on a relatively small subset of manually identified specimens. A second problem is related to the 'class imbalance', where natural community structure means that fewer specimens of rare morphotypes are available for model training. We investigated the impact of independent variation of the training and validation dataset sizes on the performance of a convolutional neural network classifier on benthic invertebrates visible in a very large set of seabed photographs captured by an autonomous underwater vehicle at the Porcupine Abyssal Plain Sustained Observatory. We tested the impact of increasing training dataset size on specimen classification in a single validation dataset, and then tested the impact of increasing validation set size, evaluating ecological metrics in addition to computer vision metrics. Computer vision metrics (recall, precision, F1-score) indicated that classification improved with increasing training dataset size. In terms of ecological metrics, the number of morphotypes recorded increased, while diversity decreased with increasing training dataset size. Variation and bias in diversity metrics decreased with increasing training dataset size. Multivariate dispersion in apparent community composition was reduced, and bias from expert-derived data declined with increasing training dataset size. In contrast, classification success and resulting ecological metrics did not differ significantly with varying validation dataset sizes. Thus, the selection of an appropriate training dataset size is key to ensuring robust automated classifications of benthic invertebrates in seabed photographs, in terms of ecological results, and validation may be conducted on a comparatively small dataset with confidence that similar results will be obtained in a larger production dataset. In addition, our results suggest that automated classification of less common morphotypes may be feasible, providing that the overall training dataset size is sufficiently large. Thus, tactics for reducing class imbalance in the training dataset may produce improvements in the resulting ecological metrics.
引用
收藏
页数:11
相关论文
共 58 条
  • [1] New High-Tech Flexible Networks for the Monitoring of Deep-Sea Ecosystems
    Aguzzi, Jacopo
    Chatzievangelou, Damianos
    Marini, Simone
    Fanelli, Emanuela
    Danovaro, Roberto
    Floegel, Sascha
    Lebris, Nadine
    Juanes, Francis
    De Leo, Fabio C.
    Del Rio, Joaquin
    Thomsen, Laurenz
    Costa, Corrado
    Riccobene, Giorgio
    Tamburini, Cristian
    Lefevre, Dominique
    Gojak, Carl
    Poulain, Pierre-Marie
    Favali, Paolo
    Griffa, Annalisa
    Purser, Autun
    Cline, Danelle
    Edgington, Duane
    Navarro, Joan
    Stefanni, Sergio
    D'Hondt, Steve
    Priede, Imants G.
    Rountree, Rodney
    Company, Joan B.
    [J]. ENVIRONMENTAL SCIENCE & TECHNOLOGY, 2019, 53 (12) : 6616 - 6631
  • [2] Aguzzi J, 2012, OCEANOGR MAR BIOL, V50, P235
  • [3] Distance-based tests for homogeneity of multivariate dispersions
    Anderson, MJ
    [J]. BIOMETRICS, 2006, 62 (01) : 245 - 253
  • [4] [Anonymous], 2016, DISTRIB PARALLEL DAT
  • [5] [Anonymous], 2016, Methods Oceanogr., DOI DOI 10.1016/J.MIO.2016.03.002
  • [6] Towards Automated Annotation of Benthic Survey Images: Variability of Human Experts and Operational Modes of Automation
    Beijbom, Oscar
    Edmunds, Peter J.
    Roelfsema, Chris
    Smith, Jennifer
    Kline, David I.
    Neal, Benjamin P.
    Dunlap, Matthew J.
    Moriarty, Vincent
    Fan, Tung-Yung
    Tan, Chih-Jui
    Chan, Stephen
    Treibitz, Tali
    Gamst, Anthony
    Mitchell, B. Greg
    Kriegman, David
    [J]. PLOS ONE, 2015, 10 (07):
  • [7] CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING
    BENJAMINI, Y
    HOCHBERG, Y
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) : 289 - 300
  • [8] Monitoring mosaic biotopes in a marine conservation zone by autonomous underwater vehicle
    Benoist, Noelie M. A.
    Morris, Kirsty J.
    Sett, Brian J.
    Durden, Jennifer M.
    Huvenne, Veerle A., I
    Le Sas, Tim P.
    Wynn, Russell B.
    Ware, Suzanne J.
    Ruhl, Henry A.
    [J]. CONSERVATION BIOLOGY, 2019, 33 (05) : 1174 - 1186
  • [9] Temporal variability in phytodetritus and megabenthic activity at the seabed in the deep Northeast Atlantic
    Bett, BJ
    Malzone, MG
    Narayanaswamy, BE
    Wigham, BD
    [J]. PROGRESS IN OCEANOGRAPHY, 2001, 50 (1-4) : 349 - 368
  • [10] Bett BJ, 2019, ENCYCLOPEDIA OF OCEAN SCIENCES, VOL 2: MARINE LIFE, 3RD EDITION, P735, DOI 10.1016/B978-0-12-409548-9.11640-9