Automated classification of fauna in seabed photographs: The impact of training and validation dataset size, with considerations for the class imbalance

被引:29
作者
Durden, Jennifer M. [1 ]
Hosking, Brett [1 ]
Bett, Brian J. [1 ]
Cline, Danelle [2 ]
Ruhl, Henry A. [1 ,2 ]
机构
[1] Natl Oceanog Ctr, Southampton, England
[2] Monterey Bay Aquarium Res Inst, Moss Landing, CA USA
基金
英国自然环境研究理事会;
关键词
Computer vision; Deep learning; Benthic ecology; Image annotation; Marine photography; Artificial intelligence; Convolutional neural networks; Sample size; LONG-TERM CHANGE; DEEP; HILL;
D O I
10.1016/j.pocean.2021.102612
中图分类号
P7 [海洋学];
学科分类号
0707 ;
摘要
Machine learning is rapidly developing as a tool for gathering data from imagery and may be useful in identifying (classifying) visible specimens in large numbers of seabed photographs. Application of an automated classification workflow requires manually identified specimens to be supplied for training and validating the model. These training and validation datasets are generally generated by partitioning the available manual identified specimens; typical ratios of training to validation dataset sizes are 75:25 or 80:20. However, this approach does not facilitate the desired scalability, which would require models to successfully classify specimens in hundreds of thousands to millions of images after training on a relatively small subset of manually identified specimens. A second problem is related to the 'class imbalance', where natural community structure means that fewer specimens of rare morphotypes are available for model training. We investigated the impact of independent variation of the training and validation dataset sizes on the performance of a convolutional neural network classifier on benthic invertebrates visible in a very large set of seabed photographs captured by an autonomous underwater vehicle at the Porcupine Abyssal Plain Sustained Observatory. We tested the impact of increasing training dataset size on specimen classification in a single validation dataset, and then tested the impact of increasing validation set size, evaluating ecological metrics in addition to computer vision metrics. Computer vision metrics (recall, precision, F1-score) indicated that classification improved with increasing training dataset size. In terms of ecological metrics, the number of morphotypes recorded increased, while diversity decreased with increasing training dataset size. Variation and bias in diversity metrics decreased with increasing training dataset size. Multivariate dispersion in apparent community composition was reduced, and bias from expert-derived data declined with increasing training dataset size. In contrast, classification success and resulting ecological metrics did not differ significantly with varying validation dataset sizes. Thus, the selection of an appropriate training dataset size is key to ensuring robust automated classifications of benthic invertebrates in seabed photographs, in terms of ecological results, and validation may be conducted on a comparatively small dataset with confidence that similar results will be obtained in a larger production dataset. In addition, our results suggest that automated classification of less common morphotypes may be feasible, providing that the overall training dataset size is sufficiently large. Thus, tactics for reducing class imbalance in the training dataset may produce improvements in the resulting ecological metrics.
引用
收藏
页数:11
相关论文
共 58 条
[41]   A new method for ecological surveying of the abyss using autonomous underwater vehicle photography [J].
Morris, Kirsty J. ;
Bett, Brian J. ;
Durden, Jennifer M. ;
Huvenne, Veerle A. I. ;
Milligan, Rosanna ;
Jones, Daniel O. B. ;
McPhail, Stephen ;
Robert, Katleen ;
Bailey, David M. ;
Ruhl, Henry A. .
LIMNOLOGY AND OCEANOGRAPHY-METHODS, 2014, 12 :795-809
[42]  
Oksanen J., 2012, Vegan community ecology package version 2
[43]  
Oliphant T.E, 2018, NUMPY 1 14 3
[44]  
Osterloff J., 2016, METHODS OCEANOGR, V15, P114, DOI DOI 10.1016/J.MIO.2016.03.002
[45]   Marine Litter Distribution and Density in European Seas, from the Shelves to Deep Basins [J].
Pham, Christopher K. ;
Ramirez-Llodra, Eva ;
Alt, Claudia H. S. ;
Amaro, Teresa ;
Bergmann, Melanie ;
Canals, Miquel ;
Company, Joan B. ;
Davies, Jaime ;
Duineveld, Gerard ;
Galgani, Francois ;
Howell, Kerry L. ;
Huvenne, Veerle A. I. ;
Isidro, Eduardo ;
Jones, Daniel O. B. ;
Lastras, Galderic ;
Morato, Telmo ;
Gomes-Pereira, Jose Nuno ;
Purser, Autun ;
Stewart, Heather ;
Tojeira, Ines ;
Tubau, Xavier ;
Van Rooij, David ;
Tyler, Paul A. .
PLOS ONE, 2014, 9 (04)
[46]   Automated identification of benthic epifauna with computer vision [J].
Piechaud, Nils ;
Hunt, Christopher ;
Culverhouse, Phil F. ;
Foster, Nicola L. ;
Howell, Kerry L. .
MARINE ECOLOGY PROGRESS SERIES, 2019, 615 :15-30
[47]   Climate-Driven Shifts in Marine Species Ranges: Scaling from Organisms to Communities [J].
Pinsky, Malin L. ;
Selden, Rebecca L. ;
Kitchel, Zoe J. .
ANNUAL REVIEW OF MARINE SCIENCE, VOL 12, 2020, 12 :153-179
[48]   Use of machine-learning algorithms for the automated detection of cold-water coral habitats: a pilot study [J].
Purser, Autun ;
Bergmann, Melanie ;
Lundalv, Tomas ;
Ontrup, Joerg ;
Nattkemper, Tim W. .
MARINE ECOLOGY PROGRESS SERIES, 2009, 397 :241-251
[49]   DeepFish: Accurate underwater live fish recognition with a deep architecture [J].
Qin, Hongwei ;
Li, Xiu ;
Liang, Jian ;
Peng, Yigang ;
Zhang, Changshui .
NEUROCOMPUTING, 2016, 187 :49-58
[50]   Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks [J].
Ren, Shaoqing ;
He, Kaiming ;
Girshick, Ross ;
Sun, Jian .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (06) :1137-1149