Automated classification of fauna in seabed photographs: The impact of training and validation dataset size, with considerations for the class imbalance

被引:27
作者
Durden, Jennifer M. [1 ]
Hosking, Brett [1 ]
Bett, Brian J. [1 ]
Cline, Danelle [2 ]
Ruhl, Henry A. [1 ,2 ]
机构
[1] Natl Oceanog Ctr, Southampton, England
[2] Monterey Bay Aquarium Res Inst, Moss Landing, CA USA
基金
英国自然环境研究理事会;
关键词
Computer vision; Deep learning; Benthic ecology; Image annotation; Marine photography; Artificial intelligence; Convolutional neural networks; Sample size; LONG-TERM CHANGE; DEEP; HILL;
D O I
10.1016/j.pocean.2021.102612
中图分类号
P7 [海洋学];
学科分类号
0707 ;
摘要
Machine learning is rapidly developing as a tool for gathering data from imagery and may be useful in identifying (classifying) visible specimens in large numbers of seabed photographs. Application of an automated classification workflow requires manually identified specimens to be supplied for training and validating the model. These training and validation datasets are generally generated by partitioning the available manual identified specimens; typical ratios of training to validation dataset sizes are 75:25 or 80:20. However, this approach does not facilitate the desired scalability, which would require models to successfully classify specimens in hundreds of thousands to millions of images after training on a relatively small subset of manually identified specimens. A second problem is related to the 'class imbalance', where natural community structure means that fewer specimens of rare morphotypes are available for model training. We investigated the impact of independent variation of the training and validation dataset sizes on the performance of a convolutional neural network classifier on benthic invertebrates visible in a very large set of seabed photographs captured by an autonomous underwater vehicle at the Porcupine Abyssal Plain Sustained Observatory. We tested the impact of increasing training dataset size on specimen classification in a single validation dataset, and then tested the impact of increasing validation set size, evaluating ecological metrics in addition to computer vision metrics. Computer vision metrics (recall, precision, F1-score) indicated that classification improved with increasing training dataset size. In terms of ecological metrics, the number of morphotypes recorded increased, while diversity decreased with increasing training dataset size. Variation and bias in diversity metrics decreased with increasing training dataset size. Multivariate dispersion in apparent community composition was reduced, and bias from expert-derived data declined with increasing training dataset size. In contrast, classification success and resulting ecological metrics did not differ significantly with varying validation dataset sizes. Thus, the selection of an appropriate training dataset size is key to ensuring robust automated classifications of benthic invertebrates in seabed photographs, in terms of ecological results, and validation may be conducted on a comparatively small dataset with confidence that similar results will be obtained in a larger production dataset. In addition, our results suggest that automated classification of less common morphotypes may be feasible, providing that the overall training dataset size is sufficiently large. Thus, tactics for reducing class imbalance in the training dataset may produce improvements in the resulting ecological metrics.
引用
收藏
页数:11
相关论文
共 58 条
  • [11] Long-term change in the abyssal NE Atlantic: The 'Amperima Event' revisited
    Billett, D. S. M.
    Bett, B. J.
    Reid, W. D. K.
    Boorman, B.
    Priede, I. G.
    [J]. DEEP-SEA RESEARCH PART II-TOPICAL STUDIES IN OCEANOGRAPHY, 2010, 57 (15) : 1406 - 1417
  • [12] Long-term change in the megabenthos of the Porcupine Abyssal Plain (NE Atlantic)
    Billett, DSM
    Bett, BJ
    Rice, AL
    Thurston, MH
    Galéron, J
    Sibuet, M
    Wolff, GA
    [J]. PROGRESS IN OCEANOGRAPHY, 2001, 50 (1-4) : 325 - 348
  • [13] Rarefaction and extrapolation with Hill numbers: a framework for sampling and estimation in species diversity studies
    Chao, Anne
    Gotelli, Nicholas J.
    Hsieh, T. C.
    Sander, Elizabeth L.
    Ma, K. H.
    Colwell, Robert K.
    Ellison, Aaron M.
    [J]. ECOLOGICAL MONOGRAPHS, 2014, 84 (01) : 45 - 67
  • [14] NONPARAMETRIC MULTIVARIATE ANALYSES OF CHANGES IN COMMUNITY STRUCTURE
    CLARKE, KR
    [J]. AUSTRALIAN JOURNAL OF ECOLOGY, 1993, 18 (01): : 117 - 143
  • [15] Ecological variables for developing a global deep-ocean monitoring and conservation strategy
    Danovaro, Roberto
    Fanelli, Emanuela
    Aguzzi, Jacopo
    Billett, David
    Carugati, Laura
    Corinaldesi, Cinzia
    Dell'Anno, Antonio
    Gjerde, Kristina
    Jamieson, Alan J.
    Kark, Salit
    McClain, Craig
    Levin, Lisa
    Levin, Noam
    Ramirez-Llodra, Eva
    Ruhl, Henry
    Smith, Craig R.
    Snelgrove, Paul V. R.
    Thomsen, Laurenz
    Van Dover, Cindy L.
    Yasuhara, Moriaki
    [J]. NATURE ECOLOGY & EVOLUTION, 2020, 4 (02) : 181 - +
  • [16] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
  • [17] Subtle variation in abyssal terrain induces significant change in benthic megafaunal abundance, diversity, and community structure
    Durden, Jennifer M.
    Bett, Brian J.
    Ruhl, Henry A.
    [J]. PROGRESS IN OCEANOGRAPHY, 2020, 186
  • [18] Response of deep-sea deposit-feeders to detrital inputs: A comparison of two abyssal time-series sites
    Durden, Jennifer M.
    Bett, Brian J.
    Huffard, Christine L.
    Pebody, Corinne
    Ruhl, Henry A.
    Smith, Kenneth L., Jr.
    [J]. DEEP-SEA RESEARCH PART II-TOPICAL STUDIES IN OCEANOGRAPHY, 2020, 173 (173)
  • [19] Differences in the carbon flows in the benthic food webs of abyssal hill and plain habitats
    Durden, Jennifer M.
    Ruhl, Henry A.
    Pebody, Corinne
    Blackbird, Sabena J.
    van Oevelen, Dick
    [J]. LIMNOLOGY AND OCEANOGRAPHY, 2017, 62 (04) : 1771 - 1782
  • [20] Durden JM, 2016, OCEANOGR MAR BIOL, V54, P1