A benchmark dataset of herbarium specimen images with label data

被引:17
作者
Dillen, Mathias [1 ]
Groom, Quentin [1 ]
Chagnoux, Simon [2 ]
Guentsch, Anton [3 ]
Hardisty, Alex [4 ]
Haston, Elspeth [5 ]
Livermore, Laurence [6 ]
Runnel, Veljo [7 ]
Schulman, Leif [8 ]
Willemse, Luc [9 ]
Wu, Zhengzhe [8 ]
Phillips, Sarah [10 ]
机构
[1] Meise Bot Garden, Meise, Belgium
[2] Museum Natl Hist Nat, Paris, France
[3] Free Univ Berlin, Berlin, Germany
[4] Cardiff Univ, Sch Comp Sci & Informat, Cardiff, S Glam, Wales
[5] Royal Bot Garden Edinburgh, Edinburgh, Midlothian, Scotland
[6] Nat Hist Museum, London, England
[7] Univ Tartu, Tartu, Estonia
[8] Finnish Museum Nat Hist LUOMUS, Helsinki, Finland
[9] Naturalis, Leiden, Netherlands
[10] Royal Bot Gardens Kew, Surrey, England
基金
欧盟地平线“2020”;
关键词
DIGITIZATION; WORKFLOW;
D O I
10.3897/BDJ.7.e31817
中图分类号
X176 [生物多样性保护];
学科分类号
090705 ;
摘要
Background More and more herbaria are digitising their collections. Images of specimens are made available online to facilitate access to them and allow extraction of information from them. Transcription of the data written on specimens is critical for general discoverability and enables incorporation into large aggregated research datasets. Different methods, such as crowdsourcing and artificial intelligence, are being developed to optimise transcription, but herbarium specimens pose difficulties in data extraction for many reasons. New information To provide developers of transcription methods with a means of optimisation, we have compiled a benchmark dataset of 1,800 herbarium specimen images with corresponding transcribed data. These images originate from nine different collections and include specimens that reflect the multiple potential obstacles that transcription methods may encounter, such as differences in language, text format (printed or handwritten), specimen age and nomenclatural type status. We are making these specimens available with a Creative Commons Zero licence waiver and with permanent online storage of the data. By doing this, we are minimising the obstacles to the use of these images for transcription training. This benchmark dataset of images may also be used where a defined and documented set of herbarium specimens is needed, such as for the extraction of morphological traits, handwriting recognition and colour analysis of specimens.
引用
收藏
页数:15
相关论文
共 34 条
  • [1] [Anonymous], 2019, Checklist dataset, DOI DOI 10.15468/39OMEI
  • [2] [Anonymous], 2017, FRONT LIFE SCI, DOI DOI 10.1080/21553769.2017.1412361
  • [3] [Anonymous], BIODIVERS INFORM
  • [4] Baird R., 2010, Biodiversity Informatics, V7, P130, DOI DOI 10.17161/BI.V7I2.3987
  • [5] The SALIX Method: A semi-automated workflow for herbarium specimen digitization
    Barber, Anne
    Lafferty, Daryl
    Landrum, Leslie R.
    [J]. TAXON, 2013, 62 (03) : 581 - 590
  • [6] Going deeper in the automated identification of Herbarium specimens
    Carranza-Rojas, Jose
    Goeau, Herve
    Bonnet, Pierre
    Mata-Montero, Erick
    Joly, Alexis
    [J]. BMC EVOLUTIONARY BIOLOGY, 2017, 17 : 1 - 14
  • [7] Chamberlain S., 2017, rgbif: Interface to the Global 'Biodiversity' Information Facility 'API'
  • [8] Plant species identification using digital morphometrics: A review
    Cope, James S.
    Corney, David
    Clark, Jonathan Y.
    Remagnino, Paolo
    Wilkin, Paul
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (08) : 7562 - 7573
  • [9] Automatic extraction of leaf characters from herbarium specimens
    Corney, David P. A.
    Clark, Jonathan Y.
    Tang, H. Lilian
    Wilkin, Paul
    [J]. TAXON, 2012, 61 (01) : 231 - 244
  • [10] Darwin Core Task Group, 2009, DARW COR 0 1 REL DAT