A novel automated label data extraction and data base generation system from herbarium specimen images using OCR and NER

被引:3
作者
Takano, Atsuko [1 ]
Cole, Theodor C. H. [2 ]
Konagai, Hajime [3 ]
机构
[1] Univ Hyogo, Museum Nat & Human Act, Inst Nat & Environm Sci, 6 Chome, Sanda, Hyogo 6691546, Japan
[2] Free Univ Berlin, Inst Biol, Dahlem Ctr Plant Sci, Altensteinstr 6, D-14195 Berlin, Germany
[3] Funct Tales, Shimogamo Honmachi 19-1-101,Sakyo Ku, Kyoto 6060862, Japan
基金
日本学术振兴会;
关键词
SCALE DIGITIZATION; RECOGNITION;
D O I
10.1038/s41598-023-50179-0
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Digital extraction of label data from natural history specimens along with more efficient procedures of data entry and processing is essential for improving documentation and global information availability. Herbaria have made great advances in this direction lately. In this study, using optical character recognition (OCR) and named entity recognition (NER) techniques, we have been able to make further advancements towards fully automatic extraction of label data from herbarium specimen images. This system can be developed and run on a consumer grade desktop computer with standard specifications, and can also be applied to extracting label data from diverse kinds of natural history specimens, such as those in entomological collections. This system can facilitate the digitization and publication of natural history museum specimens around the world.
引用
收藏
页数:8
相关论文
共 28 条
  • [1] Alzuru I, 2016, P IEEE INT C E-SCI, P41, DOI 10.1109/eScience.2016.7870884
  • [2] Aoki K., 2019, Master thesis
  • [3] The SALIX Method: A semi-automated workflow for herbarium specimen digitization
    Barber, Anne
    Lafferty, Daryl
    Landrum, Leslie R.
    [J]. TAXON, 2013, 62 (03) : 581 - 590
  • [4] Beaman R. S., 2006, Botany 2006. Botanical Cyberinfrastructure: Issues, Challenges, Opportunities, and Initiatives
  • [5] No specimen left behind: industrial scale digitization of natural history collections
    Blagoderov, Vladimir
    Kitching, Ian J.
    Livermore, Laurence
    Simonsen, Thomas J.
    Smith, Vincent S.
    [J]. ZOOKEYS, 2012, (209) : 133 - 146
  • [6] Chaitanya K. D. V., 2023, J. Artif. Intell. Capsul. Netw., V5, P330, DOI [10.36548/jaicn.2023.3.008, DOI 10.36548/JAICN.2023.3.008]
  • [7] The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels
    Drinkwater, Robyn E.
    Cubey, Robert W. N.
    Haston, Elspeth M.
    [J]. PHYTOKEYS, 2014, 38 : 15 - 30
  • [8] Biodiversity Science and the Twenty-First Century Workforce
    Ellwood, Elizabeth R.
    Sessa, Jocelyn Anne
    Abraham, Joel K.
    Budden, Amber E.
    Douglas, Natalie
    Guralnick, Robert
    Krimmel, Erica
    Langen, Tom
    Linton, Debra
    Phillips, Molly
    Soltis, Pamela S.
    Studer, Marie
    White, Lisa D.
    Williams, Jason
    Monfils, Anna K.
    [J]. BIOSCIENCE, 2020, 70 (02) : 119 - 121
  • [9] DIGITIZING SPECIMENS IN A SMALL HERBARIUM: A VIABLE WORKFLOW FOR COLLECTIONS WORKING WITH LIMITED RESOURCES
    Harris, Kari M.
    Marsico, Travis D.
    [J]. APPLICATIONS IN PLANT SCIENCES, 2017, 5 (04):
  • [10] Developing integrated workflows for the digitisation of herbarium specimens using a modular and scalable approach
    Haston, Elspeth
    Cubey, Robert
    Pullan, Martin
    Atkins, Hannah
    Harris, David J.
    [J]. ZOOKEYS, 2012, (209) : 93 - 102