Human-Machine Information Fxtraction Simulator for Biological Collections

被引:0
作者
Alzuru, Icaro [1 ]
Malladi, Aditi [1 ]
Matsunaga, Andrea [2 ]
Tsugawa, Mauricio [2 ]
Fortes, Jose A. B. [2 ]
机构
[1] Univ Florida, CISE Dept, Gainesville, FL 32611 USA
[2] Univ Florida, ACIS Lab, Gainesville, FL USA
来源
2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA) | 2019年
基金
美国国家科学基金会;
关键词
Information extraction; simulator; human-machine human-in-the-loop; crowdsourcing; optical character recognition; natural language processing;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the last decade, institutions from around the world have implemented initiatives for digitizing biological collections (biocollections) and sharing their information online. The transcription of the metadata from photographs of specimens' labels is performed through human-centered approaches (e.g., crowdsourcing) because fully automated Information Extraction (IE) methods still generate a significant number of errors. The integration of human and machine tasks has been proposed to accelerate the IE from the billions of specimens waiting to be digitized. Nevertheless, in order to conduct research and trying new techniques, IE practitioners need to prepare sets of images, crowdsourcing experiments, recruit volunteers, process the transcriptions, generate ground truth values, program automated methods, etc. These research resources and processes require time and effort to be developed and architected into a functional system. In this paper, we present a simulator intended to accelerate the ability to experiment with workflows for extracting Darwin Core (DC) terms from images of specimens. The so-called HuMaIN Simulator includes the engine, the human-machine IE workflows for three DC terms, the code of the automated IE methods, crowdsourced and ground truth transcriptions of the DC terms of three biocollections, and several experiments that exemplify its potential use. The simulator adds Human-in-the-loop capabilities, for iterative IE and research on optimal methods. Its practical design permits the quick definition, customization, and implementation of experimental IE scenarios.
引用
收藏
页码:4565 / 4572
页数:8
相关论文
共 28 条
  • [1] ACTS Lab, 2019, HUM INF EXTR SIM BIO
  • [2] Multi-lingual Concept Extraction with Linked Data and Human-in-the-Loop
    Alba, Alfredo
    Coden, Anni
    Gentile, Anna Lisa
    Gruhl, Daniel
    Ristoski, Petar
    Welch, Steve
    [J]. K-CAP 2017: PROCEEDINGS OF THE KNOWLEDGE CAPTURE CONFERENCE, 2017,
  • [3] Allmacher Christoph, 2019, 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), P838, DOI 10.1109/VR.2019.8797981
  • [4] SELFIE: Self-aware Information Extraction from Digitized Biocollections
    Alzuru, Icaro
    Matsunaga, Andrea
    Tsugawa, Mauricio
    Fortes, Jose A. B.
    [J]. 2017 IEEE 13TH INTERNATIONAL CONFERENCE ON E-SCIENCE (E-SCIENCE), 2017, : 69 - 78
  • [5] [Anonymous], 2019, OCROPY PYTHON BASED
  • [6] [Anonymous], 2019, TESSERACT OPEN SOURC
  • [7] Arino A., 2010, Biodiversity Informatics, V7
  • [8] How Accurate Is Information Transmitted to Medical Professionals Joining a Medical Emergency? A Simulator Study
    Bogenstaetter, Yvonne
    Tschan, Franziska
    Semmer, Norbert K.
    Spychiger, Martin
    Breuer, Marc
    Marsch, Stephan
    [J]. HUMAN FACTORS, 2009, 51 (02) : 115 - 125
  • [9] Chi E.H., 2003, Proceedings of the 21th Annual International ACM Conference on Human Factors in Computing Systems (CHI'03), Fort Lauderdale, Florida, P505, DOI [DOI 10.1145/642611.642699, 10.1145/642611.642699]
  • [10] Pegasus, a workflow management system for science automation
    Deelman, Ewa
    Vahi, Karan
    Juve, Gideon
    Rynge, Mats
    Callaghan, Scott
    Maechling, Philip J.
    Mayani, Rajiv
    Chen, Weiwei
    da Silva, Rafael Ferreira
    Livny, Miron
    Wenger, Kent
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2015, 46 : 17 - 35