Natural Language Processing using Kepler Workflow System: First Steps

被引:1
作者
Goyal, Ankit [1 ]
Singh, Alok [2 ]
Bhargava, Shitij [1 ]
Crawl, Daniel [2 ]
Altintas, Ilkay [2 ]
Hsu, Chun-Nan [3 ]
机构
[1] Univ Calif San Diego, Dept Comp Sci & Engn, San Diego, CA 92103 USA
[2] Univ Calif San Diego, San Diego Supercomp Ctr, San Diego, CA 92103 USA
[3] Univ Calif San Diego, Sch Med, Dept Biomed Informat, San Diego, CA 92103 USA
来源
INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE 2016 (ICCS 2016) | 2016年 / 80卷
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
Kepler; Scalable Document Conversion; Natural Language Processing;
D O I
10.1016/j.procs.2016.05.358
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Scientific community across many disciplines is exploring new ways to extract knowledge from all available sources. Historically, written manuscripts have been the media of choice for recording experimental findings. Many disciplines such as social science, medical science are exploring ways to automate knowledge discovery from a vast repository of published scientific work. This work attempts to accelerate the process of information extraction by extending Kepler, a graphical workflow management tool. Kepler provides a simple way of designing and executing complex workflows in the form of directed graphs. This work presents a scalable approach to convert published research as PDF documents into indexable XML documents using Kepler. This conversion is a critical step in the Natural Language Processing pipeline. Kepler's distributed data processing capability enables scientists to scale this critical computation by simply adding more computing resources over the cloud.
引用
收藏
页码:712 / 721
页数:10
相关论文
共 18 条
  • [1] Altintas Ilkay., 2012, EDBT/ICDT Workshops, P73
  • [2] Berg O. R., 2012, P SPEC WORKSH RED YE, P98
  • [3] Bhargava Shitij, 2015, P 2 USENIX C HOT TOP
  • [4] Bhargava Shitij, 2015, PREPARING PDF SCI AR
  • [5] Training and Evaluating a Statistical Part-of-Speech Tagger for Natural Language Applications using Kepler Workflows
    Briesch, Doug
    Hobbs, Reginald
    Jaja, Claire
    Kjersten, Brian
    Voss, Clare
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, ICCS 2012, 2012, 9 : 1588 - 1594
  • [6] Constantin Alexandru., 2013, Proceedings of the 2013 ACM symposium on Document engineering, P177, DOI DOI 10.1145/2494266.2494271
  • [7] FLUX-CiM: Flexible Unsupervised Extraction of Citation Metadata
    Cortez, Eli
    da Silva, Altigran S.
    Goncalves, Marcos Andre
    Mesquita, Filipe
    de Moura, Edleno S.
    [J]. PROCEEDINGS OF THE 7TH ACM/IEE JOINT CONFERENCE ON DIGITAL LIBRARIES: BUILDING & SUSTAINING THE DIGITAL ENVIRONMENT, 2007, : 215 - +
  • [8] Taming heterogeneity the Ptolemy approach
    Eker, J
    Janneck, JW
    Lee, EA
    Liu, J
    Liu, XJ
    Ludvig, J
    Neuendorffer, S
    Sachs, S
    Xiong, YH
    [J]. PROCEEDINGS OF THE IEEE, 2003, 91 (01) : 127 - 144
  • [9] Journal Article Tag Suite 1.0: National Information Standards Organization standard of journal extensible markup language
    Huh, Sun
    [J]. SCIENCE EDITING, 2014, 1 (02): : 99 - 104
  • [10] Johnson Duff, 2010, EACH PDF PAGE IS PAI, V8, P2010