Web scraping technologies in an API world

被引:81
作者
Glez-Pena, Daniel [1 ]
Lourenco, Analia [1 ,2 ]
Lopez-Fernandez, Hugo [1 ]
Reboiro-Jato, Miguel [1 ]
Fdez-Riverola, Florentino [3 ]
机构
[1] Univ Vigo, Dept Comp Sci, Vigo, Spain
[2] Univ Minho, Ctr Biol Engn, P-4719 Braga, Portugal
[3] Univ Vigo, Next Generat Comp Syst Grp, Vigo, Spain
关键词
Web scraping; data integration; interoperability; database interfaces; SET ENRICHMENT ANALYSIS; RESOURCE; DATABASE; BIOINFORMATICS; INFORMATION; INTEGRATION; SERVICES; COLLECTION; DISEASE; NATION;
D O I
10.1093/bib/bbt026
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Web services are the de facto standard in biomedical data integration. However, there are data integration scenarios that cannot be fully covered by Web services. A number of Web databases and tools do not support Web services, and existing Web services do not cover for all possible user data demands. As a consequence, Web data scraping, one of the oldest techniques for extracting Web contents, is still in position to offer a valid and valuable service to a wide range of bioinformatics applications, ranging from simple extraction robots to online meta-servers. This article reviews existing scraping frameworks and tools, identifying their strengths and limitations in terms of extraction capabilities. The main focus is set on showing how straightforward it is today to set up a data scraping pipeline, with minimal programming effort, and answer a number of practical needs. For exemplification purposes, we introduce a biomedical data extraction scenario where the desired data sources, well-known in clinical microbiology and similar domains, do not offer programmatic interfaces yet. Moreover, we describe the operation of WhichGenes and PathJam, two bioinformatics meta-servers that use scraping as means to cope with gene set enrichment analysis.
引用
收藏
页码:788 / 797
页数:10
相关论文
共 47 条
  • [1] Reorganizing the protein space at the Universal Protein Resource (UniProt)
    Apweiler, Rolf
    Martin, Maria Jesus
    O'Donovan, Claire
    Magrane, Michele
    Alam-Faruque, Yasmin
    Antunes, Ricardo
    Casanova, Elisabet Barrera
    Bely, Benoit
    Bingley, Mark
    Bower, Lawrence
    Bursteinas, Borisas
    Chan, Wei Mun
    Chavali, Gayatri
    Da Silva, Alan
    Dimmer, Emily
    Eberhardt, Ruth
    Fazzini, Francesco
    Fedotov, Alexander
    Garavelli, John
    Castro, Leyla Garcia
    Gardner, Michael
    Hieta, Reija
    Huntley, Rachael
    Jacobsen, Julius
    Legge, Duncan
    Liu, Wudong
    Luo, Jie
    Orchard, Sandra
    Patient, Samuel
    Pichler, Klemens
    Poggioli, Diego
    Pontikos, Nikolas
    Pundir, Sangya
    Rosanoff, Steven
    Sawford, Tony
    Sehra, Harminder
    Turner, Edward
    Wardell, Tony
    Watkins, Xavier
    Corbett, Matt
    Donnelly, Mike
    van Rensburg, Pieter
    Goujon, Mickael
    McWilliam, Hamish
    Lopez, Rodrigo
    Xenarios, Ioannis
    Bougueleret, Lydie
    Bridge, Alan
    Poux, Sylvain
    Redaschi, Nicole
    [J]. NUCLEIC ACIDS RESEARCH, 2012, 40 (D1) : D71 - D75
  • [2] The Firegoose: two-way integration of diverse data from different bioinformatics web resources with desktop applications
    Bare, J. Christopher
    Shannon, Paul T.
    Schmid, Amy K.
    Baliga, Nitin S.
    [J]. BMC BIOINFORMATICS, 2007, 8 (1)
  • [3] NUCLEIC ACIDS RESEARCH ANNUAL WEB SERVER ISSUE IN 2012
    Benson, Gary
    [J]. NUCLEIC ACIDS RESEARCH, 2012, 40 (W1) : W1 - W2
  • [4] medpie: an information extraction package for medical message board posts
    Benton, A.
    Holmes, J. H.
    Hill, S.
    Chung, A.
    Ungar, L.
    [J]. BIOINFORMATICS, 2012, 28 (05) : 743 - 744
  • [5] Standardizing Access to Hydrologic Data Repositories through Web Services
    Beran, Bora
    Goodall, Jonathan
    Valentine, David
    Zaslavsky, Ilya
    Piasecki, Michael
    [J]. INTERNATIONAL CONFERENCE ON ADVANCED GEOGRAPHIC INFORMATION SYSTEMS AND WEB SERVICES: GEOWS 2009, PROCEEDINGS, 2009, : 64 - 67
  • [6] Engineering new paths to water data
    Beran, Bora
    Piasecki, Michael
    [J]. COMPUTERS & GEOSCIENCES, 2009, 35 (04) : 753 - 760
  • [7] The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases
    Caspi, Ron
    Altman, Tomer
    Dale, Joseph M.
    Dreher, Kate
    Fulcher, Carol A.
    Gilham, Fred
    Kaipa, Pallavi
    Karthikeyan, Athikkattuvalasu S.
    Kothari, Anamika
    Krummenacker, Markus
    Latendresse, Mario
    Mueller, Lukas A.
    Paley, Suzanne
    Popescu, Liviu
    Pujar, Anuradha
    Shearer, Alexander G.
    Zhang, Peifen
    Karp, Peter D.
    [J]. NUCLEIC ACIDS RESEARCH, 2010, 38 : D473 - D479
  • [8] Reactome: a database of reactions, pathways and biological processes
    Croft, David
    O'Kelly, Gavin
    Wu, Guanming
    Haw, Robin
    Gillespie, Marc
    Matthews, Lisa
    Caudy, Michael
    Garapati, Phani
    Gopinath, Gopal
    Jassal, Bijay
    Jupe, Steven
    Kalatskaya, Irina
    Mahajan, Shahana
    May, Bruce
    Ndegwa, Nelson
    Schmidt, Esther
    Shamovsky, Veronica
    Yung, Christina
    Birney, Ewan
    Hermjakob, Henning
    D'Eustachio, Peter
    Stein, Lincoln
    [J]. NUCLEIC ACIDS RESEARCH, 2011, 39 : D691 - D697
  • [9] Day NE, 2009, THESIS U CAMBRIDGE C
  • [10] Day Roger., 2010, DAVIDQuery: Retrieval from the DAVID bioinformatics data resource into r