DISTRIBUTED WEB-SCALE INFRASTRUCTURE FOR CRAWLING, INDEXING AND SEARCH WITH SEMANTIC SUPPORT

被引:1
|
作者
Dlugolinsky, Stefan [1 ]
Seleng, Martin [1 ]
Laclavik, Michal [1 ]
Hluchy, Ladislav [1 ]
机构
[1] Slovak Acad Sci, Inst Informat, Bratislava, Slovakia
来源
COMPUTER SCIENCE-AGH | 2012年 / 13卷 / 04期
关键词
distributed web crawling; information extraction; information retrieval; semantic search; geocoding; spatial search;
D O I
10.7494/csci.2012.13.4.5
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this paper, we describe our work in progress in the scope of web-scale information extraction and information retrieval utilizing distributed computing. We present a distributed architecture built on top of the MapReduce paradigm for information retrieval, information processing and intelligent search supported by spatial capabilities. Proposed architecture is focused on crawling documents in several different formats, information extraction, lightweight semantic annotation of the extracted information, indexing of extracted information and finally on indexing of documents based on the geo-spatial information found in a document. We demonstrate the architecture on two use cases, where the first is search in job offers retrieved from the LinkedIn portal and the second is search in BBC news feeds and discuss several problems we had to face during the implementation. We also discuss spatial search applications for both cases because both LinkedIn job offer pages and BBC news feeds contain a lot of spatial information to extract and process.
引用
收藏
页码:5 / 19
页数:15
相关论文
共 5 条
  • [1] Leveraging Knowledge Graphs for Web-Scale Unsupervised Semantic Parsing
    Heck, Larry
    Hakkani-Tur, Dilek
    Tur, Gokhan
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 1593 - 1597
  • [2] Web Page Indexing through Page Ranking for Effective Semantic Search
    Sharma, Robin
    Kandpal, Ankita
    Bhakuni, Priyanka
    Chauhan, Rashmi
    Goudar, R. H.
    Tyagi, Asit
    7TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS AND CONTROL (ISCO 2013), 2013, : 389 - 392
  • [3] Large Scale Semantic Annotation, Indexing, and Search at The National Archives
    Maynard, Diana
    Greenwood, Mark A.
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 3487 - 3494
  • [4] Scale-Adaptable Recrawl Strategies for DHT-Based Distributed Web Crawling System
    Xu, Xiao
    Zhang, Weizhe
    Zhang, Hongli
    Fang, Binxing
    NETWORK AND PARALLEL COMPUTING, 2010, 6289 : 91 - 105
  • [5] A Novel Model to Support OGC Web Services Semantic Search Using OWL-S
    Miao, Lizhi
    Guo, Jing
    Cheng, Wenchao
    Zhou, Ya
    2016 24TH INTERNATIONAL CONFERENCE ON GEOINFORMATICS (GEOINFORMATICS), 2016,