Prioritization of Domain-Specific Web Information Extraction

被引:0
|
作者
Huang, Jian [1 ]
Yu, Cong [2 ]
机构
[1] Penn State Univ, Informat Sci & Technol, University Pk, PA 16802 USA
[2] Yahoo Res, New York, NY USA
来源
PROCEEDINGS OF THE TWENTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-10) | 2010年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is often desirable to extract structured information from raw web pages for better information browsing, query answering, and pattern mining. Many such Information Extraction (IE) technologies are costly and applying them at the web-scale is impractical. In this paper, we propose a novel prioritization approach where candidate pages from the corpus are ordered according to their expected contribution to the extraction results and those with higher estimated potential are extracted earlier. Systems employing this approach can stop the extraction process at any time when the resource gets scarce (i.e., not all pages in the corpus can be processed), without worrying about wasting extraction effort on unimportant pages. More specifically, we define a novel notion to measure the value of extraction results and design various mechanisms for estimating a candidate page's contribution to this value. We further design and build the EXTRACTION PRIORITIZATION (EP) system with efficient scoring and scheduling algorithms, and experimentally demonstrate that EP significantly outperforms the naive approach and is more flexible than the classifier approach.
引用
收藏
页码:1327 / 1333
页数:7
相关论文
共 50 条
  • [21] SNPMiner: A domain-specific deep web mining
    Wang, Fan
    Agrawal, Gagan
    Jin, Ruoming
    Piontkivska, Helen
    PROCEEDINGS OF THE 7TH IEEE INTERNATIONAL SYMPOSIUM ON BIOINFORMATICS AND BIOENGINEERING, VOLS I AND II, 2007, : 192 - +
  • [22] Bootstrapping Domain-Specific Content Discovery on the Web
    Kien Pham
    Santos, Aecio
    Freire, Juliana
    WEB CONFERENCE 2019: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2019), 2019, : 1476 - 1486
  • [23] Learning to Discover Domain-Specific Web Content
    Pham, Kien
    Santos, Aecio
    Freire, Juliana
    WSDM'18: PROCEEDINGS OF THE ELEVENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2018, : 432 - 440
  • [24] A new framework for domain-specific hidden web crainling based on data extraction techniques
    El-Desouky, Ali I.
    Ali, Hesham A.
    El-Ghamrawy, Sally M.
    INFORMATION PROCESSING IN THE SERVICE OF MANKIND AND HEALTH, 2006, : 605 - +
  • [25] Crawling for domain-specific Hidden Web resources
    Bergholz, A
    Chidlovskii, B
    FOURTH INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS ENGINEERING, PROCEEDINGS, 2003, : 125 - 133
  • [26] Software Keyphrase Extraction with Domain-specific Features
    Karnalim, Oscar
    2016 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND APPLICATIONS (ACOMP), 2016, : 43 - 50
  • [27] Transformers-based information extraction with limited data for domain-specific business documents
    Nguyen, Minh-Tien
    Le, Dung Tien
    Le, Linh
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2021, 97
  • [28] Domain-Specific Languages in a Customs Information System
    Freudenthal, Margus
    IEEE SOFTWARE, 2010, 27 (02) : 65 - 71
  • [29] Domain-Specific Information Retrieval Using Recommenders
    Li, Wei
    PROCEEDINGS OF THE 34TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR'11), 2011, : 1327 - 1327
  • [30] A Framework for Incremental Domain-Specific Hidden Web Crawler
    Madaan, Rosy
    Dixit, Ashutosh
    Sharma, A. K.
    Bhatia, Komal Kumar
    CONTEMPORARY COMPUTING, PT 1, 2010, 94 : 412 - 422