Prioritization of Domain-Specific Web Information Extraction

被引:0
|
作者
Huang, Jian [1 ]
Yu, Cong [2 ]
机构
[1] Penn State Univ, Informat Sci & Technol, University Pk, PA 16802 USA
[2] Yahoo Res, New York, NY USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is often desirable to extract structured information from raw web pages for better information browsing, query answering, and pattern mining. Many such Information Extraction (IE) technologies are costly and applying them at the web-scale is impractical. In this paper, we propose a novel prioritization approach where candidate pages from the corpus are ordered according to their expected contribution to the extraction results and those with higher estimated potential are extracted earlier. Systems employing this approach can stop the extraction process at any time when the resource gets scarce (i.e., not all pages in the corpus can be processed), without worrying about wasting extraction effort on unimportant pages. More specifically, we define a novel notion to measure the value of extraction results and design various mechanisms for estimating a candidate page's contribution to this value. We further design and build the EXTRACTION PRIORITIZATION (EP) system with efficient scoring and scheduling algorithms, and experimentally demonstrate that EP significantly outperforms the naive approach and is more flexible than the classifier approach.
引用
收藏
页码:1327 / 1333
页数:7
相关论文
共 50 条
  • [1] Potential and Pitfalls of Domain-Specific Information Extraction at Web Scale
    Rheinlaender, Astrid
    Lehmann, Mario
    Kunkel, Anja
    Meier, Joerg
    Leser, Ulf
    SIGMOD'16: PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2016, : 759 - 771
  • [2] Domain-specific information extraction structures
    Lyons, S
    Smith, D
    13TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2002, : 80 - 84
  • [3] Adapting Open Information Extraction to Domain-Specific Relations
    Soderland, Stephen
    Roof, Brendan
    Qin, Bo
    Xu, Shi
    Mausam
    Etzioni, Oren
    AI MAGAZINE, 2010, 31 (03) : 93 - 102
  • [4] Extraction of Query Interfaces for Domain-Specific Hidden Web Crawler
    Gupta, Nupur
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2016, 16 (02): : 124 - 127
  • [5] Extracting Web Business Information Based on Domain-Specific Ontology
    Shen, J.
    Bi, L.
    Xu, F. Y.
    He, K.
    Wei, L. H.
    Zhu, Y.
    ITESS: 2008 PROCEEDINGS OF INFORMATION TECHNOLOGY AND ENVIRONMENTAL SYSTEM SCIENCES, PT 1, 2008, : 997 - 1003
  • [6] Organizing domain-specific information on the Web: An experiment on the Spanish business Web directory
    Chung, Wingyan
    Lai, Gump
    Bonillas, Alfonso
    Xi, Wei
    Chen, Hsinchun
    INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES, 2008, 66 (02) : 51 - 66
  • [7] Information Extraction of Domain-specific Business Documents with Limited Data
    Minh-Tien Nguyen
    Le Thai Linh
    Dung Tien Le
    Nguyen Hong Son
    Do Hoang Thai Duong
    Bui Cong Minh
    Akira Shojiguchi
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [8] Domain-specific keyphrase extraction
    Frank, E
    Paynter, GW
    Witten, IH
    Gutwin, C
    Nevill-Manning, CG
    IJCAI-99: PROCEEDINGS OF THE SIXTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 & 2, 1999, : 668 - 673
  • [9] Domain-Specific Paraphrase Extraction
    Pavlick, Ellie
    Ganitkevitch, Juri
    Chan, Tsz Ping
    Yao, Xuchen
    Van Durme, Benjamin
    Callison-Burch, Chris
    PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL) AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (IJCNLP), VOL 2, 2015, : 57 - 62
  • [10] Domain-specific terms extraction based on web resource and user behavior
    Yan, Xing-Long
    Liu, Yi-Qun
    Fang, Qi
    Zhang, Min
    Ma, Shao-Ping
    Ru, Li-Yun
    Ruan Jian Xue Bao/Journal of Software, 2013, 24 (09): : 2089 - 2100