Learning to Discover Domain-Specific Web Content

被引:4
|
作者
Pham, Kien [1 ]
Santos, Aecio [1 ]
Freire, Juliana [1 ]
机构
[1] NYU, New York, NY 10003 USA
来源
WSDM'18: PROCEEDINGS OF THE ELEVENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING | 2018年
关键词
D O I
10.1145/3159652.3159724
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The ability to discover all content relevant to an information domain has many applications, from helping in the understanding of humanitarian crises to countering human and arms trafficking. In such applications, time is of essence: it is crucial to both maximize coverage and identify new content as soon as it becomes available, so that appropriate actions can be taken. In this paper, we propose new methods for efficient domain-specific re-crawling that maximize the yield for new content. By learning patterns of pages that have a high yield, our methods select a small set of pages that can be re-crawled frequently, increasing the coverage and freshness while conserving resources. Unlike previous approaches to this problem, our methods combine different factors to optimize the re-crawling strategy, do not require full snapshots for the learning step, and dynamically adapt the strategy as the crawl progresses. In an empirical evaluation, we have simulated the framework over 600 partial crawl snapshots in three different domains. The results show that our approach can achieve 150% higher coverage compared to existing, state-of-the-art techniques. In addition, it is also able to capture 80% of new relevant content within less than 4 hours of publication.
引用
收藏
页码:432 / 440
页数:9
相关论文
共 50 条
  • [31] Extracting Web Business Information Based on Domain-Specific Ontology
    Shen, J.
    Bi, L.
    Xu, F. Y.
    He, K.
    Wei, L. H.
    Zhu, Y.
    ITESS: 2008 PROCEEDINGS OF INFORMATION TECHNOLOGY AND ENVIRONMENTAL SYSTEM SCIENCES, PT 1, 2008, : 997 - 1003
  • [32] Generating domain-specific web-based expert systems
    Dunstan, Neil
    EXPERT SYSTEMS WITH APPLICATIONS, 2008, 35 (03) : 686 - 690
  • [33] On Web-based Domain-Specific Language for Internet of Things
    Sneps-Sneppe, Manfred
    Namiot, Dmitry
    2015 7TH INTERNATIONAL CONGRESS ON ULTRA MODERN TELECOMMUNICATIONS AND CONTROL SYSTEMS AND WORKSHOPS (ICUMT), 2015, : 287 - 292
  • [34] Web Site Modeling and Prototyping Based on a Domain-Specific Language
    Stibe, Agnis
    Bicevskis, Janis
    BALTIC JOURNAL OF MODERN COMPUTING, 2009, 751 : 7 - 21
  • [35] Building Web navigation agents using domain-specific ontologies
    Yang, JY
    Jung, HS
    Choi, J
    INTELLIGENT AGENTS AND MULTI-AGENT SYSTEMS, 2005, 3371 : 303 - 316
  • [36] SWQL: A new domain-specific language for mining the social Web
    Guzman-Guzman, Xiomarah
    Rolando Nunez-Valdez, Edward
    Vasquez-Reynoso, Raysa
    Asencio, Angel
    Garcia-Diaz, Vicente
    SCIENCE OF COMPUTER PROGRAMMING, 2021, 207
  • [37] Lifelong Learning of Topics and Domain-Specific Word Embeddings
    Qin, Xiaorui
    Lu, Yuyin
    Chen, Yufu
    Rao, Yanghui
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2294 - 2309
  • [38] LEARNING DOMAIN-SPECIFIC HEURISTICS FOR ANSWER SET SOLVERS
    Balduccini, Marcello
    TECHNICAL COMMUNICATIONS OF THE 26TH INTERNATIONAL CONFERENCE ON LOGIC PROGRAMMING (ICLP'10), 2010, 7 : 14 - 23
  • [39] Learning and using domain-specific heuristics in ASP solvers
    Balduccini, Marcello
    AI COMMUNICATIONS, 2011, 24 (02) : 147 - 164
  • [40] Domain-specific and domain-general constraints on word and sequence learning
    Archibald, Lisa M. D.
    Joanisse, Marc F.
    MEMORY & COGNITION, 2013, 41 (02) : 268 - 280