Effective and efficient classification on a search-engine model

被引:0
|
作者
Aris Anagnostopoulos
Andrei Broder
Kunal Punera
机构
[1] Yahoo! Research,Department of Electrical and Computer Engineering
[2] University of Texas at Austin,undefined
来源
Knowledge and Information Systems | 2008年 / 16卷
关键词
Text classification; Search engine; Feature selection; Query efficiency; WAND; Term correlations;
D O I
暂无
中图分类号
学科分类号
摘要
Traditional document classification frameworks, which apply the learned classifier to each document in a corpus one by one, are infeasible for extremely large document corpora, like the Web or large corporate intranets. We consider the classification problem on a corpus that has been processed primarily for the purpose of searching, and thus our access to documents is solely through the inverted index of a large scale search engine. Our main goal is to build the “best” short query that characterizes a document class using operators normally available within search engines. We show that surprisingly good classification accuracy can be achieved on average over multiple classes by queries with as few as 10 terms. As part of our study, we enhance some of the feature-selection techniques that are found in the literature by forcing the inclusion of terms that are negatively correlated with the target class and by making use of term correlations; we show that both of those techniques can offer significant advantages. Moreover, we show that optimizing the efficiency of query execution by careful selection of terms can further reduce the query costs. More precisely, we show that on our set-up the best 10-term query can achieve 93% of the accuracy of the best SVM classifier (14,000 terms), and if we are willing to tolerate a reduction to 89% of the best SVM, we can build a 10-term query that can be executed more than twice as fast as the best 10-term query.
引用
收藏
页码:129 / 154
页数:25
相关论文
共 50 条
  • [1] Effective and efficient classification on a search-engine model
    Anagnostopoulos, Aris
    Broder, Andrei
    Punera, Kunal
    KNOWLEDGE AND INFORMATION SYSTEMS, 2008, 16 (02) : 129 - 154
  • [2] Search-engine power
    Stern, Ewan
    NEW SCIENTIST, 2010, 206 (2759) : 26 - 26
  • [3] A mixture model for Internet search-engine visits
    Telang, R
    Boatwright, P
    Mukhopadhyay, T
    JOURNAL OF MARKETING RESEARCH, 2004, 41 (02) : 206 - 214
  • [4] Sampling Search-Engine Results
    Aris Anagnostopoulos
    Andrei Z. Broder
    David Carmel
    World Wide Web, 2006, 9 : 397 - 429
  • [5] Sampling search-engine results
    Anagnostopoulos, Aris
    Broder, Andrei Z.
    Carmel, David
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2006, 9 (04): : 397 - 429
  • [6] Understanding Search-Engine Optimization
    Gudivada, Venkat N.
    Rao, Dhana
    Paris, Jordan
    COMPUTER, 2015, 48 (10) : 43 - 52
  • [7] Research and implementation on search-engine technology
    Jisuanji Gongcheng/Computer Engineering, 2002, 28 (01):
  • [8] A visual representation of search-engine queries and their results
    Grewal, RS
    Jackson, M
    Burden, P
    Wallis, J
    PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS ENGINEERING, VOL I, 2000, : 352 - 356
  • [9] Cyclical Bid Adjustments in Search-Engine Advertising
    Zhang, Xiaoquan
    Feng, Juan
    MANAGEMENT SCIENCE, 2011, 57 (09) : 1703 - 1719
  • [10] Modeling User Behavior Using a Search-Engine
    O'Brien, Maeve
    Keane, Mark T.
    2007 INTERNATIONAL CONFERENCE ON INTELLIGENT USER INTERFACES, 2007, : 357 - 360