Improving automatic query classification via semi-supervised learning

被引:31
作者
Beitzel, SM
Jensen, EC
Frieder, O
Lewis, DD
Chowdhury, A
Kolcz, A
机构
来源
Fifth IEEE International Conference on Data Mining, Proceedings | 2005年
关键词
D O I
10.1109/ICDM.2005.80
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Accurate topical classification of user queries allows for increased effectiveness and efficiency in general-purpose web search systems. Such classification becomes critical if the system is to return results not just from a general web collection but from topic-specific back-end databases as well. Maintaining sufficient classification recall is very difficult as web queries are typically short, yielding few features per query. This feature sparseness coupled with the high query volumes typical for a large-scale search service makes manual and supervised learning approaches alone insufficient. We use an application of computational linguistics to develop an approach for mining the vast amount of unlabeled data in web query logs to improve automatic topical web query classification. We show that our approach in combination with manual matching and supervised learning allows its to classify a substantially larger proportion of queries than any single technique. We examine the performance of each approach on a real web query stream and show that our combined method accurately classifies 46% of queries, out performing the recall of best single approach by nearly 20% with a 7% improvement in overall effectiveness.
引用
收藏
页码:42 / 49
页数:8
相关论文
共 22 条
  • [1] BEEFERMAN D, 2000, ACM SIGMOD
  • [2] BEITZEL S, 2004, ACM SIGIR
  • [3] BEITZEL SM, 2005, SIGIR 2005 SALV BRAZ
  • [4] Cover TM, 2006, Elements of Information Theory
  • [5] GRAVANO L, 2003, ACM CIKM
  • [6] Greiff W., 2002, COGNITIVE SCI, V87, P1
  • [7] Real life, real users, and real needs: a study and analysis of user queries on the web
    Jansen, BJ
    Spink, A
    Saracevic, T
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2000, 36 (02) : 207 - 227
  • [8] KANG I, 2003, ACM SIGIR
  • [9] KRAUTH W, 1987, J PHYSICS A, V20, P745
  • [10] HOW APPROPRIATE ARE POPULAR SAMPLE-SIZE FORMULAS
    KUPPER, LL
    HAFNER, KB
    [J]. AMERICAN STATISTICIAN, 1989, 43 (02) : 101 - 105