Effective Measures for Inter-Document Similarity

被引:14
作者
Whissell, John S. [1 ]
Clarke, Charles L. A. [1 ]
机构
[1] Univ Waterloo, David R Cheriton Sch Comp Sci, Waterloo, ON N2L 3G1, Canada
来源
PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13) | 2013年
关键词
Clustering; Similarity Measures; MODELS;
D O I
10.1145/2505515.2505526
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While supervised learning-to-rank algorithms have largely supplanted unsupervised query-document similarity measures for search, the exploration of query-document measures by many researchers over many years produced insights that might be exploited in other domains. For example, the BM25 measure substantially and consistently outperforms cosine across many tested environments, and potentially provides retrieval effectiveness approaching that of the best learning-to-rank methods over equivalent features sets. Other measures based on language modeling and divergence from randomness can outperform BM25 in some circumstances. Despite this evidence, cosine remains the prevalent method for determining inter-document similarity for clustering and other applications. However, recent research demonstrates that BM25 terms weights can significantly improve clustering. In this work, we extend that result, presenting and evaluating novel inter-document similarity measures based on BM25, language modeling, and divergence from randomness. In our first experiment we analyze the accuracy of nearest neighborhoods when using our measures. In our second experiment, we analyze using clustering algorithms in conjunction with our measures. Our novel symmetric BM25 and language modeling similarity measures outperform alternative measures in both experiments. This outcome strongly recommends the adoption of these measures, replacing cosine similarity in future work.
引用
收藏
页码:1361 / 1370
页数:10
相关论文
共 26 条
[1]   Probabilistic models of information retrieval based on measuring the divergence from randomness [J].
Amati, G ;
Van Rijsbergen, CJ .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2002, 20 (04) :357-389
[2]  
[Anonymous], 2005, INT C MACH LEARN
[3]  
[Anonymous], 2002, P ACM SIGKDD KDD 200, DOI 10.1145/775047.775067
[4]  
[Anonymous], 2003, P 26 ANN INT ACM SIG, DOI DOI 10.1145/860435.860485
[5]  
[Anonymous], 2002, P 8 ACM SIGKDD INT C, DOI DOI 10.1145/775047.775110
[6]   Document categorization and query generation on the World Wide Web using WebACE [J].
Boley, D ;
Gini, M ;
Gross, R ;
Han, EH ;
Hastings, K ;
Karypis, G ;
Kumar, V ;
Mobasher, B ;
Moore, J .
ARTIFICIAL INTELLIGENCE REVIEW, 1999, 13 (5-6) :365-391
[7]  
Burges C. J. C., 20 ANN C NEUR INF PR
[8]  
Clarke C. L. A., 2010, INFORM RETRIEVAL IMP
[9]  
Fung BCM, 2003, SIAM PROC S, P59
[10]  
Hu XH, 2009, KDD-09: 15TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, P389