An Improved Cosine Similarity Algorithm Based on Document Similarity

被引:0
作者
Lee, Ming
Zhao, Heji
机构
来源
INTERNATIONAL SYMPOSIUM ON FUZZY SYSTEMS, KNOWLEDGE DISCOVERY AND NATURAL COMPUTATION (FSKDNC 2014) | 2014年
关键词
Cosine similarity; document similarity; sliding window;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the coming of big date and high development of education level, there are more and more duplicate documents, which does not only waste source but also time of readers. Therefore, duplicate checking is extremely important. In order to achieve better similarity measure result, we usually transfer documents into distance, angle or curvature. Cosine Similarity is a common method used to measure the similarity of documents, but it is insensitive to number and proportion, which brings some problems to duplicate checking. In this paper, we put forward a new similarity measure method and add sliding window in Cosine Similarity so as to improve its efficiency. With repeated experiments, we finally found that the improved method not only overcomes the drawbacks of Cosine Similarity, but also is superior on speed and accuracy.
引用
收藏
页码:196 / 204
页数:9
相关论文
共 11 条
[1]  
[Anonymous], 1936, P NATL I SCI INDIA
[2]  
[Anonymous], 2001, The elements of statistical learning: data mining, inference and prediction
[3]  
[Anonymous], 2011, Pei. data mining concepts and techniques
[4]  
Buda A., 2010, LIFETIME CORRELATION, V1, P5
[5]  
Deza Elena., 2009, Encyclopedia of Distances, P94
[6]   Adaptive Windows for Duplicate Detection [J].
Draisbach, Uwe ;
Naumann, Felix ;
Szott, Sascha ;
Wonneberg, Oliver .
2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, :1073-1083
[7]  
jiang Jung-Yi, 2011, P 2011 INT C MACH LE
[8]  
MITCHELL TM, 2008, MACHINE LEARNING
[9]   TERM-WEIGHTING APPROACHES IN AUTOMATIC TEXT RETRIEVAL [J].
SALTON, G ;
BUCKLEY, C .
INFORMATION PROCESSING & MANAGEMENT, 1988, 24 (05) :513-523
[10]  
Tan P.N., 2016, Introduction to Data Mining