Near Duplicate Web Page Detection With Analytic Feature Weighting

被引:2
作者
Naseem, Rasia [1 ]
Anees, Sheena [1 ]
Muneer, K. [2 ]
Farook, Syed K. [2 ]
机构
[1] KMEA Engn Coll, Dept Comp Sci & Engn, Aluva, Kerala, India
[2] MES Coll Engn, Dept Comp Sci & Engn, Kuttippuram, Kerala, India
来源
2013 THIRD INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING AND COMMUNICATIONS (ICACC 2013) | 2013年
关键词
Near Duplicate Detection; Term Document Weight Matrix; Analytic Combination Criteria; Prefix filtering; Positional filtering; Minimum Weight Overlapping; Web page classification;
D O I
10.1109/ICACC.2013.69
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Near duplicate web pages are web pages that differ only slightly in content. The existence of near duplicate web pages are due to exact replica of the original site, mirrored sites, versioned sites, and multiple representations of the same physical object and plagiarized documents. The identification of similar or near duplicate pages in a large collection is a significant problem with wide spread applications. Here we propose a four stage algorithm for finding near duplicates of an input Web page from a huge repository. We propose a Term Document Weight (TDW) matrix based algorithm with four phases - Pre-processing, Feature weighting, Filtering and Verification. The system receives an input web page and a similarity threshold in its first phase and performs some pre processing operations on it. In the second phase, weights of features are calculated using Analytic Combination Criteria (ACC). In the third phase, Prefix and Positional filtering are performed to reduce the size of candidate records, and it returns an optimal set of near duplicate web pages in the Verification phase after calculating their similarity using Minimum Weight Overlapping (MWO) method.
引用
收藏
页码:324 / 327
页数:4
相关论文
共 9 条
  • [1] Broder A, 1997, 6 INT WORLD WID WEB, P393
  • [2] Das S. N., 2012, Proceedings of the 2012 Ninth International Conference on Information Technology: New Generations (ITNG), P121, DOI 10.1109/ITNG.2012.168
  • [3] An analytical approach to concept extraction in HTML']HTML environments
    Fresno, V
    Ribeiro, A
    [J]. JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2004, 22 (03) : 215 - 235
  • [4] Govardhan A., 2009, International Journal of Computational Intelligence Research, V5, P83
  • [5] Manku G. S., 2007, Proceedings of the 16th international conference on World Wide Web-WWW'07, P141
  • [6] Mathew M., 2011, INT J COMPUT APPL, V19, P16
  • [7] Pant G, 2004, WEB DYNAMICS: ADAPTING TO CHANGE IN CONTENT, SIZE TOPOLOG AND USE, P153
  • [8] AN ALGORITHM FOR SUFFIX STRIPPING
    PORTER, MF
    [J]. PROGRAM-AUTOMATED LIBRARY AND INFORMATION SYSTEMS, 1980, 14 (03): : 130 - 137
  • [9] Xiao C., 2008, WWW, P131