Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting

被引:6
|
作者
Kumar, J. Prasanna [1 ]
Govindarajulu, P. [2 ]
机构
[1] Osmania Univ, MVSR Engn Coll, Hyderabad 500007, Andhra Pradesh, India
[2] Sri Venkateswara Univ, SVC Coll CMIS, Dept CS, Tirupati, Andhra Pradesh, India
关键词
Web Crawling; Web page; Duplicate web page; Near duplicate web page; Near duplicate detection; fingerprinting;
D O I
10.1080/18756891.2013.752657
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Duplicate and near-duplicate web pages are the chief concerns for web search engines. In reality, they incur enormous space to store the indexes, ultimately slowing down and increasing the cost of serving results. A variety of techniques have been developed to identify pairs of web pages that are "similar" to each other. The problem of finding near-duplicate web pages has been a subject of research in the database and web-search communities for some years. In order to identify the near duplicate web pages, we make use of sentence level features along with fingerprinting method. When a large number of web documents are in consideration for the detection of web pages, then at first, we use K-mode clustering and subsequently sentence feature and fingerprint comparison is used. Using these steps, we exactly identify the near duplicate web pages in an efficient manner. The experimentation is carried out on the web page collections and the results ensured the efficiency of the proposed approach in detecting the near duplicate web pages.
引用
收藏
页码:1 / 13
页数:13
相关论文
共 50 条
  • [1] Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting
    J. Prasanna Kumar
    P. Govindarajulu
    International Journal of Computational Intelligence Systems, 2013, 6 : 1 - 13
  • [2] An Efficient Approach to Web Near-Duplicate Image Detection
    Li, Jun
    Thou, Shan
    Xing, Liang
    Sun, Changyin
    Hu, Weiming
    2013 SECOND IAPR ASIAN CONFERENCE ON PATTERN RECOGNITION (ACPR 2013), 2013, : 186 - 190
  • [3] Near-Duplicate Web Page Detection by Enhanced TDW and simHash Technique
    Arun, P. R.
    Sumesh, M. S.
    2015 INTERNATIONAL CONFERENCE ON COMPUTING AND NETWORK COMMUNICATIONS (COCONET), 2015, : 765 - 770
  • [4] A Novel and Efficient Approach For Near Duplicate Page Detection in Web Crawling
    Narayana, V. A.
    Premchand, P.
    Govardhan, A.
    2009 IEEE INTERNATIONAL ADVANCE COMPUTING CONFERENCE, VOLS 1-3, 2009, : 1492 - +
  • [5] Efficient Near-duplicate Image Detection with Created Feature Subset
    Yildiz, Burak
    Demirci, M. Fatih
    2016 24TH SIGNAL PROCESSING AND COMMUNICATION APPLICATION CONFERENCE (SIU), 2016, : 1901 - 1904
  • [6] Near Duplicate Web Page Detection With Analytic Feature Weighting
    Naseem, Rasia
    Anees, Sheena
    Muneer, K.
    Farook, Syed K.
    2013 THIRD INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING AND COMMUNICATIONS (ICACC 2013), 2013, : 324 - 327
  • [7] Near-Duplicate Web Page Detection: A Comparative Study of Two Contrary Approaches
    Narayana, V. A.
    Govardhan, A.
    Premchand, P.
    2011 6TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCES AND CONVERGENCE INFORMATION TECHNOLOGY (ICCIT), 2012, : 769 - 776
  • [8] Efficient Near-Duplicate Document Detection using FPGAs
    Luo, Xi
    Najjar, Walid
    Hristidis, Vagelis
    2013 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2013,
  • [9] On the Annotation of Web Videos by Efficient Near-Duplicate Search
    Zhao, Wan-Lei
    Wu, Xiao
    Ngo, Chong-Wah
    IEEE TRANSACTIONS ON MULTIMEDIA, 2010, 12 (05) : 448 - 461
  • [10] An Efficient Method for Near-Duplicate Video Detection
    Tahayna, Bashar
    Belkhatir, Mohammed
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2008, 9TH PACIFIC RIM CONFERENCE ON MULTIMEDIA, 2008, 5353 : 377 - 386