Near-Duplicate Web Page Detection by Enhanced TDW and simHash Technique

被引:0
|
作者
Arun, P. R. [1 ]
Sumesh, M. S. [1 ]
机构
[1] Adi Shankara Inst Engn & Technol, Comp Sci & Engn, Kalady, India
来源
2015 INTERNATIONAL CONFERENCE ON COMPUTING AND NETWORK COMMUNICATIONS (COCONET) | 2015年
关键词
Near Duplicate Detection; SimHash; Enhanced TDW List; Word mapping;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Internet is one of the imperative explosion in communication and information retrieval. This massive development of web prompts host millions of web pages in heterogeneous platform. Due to the lack of a standard mechanism to guarantee the nonexistence of a web page before hosting them in the server leads to increases the near duplicate pages in the internet. These near duplicate content can exist either by intentional or accidental. The issue of finding near-duplicate web pages has been a subject of research in the database and web-scan groups for a few years. Since most winning content mining strategies received term-based methodologies, they all experience an issues of word synonym and substantial number of comparison. In this paper we propose a method, which deal with the detection of near and duplicate web pages detection by using an extended term document weighting scheme, sentence level features and simHash technique. The existence of these near and duplicate web pages causes the problems that range from network band width utilization, storage cost, reduce the performance of search engines by duplicated content indexing and increase load on a remote host.
引用
收藏
页码:765 / 770
页数:6
相关论文
共 50 条
  • [1] Near-Duplicate Detection Using GPU-based Simhash Scheme
    Feng, Xiaowen
    Jin, Hai
    Zheng, Ran
    Zhu, Lei
    2014 INTERNATIONAL CONFERENCE ON SMART COMPUTING (SMARTCOMP), 2014,
  • [2] Near-Duplicate Web Page Detection: A Comparative Study of Two Contrary Approaches
    Narayana, V. A.
    Govardhan, A.
    Premchand, P.
    2011 6TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCES AND CONVERGENCE INFORMATION TECHNOLOGY (ICCIT), 2012, : 769 - 776
  • [3] Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting
    Kumar, J. Prasanna
    Govindarajulu, P.
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2013, 6 (01) : 1 - 13
  • [4] Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting
    J. Prasanna Kumar
    P. Govindarajulu
    International Journal of Computational Intelligence Systems, 2013, 6 : 1 - 13
  • [5] An Efficient Approach to Web Near-Duplicate Image Detection
    Li, Jun
    Thou, Shan
    Xing, Liang
    Sun, Changyin
    Hu, Weiming
    2013 SECOND IAPR ASIAN CONFERENCE ON PATTERN RECOGNITION (ACPR 2013), 2013, : 186 - 190
  • [6] Near-Duplicate Detection in Web App Model Inference
    Yandrapally, Rahulkrishna
    Stocco, Andrea
    Mesbah, Ali
    2020 ACM/IEEE 42ND INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2020), 2020, : 186 - 197
  • [7] SVD-SIFT FOR WEB NEAR-DUPLICATE IMAGE DETECTION
    Liu, Hong
    Lu, Hong
    Xue, Xiangyang
    2010 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, 2010, : 1445 - 1448
  • [8] Practical On-line Near-Duplicate Detection for Web Video
    Bao, Wei
    Ji, Lixin
    Liu, Lixiong
    2013 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND TECHNOLOGY (ICIST), 2013, : 862 - 866
  • [9] SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages
    Pi, Bingfeng
    Fu, Shunkai
    Wang, Weilei
    Han, Song
    PROCEEDINGS OF INTERNATIONAL SYMPOSIUM ON COMPUTER SCIENCE AND COMPUTATIONAL TECHNOLOGY (ISCSCT 2009), 2009, : 20 - 25
  • [10] Duplicate and near-duplicate documents in the web: detection by means of fuzzy-hash techniques
    Figuerola, Carlos G.
    Gomez Diaz, Raquel
    Alonso Berrocal, Jose L.
    Zazo Rodriguez, Angel F.
    SCIRE-REPRESENTACION Y ORGANIZACION DEL CONOCIMIENTO, 2011, 17 (01): : 49 - 54