Near-Duplicate Web Page Detection by Enhanced TDW and simHash Technique

被引：0

作者：

Arun, P. R. ^{[1
]}

Sumesh, M. S. ^{[1
]}

机构：

[1] Adi Shankara Inst Engn & Technol, Comp Sci & Engn, Kalady, India

来源：

2015 INTERNATIONAL CONFERENCE ON COMPUTING AND NETWORK COMMUNICATIONS (COCONET) | 2015年

关键词：

Near Duplicate Detection; SimHash; Enhanced TDW List; Word mapping;

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Internet is one of the imperative explosion in communication and information retrieval. This massive development of web prompts host millions of web pages in heterogeneous platform. Due to the lack of a standard mechanism to guarantee the nonexistence of a web page before hosting them in the server leads to increases the near duplicate pages in the internet. These near duplicate content can exist either by intentional or accidental. The issue of finding near-duplicate web pages has been a subject of research in the database and web-scan groups for a few years. Since most winning content mining strategies received term-based methodologies, they all experience an issues of word synonym and substantial number of comparison. In this paper we propose a method, which deal with the detection of near and duplicate web pages detection by using an extended term document weighting scheme, sentence level features and simHash technique. The existence of these near and duplicate web pages causes the problems that range from network band width utilization, storage cost, reduce the performance of search engines by duplicated content indexing and increase load on a remote host.

引用

页码：765 / 770

页数：6

共 50 条

[1] Near-Duplicate Detection Using GPU-based Simhash Scheme
Feng, Xiaowen
Jin, Hai
Zheng, Ran
Zhu, Lei
2014 INTERNATIONAL CONFERENCE ON SMART COMPUTING (SMARTCOMP), 2014,
[2] Near-Duplicate Web Page Detection: A Comparative Study of Two Contrary Approaches
Narayana, V. A.
Govardhan, A.
Premchand, P.
2011 6TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCES AND CONVERGENCE INFORMATION TECHNOLOGY (ICCIT), 2012, : 769 - 776
[3] Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting
Kumar, J. Prasanna
Govindarajulu, P.
INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2013, 6 (01) : 1 - 13
[4] Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting
J. Prasanna Kumar
P. Govindarajulu
International Journal of Computational Intelligence Systems, 2013, 6 : 1 - 13
[5] An Efficient Approach to Web Near-Duplicate Image Detection
Li, Jun
Thou, Shan
Xing, Liang
Sun, Changyin
Hu, Weiming
2013 SECOND IAPR ASIAN CONFERENCE ON PATTERN RECOGNITION (ACPR 2013), 2013, : 186 - 190
[6] Near-Duplicate Detection in Web App Model Inference
Yandrapally, Rahulkrishna
Stocco, Andrea
Mesbah, Ali
2020 ACM/IEEE 42ND INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2020), 2020, : 186 - 197
[7] SVD-SIFT FOR WEB NEAR-DUPLICATE IMAGE DETECTION
Liu, Hong
Lu, Hong
Xue, Xiangyang
2010 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, 2010, : 1445 - 1448
[8] Practical On-line Near-Duplicate Detection for Web Video
Bao, Wei
Ji, Lixin
Liu, Lixiong
2013 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND TECHNOLOGY (ICIST), 2013, : 862 - 866
[9] SimHash-based Effective and Efficient Detecting of Near-Duplicate Short Messages
Pi, Bingfeng
Fu, Shunkai
Wang, Weilei
Han, Song
PROCEEDINGS OF INTERNATIONAL SYMPOSIUM ON COMPUTER SCIENCE AND COMPUTATIONAL TECHNOLOGY (ISCSCT 2009), 2009, : 20 - 25
[10] Duplicate and near-duplicate documents in the web: detection by means of fuzzy-hash techniques
Figuerola, Carlos G.
Gomez Diaz, Raquel
Alonso Berrocal, Jose L.
Zazo Rodriguez, Angel F.
SCIRE-REPRESENTACION Y ORGANIZACION DEL CONOCIMIENTO, 2011, 17 (01): : 49 - 54

← 1 2 3 4 5 →