Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting

被引:6
|
作者
Kumar, J. Prasanna [1 ]
Govindarajulu, P. [2 ]
机构
[1] Osmania Univ, MVSR Engn Coll, Hyderabad 500007, Andhra Pradesh, India
[2] Sri Venkateswara Univ, SVC Coll CMIS, Dept CS, Tirupati, Andhra Pradesh, India
关键词
Web Crawling; Web page; Duplicate web page; Near duplicate web page; Near duplicate detection; fingerprinting;
D O I
10.1080/18756891.2013.752657
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Duplicate and near-duplicate web pages are the chief concerns for web search engines. In reality, they incur enormous space to store the indexes, ultimately slowing down and increasing the cost of serving results. A variety of techniques have been developed to identify pairs of web pages that are "similar" to each other. The problem of finding near-duplicate web pages has been a subject of research in the database and web-search communities for some years. In order to identify the near duplicate web pages, we make use of sentence level features along with fingerprinting method. When a large number of web documents are in consideration for the detection of web pages, then at first, we use K-mode clustering and subsequently sentence feature and fingerprint comparison is used. Using these steps, we exactly identify the near duplicate web pages in an efficient manner. The experimentation is carried out on the web page collections and the results ensured the efficiency of the proposed approach in detecting the near duplicate web pages.
引用
收藏
页码:1 / 13
页数:13
相关论文
共 50 条
  • [21] TWO-LAYER VIDEO FINGERPRINTING STRATEGY FOR NEAR-DUPLICATE VIDEO DETECTION
    Nie, Xiushan
    Jing, Weizhen
    Ma, Lin Yuan
    Cui, Chaoran
    Yin, Yilong
    2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2017,
  • [22] Duplicate and near-duplicate documents in the web: detection by means of fuzzy-hash techniques
    Figuerola, Carlos G.
    Gomez Diaz, Raquel
    Alonso Berrocal, Jose L.
    Zazo Rodriguez, Angel F.
    SCIRE-REPRESENTACION Y ORGANIZACION DEL CONOCIMIENTO, 2011, 17 (01): : 49 - 54
  • [23] EFFICIENT NEAR-DUPLICATE IMAGE DETECTION BY LEARNING FROM EXAMPLES
    Hu, Yang
    Li, Mingjing
    Yu, Nenghai
    2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, 2008, : 657 - +
  • [24] On Efficient Content-based Near-duplicate Video Detection
    Uysal, Merih Seran
    Beecks, Christian
    Seidl, Thomas
    2015 13TH INTERNATIONAL WORKSHOP ON CONTENT-BASED MULTIMEDIA INDEXING (CBMI), 2015,
  • [25] Efficient Feature Detection and Effective Post-Verification for Large Scale Near-Duplicate Image Search
    Xie, Hongtao
    Gao, Ke
    Zhang, Yongdong
    Tang, Sheng
    Li, Jintao
    Liu, Yizhi
    IEEE TRANSACTIONS ON MULTIMEDIA, 2011, 13 (06) : 1319 - 1332
  • [26] A Robust Near-Duplicate Images Detection Approach with Ordinal Measure
    Li Yafeng
    2021 5TH INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION SCIENCES (ICRAS 2021), 2021, : 226 - 229
  • [27] Near-Duplicate Video Clustering Using Multiple Complementary Video Signatures
    Lee, Jun-Tae
    Kim, Kyung-Rae
    Jang, Won-Dong
    Kim, Chang-Su
    2015 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2015, : 667 - 671
  • [28] Efficient Large Scale Near-Duplicate Video Detection Base on Spark
    Lv, Jinna
    Wu, Bin
    Yang, Shuai
    Jia, Bingjing
    Qiu, Peigang
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 957 - 962
  • [29] INDEXING NEAR-DUPLICATE IMAGES IN WEB SEARCH USING MINHASH ALGORITHM
    Thaiyalnayaki, S.
    Sasikala, J.
    Ponraj, R.
    MATERIALS TODAY-PROCEEDINGS, 2018, 5 (01) : 1943 - 1949
  • [30] Consumer photo management and browsing facilitated by near-duplicate detection with feature filtering
    Chu, Wei-Ta
    Lin, Chia-Hung
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2010, 21 (03) : 256 - 268