Anomaly detection in web documents using crisp and fuzzy-based cosine clustering methodology

被引:34
作者
Friedman, Menahem
Last, Mark [1 ]
Makover, Yaniv
Kandel, Abraham
机构
[1] Ben Gurion Univ Negev, Dept Informat Syst Engn, IL-84105 Beer Sheva, Israel
[2] Nucl Res Ctr Negev, Dept Phys, IL-84190 Beer Sheva, Israel
[3] Univ S Florida, Dept Comp Sci & Engn, Tampa, FL 33620 USA
关键词
fuzzy-based clustering; document clustering; cosine similarity; anomaly detection;
D O I
10.1016/j.ins.2006.03.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cluster analysis is a primary tool for detecting anomalous behavior in real-world data such as web documents, medical records of patients or other personal data. Most existing methods for document clustering are based on the classical vector-space model, which represents each document by a fixed-size vector of weighted key terms often referred to as key phrases. Since vector representations of documents are frequently very sparse, inverted files are used to prevent a tremendous computational overload which may be caused in large and diverse document collections such as pages downloaded from the World Wide Web. In order to reduce computation costs and space complexity, many popular methods for clustering web documents, including those using inverted files, usually assume a relatively small prefixed number of clusters. We propose several new crisp and fuzzy approaches based on the cosine similarity principle for clustering documents that are represented by variable-size vectors of key phrases, without limiting the final number of clusters. Each entry in a vector consists of two fields. The first field refers to a key phrase in the document and the second denotes an importance weight associated with this key phrase within the particular document. Removing the restriction on the total number of clusters, may moderately increase computing costs but on the other hand improves the method's performance in classifying incoming vectors as normal or abnormal, based on their similarity to the existing clusters. All the procedures represented in this work are characterized by two features: (a) the number of clusters is not restricted by some relatively prefixed small number, i.e., an arbitrary new incoming vector which is not similar to any of the existing cluster centers necessarily starts a new cluster and (b) a vector with multiple appearance n in the training set is counted as n distinct vectors rather than a single vector. These features are the main reasons for the high quality performance of the proposed algorithms. We later describe them in detail and show their implementation in a real-world application from the area of web activity monitoring, in particular, by detecting anomalous documents downloaded from the internet by users with abnormal information interests. (C) 2006 Elsevier Inc. All rights reserved.
引用
收藏
页码:467 / 475
页数:9
相关论文
共 50 条
  • [31] Factor-analysis based anomaly detection and clustering
    Wu, Ningning
    Zhang, Jing
    DECISION SUPPORT SYSTEMS, 2006, 42 (01) : 375 - 389
  • [32] Anomaly detection model based on data stream clustering
    Chunyong Yin
    Sun Zhang
    Zhichao Yin
    Jin Wang
    Cluster Computing, 2019, 22 : 1729 - 1738
  • [33] BAE: Anomaly Detection Algorithm Based on Clustering and Autoencoder
    Wang, Dongqi
    Nie, Mingshuo
    Chen, Dongming
    MATHEMATICS, 2023, 11 (15)
  • [34] Hyperspectral Anomaly Detection Using Quantum Potential Clustering
    Tu, Bing
    Wang, Zhi
    Yang, Xianchang
    Li, Jun
    Plaza, Antonio
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2022, 71
  • [35] Anomaly intrusion detection based on clustering a data stream
    Oh, Sang-Hyun
    Kang, Jin-Suk
    Bytin, Yung-Cheol
    Jeong, Taikyeong T.
    Lee, Won-Suk
    INFORMATION SECURITY, PROCEEDINGS, 2006, 4176 : 415 - 426
  • [36] Power Equipment Anomaly Detection Based on Spatiotemporal Clustering
    Chen, Yufeng
    Du, Xiuming
    Chen, Jiajun
    Yan, Yingjie
    Sheng, Gehao
    Yi, Yang
    2016 INTERNATIONAL CONFERENCE ON CONDITION MONITORING AND DIAGNOSIS (CMD), 2016, : 392 - 395
  • [37] Anomaly detection model based on data stream clustering
    Yin, Chunyong
    Zhang, Sun
    Yin, Zhichao
    Wang, Jin
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (Suppl 1): : 1729 - 1738
  • [38] MFCD:A Deep Learning Method with Fuzzy Clustering for Time Series Anomaly Detection
    Luo, Kaisheng
    Liu, Chang
    Chen, Baiyang
    Li, Xuedong
    Peng, Dezhong
    Yuan, Zhong
    WEB AND BIG DATA, APWEB-WAIM 2024, PT III, 2024, 14963 : 62 - 77
  • [39] Maritime Anomaly Detection using Density-based Clustering and Recurrent Neural Network
    Zhao, Liangbin
    Shi, Guoyou
    JOURNAL OF NAVIGATION, 2019, 72 (04) : 894 - 916
  • [40] An Approach For Verifying And Validating Clustering Based Anomaly Detection Systems Using Metamorphic Testing
    Rehman, Faqeer Ur
    Izurieta, Clemente
    2022 FOURTH IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE TESTING (AITEST 2022), 2022, : 12 - 18