Anomaly detection in web documents using crisp and fuzzy-based cosine clustering methodology

被引:34
作者
Friedman, Menahem
Last, Mark [1 ]
Makover, Yaniv
Kandel, Abraham
机构
[1] Ben Gurion Univ Negev, Dept Informat Syst Engn, IL-84105 Beer Sheva, Israel
[2] Nucl Res Ctr Negev, Dept Phys, IL-84190 Beer Sheva, Israel
[3] Univ S Florida, Dept Comp Sci & Engn, Tampa, FL 33620 USA
关键词
fuzzy-based clustering; document clustering; cosine similarity; anomaly detection;
D O I
10.1016/j.ins.2006.03.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cluster analysis is a primary tool for detecting anomalous behavior in real-world data such as web documents, medical records of patients or other personal data. Most existing methods for document clustering are based on the classical vector-space model, which represents each document by a fixed-size vector of weighted key terms often referred to as key phrases. Since vector representations of documents are frequently very sparse, inverted files are used to prevent a tremendous computational overload which may be caused in large and diverse document collections such as pages downloaded from the World Wide Web. In order to reduce computation costs and space complexity, many popular methods for clustering web documents, including those using inverted files, usually assume a relatively small prefixed number of clusters. We propose several new crisp and fuzzy approaches based on the cosine similarity principle for clustering documents that are represented by variable-size vectors of key phrases, without limiting the final number of clusters. Each entry in a vector consists of two fields. The first field refers to a key phrase in the document and the second denotes an importance weight associated with this key phrase within the particular document. Removing the restriction on the total number of clusters, may moderately increase computing costs but on the other hand improves the method's performance in classifying incoming vectors as normal or abnormal, based on their similarity to the existing clusters. All the procedures represented in this work are characterized by two features: (a) the number of clusters is not restricted by some relatively prefixed small number, i.e., an arbitrary new incoming vector which is not similar to any of the existing cluster centers necessarily starts a new cluster and (b) a vector with multiple appearance n in the training set is counted as n distinct vectors rather than a single vector. These features are the main reasons for the high quality performance of the proposed algorithms. We later describe them in detail and show their implementation in a real-world application from the area of web activity monitoring, in particular, by detecting anomalous documents downloaded from the internet by users with abnormal information interests. (C) 2006 Elsevier Inc. All rights reserved.
引用
收藏
页码:467 / 475
页数:9
相关论文
共 50 条
  • [41] Anomaly Detection for Spacecraft using Hierarchical Agglomerative Clustering based on Maximal Information Coefficient
    Zhang, Liwen
    Yu, Jinsong
    Tang, Diyin
    Han, Danyang
    Tian, Limei
    Dai, Jing
    PROCEEDINGS OF THE 15TH IEEE CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA 2020), 2020, : 1848 - 1853
  • [42] Anomaly Detection Based on Histogram Methodology and Factor Analysis Using LightGBM for Cooling Systems
    Yanabe, Tomu
    Nishi, Hiroaki
    Hashimoto, Masahiro
    2020 25TH IEEE INTERNATIONAL CONFERENCE ON EMERGING TECHNOLOGIES AND FACTORY AUTOMATION (ETFA), 2020, : 952 - 958
  • [43] Clustering web people search results using fuzzy ants
    Lefever, E.
    Fayruzov, T.
    Hoste, V.
    De Cock, M.
    INFORMATION SCIENCES, 2010, 180 (17) : 3192 - 3209
  • [44] Anomaly Upload Behavior Detection Based on Fuzzy Inference
    Han, Ting
    Zhan, Xuna
    Tao, Jing
    Cao, Ken
    Xiong, Yuheng
    2021 IEEE INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, INTL CONF ON CLOUD AND BIG DATA COMPUTING, INTL CONF ON CYBER SCIENCE AND TECHNOLOGY CONGRESS DASC/PICOM/CBDCOM/CYBERSCITECH 2021, 2021, : 923 - 929
  • [45] A fuzzy rules based approach for performance anomaly detection
    Xu, H
    You, J
    Liu, FY
    2005 IEEE Networking, Sensing and Control Proceedings, 2005, : 44 - 48
  • [46] Anomaly detection based on fuzzy neighborhood rough sets
    Yuan, Yuan
    Wang, Sihan
    Chen, Hongmei
    Luo, Chuan
    Yuan, Zhong
    INFORMATION SCIENCES, 2025, 709
  • [47] Unsupervised anomaly detection using HDG-Clustering algorithm
    Tsai, Cheng-Fa
    Yen, Chia-Chen
    NEURAL INFORMATION PROCESSING, PART II, 2008, 4985 : 356 - 365
  • [48] ECG Anomaly Detection using Wireless BAN and HEMFCM Clustering
    Janani, S. R.
    Hemalatha, C. Sweetlin
    Vaidehi, V.
    2013 INTERNATIONAL CONFERENCE ON RECENT TRENDS IN INFORMATION TECHNOLOGY (ICRTIT), 2013, : 257 - 262
  • [49] Anomaly detection in group activities based on fuzzy lattices using Schrödinger equation
    Rajiv Kapoor
    Om Mishra
    M. M. Tripathi
    Iran Journal of Computer Science, 2020, 3 (2) : 103 - 114
  • [50] A Deep Learning Enabled Subspace Spectral Ensemble Clustering Approach for Web Anomaly Detection
    Yuan, Guiqin
    Li, Bo
    Yao, Yiyang
    Zhang, Simin
    2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 3896 - 3903