Anomaly detection in web documents using crisp and fuzzy-based cosine clustering methodology

被引:34
|
作者
Friedman, Menahem
Last, Mark [1 ]
Makover, Yaniv
Kandel, Abraham
机构
[1] Ben Gurion Univ Negev, Dept Informat Syst Engn, IL-84105 Beer Sheva, Israel
[2] Nucl Res Ctr Negev, Dept Phys, IL-84190 Beer Sheva, Israel
[3] Univ S Florida, Dept Comp Sci & Engn, Tampa, FL 33620 USA
关键词
fuzzy-based clustering; document clustering; cosine similarity; anomaly detection;
D O I
10.1016/j.ins.2006.03.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cluster analysis is a primary tool for detecting anomalous behavior in real-world data such as web documents, medical records of patients or other personal data. Most existing methods for document clustering are based on the classical vector-space model, which represents each document by a fixed-size vector of weighted key terms often referred to as key phrases. Since vector representations of documents are frequently very sparse, inverted files are used to prevent a tremendous computational overload which may be caused in large and diverse document collections such as pages downloaded from the World Wide Web. In order to reduce computation costs and space complexity, many popular methods for clustering web documents, including those using inverted files, usually assume a relatively small prefixed number of clusters. We propose several new crisp and fuzzy approaches based on the cosine similarity principle for clustering documents that are represented by variable-size vectors of key phrases, without limiting the final number of clusters. Each entry in a vector consists of two fields. The first field refers to a key phrase in the document and the second denotes an importance weight associated with this key phrase within the particular document. Removing the restriction on the total number of clusters, may moderately increase computing costs but on the other hand improves the method's performance in classifying incoming vectors as normal or abnormal, based on their similarity to the existing clusters. All the procedures represented in this work are characterized by two features: (a) the number of clusters is not restricted by some relatively prefixed small number, i.e., an arbitrary new incoming vector which is not similar to any of the existing cluster centers necessarily starts a new cluster and (b) a vector with multiple appearance n in the training set is counted as n distinct vectors rather than a single vector. These features are the main reasons for the high quality performance of the proposed algorithms. We later describe them in detail and show their implementation in a real-world application from the area of web activity monitoring, in particular, by detecting anomalous documents downloaded from the internet by users with abnormal information interests. (C) 2006 Elsevier Inc. All rights reserved.
引用
收藏
页码:467 / 475
页数:9
相关论文
共 50 条
  • [1] A fuzzy-based algorithm for Web document clustering
    Friedman, M
    Kandel, A
    Schneider, M
    Last, M
    Shapira, B
    Elovici, Y
    Zaafrany, O
    NAFIPS 2004: ANNUAL MEETING OF THE NORTH AMERICAN FUZZY INFORMATION PROCESSING SOCIETY, VOLS 1AND 2: FUZZY SETS IN THE HEART OF THE CANADIAN ROCKIES, 2004, : 524 - 527
  • [2] Anomaly based Intrusion Detection using Modified Fuzzy Clustering
    Harish, B. S.
    Kumar, S. V. Aruna
    INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2017, 4 (06): : 54 - 59
  • [3] Cosine similarity based anomaly detection methodology for the CAN bus
    Kwak, Byung Il
    Han, Mee Lan
    Kim, Huy Kang
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 166
  • [4] Fast fuzzy clustering of Web documents
    Wang, Jian-Hui
    Jiang, Long-Bin
    Yang, Shu
    Chang'an Daxue Xuebao (Ziran Kexue Ban)/Journal of Chang'an University (Natural Science Edition), 2007, 27 (02): : 107 - 110
  • [5] Anomaly-based intrusion detection using fuzzy rough clustering
    Chimphlee, Witcha
    Abdullah, Abdul Hanan
    Sap, Mohd Noor Md
    Srinoy, Surat
    Chimphlee, Siriporn
    2006 International Conference on Hybrid Information Technology, Vol 1, Proceedings, 2006, : 329 - 334
  • [6] Leakage detection and location in water distribution systems using a fuzzy-based methodology
    Islam, M. Shafiqul
    Sadiq, Rehan
    Rodriguez, Manuel J.
    Francisque, Alex
    Najjaran, Homayoun
    Hoorfar, Mina
    URBAN WATER JOURNAL, 2011, 8 (06) : 351 - 365
  • [7] Discovering Latent Semantics in Web Documents Using Fuzzy Clustering
    Chiang, I-Jen
    Liu, Charles Chih-Ho
    Tsai, Yi-Hsin
    Kumar, Ajit
    IEEE TRANSACTIONS ON FUZZY SYSTEMS, 2015, 23 (06) : 2122 - 2134
  • [8] Content-based methodology for anomaly detection on the web
    Last, M
    Shapira, B
    Elovici, Y
    Zaafrany, O
    Kandel, A
    ADVANCES IN WEB INTELLIGENCE, 2003, 2663 : 113 - 123
  • [9] Anomaly Detection System in Cloud Environment Using Fuzzy Clustering Based ANN
    N. Pandeeswari
    Ganesh Kumar
    Mobile Networks and Applications, 2016, 21 : 494 - 505
  • [10] Anomaly Detection System in Cloud Environment Using Fuzzy Clustering Based ANN
    Pandeeswari, N.
    Kumar, Ganesh
    MOBILE NETWORKS & APPLICATIONS, 2016, 21 (03): : 494 - 505