Anomaly detection in web documents using crisp and fuzzy-based cosine clustering methodology

被引：34

作者：

Friedman, Menahem

Last, Mark ^{[1
]}

Makover, Yaniv

Kandel, Abraham

机构：

[1] Ben Gurion Univ Negev, Dept Informat Syst Engn, IL-84105 Beer Sheva, Israel

[2] Nucl Res Ctr Negev, Dept Phys, IL-84190 Beer Sheva, Israel

[3] Univ S Florida, Dept Comp Sci & Engn, Tampa, FL 33620 USA

来源：

INFORMATION SCIENCES | 2007年 / 177卷 / 02期

关键词：

fuzzy-based clustering; document clustering; cosine similarity; anomaly detection;

D O I：

10.1016/j.ins.2006.03.006

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Cluster analysis is a primary tool for detecting anomalous behavior in real-world data such as web documents, medical records of patients or other personal data. Most existing methods for document clustering are based on the classical vector-space model, which represents each document by a fixed-size vector of weighted key terms often referred to as key phrases. Since vector representations of documents are frequently very sparse, inverted files are used to prevent a tremendous computational overload which may be caused in large and diverse document collections such as pages downloaded from the World Wide Web. In order to reduce computation costs and space complexity, many popular methods for clustering web documents, including those using inverted files, usually assume a relatively small prefixed number of clusters. We propose several new crisp and fuzzy approaches based on the cosine similarity principle for clustering documents that are represented by variable-size vectors of key phrases, without limiting the final number of clusters. Each entry in a vector consists of two fields. The first field refers to a key phrase in the document and the second denotes an importance weight associated with this key phrase within the particular document. Removing the restriction on the total number of clusters, may moderately increase computing costs but on the other hand improves the method's performance in classifying incoming vectors as normal or abnormal, based on their similarity to the existing clusters. All the procedures represented in this work are characterized by two features: (a) the number of clusters is not restricted by some relatively prefixed small number, i.e., an arbitrary new incoming vector which is not similar to any of the existing cluster centers necessarily starts a new cluster and (b) a vector with multiple appearance n in the training set is counted as n distinct vectors rather than a single vector. These features are the main reasons for the high quality performance of the proposed algorithms. We later describe them in detail and show their implementation in a real-world application from the area of web activity monitoring, in particular, by detecting anomalous documents downloaded from the internet by users with abnormal information interests. (C) 2006 Elsevier Inc. All rights reserved.

引用

页码：467 / 475

页数：9

共 50 条

[31] Factor-analysis based anomaly detection and clustering
Wu, Ningning
Zhang, Jing
DECISION SUPPORT SYSTEMS, 2006, 42 (01) : 375 - 389
[32] Anomaly detection model based on data stream clustering
Chunyong Yin
Sun Zhang
Zhichao Yin
Jin Wang
Cluster Computing, 2019, 22 : 1729 - 1738
[33] BAE: Anomaly Detection Algorithm Based on Clustering and Autoencoder
Wang, Dongqi
Nie, Mingshuo
Chen, Dongming
MATHEMATICS, 2023, 11 (15)
[34] Hyperspectral Anomaly Detection Using Quantum Potential Clustering
Tu, Bing
Wang, Zhi
Yang, Xianchang
Li, Jun
Plaza, Antonio
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2022, 71
[35] Anomaly intrusion detection based on clustering a data stream
Oh, Sang-Hyun
Kang, Jin-Suk
Bytin, Yung-Cheol
Jeong, Taikyeong T.
Lee, Won-Suk
INFORMATION SECURITY, PROCEEDINGS, 2006, 4176 : 415 - 426
[36] Power Equipment Anomaly Detection Based on Spatiotemporal Clustering
Chen, Yufeng
Du, Xiuming
Chen, Jiajun
Yan, Yingjie
Sheng, Gehao
Yi, Yang
2016 INTERNATIONAL CONFERENCE ON CONDITION MONITORING AND DIAGNOSIS (CMD), 2016, : 392 - 395
[37] Anomaly detection model based on data stream clustering
Yin, Chunyong
Zhang, Sun
Yin, Zhichao
Wang, Jin
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (Suppl 1): : 1729 - 1738
[38] MFCD:A Deep Learning Method with Fuzzy Clustering for Time Series Anomaly Detection
Luo, Kaisheng
Liu, Chang
Chen, Baiyang
Li, Xuedong
Peng, Dezhong
Yuan, Zhong
WEB AND BIG DATA, APWEB-WAIM 2024, PT III, 2024, 14963 : 62 - 77
[39] Maritime Anomaly Detection using Density-based Clustering and Recurrent Neural Network
Zhao, Liangbin
Shi, Guoyou
JOURNAL OF NAVIGATION, 2019, 72 (04) : 894 - 916
[40] An Approach For Verifying And Validating Clustering Based Anomaly Detection Systems Using Metamorphic Testing
Rehman, Faqeer Ur
Izurieta, Clemente
2022 FOURTH IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE TESTING (AITEST 2022), 2022, : 12 - 18

← 1 2 3 4 5 →