Detecting anomaly collections using extreme feature ranks

被引:5
作者
Dai, Hanbo [1 ]
Zhu, Feida [2 ]
Lim, Ee-Peng [2 ]
Pang, HweeHwa [2 ]
机构
[1] Hubei Univ, Sch Comp Sci & Informat Engn, Wuhan, Peoples R China
[2] Singapore Management Univ, Sch Informat Syst, Singapore 178902, Singapore
关键词
Anomaly collection; Extreme feature rank; Anomaly cluster; Outlier group; Spam detection; Spam cluster; CLUSTER; ALGORITHM;
D O I
10.1007/s10618-014-0360-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Detecting anomaly collections is an important task with many applications, including spam and fraud detection. In an anomaly collection, entities often operate in collusion and hold different agendas to normal entities. As a result, they usually manifest collective extreme traits, i.e., members of an anomaly collection are consistently clustered toward the top or bottom ranks on certain features. We therefore propose to detect these anomaly collections by extreme feature ranks. We introduce a novel anomaly definition called Extreme Rank Anomalous Collection or ERAC. We propose a new measure of anomalousness capturing collective extreme traits based on a statistical model. As there can be a large number of ERACs of various sizes, for simplicity, we first investigate the ERAC detection problem of finding top- ERACs of a predefined size limit. We then tackle the follow-up ERAC expansion problem of uncovering the supersets of the detected ERACs that are more anomalous without any size constraint. Algorithms are proposed for both ERAC detection and expansion problems, followed by studies of their performance in four datasets. Specifically, in synthetic datasets, both ERAC detection and expansion algorithms demonstrate high precisions and recalls. In a web spam dataset, both ERAC detection and expansion algorithms discover web spammers with higher precisions than existing approaches. In an IMDB dataset, both ERAC detection and expansion algorithms identify unusual actor collections that are not easily identified by clustering-based methods. In a Chinese online forum dataset, our ERAC detection algorithm identifies suspicious "water army" spammer collections agreed by human evaluators. ERAC expansion algorithm successfully reveals two larger spammer collections with different spamming behaviors.
引用
收藏
页码:689 / 731
页数:43
相关论文
共 31 条
[1]  
Agrawal R., 1994, P 20 INT C VER LARG, P487, DOI DOI 10.5555/645920.672836
[2]  
[Anonymous], P 2012 SIAM INT C DA
[3]  
[Anonymous], 1999, Handbook of Combinatorial Optimization, DOI [DOI 10.1007/978-1-4757-3023-4_1, 10.1007/978-1-4757-3023-41]
[4]   DETECTION OF AN ANOMALOUS CLUSTER IN A NETWORK [J].
Arias-Castro, Ery ;
Candes, Emmanuel J. ;
Durand, Arnaud .
ANNALS OF STATISTICS, 2011, 39 (01) :278-304
[5]   Reverse search for enumeration [J].
Avis, D ;
Fukuda, K .
DISCRETE APPLIED MATHEMATICS, 1996, 65 (1-3) :21-46
[6]  
Barnett V., 1994, Outliers in statistical data
[7]   Link Analysis for Web Spam Detection [J].
Becchetti, Luca ;
Castillo, Carlos ;
Donato, Debora ;
Baeza-Yates, Ricardo ;
Leonardi, Stefano .
ACM TRANSACTIONS ON THE WEB, 2008, 2 (01)
[8]   LOF: Identifying density-based local outliers [J].
Breunig, MM ;
Kriegel, HP ;
Ng, RT ;
Sander, J .
SIGMOD RECORD, 2000, 29 (02) :93-104
[9]  
Castillo Carlos, 2007, 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P423, DOI 10.1145/1277741.1277814
[10]  
Chau Duen Horng, 2007, Proceedings of the 16th international conference on World Wide Web