An outlier detection algorithm for categorical matrix-object data

被引:3
作者
Cao, Fuyuan [1 ]
Wu, Xiaolin [1 ]
Yu, Liqin [1 ]
Liang, Jiye [1 ]
机构
[1] Shanxi Univ, Key Lab Computat Intelligence & Chinese Informat, Minist Educ, Sch Comp & Informat Technol, Taiyuan 030006, Peoples R China
基金
中国国家自然科学基金;
关键词
Outlier detection algorithms; Categorical matrix-object data; Data mining; Machine learning;
D O I
10.1016/j.asoc.2021.107182
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Outlier detection is a significant problem in data mining and machine learning which aims to discover objects in a data set that do not conform to well-defined notions of expected behavior. Generally, the input of the existing outlier detection algorithms is a collection of n objects and each object is described by a feature vector. However, in many real world applications, an object is not only described by one feature vector, but a number of feature vectors. In this paper, we define an object described by more than one feature vector as a matrix-object. Inspired by the concepts of cohesion and coupling in software engineering, we define the coupling of a matrix-object based on the average distance between it and other matrix-objects, and define its cohesion based on information entropy and mutual information. On this basis, the outlier factor of a matrix-object is given, and an outlier detection algorithm for categorical matrix-object data is proposed. The experimental results on real and synthetic data sets have shown that the proposed outlier detection algorithm can effectively detect outliers for the matrix-object data set compared with other algorithms. (c) 2021 Elsevier B.V. All rights reserved.
引用
收藏
页数:10
相关论文
共 25 条
[1]  
[Anonymous], 1981, Introduction To Multidimensional Scaling: Theory, Methods, and Applications
[2]  
[Anonymous], 2009, P 18 ACM C INF KNOWL
[3]  
Arning A., 1996, KDD-96 Proceedings. Second International Conference on Knowledge Discovery and Data Mining, P164
[4]   Attribute clustering for grouping, selection, and classification of gene expression data [J].
Au, WH ;
Chan, KCC ;
Wong, AKC ;
Wang, Y .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2005, 2 (02) :83-101
[5]  
Bay S. D., 2003, P 9 ACM SIGKDD INT C, P29
[6]   LOF: Identifying density-based local outliers [J].
Breunig, MM ;
Kriegel, HP ;
Ng, RT ;
Sander, J .
SIGMOD RECORD, 2000, 29 (02) :93-104
[7]   k-mw-modes: An algorithm for clustering categorical matrix-object data [J].
Cao, Fuyuan ;
Yu, Liqin ;
Huang, Joshua Zhexue ;
Liang, Jiye .
APPLIED SOFT COMPUTING, 2017, 57 :605-614
[8]   Anomaly Detection for Discrete Sequences: A Survey [J].
Chandola, Varun ;
Banerjee, Arindam ;
Kumar, Vipin .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2012, 24 (05) :823-839
[9]  
Han J, 2012, MOR KAUF D, P1
[10]  
[韩昭蓉 Han Zhaorong], 2019, [雷达学报, Journal of Radars], V8, P36