Outlier detection for partially labeled categorical data based on conditional information entropy

被引:4
作者
Zhao, Zhengwei [1 ]
Wang, Rongrong [2 ]
Huang, Dan [3 ]
Li, Zhaowen [4 ]
机构
[1] Guangxi Minzu Univ, Sch Math & Phys, Nanning 530006, Guangxi, Peoples R China
[2] Guangxi Minzu Univ, Elect & Informat Engn, Nanning 530000, Guangxi, Peoples R China
[3] Yulin Normal Univ, Sch Comp Sci & Engn, Yulin 537000, Guangxi, Peoples R China
[4] Putian Univ, Key Lab Appl Math Fujian Prov Univ, Fujian Key Lab Financial Informat Proc, Putian 351100, Fujian, Peoples R China
基金
中国国家自然科学基金;
关键词
Partially labeled categorical data; Partially labeled categorical decision; information system; Outlier detection; Conditional information entropy; ALGORITHMS; CLUSTERS;
D O I
10.1016/j.ijar.2023.109086
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Labeling a large amount of data is exceptionally costly and practically infeasible, and thus available data may have missing labels. In this article, we investigate outlier detection for partially labeled categorical data based on conditional information entropy. Firstly, the equivalence class in a partially labeled categorical decision information system (p-CDIS) is introduced, so that the missing labels can be predicted by use of conditional probability. Then, conditional information entropy in a p-CDIS is calculated, which provides a more comprehensive measure of uncertainty. Additionally, the relative information entropy and relative cardinality in a p-CDIS are proposed. Next, the degree of outlierness and the weight function are presented to find outlier factors. Finally, an outlier detection method in a p-CDIS based on conditional information entropy is proposed, and a corresponding conditional information entropy algorithm (CEOF) is designed. To evaluate the stability of the CEOF algorithm, experiments are performed on ten UCI Machine Learning Repository datasets. Compared with five other algorithms, the proposed method is shown to have good effectiveness and adaptability for categorical data.
引用
收藏
页数:25
相关论文
共 45 条
[11]  
Hawkins DM., 1980, Identification of outliers, DOI DOI 10.1007/978-94-015-3994-4
[12]  
Hawkins S., 2002, Data Warehousing and Knowledge Discovery. 4th International Conference, DaWaK 2002. Proceedings (Lecture Notes in Computer Science Vol.2454), P170
[13]  
He Z., 2005, Comput. Sci. Inf. Syst, V1, P103
[14]  
He ZY, 2006, LECT NOTES ARTIF INT, V3918, P567
[15]   Outlier detection based on approximation accuracy entropy [J].
Jiang, Feng ;
Zhao, Hongbo ;
Du, Junwei ;
Xue, Yu ;
Peng, Yanjun .
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2019, 10 (09) :2483-2499
[16]   Outlier detection based on granular computing and rough set theory [J].
Jiang, Feng ;
Chen, Yu-Ming .
APPLIED INTELLIGENCE, 2015, 42 (02) :303-322
[17]   A hybrid approach to outlier detection based on boundary region [J].
Jiang, Feng ;
Sui, Yuefei ;
Cao, Cungen .
PATTERN RECOGNITION LETTERS, 2011, 32 (14) :1860-1870
[18]   Intrusion detection on internet of vehicles via combining log-ratio oversampling, outlier detection and metric learning [J].
Jin, Fusheng ;
Chen, Mengnan ;
Zhang, Weiwei ;
Yuan, Ye ;
Wang, Shuliang .
INFORMATION SCIENCES, 2021, 579 :814-831
[19]   Unsupervised anomaly detection ensembles using item response theory [J].
Kandanaarachchi, Sevvandi .
INFORMATION SCIENCES, 2022, 587 :142-163
[20]   AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities [J].
Kim, Jeong-Hun ;
Choi, Jong-Hyeok ;
Yoo, Kwan-Hee ;
Nasridinov, Aziz .
JOURNAL OF SUPERCOMPUTING, 2019, 75 (01) :142-169