Homophily outlier detection in non-IID categorical data

被引:10
|
作者
Pang, Guansong [1 ]
Cao, Longbing [1 ]
Chen, Ling [2 ]
机构
[1] Univ Technol Sydney, Adv Analyt Inst, Sydney, NSW 2007, Australia
[2] Univ Technol Sydney, Ctr Artificial Intelligence, Sydney, NSW 2007, Australia
基金
澳大利亚研究理事会;
关键词
Outlier detection; Feature selection; Non-IID learning; Categorical data; Homophily relation; Random walk; Coupling learning; COMPLEXITY-MEASURES; FEATURE-SELECTION; DATA SETS;
D O I
10.1007/s10618-021-00750-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness of different entities is dependent on each other and/or taken from different probability distributions (non-IID). This may lead to the failure of detecting important outliers that are too subtle to be identified without considering the non-IID nature. The issue is even intensified in more challenging contexts, e.g., high-dimensional data with many noisy features. This work introduces a novel outlier detection framework and its two instances to identify outliers in categorical data by capturing non-IID outlier factors. Our approach first defines and incorporates distribution-sensitive outlier factors and their interdependence into a value-value graph-based representation. It then models an outlierness propagation process in the value graph to learn the outlierness of feature values. The learned value outlierness allows for either direct outlier detection or outlying feature selection. The graph representation and mining approach is employed here to well capture the rich non-IID characteristics. Our empirical results on 15 real-world data sets with different levels of data complexities show that (i) the proposed outlier detection methods significantly outperform five state-of-the-art methods at the 95%/99% confidence level, achieving 10-28% AUC improvement on the 10 most complex data sets; and (ii) the proposed feature selection methods significantly outperform three competing methods in enabling subsequent outlier detection of two different existing detectors.
引用
收藏
页码:1163 / 1224
页数:62
相关论文
共 50 条
  • [11] Federated Learning With Taskonomy for Non-IID Data
    Jamali-Rad, Hadi
    Abdizadeh, Mohammad
    Singh, Anuj
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (11) : 8719 - 8730
  • [12] Federated Learning With Non-IID Data: A Survey
    Lu, Zili
    Pan, Heng
    Dai, Yueyue
    Si, Xueming
    Zhang, Yan
    IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (11): : 19188 - 19209
  • [13] A Survey of Federated Learning on Non-IID Data
    HAN Xuming
    GAO Minghan
    WANG Limin
    HE Zaobo
    WANG Yanze
    ZTECommunications, 2022, 20 (03) : 17 - 26
  • [14] NeoLOD: A Novel Generalized Coupled Local Outlier Detection Model Embedded Non-IID Similarity Metric
    Meng, Fan
    Gao, Yang
    Huo, Jing
    Qi, Xiaolong
    Yi, Shichao
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2019, PT I, 2019, 11439 : 587 - 599
  • [15] Non-IID Learning
    Cao, Longbing
    IEEE INTELLIGENT SYSTEMS, 2022, 37 (04) : 3 - 4
  • [16] Federated Learning-Based IoT Intrusion Detection on Non-IID Data
    Huang, Wenxuan
    Tiropanis, Thanassis
    Konstantinidis, George
    INTERNET OF THINGS, GIOTS 2022, 2022, 13533 : 326 - 337
  • [17] Gaussian Process Subset Scanning for Anomalous Pattern Detection in Non-iid Data
    Herlands, William
    McFowland, Edward, III
    Wilson, Andrew G.
    Neill, Daniel B.
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 84, 2018, 84
  • [18] Private Data Synthesis from Decentralized Non-IID Data
    Saleem, Muhammad Usama
    Fan, Liyue
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [19] Differentially private federated learning with non-IID data
    Cheng, Shuyan
    Li, Peng
    Wang, Ruchuan
    Xu, He
    COMPUTING, 2024, 106 (07) : 2459 - 2488
  • [20] Data augmentation scheme for federated learning with non-IID data
    Tang L.
    Wang D.
    Liu S.
    Tongxin Xuebao/Journal on Communications, 2023, 44 (01): : 164 - 176