Homophily outlier detection in non-IID categorical data

被引:10
|
作者
Pang, Guansong [1 ]
Cao, Longbing [1 ]
Chen, Ling [2 ]
机构
[1] Univ Technol Sydney, Adv Analyt Inst, Sydney, NSW 2007, Australia
[2] Univ Technol Sydney, Ctr Artificial Intelligence, Sydney, NSW 2007, Australia
基金
澳大利亚研究理事会;
关键词
Outlier detection; Feature selection; Non-IID learning; Categorical data; Homophily relation; Random walk; Coupling learning; COMPLEXITY-MEASURES; FEATURE-SELECTION; DATA SETS;
D O I
10.1007/s10618-021-00750-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness of different entities is dependent on each other and/or taken from different probability distributions (non-IID). This may lead to the failure of detecting important outliers that are too subtle to be identified without considering the non-IID nature. The issue is even intensified in more challenging contexts, e.g., high-dimensional data with many noisy features. This work introduces a novel outlier detection framework and its two instances to identify outliers in categorical data by capturing non-IID outlier factors. Our approach first defines and incorporates distribution-sensitive outlier factors and their interdependence into a value-value graph-based representation. It then models an outlierness propagation process in the value graph to learn the outlierness of feature values. The learned value outlierness allows for either direct outlier detection or outlying feature selection. The graph representation and mining approach is employed here to well capture the rich non-IID characteristics. Our empirical results on 15 real-world data sets with different levels of data complexities show that (i) the proposed outlier detection methods significantly outperform five state-of-the-art methods at the 95%/99% confidence level, achieving 10-28% AUC improvement on the 10 most complex data sets; and (ii) the proposed feature selection methods significantly outperform three competing methods in enabling subsequent outlier detection of two different existing detectors.
引用
收藏
页码:1163 / 1224
页数:62
相关论文
共 50 条
  • [1] Homophily outlier detection in non-IID categorical data
    Guansong Pang
    Longbing Cao
    Ling Chen
    Data Mining and Knowledge Discovery, 2021, 35 : 1163 - 1224
  • [2] Learning Homophily Couplings from Non-IID Data for Joint Feature Selection and Noise-Resilient Outlier Detection
    Pang, Guansong
    Cao, Longbing
    Chen, Ling
    Liu, Huan
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 2585 - 2591
  • [3] Guest Editorial: Non-IID Outlier Detection in Complex Contexts
    Pang, Guansong
    Angiulli, Fabrizio
    Cucuringu, Mihai
    Liu, Huan
    IEEE INTELLIGENT SYSTEMS, 2021, 36 (03) : 3 - 4
  • [4] Unsupervised Coupled Metric Similarity for Non-IID Categorical Data
    Jian, Songlei
    Cao, Longbing
    Lu, Kai
    Gao, Hang
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2018, 30 (09) : 1810 - 1823
  • [5] Isolation Forest Based Anomaly Detection Framework on Non-IID Data
    Xiang, Haolong
    Wang, Jiayu
    Ramamohanarao, Kotagiri
    Salcic, Zoran
    Dou, Wanchun
    Zhang, Xuyun
    IEEE INTELLIGENT SYSTEMS, 2021, 36 (03) : 31 - 40
  • [6] Coupled Fuzzy k-Nearest Neighbors Classification of Imbalanced Non-IID Categorical Data
    Liu, Chunming
    Caol, Longbing
    Yu, Philip S.
    PROCEEDINGS OF THE 2014 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2014, : 1122 - 1129
  • [7] Efficient Split Learning with Non-iid Data
    Cai, Yuanqin
    Wei, Tongquan
    2022 23RD IEEE INTERNATIONAL CONFERENCE ON MOBILE DATA MANAGEMENT (MDM 2022), 2022, : 128 - 136
  • [8] Federated learning on non-IID data: A survey
    Zhu, Hangyu
    Xu, Jinjin
    Liu, Shiqing
    Jin, Yaochu
    NEUROCOMPUTING, 2021, 465 : 371 - 390
  • [9] ASYMPTOTICALLY OPTIMALMULTISTAGE TESTS FOR NON-IID DATA
    Xing, Yiming
    Fellouris, Georgios
    STATISTICA SINICA, 2024, 34 (04) : 2325 - 2346
  • [10] Adaptive Federated Learning With Non-IID Data
    Zeng, Yan
    Mu, Yuankai
    Yuan, Junfeng
    Teng, Siyuan
    Zhang, Jilin
    Wan, Jian
    Ren, Yongjian
    Zhang, Yunquan
    COMPUTER JOURNAL, 2023, 66 (11): : 2758 - 2772