An Empirical Analysis of Synthetic-Data-Based Anomaly Detection

被引:4
作者
Llugiqi, Majlinda [1 ,2 ]
Mayer, Rudolf [1 ,2 ]
机构
[1] Vienna Univ Technol, Vienna, Austria
[2] SBA Res, Vienna, Austria
来源
MACHINE LEARNING AND KNOWLEDGE EXTRACTION, CD-MAKE 2022 | 2022年 / 13480卷
关键词
Anomaly detection; Synthetic data; Privacy preserving; Machine learning;
D O I
10.1007/978-3-031-14463-9_20
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data is increasingly collected on practically every area of human life, e.g. from health care to financial or work aspects, and from many different sources. As the amount of data gathered grows, efforts to leverage it have intensified. Many organizations are interested to analyse or share the data they collect, as it may be used to provide critical services and support much-needed research. However, this often conflicts with data protection regulations. Thus sharing, analyzing and working with those sensitive data while preserving the privacy of the individuals represented by the data is needed. Synthetic data generation is one method increasingly used for achieving this goal. Using synthetic data would useful also for anomaly detection tasks, which often contains highly sensitive data. While synthetic data generation aims at capturing the most relevant statistical properties of a dataset to create a dataset with similar characteristics, it is less explored if this method is capable of capturing also the properties of anomalous data, which is generally a minority class with potentially very few samples, and can thus reproduce meaningful anomaly instances. In this paper, we perform an extensive study on several anomaly detection techniques (supervised, unsupervised and semi-supervised) on credit card fraud and medical (annthyroid) data, and evaluate the utility of corresponding, synthetically generated datasets, obtained by various different synthetisation methods. Moreover, for supervised methods, we have also investigated various sampling methods; sampling in average improves the results, and we show that this transfers also to detectors learned on synthetic data. Overall, our evaluation shows that models trained on synthetic data can achieve a performance that renders them a viable alternative to real data, sometimes even outperforming them. Based on the evaluation, we provide guidelines on which synthesizer method to use for which anomaly detection setting.
引用
收藏
页码:306 / 327
页数:22
相关论文
共 39 条
[1]   Performance Analysis of Machine Learning Algorithms for Thyroid Disease [J].
Abbad Ur Rehman, Hafiz ;
Lin, Chyi-Yeu ;
Mushtaq, Zohaib ;
Su, Shun-Feng .
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2021, 46 (10) :9437-9449
[2]   Differentially Private Mixture of Generative Neural Networks [J].
Acs, Gergely ;
Melis, Luca ;
Castelluccia, Claude ;
De Cristofaro, Emiliano .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2019, 31 (06) :1109-1121
[3]  
[Anonymous], 2008, 14 ACM SIGKDD INT C
[4]   A comparative study on thyroid disease detection using K-nearest neighbor and Naive Bayes classification techniques [J].
Khushboo Chandel ;
Veenita Kunwar ;
Sai Sabitha ;
Tanupriya Choudhury ;
Saurabh Mukherjee .
CSI Transactions on ICT, 2016, 4 (2-4) :313-319
[5]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[6]   A Multi-Dimensional Evaluation of Synthetic Data Generators [J].
Dankar, Fida K. ;
Ibrahim, Mahmoud K. ;
Ismail, Leila .
IEEE ACCESS, 2022, 10 :11147-11158
[7]   Supervised Machine Learning Algorithms for Credit Card Fraudulent Transaction Detection: A Comparative Study [J].
Dhankhad, Sahil ;
Mohammed, Emad A. ;
Far, Behrouz .
2018 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), 2018, :122-125
[8]   Credit Card Fraud Detection using Machine Learning Algorithms [J].
Dornadula, Vaishnavi Nath ;
Geetha, S. .
2ND INTERNATIONAL CONFERENCE ON RECENT TRENDS IN ADVANCED COMPUTING ICRTAC -DISRUP - TIV INNOVATION , 2019, 2019, 165 :631-641
[9]  
Goix N., 2016, ABS160701152 CORR
[10]   A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data [J].
Goldstein, Markus ;
Uchida, Seiichi .
PLOS ONE, 2016, 11 (04)