Machine Learning Model Generation With Copula-Based Synthetic Dataset for Local Differentially Private Numerical Data

被引:8
|
作者
Sei, Yuichi [1 ,2 ]
Onesimu, J. Andrew [3 ]
Ohsuga, Akihiko [1 ]
机构
[1] Univ Electrocommun, Grad Sch Informat & Engn, Dept Informat, Chofu, Tokyo 1828585, Japan
[2] JST, PRESTO, Kawaguchi, Saitama 1020076, Japan
[3] Manipal Acad Higher Educ, Manipal Inst Technol, Dept Comp Sci & Engn, Manipal 576104, India
基金
日本科学技术振兴机构; 日本学术振兴会;
关键词
Data models; Machine learning; Differential privacy; Decision trees; Numerical models; Machine learning algorithms; Generators; Data mining; Privacy; Data collection; Copula; data mining; decision trees; local differential privacy; machine learning; privacy-preserving data collection; DECISION TREE; SECURITY;
D O I
10.1109/ACCESS.2022.3208715
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the development of IoT technology, personal data are being collected in many places. These data can be used to create new services, but consideration must be given to the individual's privacy. We can safely collect personal data while adding noise by applying differential privacy. However, because such data are very noisy, the accuracy of machine learning trained by the data greatly decreased. In this study, our objective is to build a highly accurate machine learning model using these data. We focus on the decision tree machine learning algorithm, and, instead of applying it as is, we use a preprocessing technique wherein pseudodata are generated using a copula while removing the effect of noise added by differential privacy. In detail, the proposed novel protocol consists of three steps: generating a covariance matrix from the differentially private numerical data, generating a discrete cumulative distribution function from differentially private numerical data, and generating copula-based numerical samples. Simulation results using synthetic and real datasets verify the utility of the proposed method not only for the decision tree algorithm but also for other machine learning algorithms such as deep neural networks. This method will help create machine learning models, such as recommendation systems, using differential privacy data.
引用
收藏
页码:101656 / 101671
页数:16
相关论文
共 50 条
  • [21] Copula-based Multi-dimensional Crowdsourced Data Synthesis and Release with Local Privacy
    Yang, Xinyu
    Wang, Teng
    Ren, Xuebin
    Yu, Wei
    GLOBECOM 2017 - 2017 IEEE GLOBAL COMMUNICATIONS CONFERENCE, 2017,
  • [22] Reconciling privacy and utility: an unscented Kalman filter-based framework for differentially private machine learning
    Tang, Kunsheng
    Li, Ping
    Song, Yide
    Luo, Tian
    MACHINE LEARNING, 2023, 112 (01) : 311 - 351
  • [23] Reconciling privacy and utility: an unscented Kalman filter-based framework for differentially private machine learning
    Kunsheng Tang
    Ping Li
    Yide Song
    Tian Luo
    Machine Learning, 2023, 112 : 311 - 351
  • [24] Synthetic data generation using Copula model and driving behavior analysis
    Savran, Efe
    Karpat, Fatih
    AIN SHAMS ENGINEERING JOURNAL, 2024, 15 (12)
  • [25] Heuristic Model to Improve Feature Selection Based on Machine Learning in Data Mining
    Majumdar, Jahin
    Mal, Anwesha
    Gupta, Shruti
    2016 6TH INTERNATIONAL CONFERENCE - CLOUD SYSTEM AND BIG DATA ENGINEERING (CONFLUENCE), 2016, : 73 - 77
  • [26] Synthetic Data Generation With Machine Learning for Network Intrusion Detection Systems
    Newlin, Marvin
    Reith, Mark
    DeYoung, Mark
    PROCEEDINGS OF THE 18TH EUROPEAN CONFERENCE ON CYBER WARFARE AND SECURITY (ECCWS 2019), 2019, : 785 - 789
  • [27] CFD-based Synthetic Data Generation for Machine Learning Based Pressure Drop Assessment in Aortic Stenosis
    Matei, Teodor Ionut
    Popescu, Andreea Bianca
    Nita, Cosmin Ioan
    Ciusdel, Costin Florian
    Itu, Lucian Mihai
    STUDIES IN INFORMATICS AND CONTROL, 2023, 32 (04): : 49 - 58
  • [28] Football tracking data: a copula-based hidden Markov model for classification of tactics in football
    Marius Ötting
    Dimitris Karlis
    Annals of Operations Research, 2023, 325 : 167 - 183
  • [29] Football tracking data: a copula-based hidden Markov model for classification of tactics in football
    Oetting, Marius
    Karlis, Dimitris
    ANNALS OF OPERATIONS RESEARCH, 2023, 325 (01) : 167 - 183
  • [30] Copula-based analysis of dependent current status data with semiparametric linear transformation model
    Yu, Huazhen
    Zhang, Rui
    Zhang, Lixin
    LIFETIME DATA ANALYSIS, 2024, 30 (04) : 742 - 775