Machine Learning Model Generation With Copula-Based Synthetic Dataset for Local Differentially Private Numerical Data

被引:8
|
作者
Sei, Yuichi [1 ,2 ]
Onesimu, J. Andrew [3 ]
Ohsuga, Akihiko [1 ]
机构
[1] Univ Electrocommun, Grad Sch Informat & Engn, Dept Informat, Chofu, Tokyo 1828585, Japan
[2] JST, PRESTO, Kawaguchi, Saitama 1020076, Japan
[3] Manipal Acad Higher Educ, Manipal Inst Technol, Dept Comp Sci & Engn, Manipal 576104, India
基金
日本科学技术振兴机构; 日本学术振兴会;
关键词
Data models; Machine learning; Differential privacy; Decision trees; Numerical models; Machine learning algorithms; Generators; Data mining; Privacy; Data collection; Copula; data mining; decision trees; local differential privacy; machine learning; privacy-preserving data collection; DECISION TREE; SECURITY;
D O I
10.1109/ACCESS.2022.3208715
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the development of IoT technology, personal data are being collected in many places. These data can be used to create new services, but consideration must be given to the individual's privacy. We can safely collect personal data while adding noise by applying differential privacy. However, because such data are very noisy, the accuracy of machine learning trained by the data greatly decreased. In this study, our objective is to build a highly accurate machine learning model using these data. We focus on the decision tree machine learning algorithm, and, instead of applying it as is, we use a preprocessing technique wherein pseudodata are generated using a copula while removing the effect of noise added by differential privacy. In detail, the proposed novel protocol consists of three steps: generating a covariance matrix from the differentially private numerical data, generating a discrete cumulative distribution function from differentially private numerical data, and generating copula-based numerical samples. Simulation results using synthetic and real datasets verify the utility of the proposed method not only for the decision tree algorithm but also for other machine learning algorithms such as deep neural networks. This method will help create machine learning models, such as recommendation systems, using differential privacy data.
引用
收藏
页码:101656 / 101671
页数:16
相关论文
共 50 条
  • [31] Specifics of Data Collection and Data Processing during Formation of RailVista Dataset for Machine Learning- and Deep Learning-Based Applications
    Abisheva, Gulsipat
    Goranin, Nikolaj
    Razakhova, Bibigul
    Aidynov, Tolegen
    Satybaldina, Dina
    SENSORS, 2024, 24 (16)
  • [32] PRIVATE FL-GAN: DIFFERENTIAL PRIVACY SYNTHETIC DATA GENERATION BASED ON FEDERATED LEARNING
    Xin, Bangzhou
    Yang, Wei
    Geng, Yangyang
    Chen, Sheng
    Wang, Shaowei
    Huang, Liusheng
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 2927 - 2931
  • [33] Reinforcement-Learning-Based Query Optimization in Differentially Private IoT Data Publishing
    Jiang, Yili
    Zhang, Kuan
    Qian, Yi
    Zhou, Liang
    IEEE INTERNET OF THINGS JOURNAL, 2021, 8 (14) : 11163 - 11176
  • [34] Physically based synthetic image generation for machine learning - a review of pertinent literature
    Schraml, Dominik
    PHOTONICS AND EDUCATION IN MEASUREMENT SCIENCE, 2019, 11144
  • [35] Data Mining and Machine Learning Methods Applied to A Numerical Clinching Model
    Goetz, Marco
    Leichsenring, Ferenc
    Kropp, Thomas
    Muller, Peter
    Falk, Tobias
    Graf, Wolfgang
    Kaliske, Michael
    Drossel, Welf-Guntram
    CMES-COMPUTER MODELING IN ENGINEERING & SCIENCES, 2018, 117 (03): : 387 - 423
  • [36] Cost-based recommendation of parameters for local differentially private data aggregation
    Shahani, Snehkumar
    Venkateswaran, R.
    Abraham, Jibi
    COMPUTERS & SECURITY, 2021, 102
  • [37] Synthetic data at scale: a development model to efficiently leverage machine learning in agriculture
    Klein, Jonathan
    Waller, Rebekah
    Pirk, Soeren
    Palubicki, Wojtek
    Tester, Mark
    Michels, Dominik L.
    FRONTIERS IN PLANT SCIENCE, 2024, 15
  • [38] Mechanical Transmission Model and Numerical Simulation Based on Machine Learning
    Zhang, Pan
    INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGIES AND SYSTEMS APPROACH, 2023, 16 (02)
  • [39] Application of copula-based approach as a new data-driven model for downscaling the mean daily temperature
    Nazeri Tahroudi, Mohammad
    Ramezani, Yousef
    De Michele, Carlo
    Mirabbasi, Rasoul
    INTERNATIONAL JOURNAL OF CLIMATOLOGY, 2023, 43 (01) : 240 - 254
  • [40] Data-Centric Machine Learning: Improving Model Performance and Understanding Through Dataset Analysis
    Westermann, Hannes
    Savelka, Jaromir
    Walker, Vern R.
    Ashley, Kevin D.
    Benyekhlef, Karim
    LEGAL KNOWLEDGE AND INFORMATION SYSTEMS, 2021, 346 : 54 - 57