Machine Learning Model Generation With Copula-Based Synthetic Dataset for Local Differentially Private Numerical Data

被引:8
|
作者
Sei, Yuichi [1 ,2 ]
Onesimu, J. Andrew [3 ]
Ohsuga, Akihiko [1 ]
机构
[1] Univ Electrocommun, Grad Sch Informat & Engn, Dept Informat, Chofu, Tokyo 1828585, Japan
[2] JST, PRESTO, Kawaguchi, Saitama 1020076, Japan
[3] Manipal Acad Higher Educ, Manipal Inst Technol, Dept Comp Sci & Engn, Manipal 576104, India
基金
日本科学技术振兴机构; 日本学术振兴会;
关键词
Data models; Machine learning; Differential privacy; Decision trees; Numerical models; Machine learning algorithms; Generators; Data mining; Privacy; Data collection; Copula; data mining; decision trees; local differential privacy; machine learning; privacy-preserving data collection; DECISION TREE; SECURITY;
D O I
10.1109/ACCESS.2022.3208715
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the development of IoT technology, personal data are being collected in many places. These data can be used to create new services, but consideration must be given to the individual's privacy. We can safely collect personal data while adding noise by applying differential privacy. However, because such data are very noisy, the accuracy of machine learning trained by the data greatly decreased. In this study, our objective is to build a highly accurate machine learning model using these data. We focus on the decision tree machine learning algorithm, and, instead of applying it as is, we use a preprocessing technique wherein pseudodata are generated using a copula while removing the effect of noise added by differential privacy. In detail, the proposed novel protocol consists of three steps: generating a covariance matrix from the differentially private numerical data, generating a discrete cumulative distribution function from differentially private numerical data, and generating copula-based numerical samples. Simulation results using synthetic and real datasets verify the utility of the proposed method not only for the decision tree algorithm but also for other machine learning algorithms such as deep neural networks. This method will help create machine learning models, such as recommendation systems, using differential privacy data.
引用
收藏
页码:101656 / 101671
页数:16
相关论文
共 50 条
  • [41] Predicting extreme surges from sparse data using a copula-based hierarchical Bayesian spatial model
    Beck, N.
    Genest, C.
    Jalbert, J.
    Mailhot, M.
    ENVIRONMETRICS, 2020, 31 (05)
  • [42] Machine Learning Based Distributed Big Data Analysis Framework for Next Generation Web in IoT
    Singh, Sushil Kumar
    Cha, Jeonghun
    Kim, Tae Woo
    Park, Jong Hyuk
    COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2021, 18 (02) : 597 - 618
  • [43] First Steps Toward Synthetic Sample Generation for Machine Learning Based Flare Forecasting
    Hostetter, Maxwell
    Angryk, Rafal A.
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 4208 - 4217
  • [44] Machine Learning Model for Chest Radiographs: Using Local Data to Enhance Performance
    Mohn, Sarah F.
    Law, Marco
    Koleva, Maria
    Lee, Brian
    Berg, Adam
    Murray, Nicolas
    Nicolaou, Savvas
    Parker, William A.
    CANADIAN ASSOCIATION OF RADIOLOGISTS JOURNAL-JOURNAL DE L ASSOCIATION CANADIENNE DES RADIOLOGISTES, 2023, 74 (03): : 548 - 556
  • [45] Training data selection based on dataset distillation for rapid deployment in machine-learning workflows
    Jeong, Yuna
    Hwang, Myunggwon
    Sung, Wonkyung
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (07) : 9855 - 9870
  • [46] Combining Synthetic and Observed Data to Enhance Machine Learning Model Performance for Streamflow Prediction
    Lopez-Chacon, Sergio Ricardo
    Salazar, Fernando
    Blade, Ernest
    WATER, 2023, 15 (11)
  • [47] Training data selection based on dataset distillation for rapid deployment in machine-learning workflows
    Yuna Jeong
    Myunggwon Hwang
    Wonkyung Sung
    Multimedia Tools and Applications, 2023, 82 : 9855 - 9870
  • [48] Research on Invalid Detection Data Model of Mine Catalytic Sensors Based on Machine Learning
    Wang, Bowen
    IEEE SENSORS JOURNAL, 2023, 23 (03) : 1925 - 1932
  • [49] Machine learning predictive model based on national data for fatal accidents of construction workers
    Choi, Jongko
    Gu, Bonsung
    Chin, Sangyoon
    Lee, Jong-Seok
    AUTOMATION IN CONSTRUCTION, 2020, 110
  • [50] A copula-based fuzzy chance-constrained programming model and its application to electric power generation systems planning
    Chen, F.
    Huang, G. H.
    Fan, Y. R.
    Chen, J. P.
    APPLIED ENERGY, 2017, 187 : 291 - 309