Machine Learning Model Generation With Copula-Based Synthetic Dataset for Local Differentially Private Numerical Data

被引:8
|
作者
Sei, Yuichi [1 ,2 ]
Onesimu, J. Andrew [3 ]
Ohsuga, Akihiko [1 ]
机构
[1] Univ Electrocommun, Grad Sch Informat & Engn, Dept Informat, Chofu, Tokyo 1828585, Japan
[2] JST, PRESTO, Kawaguchi, Saitama 1020076, Japan
[3] Manipal Acad Higher Educ, Manipal Inst Technol, Dept Comp Sci & Engn, Manipal 576104, India
基金
日本科学技术振兴机构; 日本学术振兴会;
关键词
Data models; Machine learning; Differential privacy; Decision trees; Numerical models; Machine learning algorithms; Generators; Data mining; Privacy; Data collection; Copula; data mining; decision trees; local differential privacy; machine learning; privacy-preserving data collection; DECISION TREE; SECURITY;
D O I
10.1109/ACCESS.2022.3208715
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the development of IoT technology, personal data are being collected in many places. These data can be used to create new services, but consideration must be given to the individual's privacy. We can safely collect personal data while adding noise by applying differential privacy. However, because such data are very noisy, the accuracy of machine learning trained by the data greatly decreased. In this study, our objective is to build a highly accurate machine learning model using these data. We focus on the decision tree machine learning algorithm, and, instead of applying it as is, we use a preprocessing technique wherein pseudodata are generated using a copula while removing the effect of noise added by differential privacy. In detail, the proposed novel protocol consists of three steps: generating a covariance matrix from the differentially private numerical data, generating a discrete cumulative distribution function from differentially private numerical data, and generating copula-based numerical samples. Simulation results using synthetic and real datasets verify the utility of the proposed method not only for the decision tree algorithm but also for other machine learning algorithms such as deep neural networks. This method will help create machine learning models, such as recommendation systems, using differential privacy data.
引用
收藏
页码:101656 / 101671
页数:16
相关论文
共 50 条
  • [1] Differentially Private Numerical Vector Analyses in the Local and Shuffle Model
    Wang, Shaowei
    Yu, Shiyu
    Ren, Xiaojun
    Li, Jin
    Li, Yuntong
    Yang, Wei
    Yan, Hongyang
    Li, Jin
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2025, 22 (01) : 1 - 15
  • [2] Copula-Based Synthetic Data Generation in Firm-Size Variables
    Fujimoto, Shouji
    Ishikawa, Atushi
    Mizuno, Takayuki
    REVIEW OF SOCIONETWORK STRATEGIES, 2022, 16 (02) : 479 - 492
  • [3] Copula-Based Synthetic Data Generation in Firm-Size Variables
    Shouji Fujimoto
    Atushi Ishikawa
    Takayuki Mizuno
    The Review of Socionetwork Strategies, 2022, 16 : 479 - 492
  • [4] Distributed Synthetic Time-Series Data Generation With Local Differentially Private Federated Learning
    Jiang, Xue
    Zhou, Xuebing
    Grossklags, Jens
    IEEE ACCESS, 2024, 12 : 157067 - 157082
  • [5] Priority Needs for Facilities of Office Buildings in Thailand: A Copula-Based Ordinal Regression Model with Machine Learning Approach
    Sriboonjit, Jittaporn
    Singvejsakul, Jittima
    Yamaka, Worapon
    Thongkairat, Sukrit
    Sriboonchitta, Songsak
    Liu, Jianxu
    BUILDINGS, 2024, 14 (03)
  • [6] A Federated Learning Framework Based on Differentially Private Continuous Data Release
    Cai, Jianping
    Liu, Ximeng
    Ye, Qingqing
    Liu, Yang
    Wang, Yuyang
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2024, 21 (05) : 4879 - 4894
  • [7] Differentially private synthetic mixed-type data generation for unsupervised learning
    Tantipongpipat, Uthaipon Tao
    Waites, Chris
    Boob, Digvijay
    Siva, Amaresh Ankit
    Cummings, Rachel
    INTELLIGENT DECISION TECHNOLOGIES-NETHERLANDS, 2021, 15 (04): : 779 - 807
  • [8] Private information in healthcare utilization: specification of a copula-based hurdle model
    Shi, Peng
    Zhang, Wei
    JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY, 2015, 178 (02) : 337 - 361
  • [9] A Survey of Synthetic Data Generation for Machine Learning
    Abufadda, Mohammad
    Mansour, Khalid
    2021 22ND INTERNATIONAL ARAB CONFERENCE ON INFORMATION TECHNOLOGY (ACIT), 2021, : 488 - 494
  • [10] Machine learning and copula-based analysis of past changes in global droughts and socioeconomic exposures
    Fang, Longzhang
    Yin, Jiabo
    Wang, Yun
    Xu, Jijun
    Wang, Yongqiang
    Wu, Guangdong
    Zeng, Ziyue
    Zhang, Xiaojing
    Zhang, Jiayu
    Meshyk, Aleh
    JOURNAL OF HYDROLOGY, 2024, 628