Differential Privacy High-Dimensional Data Publishing Based on Feature Selection and Clustering

被引:2
作者
Chu, Zhiguang [1 ,2 ]
He, Jingsha [1 ]
Zhang, Xiaolei [2 ]
Zhang, Xing [2 ]
Zhu, Nafei [1 ]
机构
[1] Beijing Univ Technol, Sch Software Engn, Beijing 100124, Peoples R China
[2] Key Lab Secur Network & Data Ind Internet Liaoning, Jinzhou 121000, Peoples R China
关键词
high-dimensional data; feature selection; random forest; clustering; differential privacy; PREDICTION; ALGORITHM;
D O I
10.3390/electronics12091959
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As a social information product, the privacy and usability of high-dimensional data are the core issues in the field of privacy protection. Feature selection is a commonly used dimensionality reduction processing technique for high-dimensional data. Some feature selection methods only process some of the features selected by the algorithm and do not take into account the information associated with the selected features, resulting in the usability of the final experimental results not being high. This paper proposes a hybrid method based on feature selection and a cluster analysis to solve the data utility and privacy problems of high-dimensional data in the actual publishing process. The proposed method is divided into three stages: (1) screening features; (2) analyzing the clustering of features; and (3) adaptive noise. This paper uses the Wisconsin Breast Cancer Diagnostic (WDBC) database from UCI's Machine Learning Library. Using classification accuracy to evaluate the performance of the proposed method, the experiments show that the original data are processed by the algorithm in this paper while protecting the sensitive data information while retaining the contribution of the data to the diagnostic results.
引用
收藏
页数:16
相关论文
共 47 条
[11]   Feature Selection and Instance Selection from Clinical Datasets Using Co-operative Co-evolution and Classification Using Random Forest [J].
Christo, V. R. Elgin ;
Nehemiah, H. Khanna ;
Brighty, J. ;
Kannan, Arputharaj .
IETE JOURNAL OF RESEARCH, 2022, 68 (04) :2508-2521
[12]   Scalable auto-encoders for gravitational waves detection from time series data [J].
Corizzo, Roberto ;
Ceci, Michelangelo ;
Zdravevski, Eftim ;
Japkowicz, Nathalie .
EXPERT SYSTEMS WITH APPLICATIONS, 2020, 151
[13]   Anomaly Detection and Repair for Accurate Predictions in Geo-distributed Big Data [J].
Corizzo, Roberto ;
Ceci, Michelangelo ;
Japkowicz, Nathalie .
BIG DATA RESEARCH, 2019, 16 :18-35
[14]  
Dwork C., 2010, P 21 ANN ACM SIAM S, DOI [10.1137/1.9781611973075.16, DOI 10.1137/1.9781611973075.16]
[15]   Differential privacy: A survey of results [J].
Dwork, Cynthia .
THEORY AND APPLICATIONS OF MODELS OF COMPUTATION, PROCEEDINGS, 2008, 4978 :1-19
[16]   Calibrating noise to sensitivity in private data analysis [J].
Dwork, Cynthia ;
McSherry, Frank ;
Nissim, Kobbi ;
Smith, Adam .
THEORY OF CRYPTOGRAPHY, PROCEEDINGS, 2006, 3876 :265-284
[17]   The Promise of Differential Privacy A Tutorial on Algorithmic Techniques [J].
Dwork, Cynthia .
2011 IEEE 52ND ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS 2011), 2011, :1-2
[18]   A Firm Foundation for Private Data Analysis [J].
Dwork, Cynthia .
COMMUNICATIONS OF THE ACM, 2011, 54 (01) :86-95
[19]  
Dwork C, 2009, LECT NOTES COMPUT SC, V5444, P496
[20]   Data-driven estimation of TBM performance in soft soils using density-based spatial clustering and random forest [J].
Fu, Xianlei ;
Feng, Liuyang ;
Zhang, Limao .
APPLIED SOFT COMPUTING, 2022, 120