Privacy Protection Practice for Data Mining with Multiple Data Sources: An Example with Data Clustering

被引：2

作者：

O'Shaughnessy, Pauline ^{[1
]}

Lin, Yan-Xia ^{[1
]}

机构：

[1] Univ Wollongong, Sch Math & Appl Stat, Wollongong, NSW 2522, Australia

来源：

MATHEMATICS | 2022年 / 10卷 / 24期

关键词：

data masking; multiplicative noise; data mining; sample size calculation;

D O I：

10.3390/math10244744

中图分类号：

O1 [数学];

学科分类号：

0701 ; 070101 ;

摘要：

In the age of data, data mining provides feasible tools with which to handle large datasets consisting of data from multiple sources. However, there is limited research on retrieving statistical information from data when data are confidential and cannot be shared directly. In this paper, we address this problem and propose a framework for performing data analysis using data from multiple sources without revealing true values for privacy purposes. The proposed framework includes three steps. First, data custodians individually mask data before publishing; then, the masked data collection is used to reconstruct the density function of the original dataset, from which resampled values are generated; last, existing data mining techniques are applied directly to the resampled data. This framework utilises the technique of reconstructing an original density function from noise-masked data using the moment-based density estimation method, which plays an essential role. Simulation studies show that the proposed framework performs well; analysis results from the resampled data are comparable to those of the original data when the density of the original data is estimated well. The proposed framework is demonstrated in data clustering analysis using the example of a real-life Australian soybean dataset. Results from the k-means algorithms with two and three fitted clusters are presented to show that cluster analysis using resampled data can well replicate that of the original data.

引用

页数：13

共 50 条

[31] Medical privacy versus data mining
Farkas, C
Valtorta, M
Fenner, S
WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL XVII, PROCEEDINGS: CYBERNETICS AND INFORMATICS: CONCEPTS AND APPLICATIONS (PT II), 2001, : 194 - 199
[32] A Review on Privacy Preserving Data Mining
Shanthi, A. S.
Karthikeyan, M.
2012 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH (ICCIC), 2012, : 438 - 441
[33] A Survey on Privacy Preserving Data Mining
Saranya, K.
Premalatha, K.
Rajasekar, S. S.
2015 2ND INTERNATIONAL CONFERENCE ON ELECTRONICS AND COMMUNICATION SYSTEMS (ICECS), 2015, : 1740 - U2102
[34] A Survey on Data Mining Methods for Clustering Complex Spatiotemporal Data
Maciag, Piotr S.
BEYOND DATABASES, ARCHITECTURES AND STRUCTURES: TOWARDS EFFICIENT SOLUTIONS FOR DATA ANALYSIS AND KNOWLEDGE REPRESENTATION, 2017, 716 : 115 - 126
[35] Data Mining and Privacy of Social Network Sites' Users: Implications of the Data Mining Problem
Al-Saggaf, Yeslam
Islam, Md Zahidul
SCIENCE AND ENGINEERING ETHICS, 2015, 21 (04) : 941 - 966
[36] SecEDMO: Enabling Efficient Data Mining with Strong Privacy Protection in Cloud Computing
Wu, Jiahui
Mu, Nankun
Lei, Xinyu
Le, Junqing
Zhang, Di
Liao, Xiaofeng
IEEE TRANSACTIONS ON CLOUD COMPUTING, 2022, 10 (01) : 691 - 705
[37] Construction of a network intelligence platform for privacy protection and integrated big data mining
Chen S.
Wang Q.
Guo Y.
Journal of Intelligent and Fuzzy Systems, 2024, 46 (04) : 11205 - 11217
[38] Empirical asset pricing based on network big data mining and privacy protection
Xiaoxiang Xu
Neural Computing and Applications, 2025, 37 (12) : 7841 - 7855
[39] Data Mining and Privacy of Social Network Sites’ Users: Implications of the Data Mining Problem
Yeslam Al-Saggaf
Md Zahidul Islam
Science and Engineering Ethics, 2015, 21 : 941 - 966
[40] Privacy preserving data mining of sequential patterns for network traffic data
Kim, Seung-Woo
Park, Sanghyun
Won, Jung-Im
Kim, Sang-Wook
INFORMATION SCIENCES, 2008, 178 (03) : 694 - 713

← 1 2 3 4 5 →