Privacy Protection Practice for Data Mining with Multiple Data Sources: An Example with Data Clustering

被引:2
作者
O'Shaughnessy, Pauline [1 ]
Lin, Yan-Xia [1 ]
机构
[1] Univ Wollongong, Sch Math & Appl Stat, Wollongong, NSW 2522, Australia
关键词
data masking; multiplicative noise; data mining; sample size calculation;
D O I
10.3390/math10244744
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
In the age of data, data mining provides feasible tools with which to handle large datasets consisting of data from multiple sources. However, there is limited research on retrieving statistical information from data when data are confidential and cannot be shared directly. In this paper, we address this problem and propose a framework for performing data analysis using data from multiple sources without revealing true values for privacy purposes. The proposed framework includes three steps. First, data custodians individually mask data before publishing; then, the masked data collection is used to reconstruct the density function of the original dataset, from which resampled values are generated; last, existing data mining techniques are applied directly to the resampled data. This framework utilises the technique of reconstructing an original density function from noise-masked data using the moment-based density estimation method, which plays an essential role. Simulation studies show that the proposed framework performs well; analysis results from the resampled data are comparable to those of the original data when the density of the original data is estimated well. The proposed framework is demonstrated in data clustering analysis using the example of a real-life Australian soybean dataset. Results from the k-means algorithms with two and three fitted clusters are presented to show that cluster analysis using resampled data can well replicate that of the original data.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] Medical privacy versus data mining
    Farkas, C
    Valtorta, M
    Fenner, S
    WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL XVII, PROCEEDINGS: CYBERNETICS AND INFORMATICS: CONCEPTS AND APPLICATIONS (PT II), 2001, : 194 - 199
  • [32] A Review on Privacy Preserving Data Mining
    Shanthi, A. S.
    Karthikeyan, M.
    2012 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH (ICCIC), 2012, : 438 - 441
  • [33] A Survey on Privacy Preserving Data Mining
    Saranya, K.
    Premalatha, K.
    Rajasekar, S. S.
    2015 2ND INTERNATIONAL CONFERENCE ON ELECTRONICS AND COMMUNICATION SYSTEMS (ICECS), 2015, : 1740 - U2102
  • [34] A Survey on Data Mining Methods for Clustering Complex Spatiotemporal Data
    Maciag, Piotr S.
    BEYOND DATABASES, ARCHITECTURES AND STRUCTURES: TOWARDS EFFICIENT SOLUTIONS FOR DATA ANALYSIS AND KNOWLEDGE REPRESENTATION, 2017, 716 : 115 - 126
  • [35] Data Mining and Privacy of Social Network Sites' Users: Implications of the Data Mining Problem
    Al-Saggaf, Yeslam
    Islam, Md Zahidul
    SCIENCE AND ENGINEERING ETHICS, 2015, 21 (04) : 941 - 966
  • [36] SecEDMO: Enabling Efficient Data Mining with Strong Privacy Protection in Cloud Computing
    Wu, Jiahui
    Mu, Nankun
    Lei, Xinyu
    Le, Junqing
    Zhang, Di
    Liao, Xiaofeng
    IEEE TRANSACTIONS ON CLOUD COMPUTING, 2022, 10 (01) : 691 - 705
  • [37] Construction of a network intelligence platform for privacy protection and integrated big data mining
    Chen S.
    Wang Q.
    Guo Y.
    Journal of Intelligent and Fuzzy Systems, 2024, 46 (04) : 11205 - 11217
  • [39] Data Mining and Privacy of Social Network Sites’ Users: Implications of the Data Mining Problem
    Yeslam Al-Saggaf
    Md Zahidul Islam
    Science and Engineering Ethics, 2015, 21 : 941 - 966
  • [40] Privacy preserving data mining of sequential patterns for network traffic data
    Kim, Seung-Woo
    Park, Sanghyun
    Won, Jung-Im
    Kim, Sang-Wook
    INFORMATION SCIENCES, 2008, 178 (03) : 694 - 713