Privacy Protection Practice for Data Mining with Multiple Data Sources: An Example with Data Clustering

被引:2
作者
O'Shaughnessy, Pauline [1 ]
Lin, Yan-Xia [1 ]
机构
[1] Univ Wollongong, Sch Math & Appl Stat, Wollongong, NSW 2522, Australia
关键词
data masking; multiplicative noise; data mining; sample size calculation;
D O I
10.3390/math10244744
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
In the age of data, data mining provides feasible tools with which to handle large datasets consisting of data from multiple sources. However, there is limited research on retrieving statistical information from data when data are confidential and cannot be shared directly. In this paper, we address this problem and propose a framework for performing data analysis using data from multiple sources without revealing true values for privacy purposes. The proposed framework includes three steps. First, data custodians individually mask data before publishing; then, the masked data collection is used to reconstruct the density function of the original dataset, from which resampled values are generated; last, existing data mining techniques are applied directly to the resampled data. This framework utilises the technique of reconstructing an original density function from noise-masked data using the moment-based density estimation method, which plays an essential role. Simulation studies show that the proposed framework performs well; analysis results from the resampled data are comparable to those of the original data when the density of the original data is estimated well. The proposed framework is demonstrated in data clustering analysis using the example of a real-life Australian soybean dataset. Results from the k-means algorithms with two and three fitted clusters are presented to show that cluster analysis using resampled data can well replicate that of the original data.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] A Data Mining Approach to Assess Privacy Risk in Human Mobility Data
    Pellungrini, Roberto
    Pappalardo, Luca
    Pratesi, Francesca
    Monreale, Anna
    ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2018, 9 (03)
  • [42] Privacy preserving data mining of sequential patterns for network traffic data
    Kim, Seung-Woo
    Park, Sanghyun
    Won, Jung-Im
    Kim, Sang-Wook
    ADVANCES IN DATABASES: CONCEPTS, SYSTEMS AND APPLICATIONS, 2007, 4443 : 201 - +
  • [43] Privacy and data mining: evaluating the impact of data anonymization on classification algorithms
    Silva, Hebert O.
    Basso, Tania
    Moraes, Regina
    2017 13TH EUROPEAN DEPENDABLE COMPUTING CONFERENCE (EDCC 2017), 2017, : 111 - 116
  • [44] Clustering Models for Data Stream Mining
    Mythily, R.
    Banu, Aisha
    Raghunathan, Shriram
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGIES, ICICT 2014, 2015, 46 : 619 - 626
  • [45] Comparison of Data Mining Clustering Algorithms
    Shah, Chintan
    Jivani, Anjali
    2013 4TH NIRMA UNIVERSITY INTERNATIONAL CONFERENCE ON ENGINEERING (NUICONE 2013), 2013,
  • [46] Blended Clustering for Health Data Mining
    Mehar, Arshad Muhammad
    Maeder, Anthony
    Matawie, Kenan
    Ginige, Athula
    E-HEALTH, 2010, 335 : 130 - 137
  • [47] Data Mining in Light of Clustering Algorithms
    Zhang, Qiusheng
    AGRO FOOD INDUSTRY HI-TECH, 2017, 28 (03): : 2568 - 2571
  • [48] Data mining in a bicriteria clustering problem
    Abascal, E.
    Lautre, I. Garcia
    Mallor, F.
    EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2006, 173 (03) : 705 - 716
  • [49] Mining XML data: A clustering approach
    Saraee, M
    Aljibouri, JM
    DMIN '05: Proceedings of the 2005 International Conference on Data Mining, 2005, : 283 - 288
  • [50] Privacy preserving data mining: A noise addition framework using a novel clustering technique
    Islam, Md Zahidul
    Brankovic, Ljiljana
    KNOWLEDGE-BASED SYSTEMS, 2011, 24 (08) : 1214 - 1223