Privacy Protection Practice for Data Mining with Multiple Data Sources: An Example with Data Clustering

被引：2

作者：

O'Shaughnessy, Pauline ^{[1
]}

Lin, Yan-Xia ^{[1
]}

机构：

[1] Univ Wollongong, Sch Math & Appl Stat, Wollongong, NSW 2522, Australia

来源：

MATHEMATICS | 2022年 / 10卷 / 24期

关键词：

data masking; multiplicative noise; data mining; sample size calculation;

D O I：

10.3390/math10244744

中图分类号：

O1 [数学];

学科分类号：

0701 ; 070101 ;

摘要：

In the age of data, data mining provides feasible tools with which to handle large datasets consisting of data from multiple sources. However, there is limited research on retrieving statistical information from data when data are confidential and cannot be shared directly. In this paper, we address this problem and propose a framework for performing data analysis using data from multiple sources without revealing true values for privacy purposes. The proposed framework includes three steps. First, data custodians individually mask data before publishing; then, the masked data collection is used to reconstruct the density function of the original dataset, from which resampled values are generated; last, existing data mining techniques are applied directly to the resampled data. This framework utilises the technique of reconstructing an original density function from noise-masked data using the moment-based density estimation method, which plays an essential role. Simulation studies show that the proposed framework performs well; analysis results from the resampled data are comparable to those of the original data when the density of the original data is estimated well. The proposed framework is demonstrated in data clustering analysis using the example of a real-life Australian soybean dataset. Results from the k-means algorithms with two and three fitted clusters are presented to show that cluster analysis using resampled data can well replicate that of the original data.

引用

页数：13

共 50 条

[1] Privacy Protection in Data Mining
Fu, Chunchang
Zhang, Nan
2010 INTERNATIONAL CONFERENCE ON MANAGEMENT SCIENCE AND ENGINEERING (MSE 2010), VOL 2, 2010, : 92 - 93
[2] Privacy protection in data mining: A perturbation approach for categorical data
Li, Xiao-Bai
Sarkar, Sumit
INFORMATION SYSTEMS RESEARCH, 2006, 17 (03) : 254 - 270
[3] Use of Multiple Data Sources in Collaborative Data Mining
Anton, Carmen
Matei, Oliviu
Avram, Anca
INTELLIGENT SYSTEMS APPLICATIONS IN SOFTWARE ENGINEERING, VOL 1, 2019, 1046 : 189 - 198
[4] Privacy in data mining
Domingo-Ferrer, J
Torra, V
DATA MINING AND KNOWLEDGE DISCOVERY, 2005, 11 (02) : 117 - 119
[5] Privacy in Data Mining
Josep Domingo-Ferrer
Vicenç Torra
Data Mining and Knowledge Discovery, 2005, 11 : 117 - 119
[6] Mining Credit Interest Rate Data from Multiple Data Sources
Hryhorkiv, Vasyl
Buiak, Lesia
Verstiak, Andrii
Hryhorkiv, Mariia
Verstiak, Oksana
Berdnuk, Andrii
2019 9TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER INFORMATION TECHNOLOGIES (ACIT'2019), 2019, : 265 - 268
[7] Identity disclosure protection: A data reconstruction approach for privacy-preserving data mining
Zhu, Dan
Li, Xiao-Bai
Wu, Shuning
DECISION SUPPORT SYSTEMS, 2009, 48 (01) : 133 - 140
[8] Clustering-assisted privacy perseveration model for data mining
Mohana, S.
Nithya, T. M.
Bushra, Sardar Khan Nikkath
Vasanthi, Ramakrishnan
Guruprakash, K. S.
Rajesh, Sudha
INTERNATIONAL JOURNAL OF AD HOC AND UBIQUITOUS COMPUTING, 2024, 47 (02) : 108 - 125
[9] Data mining with clustering
Klimek, Petr
E & M EKONOMIE A MANAGEMENT, 2008, 11 (02): : 120 - 126
[10] Architecture-centric data mining middleware supporting multiple data sources and mining techniques
Lee, Sai Peck
Hen, Lai Ee
ICSOFT 2007: PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES, VOL ISDM/WSEHST/DC, 2007, : 224 - 227

← 1 2 3 4 5 →