A distributed approach to enabling privacy-preserving model-based classifier training

被引:6
作者
Luo, Hangzai [2 ]
Fan, Jianping [1 ]
Lin, Xiaodong [3 ]
Zhou, Aoying [2 ]
Bertino, Elisa [4 ]
机构
[1] Univ N Carolina, Dept Comp Sci, Charlotte, NC 28223 USA
[2] E China Normal Univ, Shanghai Key Lab Trustworthy Comp, Shanghai 200062, Peoples R China
[3] Univ Cincinnati, Dept Math Sci, Cincinnati, OH 45221 USA
[4] Purdue Univ, Dept Comp Sci, W Lafayette, IN 47907 USA
基金
中国国家自然科学基金; 美国国家科学基金会;
关键词
Privacy-preserving classifier training; Synthetic samples; Adaptive EM algorithm; DATA PERTURBATION;
D O I
10.1007/s10115-008-0167-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper proposes a novel approach for privacy-preserving distributed model-based classifier training. Our approach is an important step towards supporting customizable privacy modeling and protection. It consists of three major steps. First, each data site independently learns a weak concept model (i.e., local classifier) for a given data pattern or concept by using its own training samples. An adaptive EM algorithm is proposed to select the model structure and estimate the model parameters simultaneously. The second step deals with combined classifier training by integrating the weak concept models that are shared from multiple data sites. To reduce the data transmission costs and the potential privacy breaches, only the weak concept models are sent to the central site and synthetic samples are directly generated from these shared weak concept models at the central site. Both the shared weak concept models and the synthetic samples are then incorporated to learn a reliable and complete global concept model. A computational approach is developed to automatically achieve a good trade off between the privacy disclosure risk, the sharing benefit and the data utility. The third step deals with validating the combined classifier by distributing the global concept model to all these data sites in the collaboration network while at the same time limiting the potential privacy breaches. Our approach has been validated through extensive experiments carried out on four UCI machine learning data sets and two image data sets.
引用
收藏
页码:157 / 185
页数:29
相关论文
共 50 条
[1]  
AGGARWAL G, 2004, VLDB, P708
[2]  
AGRAWAL D, 2001, ACM PODS
[3]  
Agrawal R, 2000, SIGMOD REC, V29, P439, DOI 10.1145/335191.335438
[4]  
[Anonymous], 1999, ICML
[5]  
CHAN P, 1996, WORK NOT AAAI WORKSH, V36
[6]  
Chen KK, 2005, Fifth IEEE International Conference on Data Mining, Proceedings, P589
[7]  
CRISES G, 2004, CRIREP04009 CRISES R
[8]  
Deutsch A, 2005, LECT NOTES COMPUT SC, V3363, P230
[9]   Practical data-oriented microaggregation for statistical disclosure control [J].
Domingo-Ferrer, J ;
Mateo-Sanz, JM .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2002, 14 (01) :189-201
[10]  
DU W, 2004, SIAM C DAT MIN