Differential Privacy in Telco Big Data Platform

被引:0
作者
Hu, Xueyang [1 ,2 ]
Yuan, Mingxuan [1 ]
Yao, Jianguo [2 ]
Deng, Yu [2 ]
Chen, Lei [3 ]
Yang, Qiang [3 ]
Guan, Haibing [2 ]
Zeng, Jia [1 ,4 ]
机构
[1] Huawei Noahs Ark Lab, Hong Kong, Hong Kong, Peoples R China
[2] Shanghai Jiao Tong Univ, Shanghai Key Lab Scalable Comp & Syst, Shanghai 200030, Peoples R China
[3] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Hong Kong, Peoples R China
[4] Soochow Univ, Collaborat Innovat Ctr Novel Software Technol & I, Suzhou, Jiangsu, Peoples R China
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2015年 / 8卷 / 12期
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Differential privacy (DP) has been widely explored in academia recently but less so in industry possibly due to its strong privacy guarantee. This paper makes the first attempt to implement three basic DP architectures in the deployed telecommunication (telco) big data platform for data mining applications. We find that all DP architectures have less than 5% loss of prediction accuracy when the weak privacy guarantee is adopted (e.g., privacy budget parameter epsilon >= 3). However, when the strong privacy guarantee is assumed (e.g., privacy budget parameter epsilon >= 0.1), all DP architectures lead to 15% similar to 30% accuracy loss, which implies that real-word industrial data mining systems cannot work well under such a strong privacy guarantee recommended by previous research works. Among the three basic DP architectures, the Hybridized DM (Data Mining) and DB (Database) architecture performs the best because of its complicated privacy protection design for the specific data mining algorithm. Through extensive experiments on big data, we also observe that the accuracy loss increases by increasing the variety of features, but decreases by increasing the volume of training data. Therefore, to make DP practically usable in large-scale industrial systems, our observations suggest that we may explore three possible research directions in future: (1) Relaxing the privacy guarantee (e.g., increasing privacy budget epsilon) and studying its effectiveness on specific industrial applications; (2) Designing specific privacy scheme for specific data mining algorithms; and (3) Using large volume of data but with low variety for training the classification models.
引用
收藏
页码:1692 / 1703
页数:12
相关论文
共 38 条
[1]  
Binti Oseman K., 2010, J STAT MODEL ANAL, V1, P19
[2]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[4]  
Dalenius T., 1986, J OFF STAT, V2, P329
[5]   A critique of k-anonymity and some of its enhancements [J].
Domingo-Ferrer, Josep ;
Torra, Vicenc .
ARES 2008: PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON AVAILABILITY, SECURITY AND RELIABILITY, 2008, :990-+
[6]  
Dwork C., 2006, PROC 33 INT C ICALP, P1, DOI DOI 10.1007/11787006_1
[7]   RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response [J].
Erlingsson, Ulfar ;
Pihur, Vasyl ;
Korolova, Aleksandra .
CCS'14: PROCEEDINGS OF THE 21ST ACM CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2014, :1054-1067
[8]  
Friedman A., 2010, PROC 16 ACM SIGKDD I
[9]  
Guyon I., 2009, P 2009 INT C KDD CUP, P1
[10]  
Han J., 2005, DATA MINING CONCEPTS