Lossy Compression of Noisy Data for Private and Data-Efficient Learning

被引:1
作者
Isik, Berivan [1 ]
Weissman, Tsachy [1 ]
机构
[1] Stanford Univ, Dept Elect Engn, Stanford, CA 94305 USA
来源
IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY | 2022年 / 3卷 / 04期
基金
美国国家科学基金会;
关键词
Compression-based denoising; rate-distortion theory; empirical distribution; learning; privacy; robustness; EMPIRICAL DISTRIBUTION;
D O I
10.1109/JSAIT.2023.3260720
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Storage-efficient privacy-preserving learning is crucial due to increasing amounts of sensitive user data required for modern learning tasks. We propose a framework for reducing the storage cost of user data while at the same time providing privacy guarantees, without essential loss in the utility of the data for learning. Our method comprises noise injection followed by lossy compression. We show that, when appropriately matching the lossy compression to the distribution of the added noise, the compressed examples converge, in distribution, to that of the noise-free training data as the sample size of the training data (or the dimension of the training data) increases. In this sense, the utility of the data for learning is essentially maintained, while reducing storage and privacy leakage by quantifiable amounts. We present experimental results on the CelebA dataset for gender classification and find that our suggested pipeline delivers in practice on the promise of the theory: the individuals in the images are unrecognizable (or less recognizable, depending on the noise level), overall storage of the data is substantially reduced, with no essential loss (and in some cases a slight boost) to the classification accuracy. As an added bonus, our experiments suggest that our method yields a substantial boost to robustness in the face of adversarial test data.
引用
收藏
页码:815 / 823
页数:9
相关论文
共 52 条
[1]  
[Anonymous], 1992, JPEG Still Image Data Compression Standard
[2]   Information Extraction Under Privacy Constraints [J].
Asoodeh, Shahab ;
Diaz, Mario ;
Alajaji, Fady ;
Linder, Tamas .
INFORMATION, 2016, 7 (01)
[3]  
Asoodeh S, 2014, ANN ALLERTON CONF, P1272, DOI 10.1109/ALLERTON.2014.7028602
[4]  
Balle J., 2017, P INT C LEARN REPR I, P1
[5]  
Ballé J, 2018, PICT COD SYMP, P248, DOI 10.1109/PCS.2018.8456272
[6]  
Basciftci YO, 2016, 2016 INFORMATION THEORY AND APPLICATIONS WORKSHOP (ITA)
[7]   An Information-Theoretic Approach to Individual Sequential Data Sanitization [J].
Bonomi, Luca ;
Fan, Liyue ;
Jin, Hongxia .
PROCEEDINGS OF THE NINTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'16), 2016, :337-346
[8]  
Calmon FD, 2012, ANN ALLERTON CONF, P1401, DOI 10.1109/Allerton.2012.6483382
[9]   LFZip: Lossy compression of multivariate floating-point time series data via improved prediction [J].
Chandak, Shubham ;
Tatwawadi, Kedar ;
Wen, Chengtao ;
Wang, Lingyun ;
Ojea, Juan Aparicio ;
Weissman, Tsachy .
2020 DATA COMPRESSION CONFERENCE (DCC 2020), 2020, :342-351
[10]  
Chatzikokolakis K, 2010, LECT NOTES COMPUT SC, V6015, P390, DOI 10.1007/978-3-642-12002-2_33