Data Provenance via Differential Auditing

被引:0
作者
Mu, Xin [1 ]
Pang, Ming [2 ]
Zhu, Feida [3 ]
机构
[1] Peng Cheng Lab, Dept Strateg & Adv Interdisciplinary Res, Shenzhen 518066, Peoples R China
[2] JD Com Inc, Beijing 100101, Peoples R China
[3] Singapore Management Univ, Sch Comp & Informat Syst, Singapore 188065, Singapore
基金
新加坡国家研究基金会; 美国国家科学基金会;
关键词
Auditing data; data provenance; machine learning;
D O I
10.1109/TKDE.2023.3334821
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the rising awareness of data assets, data governance, which is to understand where data comes from, how it is collected, and how it is used, has been assuming ever-growing importance. One critical component of data governance gaining increasing attention is auditing machine learning models to determine if specific data has been used for training. Existing auditing techniques, like shadow auditing methods, have shown feasibility under specific conditions such as having access to label information and knowledge of training protocols. However, these conditions are often not met in most real-world applications. In this paper, we introduce a practical framework for auditing data provenance based on a differential mechanism, i.e., after carefully designed transformation, perturbed input data from the target model's training set would result in much more drastic changes in the output than those from the model's non-training set. Our framework is data-dependent and does not require distinguishing training data from non-training data or training additional shadow models with labeled output data. Furthermore, our framework extends beyond point-based data auditing to group-based data auditing, aligning with the needs of real-world applications. Our theoretical analysis of the differential mechanism and the experimental results on real-world data sets verify the proposal's effectiveness.
引用
收藏
页码:5066 / 5079
页数:14
相关论文
共 60 条
[11]   Age and Gender Estimation of Unfiltered Faces [J].
Eidinger, Eran ;
Enbar, Roee ;
Hassner, Tal .
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2014, 9 (12) :2170-2179
[12]   Does Learning Require Memorization? A Short Tale about a Long Tail [J].
Feldman, Vitaly .
PROCEEDINGS OF THE 52ND ANNUAL ACM SIGACT SYMPOSIUM ON THEORY OF COMPUTING (STOC '20), 2020, :954-959
[13]   Generalization Comparison of Deep Neural Networks via Output Sensitivity [J].
Forouzesh, Mahsa ;
Salehi, Farnood ;
Thiran, Patrick .
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, :7411-7418
[14]  
Geyer R. C., 2017, arXiv
[15]  
Goodfellow I.J., 2015, 2015 INT C LEARN REP
[16]  
Hayes Jamie, 2019, Proceedings on Privacy Enhancing Technologies, V2019, P133, DOI 10.2478/popets-2019-0008
[17]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[18]  
Hu Hongsheng, 2021, Membership inference attacks on machine learning: A survey
[19]   A Practical Differentially Private Random Decision Tree Classifier [J].
Jagannathan, Geetha ;
Pillaipakkamnatt, Krishnan ;
Wright, Rebecca N. .
2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, :114-+
[20]  
Ji ZL, 2014, Arxiv, DOI [arXiv:1412.7584, DOI 10.48550/ARXIV.1412.7584]